Improve RT performance & latency w/ local_lock_t()

[RFC PATCH 0/4] Improve RT performance & latency w/ local_lock_t()

Posted by Leonardo Bras 2 years, 6 months ago

The problem:
We have a few scenarios in mm that make use of local_locks() + 
schedule_work_on() due to reduced overhead on the most common local 
references. This scenario is not ideal for RT workloads though: even on 
isolated cpus, those tasks will schedule-out the sensitive RT workloads to 
perform those jobs, and usually cause missing deadlines with this.

The idea:
In PREEMPT_RT, local_locks() end up becoming spinlocks(), so there should 
be no harm in just getting another cpu's spinlock to perform the work 
on the per-cpu structure: the cacheline invalidation will already happen 
due to the usage by schedule_work_on(), and on local functions the locking 
already happens anyway.

This will avoid schedule_work_on(), and thus avoid scheduling-out an 
RT workload. Given the usually brief nature of those scheduled tasks, their 
execution end up being faster than doing their scheduling.

======

While I really belive the solution, there are problems with this patchset, 
which I need your suggestions for improvement:

1) Naming is horrible: I could not think on a good name for the new lock 
functions, so I lazely named it local_lock_n(). 
The naming should have been making clear that we are in fact dealing 
with a local_lock but it can in some scenarios be addressing another cpu's 
local_lock, and thus we need the extra cpu parameter. 

Dealing with this local & remote duality, all I thought was:
mostly_local_lock(), (or local_mostly_lock())
local_maybe_remote_lock()  <- LOL
remote_local_lock()
per_cpu_local_lock()
local_lock_cpu()

Please help !


2) Maybe naming is this hard because the interface is not the best.
My idea was to create a simple interface to easily replace functions 
already in use, but maybe there is something better I could not see.

Please contribute!

3) I am lazely setting work->data without atomic operations, which may 
be bad in some scenario. If so, even thought it can be costly, since it 
happens outside of the hotpath (local per-cpu areas) it should be no 
problem adding the atomic operation for non-RT kernels.

For RT kernels, since the whole operation hapens locally, there should be 
no need of using the atomic set.

Please let me know of any idea, or suggestion that can improve this RFC.

Thanks a lot!
Leo

Leonardo Bras (4):
  Introducing local_lock_n() and local queue & flush
  swap: apply new local_schedule_work_on() interface
  memcontrol: apply new local_schedule_work_on() interface
  slub: apply new local_schedule_work_on() interface

 include/linux/local_lock.h          | 18 ++++++++++
 include/linux/local_lock_internal.h | 52 +++++++++++++++++++++++++++++
 mm/memcontrol.c                     | 17 ++++++----
 mm/slub.c                           | 17 ++++++----
 mm/swap.c                           | 18 +++++-----
 5 files changed, 100 insertions(+), 22 deletions(-)

-- 
2.41.0

Re: [RFC PATCH 0/4] Improve RT performance & latency w/ local_lock_t()

Posted by Leonardo Bras Soares Passos 2 years, 6 months ago

CC: Peter Xu

On Sat, Jul 29, 2023 at 5:38 AM Leonardo Bras <leobras@redhat.com> wrote:
>
> The problem:
> We have a few scenarios in mm that make use of local_locks() +
> schedule_work_on() due to reduced overhead on the most common local
> references. This scenario is not ideal for RT workloads though: even on
> isolated cpus, those tasks will schedule-out the sensitive RT workloads to
> perform those jobs, and usually cause missing deadlines with this.
>
> The idea:
> In PREEMPT_RT, local_locks() end up becoming spinlocks(), so there should
> be no harm in just getting another cpu's spinlock to perform the work
> on the per-cpu structure: the cacheline invalidation will already happen
> due to the usage by schedule_work_on(), and on local functions the locking
> already happens anyway.
>
> This will avoid schedule_work_on(), and thus avoid scheduling-out an
> RT workload. Given the usually brief nature of those scheduled tasks, their
> execution end up being faster than doing their scheduling.
>
> ======
>
> While I really belive the solution, there are problems with this patchset,
> which I need your suggestions for improvement:
>
> 1) Naming is horrible: I could not think on a good name for the new lock
> functions, so I lazely named it local_lock_n().
> The naming should have been making clear that we are in fact dealing
> with a local_lock but it can in some scenarios be addressing another cpu's
> local_lock, and thus we need the extra cpu parameter.
>
> Dealing with this local & remote duality, all I thought was:
> mostly_local_lock(), (or local_mostly_lock())
> local_maybe_remote_lock()  <- LOL
> remote_local_lock()
> per_cpu_local_lock()
> local_lock_cpu()
>
> Please help !
>
>
> 2) Maybe naming is this hard because the interface is not the best.
> My idea was to create a simple interface to easily replace functions
> already in use, but maybe there is something better I could not see.
>
> Please contribute!
>
> 3) I am lazely setting work->data without atomic operations, which may
> be bad in some scenario. If so, even thought it can be costly, since it
> happens outside of the hotpath (local per-cpu areas) it should be no
> problem adding the atomic operation for non-RT kernels.
>
> For RT kernels, since the whole operation hapens locally, there should be
> no need of using the atomic set.
>
> Please let me know of any idea, or suggestion that can improve this RFC.
>
> Thanks a lot!
> Leo
>
> Leonardo Bras (4):
>   Introducing local_lock_n() and local queue & flush
>   swap: apply new local_schedule_work_on() interface
>   memcontrol: apply new local_schedule_work_on() interface
>   slub: apply new local_schedule_work_on() interface
>
>  include/linux/local_lock.h          | 18 ++++++++++
>  include/linux/local_lock_internal.h | 52 +++++++++++++++++++++++++++++
>  mm/memcontrol.c                     | 17 ++++++----
>  mm/slub.c                           | 17 ++++++----
>  mm/swap.c                           | 18 +++++-----
>  5 files changed, 100 insertions(+), 22 deletions(-)
>
> --
> 2.41.0
>