Improve RT performance & latency w/ local_lock_t()

[RFC PATCH 2/4] swap: apply new local_schedule_work_on() interface

Posted by Leonardo Bras 2 years, 6 months ago

Make use of the new local_*lock_n*() and local_schedule_work_on() interface
to improve performance & latency on PREEMTP_RT kernels.

For functions that may be scheduled in a different cpu, replace
local_*lock*() by local_lock_n*(), and replace schedule_work_on() by
local_schedule_work_on(). The same happens for flush_work() and
local_flush_work().

This should bring no relevant performance impact on non-RT kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 mm/swap.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..a79f2091eae5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -760,11 +760,11 @@ void lru_add_drain(void)
  * the same cpu. It shouldn't be a problem in !SMP case since
  * the core is only one and the locks will disable preemption.
  */
-static void lru_add_and_bh_lrus_drain(void)
+static void lru_add_and_bh_lrus_drain(int cpu)
 {
-	local_lock(&cpu_fbatches.lock);
-	lru_add_drain_cpu(smp_processor_id());
-	local_unlock(&cpu_fbatches.lock);
+	local_lock_n(&cpu_fbatches.lock, cpu);
+	lru_add_drain_cpu(cpu);
+	local_unlock_n(&cpu_fbatches.lock, cpu);
 	invalidate_bh_lrus_cpu();
 	mlock_drain_local();
 }
@@ -782,9 +782,9 @@ void lru_add_drain_cpu_zone(struct zone *zone)
 
 static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
 
-static void lru_add_drain_per_cpu(struct work_struct *dummy)
+static void lru_add_drain_per_cpu(struct work_struct *w)
 {
-	lru_add_and_bh_lrus_drain();
+	lru_add_and_bh_lrus_drain(w->data.counter);
 }
 
 static bool cpu_needs_drain(unsigned int cpu)
@@ -888,13 +888,13 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
 
 		if (cpu_needs_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
-			queue_work_on(cpu, mm_percpu_wq, work);
+			local_queue_work_on(cpu, mm_percpu_wq, work);
 			__cpumask_set_cpu(cpu, &has_work);
 		}
 	}
 
 	for_each_cpu(cpu, &has_work)
-		flush_work(&per_cpu(lru_add_drain_work, cpu));
+		local_flush_work(&per_cpu(lru_add_drain_work, cpu));
 
 done:
 	mutex_unlock(&lock);
@@ -941,7 +941,7 @@ void lru_cache_disable(void)
 #ifdef CONFIG_SMP
 	__lru_add_drain_all(true);
 #else
-	lru_add_and_bh_lrus_drain();
+	lru_add_and_bh_lrus_drain(smp_processor_id());
 #endif
 }
 
-- 
2.41.0

Re: [RFC PATCH 2/4] swap: apply new local_schedule_work_on() interface

Posted by Marcelo Tosatti 2 years, 6 months ago

On Sat, Jul 29, 2023 at 05:37:33AM -0300, Leonardo Bras wrote:
> Make use of the new local_*lock_n*() and local_schedule_work_on() interface
> to improve performance & latency on PREEMTP_RT kernels.
> 
> For functions that may be scheduled in a different cpu, replace
> local_*lock*() by local_lock_n*(), and replace schedule_work_on() by
> local_schedule_work_on(). The same happens for flush_work() and
> local_flush_work().
> 
> This should bring no relevant performance impact on non-RT kernels:
> For functions that may be scheduled in a different cpu, the local_*lock's
> this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
>  mm/swap.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)

Leo,

I think the interruptions should rather be removed for both
CONFIG_PREEMPT_RT AND !CONFIG_PREEMPT_RT.

The impact of grabbing locks must be properly analyzed and not
"rejected blindly".

Example:

commit 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Jun 24 13:54:23 2022 +0100

    mm/page_alloc: replace local_lock with normal spinlock
    
    struct per_cpu_pages is no longer strictly local as PCP lists can be
    drained remotely using a lock for protection.  While the use of local_lock
    works, it goes against the intent of local_lock which is for "pure CPU
    local concurrency control mechanisms and not suited for inter-CPU
    concurrency control" (Documentation/locking/locktypes.rst)
    
    local_lock protects against migration between when the percpu pointer is
    accessed and the pcp->lock acquired.  The lock acquisition is a preemption
    point so in the worst case, a task could migrate to another NUMA node and
    accidentally allocate remote memory.  The main requirement is to pin the
    task to a CPU that is suitable for PREEMPT_RT and !PREEMPT_RT.

Re: [RFC PATCH 2/4] swap: apply new local_schedule_work_on() interface

Posted by Leonardo Brás 2 years, 5 months ago

On Tue, 2023-08-08 at 16:39 -0300, Marcelo Tosatti wrote:
> On Sat, Jul 29, 2023 at 05:37:33AM -0300, Leonardo Bras wrote:
> > Make use of the new local_*lock_n*() and local_schedule_work_on() interface
> > to improve performance & latency on PREEMTP_RT kernels.
> > 
> > For functions that may be scheduled in a different cpu, replace
> > local_*lock*() by local_lock_n*(), and replace schedule_work_on() by
> > local_schedule_work_on(). The same happens for flush_work() and
> > local_flush_work().
> > 
> > This should bring no relevant performance impact on non-RT kernels:
> > For functions that may be scheduled in a different cpu, the local_*lock's
> > this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).
> > 
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > ---
> >  mm/swap.c | 18 +++++++++---------
> >  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> Leo,
> 
> I think the interruptions should rather be removed for both
> CONFIG_PREEMPT_RT AND !CONFIG_PREEMPT_RT.
> 
> The impact of grabbing locks must be properly analyzed and not
> "rejected blindly".

Yes, I agree with the idea, but we have been noticing a lot of rejection for
these ideas lately, as the maintainers perceive this as a big change.

My idea here is to provide a general way to improve the PREEMPT_RT scenarios CPU
Isolation, even though there is resistance in !PREEMPT_RT case.

As I commented, spinlocks are already held by PREEMPT_RT's local_locks hotpaths,
and yet are using schedule_work_on() to have remote work done. This patch tries
to solve this by using spin_lock(remote_cpu_lock) instead and save a lot of
cycles while decreasing IPI at the remote cpu.

It looks a simple solution, improving isolation and performance on PREEMPT_RT
with no visible drawbacks. I agree the interface is not ideal, and for that I
really need you guys help.

I understand that having this change merged, we will have more precedence to
discuss performance for the !PREEMPT_RT case.

What do you think on that?

Thanks!
Leo

> 
> Example:
> 
> commit 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8
> Author: Mel Gorman <mgorman@techsingularity.net>
> Date:   Fri Jun 24 13:54:23 2022 +0100
> 
>     mm/page_alloc: replace local_lock with normal spinlock
>     
>     struct per_cpu_pages is no longer strictly local as PCP lists can be
>     drained remotely using a lock for protection.  While the use of local_lock
>     works, it goes against the intent of local_lock which is for "pure CPU
>     local concurrency control mechanisms and not suited for inter-CPU
>     concurrency control" (Documentation/locking/locktypes.rst)
>     
>     local_lock protects against migration between when the percpu pointer is
>     accessed and the pcp->lock acquired.  The lock acquisition is a preemption
>     point so in the worst case, a task could migrate to another NUMA node and
>     accidentally allocate remote memory.  The main requirement is to pin the
>     task to a CPU that is suitable for PREEMPT_RT and !PREEMPT_RT.
>

[RFC PATCH 1/4] Introducing local_lock_n() and local queue & flush
[RFC PATCH 2/4] swap: apply new local_schedule_work_on() interface
[RFC PATCH 3/4] memcontrol: apply new local_schedule_work_on() interface
[RFC PATCH 4/4] slub: apply new local_schedule_work_on() interface