[RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs

Frederic Weisbecker posted 6 patches 1 year, 5 months ago
There is a newer version of this series
[RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
Posted by Frederic Weisbecker 1 year, 5 months ago
LRUs can be drained through several ways. One of them may add disturbances
to isolated workloads while queuing a work at any time to any target,
whether running in nohz_full mode or not.

Prevent from that on isolated tasks with draining LRUs upon resuming to
userspace using the isolated task work framework.

It's worth noting that this is inherently racy against
lru_add_drain_all() remotely queueing the per CPU drain work and
therefore it prevents from the undesired disturbance only
*most of the time*.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/swap.h     | 1 +
 kernel/sched/isolation.c | 1 +
 mm/swap.c                | 5 ++++-
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index bd450023b9a4..bd6169c9cc14 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -393,6 +393,7 @@ extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_cpu_zone(struct zone *zone);
 extern void lru_add_drain_all(void);
+extern void lru_add_and_bh_lrus_drain(void);
 void folio_deactivate(struct folio *folio);
 void folio_mark_lazyfree(struct folio *folio);
 extern void swap_setup(void);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 410df1fedc9d..68c70bea99e7 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -257,6 +257,7 @@ __setup("isolcpus=", housekeeping_isolcpus_setup);
 #if defined(CONFIG_NO_HZ_FULL)
 static void isolated_task_work(struct callback_head *head)
 {
+	lru_add_and_bh_lrus_drain();
 }
 
 int __isolated_task_work_queue(void)
diff --git a/mm/swap.c b/mm/swap.c
index 67786cb77130..a4d7e3dc2a66 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -37,6 +37,7 @@
 #include <linux/page_idle.h>
 #include <linux/local_lock.h>
 #include <linux/buffer_head.h>
+#include <linux/sched/isolation.h>
 
 #include "internal.h"
 
@@ -521,6 +522,8 @@ void folio_add_lru(struct folio *folio)
 	fbatch = this_cpu_ptr(&cpu_fbatches.lru_add);
 	folio_batch_add_and_move(fbatch, folio, lru_add_fn);
 	local_unlock(&cpu_fbatches.lock);
+
+	isolated_task_work_queue();
 }
 EXPORT_SYMBOL(folio_add_lru);
 
@@ -765,7 +768,7 @@ void lru_add_drain(void)
  * the same cpu. It shouldn't be a problem in !SMP case since
  * the core is only one and the locks will disable preemption.
  */
-static void lru_add_and_bh_lrus_drain(void)
+void lru_add_and_bh_lrus_drain(void)
 {
 	local_lock(&cpu_fbatches.lock);
 	lru_add_drain_cpu(smp_processor_id());
-- 
2.45.2
Re: [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
Posted by Michal Hocko 1 year, 5 months ago
On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote:
> LRUs can be drained through several ways. One of them may add disturbances
> to isolated workloads while queuing a work at any time to any target,
> whether running in nohz_full mode or not.
> 
> Prevent from that on isolated tasks with draining LRUs upon resuming to
> userspace using the isolated task work framework.
> 
> It's worth noting that this is inherently racy against
> lru_add_drain_all() remotely queueing the per CPU drain work and
> therefore it prevents from the undesired disturbance only
> *most of the time*.

Can we simply not schedule flushing on remote CPUs and leave that to the
"return to the userspace" path?

I do not think we rely on LRU cache flushing for correctness purposes anywhere.

Please also CC linux MM ML once the core infrastructure is agreed on.
-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
Posted by Frederic Weisbecker 1 year, 5 months ago
Le Tue, Jun 25, 2024 at 04:20:01PM +0200, Michal Hocko a écrit :
> On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote:
> > LRUs can be drained through several ways. One of them may add disturbances
> > to isolated workloads while queuing a work at any time to any target,
> > whether running in nohz_full mode or not.
> > 
> > Prevent from that on isolated tasks with draining LRUs upon resuming to
> > userspace using the isolated task work framework.
> > 
> > It's worth noting that this is inherently racy against
> > lru_add_drain_all() remotely queueing the per CPU drain work and
> > therefore it prevents from the undesired disturbance only
> > *most of the time*.
> 
> Can we simply not schedule flushing on remote CPUs and leave that to the
> "return to the userspace" path?

Do you mean I should add a call on return to the userspace path or can
I expect it to be drained at some point already?

The other limitation with that task work thing is that if the task
queueing the work actually goes to sleep and another task go on the CPU
and does isolated work in userspace, the drain doesn't happen. Now whether
that is a real problem or not, I have no idea.

> 
> I do not think we rely on LRU cache flushing for correctness purposes anywhere.
> 
> Please also CC linux MM ML once the core infrastructure is agreed on.

Ok, thanks.

> -- 
> Michal Hocko
> SUSE Labs
Re: [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
Posted by Michal Hocko 1 year, 5 months ago
On Wed 03-07-24 14:52:21, Frederic Weisbecker wrote:
> Le Tue, Jun 25, 2024 at 04:20:01PM +0200, Michal Hocko a écrit :
> > On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote:
> > > LRUs can be drained through several ways. One of them may add disturbances
> > > to isolated workloads while queuing a work at any time to any target,
> > > whether running in nohz_full mode or not.
> > > 
> > > Prevent from that on isolated tasks with draining LRUs upon resuming to
> > > userspace using the isolated task work framework.
> > > 
> > > It's worth noting that this is inherently racy against
> > > lru_add_drain_all() remotely queueing the per CPU drain work and
> > > therefore it prevents from the undesired disturbance only
> > > *most of the time*.
> > 
> > Can we simply not schedule flushing on remote CPUs and leave that to the
> > "return to the userspace" path?
> 
> Do you mean I should add a call on return to the userspace path or can
> I expect it to be drained at some point already?

I would make the particular per cpu cache to be drained on return to the
userspace.

> The other limitation with that task work thing is that if the task
> queueing the work actually goes to sleep and another task go on the CPU
> and does isolated work in userspace, the drain doesn't happen. Now whether
> that is a real problem or not, I have no idea.

Theoretically there is a problem because pages sitting on pcp LRU caches
cannot be migrated and some other operations will fail as well. But
practically speaking those pages should be mostly of interest to the
process allocating them most of the time. Page sharing between isolated
workloads sounds like a terrible idea to me. Maybe reality hits us in
this regards but we can deal with that when we learn about those
workloads.

So I wouldn't lose too much sleep over that. We are dealing with those
isolated workloads being broken by simple things like fork now because
that apparently adds pages on the pcp LRU cache and draining will happen
sooner or later (very often when the task is already running in the
userspace).

-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
Posted by Frederic Weisbecker 1 year, 5 months ago
Le Thu, Jul 04, 2024 at 03:11:24PM +0200, Michal Hocko a écrit :
> On Wed 03-07-24 14:52:21, Frederic Weisbecker wrote:
> > Le Tue, Jun 25, 2024 at 04:20:01PM +0200, Michal Hocko a écrit :
> > > On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote:
> > > > LRUs can be drained through several ways. One of them may add disturbances
> > > > to isolated workloads while queuing a work at any time to any target,
> > > > whether running in nohz_full mode or not.
> > > > 
> > > > Prevent from that on isolated tasks with draining LRUs upon resuming to
> > > > userspace using the isolated task work framework.
> > > > 
> > > > It's worth noting that this is inherently racy against
> > > > lru_add_drain_all() remotely queueing the per CPU drain work and
> > > > therefore it prevents from the undesired disturbance only
> > > > *most of the time*.
> > > 
> > > Can we simply not schedule flushing on remote CPUs and leave that to the
> > > "return to the userspace" path?
> > 
> > Do you mean I should add a call on return to the userspace path or can
> > I expect it to be drained at some point already?
> 
> I would make the particular per cpu cache to be drained on return to the
> userspace.

And then we need the patchset from Valentin that defers work to kernel entry?

> 
> > The other limitation with that task work thing is that if the task
> > queueing the work actually goes to sleep and another task go on the CPU
> > and does isolated work in userspace, the drain doesn't happen. Now whether
> > that is a real problem or not, I have no idea.
> 
> Theoretically there is a problem because pages sitting on pcp LRU caches
> cannot be migrated and some other operations will fail as well. But
> practically speaking those pages should be mostly of interest to the
> process allocating them most of the time. Page sharing between isolated
> workloads sounds like a terrible idea to me. Maybe reality hits us in
> this regards but we can deal with that when we learn about those
> workloads.
> 
> So I wouldn't lose too much sleep over that. We are dealing with those
> isolated workloads being broken by simple things like fork now because
> that apparently adds pages on the pcp LRU cache and draining will happen
> sooner or later (very often when the task is already running in the
> userspace).

That sounds good!

Thanks.

> 
> -- 
> Michal Hocko
> SUSE Labs
Re: [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
Posted by Vlastimil Babka 1 year, 5 months ago
On 6/25/24 4:20 PM, Michal Hocko wrote:
> On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote:
>> LRUs can be drained through several ways. One of them may add disturbances
>> to isolated workloads while queuing a work at any time to any target,
>> whether running in nohz_full mode or not.
>> 
>> Prevent from that on isolated tasks with draining LRUs upon resuming to
>> userspace using the isolated task work framework.
>> 
>> It's worth noting that this is inherently racy against
>> lru_add_drain_all() remotely queueing the per CPU drain work and
>> therefore it prevents from the undesired disturbance only
>> *most of the time*.
> 
> Can we simply not schedule flushing on remote CPUs and leave that to the
> "return to the userspace" path?
> 
> I do not think we rely on LRU cache flushing for correctness purposes anywhere.

I guess drain via lru_cache_disable() should be honored, but also rare.

> Please also CC linux MM ML once the core infrastructure is agreed on.
Re: [RFC PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
Posted by Michal Hocko 1 year, 5 months ago
On Wed 26-06-24 15:16:04, Vlastimil Babka wrote:
> On 6/25/24 4:20 PM, Michal Hocko wrote:
> > On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote:
> >> LRUs can be drained through several ways. One of them may add disturbances
> >> to isolated workloads while queuing a work at any time to any target,
> >> whether running in nohz_full mode or not.
> >> 
> >> Prevent from that on isolated tasks with draining LRUs upon resuming to
> >> userspace using the isolated task work framework.
> >> 
> >> It's worth noting that this is inherently racy against
> >> lru_add_drain_all() remotely queueing the per CPU drain work and
> >> therefore it prevents from the undesired disturbance only
> >> *most of the time*.
> > 
> > Can we simply not schedule flushing on remote CPUs and leave that to the
> > "return to the userspace" path?
> > 
> > I do not think we rely on LRU cache flushing for correctness purposes anywhere.
> 
> I guess drain via lru_cache_disable() should be honored, but also rare.

I do not think we can call it rare because it can be triggered by the
userspace by NUMA syscalls for example. I think we should just either
make it fail and let caller decide what to do or just make it best
effort and eventually fail the operation if there is no other way. The
latter has an advantage that the failure is lazy as well. In an ideal
world, memory offlining will be a complete no-no in isolated workloads
and mbind calls will not try to migrate memory that has been just
added on the LRU cache. In any case this would require to document this
limitation at least.
-- 
Michal Hocko
SUSE Labs