LRUs can be drained through several ways. One of them may add disturbances
to isolated workloads while queuing a work at any time to any target,
whether running in nohz_full mode or not.
Prevent from that on isolated tasks with draining LRUs upon resuming to
userspace using the isolated task work framework.
It's worth noting that this is inherently racy against
lru_add_drain_all() remotely queueing the per CPU drain work and
therefore it prevents from the undesired disturbance only
*most of the time*.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/swap.h | 1 +
kernel/sched/isolation.c | 1 +
mm/swap.c | 5 ++++-
3 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index bd450023b9a4..bd6169c9cc14 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -393,6 +393,7 @@ extern void lru_add_drain(void);
extern void lru_add_drain_cpu(int cpu);
extern void lru_add_drain_cpu_zone(struct zone *zone);
extern void lru_add_drain_all(void);
+extern void lru_add_and_bh_lrus_drain(void);
void folio_deactivate(struct folio *folio);
void folio_mark_lazyfree(struct folio *folio);
extern void swap_setup(void);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 410df1fedc9d..68c70bea99e7 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -257,6 +257,7 @@ __setup("isolcpus=", housekeeping_isolcpus_setup);
#if defined(CONFIG_NO_HZ_FULL)
static void isolated_task_work(struct callback_head *head)
{
+ lru_add_and_bh_lrus_drain();
}
int __isolated_task_work_queue(void)
diff --git a/mm/swap.c b/mm/swap.c
index 67786cb77130..a4d7e3dc2a66 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -37,6 +37,7 @@
#include <linux/page_idle.h>
#include <linux/local_lock.h>
#include <linux/buffer_head.h>
+#include <linux/sched/isolation.h>
#include "internal.h"
@@ -521,6 +522,8 @@ void folio_add_lru(struct folio *folio)
fbatch = this_cpu_ptr(&cpu_fbatches.lru_add);
folio_batch_add_and_move(fbatch, folio, lru_add_fn);
local_unlock(&cpu_fbatches.lock);
+
+ isolated_task_work_queue();
}
EXPORT_SYMBOL(folio_add_lru);
@@ -765,7 +768,7 @@ void lru_add_drain(void)
* the same cpu. It shouldn't be a problem in !SMP case since
* the core is only one and the locks will disable preemption.
*/
-static void lru_add_and_bh_lrus_drain(void)
+void lru_add_and_bh_lrus_drain(void)
{
local_lock(&cpu_fbatches.lock);
lru_add_drain_cpu(smp_processor_id());
--
2.45.2
On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote: > LRUs can be drained through several ways. One of them may add disturbances > to isolated workloads while queuing a work at any time to any target, > whether running in nohz_full mode or not. > > Prevent from that on isolated tasks with draining LRUs upon resuming to > userspace using the isolated task work framework. > > It's worth noting that this is inherently racy against > lru_add_drain_all() remotely queueing the per CPU drain work and > therefore it prevents from the undesired disturbance only > *most of the time*. Can we simply not schedule flushing on remote CPUs and leave that to the "return to the userspace" path? I do not think we rely on LRU cache flushing for correctness purposes anywhere. Please also CC linux MM ML once the core infrastructure is agreed on. -- Michal Hocko SUSE Labs
Le Tue, Jun 25, 2024 at 04:20:01PM +0200, Michal Hocko a écrit : > On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote: > > LRUs can be drained through several ways. One of them may add disturbances > > to isolated workloads while queuing a work at any time to any target, > > whether running in nohz_full mode or not. > > > > Prevent from that on isolated tasks with draining LRUs upon resuming to > > userspace using the isolated task work framework. > > > > It's worth noting that this is inherently racy against > > lru_add_drain_all() remotely queueing the per CPU drain work and > > therefore it prevents from the undesired disturbance only > > *most of the time*. > > Can we simply not schedule flushing on remote CPUs and leave that to the > "return to the userspace" path? Do you mean I should add a call on return to the userspace path or can I expect it to be drained at some point already? The other limitation with that task work thing is that if the task queueing the work actually goes to sleep and another task go on the CPU and does isolated work in userspace, the drain doesn't happen. Now whether that is a real problem or not, I have no idea. > > I do not think we rely on LRU cache flushing for correctness purposes anywhere. > > Please also CC linux MM ML once the core infrastructure is agreed on. Ok, thanks. > -- > Michal Hocko > SUSE Labs
On Wed 03-07-24 14:52:21, Frederic Weisbecker wrote: > Le Tue, Jun 25, 2024 at 04:20:01PM +0200, Michal Hocko a écrit : > > On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote: > > > LRUs can be drained through several ways. One of them may add disturbances > > > to isolated workloads while queuing a work at any time to any target, > > > whether running in nohz_full mode or not. > > > > > > Prevent from that on isolated tasks with draining LRUs upon resuming to > > > userspace using the isolated task work framework. > > > > > > It's worth noting that this is inherently racy against > > > lru_add_drain_all() remotely queueing the per CPU drain work and > > > therefore it prevents from the undesired disturbance only > > > *most of the time*. > > > > Can we simply not schedule flushing on remote CPUs and leave that to the > > "return to the userspace" path? > > Do you mean I should add a call on return to the userspace path or can > I expect it to be drained at some point already? I would make the particular per cpu cache to be drained on return to the userspace. > The other limitation with that task work thing is that if the task > queueing the work actually goes to sleep and another task go on the CPU > and does isolated work in userspace, the drain doesn't happen. Now whether > that is a real problem or not, I have no idea. Theoretically there is a problem because pages sitting on pcp LRU caches cannot be migrated and some other operations will fail as well. But practically speaking those pages should be mostly of interest to the process allocating them most of the time. Page sharing between isolated workloads sounds like a terrible idea to me. Maybe reality hits us in this regards but we can deal with that when we learn about those workloads. So I wouldn't lose too much sleep over that. We are dealing with those isolated workloads being broken by simple things like fork now because that apparently adds pages on the pcp LRU cache and draining will happen sooner or later (very often when the task is already running in the userspace). -- Michal Hocko SUSE Labs
Le Thu, Jul 04, 2024 at 03:11:24PM +0200, Michal Hocko a écrit : > On Wed 03-07-24 14:52:21, Frederic Weisbecker wrote: > > Le Tue, Jun 25, 2024 at 04:20:01PM +0200, Michal Hocko a écrit : > > > On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote: > > > > LRUs can be drained through several ways. One of them may add disturbances > > > > to isolated workloads while queuing a work at any time to any target, > > > > whether running in nohz_full mode or not. > > > > > > > > Prevent from that on isolated tasks with draining LRUs upon resuming to > > > > userspace using the isolated task work framework. > > > > > > > > It's worth noting that this is inherently racy against > > > > lru_add_drain_all() remotely queueing the per CPU drain work and > > > > therefore it prevents from the undesired disturbance only > > > > *most of the time*. > > > > > > Can we simply not schedule flushing on remote CPUs and leave that to the > > > "return to the userspace" path? > > > > Do you mean I should add a call on return to the userspace path or can > > I expect it to be drained at some point already? > > I would make the particular per cpu cache to be drained on return to the > userspace. And then we need the patchset from Valentin that defers work to kernel entry? > > > The other limitation with that task work thing is that if the task > > queueing the work actually goes to sleep and another task go on the CPU > > and does isolated work in userspace, the drain doesn't happen. Now whether > > that is a real problem or not, I have no idea. > > Theoretically there is a problem because pages sitting on pcp LRU caches > cannot be migrated and some other operations will fail as well. But > practically speaking those pages should be mostly of interest to the > process allocating them most of the time. Page sharing between isolated > workloads sounds like a terrible idea to me. Maybe reality hits us in > this regards but we can deal with that when we learn about those > workloads. > > So I wouldn't lose too much sleep over that. We are dealing with those > isolated workloads being broken by simple things like fork now because > that apparently adds pages on the pcp LRU cache and draining will happen > sooner or later (very often when the task is already running in the > userspace). That sounds good! Thanks. > > -- > Michal Hocko > SUSE Labs
On 6/25/24 4:20 PM, Michal Hocko wrote: > On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote: >> LRUs can be drained through several ways. One of them may add disturbances >> to isolated workloads while queuing a work at any time to any target, >> whether running in nohz_full mode or not. >> >> Prevent from that on isolated tasks with draining LRUs upon resuming to >> userspace using the isolated task work framework. >> >> It's worth noting that this is inherently racy against >> lru_add_drain_all() remotely queueing the per CPU drain work and >> therefore it prevents from the undesired disturbance only >> *most of the time*. > > Can we simply not schedule flushing on remote CPUs and leave that to the > "return to the userspace" path? > > I do not think we rely on LRU cache flushing for correctness purposes anywhere. I guess drain via lru_cache_disable() should be honored, but also rare. > Please also CC linux MM ML once the core infrastructure is agreed on.
On Wed 26-06-24 15:16:04, Vlastimil Babka wrote: > On 6/25/24 4:20 PM, Michal Hocko wrote: > > On Tue 25-06-24 15:52:44, Frederic Weisbecker wrote: > >> LRUs can be drained through several ways. One of them may add disturbances > >> to isolated workloads while queuing a work at any time to any target, > >> whether running in nohz_full mode or not. > >> > >> Prevent from that on isolated tasks with draining LRUs upon resuming to > >> userspace using the isolated task work framework. > >> > >> It's worth noting that this is inherently racy against > >> lru_add_drain_all() remotely queueing the per CPU drain work and > >> therefore it prevents from the undesired disturbance only > >> *most of the time*. > > > > Can we simply not schedule flushing on remote CPUs and leave that to the > > "return to the userspace" path? > > > > I do not think we rely on LRU cache flushing for correctness purposes anywhere. > > I guess drain via lru_cache_disable() should be honored, but also rare. I do not think we can call it rare because it can be triggered by the userspace by NUMA syscalls for example. I think we should just either make it fail and let caller decide what to do or just make it best effort and eventually fail the operation if there is no other way. The latter has an advantage that the failure is lazy as well. In an ideal world, memory offlining will be a complete no-no in isolated workloads and mbind calls will not try to migrate memory that has been just added on the LRU cache. In any case this would require to document this limitation at least. -- Michal Hocko SUSE Labs
© 2016 - 2025 Red Hat, Inc.