arch/arm64/include/asm/percpu.h | 16 ++ arch/loongarch/include/asm/percpu.h | 23 +++- arch/s390/include/asm/percpu.h | 5 arch/x86/include/asm/percpu.h | 39 +++--- include/asm-generic/percpu.h | 17 ++ include/linux/mmzone.h | 8 - include/linux/percpu-defs.h | 2 include/linux/vmstat.h | 2 kernel/fork.c | 2 kernel/scs.c | 2 mm/page_alloc.c | 5 mm/vmscan.c | 2 mm/vmstat.c | 440 +++++++++++++++++++++++++++++++++++++++++++++++------------------------------ 13 files changed, 361 insertions(+), 202 deletions(-)
This patch series addresses the following two problems: 1. A customer provided evidence indicating that a process was stalled in direct reclaim: - The process was trapped in throttle_direct_reclaim(). The function wait_event_killable() was called to wait condition allow_direct_reclaim(pgdat) for current node to be true. The allow_direct_reclaim(pgdat) examined the number of free pages on the node by zone_page_state() which just returns value in zone->vm_stat[NR_FREE_PAGES]. - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0. However, the freelist on this node was not empty. - This inconsistent of vmstat value was caused by percpu vmstat on nohz_full cpus. Every increment/decrement of vmstat is performed on percpu vmstat counter at first, then pooled diffs are cumulated to the zone's vmstat counter in timely manner. However, on nohz_full cpus (in case of this customer's system, 48 of 52 cpus) these pooled diffs were not cumulated once the cpu had no event on it so that the cpu started sleeping infinitely. I checked percpu vmstat and found there were total 69 counts not cumulated to the zone's vmstat counter yet. - In this situation, kswapd did not help the trapped process. In pgdat_balanced(), zone_wakermark_ok_safe() examined the number of free pages on the node by zone_page_state_snapshot() which checks pending counts on percpu vmstat. Therefore kswapd could know there were 69 free pages correctly. Since zone->_watermark = {8, 20, 32}, kswapd did not work because 69 was greater than 32 as high watermark. 2. With a task that busy loops on a given CPU, the kworker interruption to execute vmstat_update is undesired and may exceed latency thresholds for certain applications. By having vmstat_shepherd flush the per-CPU counters to the global counters from remote CPUs. This is done using cmpxchg to manipulate the counters, both CPU locally (via the account functions), and remotely (via cpu_vm_stats_fold). Thanks to Aaron Tomlin for diagnosing issue 1 and writing the initial patch series. Performance details for the kworker interruption: oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... The example above shows an additional 7us for the oslat -> kworker -> oslat switches. In the case of a virtualized CPU, and the vmstat_update interruption in the host (of a qemu-kvm vcpu), the latency penalty observed in the guest is higher than 50us, violating the acceptable latency threshold for certain applications. Follows a summary of the arguments against this patchset and responses to them. They are numbered from 1 and start with "O-x)", where x is a number. A line with "*** Response: ***" precedes the response to each numbered objection. O-1) `echo 1 > /proc/sys/vm/stat_refresh' achieves essentially the same without any kernel changes. *** Response *** The stat_refresh interface is not reliable as it triggers "queue_work_on" for all CPUs which have dirty per-CPU mm counters. So if you have two threads that desire not to be interrupted, starting one of the threads can interrupt an already executing and isolated thread. O-2) Why not always loop through all CPUs when reading the counters? (that is replace zone_page_state with zone_page_state_snapshot). *** Response *** Consider zone_watermark_fast, called from get_page_from_freelist. https://lwn.net/Articles/684616/ On x86 systems, DMA is not usually the problem; instead, memory allocation is. He is working with a target of 14.8 million packets (full wire speed on a 10Gb/s link) per second; on a 3GHz system, that gives the kernel about 200 cycles in which to process each packet. Allocating a single page, though, takes 277 cycles on that system, making the 14.8Mpps rate unattainable. He pointed out Mel Gorman's recent work to reduce memory-allocator overhead; that work reduced the cost to 230 cycles \ufffd\ufffd\ufffd a significant improvement, but nowhere near enough. O-3) Also vmstat already has a concept of silencing - i.e. quiet_vmstat. IIRC this is used by NOHZ. *** Response ***: The quiet_vmstat mechanism is not reliable, as it is used by the NOHZ code to synchronize the per-CPU mm stats to global counters (when entering NOHZ mode). Any subsequent use of per-CPU mm counters will allow vmstat shepherd thread to queue work (therefore waking up kwork thread) on a CPU. /* * Switch off vmstat processing and then fold all the remaining differentials * until the diffs stay at zero. The function is used by NOHZ and can only be * invoked when tick processing is not active. */ void quiet_vmstat(void) O-4) The only applications of interest are those that do not enter the kernel. > > > 2. With a task that busy loops on a given CPU, > > > the kworker interruption to execute vmstat_update > > > is undesired and may exceed latency thresholds > > > for certain applications. > > > > Yes it can but why does that matter? > > It matters for the application that is executing and expects > not to be interrupted. Those workloads shouldn't enter the kernel in the first place, no? Otherwise the in kernel execution with all the direct or indirect dependencies (e.g. via locks) can throw any latency expectations off the window. *** Response ***: a common counter example is for latency sensitive applications to call sys_nanosleep (for example cyclictest or PLC programs do that). O-5) Why not implement a syscall to flush all per-cpu caches. > The practical problem we have been seeing is -RT app initialization. > For example: > > 1) mlock(); > 2) enter loop without system calls OK, that is what I have kinda expected. Would have been better to mention it explicitly. I expect this to be a very common pattern and vmstat might not be the only subsystem that could interfere later on. Would it make more sense to address this by a more generic solution? E.g. a syscall to flush all per-cpu caches so they won't interfere later unless userspace hits the kernel path in some way (e.g. flush_cpu_caches(cpu_set_t cpumask, int flags)? The above pattern could then be implemented as do_initial_setup() sched_setaffinity(getpid(), cpumask); flush_cpu_caches(cpumask, 0); do_userspace_loop() *** Response ***: A special mode, where flushing of caches has been attempted before: https://lwn.net/Articles/883940/ However it has a number of drawbacks: 1) If the application is in kernel mode, the interruption will not be avoided (this patchset will avoid the interruption even in kernel space). 2) It requires modification of applications. 3) Applications which attempt to use this mode in combination with system call periods, for example: https://lore.kernel.org/linux-mm/87im9d4ezq.fsf@nanos.tec.linutronix.de/ "In a real-world usecase we had the situation of compute bursts and an unfortunate hw enforced requirement to go into the kernel between them for synchronization between the compute threads and hardware (A quick hardware assisted save/load). Unmodified NOHZ full accumulated to more than 6% loss compared to a fully undisturbed run. Most of it was caused by cache effects and not by the actually used CPU time. A full enforced quiescing upfront gained ~2-3%, but a lazy approach of accepting that some stuff might happen once and does not happen again gained almost 5%. In that particular scenario 5% _is_ a huge win." Will suffer performance slowdowns. O-7) There is a lack of examples where this change is required. *** Response ***: Example-1: MAC scheduler processing must occur every 1ms, and a certain amount of computation takes place (and must finish before the next 1ms timeframe). A > 50us latency spike as observed by cyclictest is considered a "failure". Performance details for the kworker interruption being solved: oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... The example above shows an additional 7us for the oslat -> kworker -> oslat switches. In the case of a virtualized CPU, and the vmstat_update interruption in the host (of a qemu-kvm vcpu), the latency penalty observed in the guest is higher than 50us, violating the acceptable latency threshold for certain applications. Example-2: Based on the tracing data above, but a slight different use case: Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency as measured by cyclictest & oslat. We cannot achieve under 20 usec with the vmstats interruption. Example-3: 7us above has been measured on recent Intel Xeon processors. There are use cases which use less powerful processors, such as embedded ARM boards, where switching from kworker and back is much more expensive (causing problems to a larger range of applications). For example, 3D printing: https://scholarworks.sjsu.edu/cgi/viewcontent.cgi?article=2077&context=etd_projects O-8) This is a general problem, therefore requires a general solution. > But let me repeat, this is not just about vmstats. Just have a look at > other queue_work_on users. You do not want to handy pick each and every > one and do so in the future as well. *** Response ***: The ones that are problematic are being fixed for sometime now. For example: commit 2de79ee27fdb52626ac4ac48ec6d8d52ba6f9047 Author: Paolo Abeni <pabeni@redhat.com> net: try to avoid unneeded backlog flush flush_all_backlogs() may cause deadlock on systems running processes with FIFO scheduling policy. The above is critical in -RT scenarios, where user-space specifically ensure no network activity is scheduled on the CPU running the mentioned FIFO process, but still get stuck. This commit tries to address the problem checking the backlog status on the remote CPUs before scheduling the flush operation. If the backlog is empty, we can skip it. v1 -> v2: - explicitly clear flushed cpu mask - Eric Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> And it has been a normal process so far. I think what needs to be done is to avoid new queue_work_on() users from being introduced in the tree (the number of existing ones is finite and can therefore be fixed). Agree with the criticism here, however, i can't see other options than the following: 1) Given an activity, which contains a sequence of instructions to execute on a CPU, to change the algorithm to execute that code remotely (therefore avoid interrupting a CPU), or to avoid the interruption somehow (which must be dealt with on a case-by-case basis). 2) To block that activity from happening in the first place, for the sites where it can be blocked (that return errors to userspace, for example). 3) Completly isolate the CPU from the kernel (off-line it). Working on patches to improve #2. v8 - Add summary of discussion on -v7 to cover letter - rebase v7: - Fix allow_direct_reclaim issue by using zone_page_state_snapshot (Michal Hocko) v6: - Add more information on throttle_direct_reclaim problem to commit logs (Michal Hocko) v5: - Drop "mm/vmstat: remove remote node draining" (Vlastimil Babka) - Implement remote node draining for cpu_vm_stats_fold (Vlastimil Babka) v4: - Switch per-CPU vmstat counters to s32, required by RISC-V, ARC architectures v3: - Removed unused drain_zone_pages and changes variable (David Hildenbrand) - Use xchg instead of cmpxchg in refresh_cpu_vm_stats (Peter Xu) - Add drain_all_pages to vmstat_refresh to make stats more accurate (Peter Xu) - Improve changelog of "mm/vmstat: switch counter modification to cmpxchg" (Peter Xu / David) - Improve changelog of "mm/vmstat: remove remote node draining" (David Hildenbrand) v2: - actually use LOCK CMPXCHG on counter mod/inc/dec functions (Christoph Lameter) - use try_cmpxchg for cmpxchg loops (Uros Bizjak / Matthew Wilcox) arch/arm64/include/asm/percpu.h | 16 ++ arch/loongarch/include/asm/percpu.h | 23 +++- arch/s390/include/asm/percpu.h | 5 arch/x86/include/asm/percpu.h | 39 +++--- include/asm-generic/percpu.h | 17 ++ include/linux/mmzone.h | 8 - include/linux/percpu-defs.h | 2 include/linux/vmstat.h | 2 kernel/fork.c | 2 kernel/scs.c | 2 mm/page_alloc.c | 5 mm/vmscan.c | 2 mm/vmstat.c | 440 +++++++++++++++++++++++++++++++++++++++++++++++------------------------------ 13 files changed, 361 insertions(+), 202 deletions(-)
[Sorry for a late response but I was conferencing last two weeks and now catching up] On Mon 15-05-23 15:00:15, Marcelo Tosatti wrote: [...] > v8 > - Add summary of discussion on -v7 to cover letter Thanks this is very useful! This helps to frame the further discussion. I believe the most important question to answer is this in fact > I think what needs to be done is to avoid new queue_work_on() > users from being introduced in the tree (the number of > existing ones is finite and can therefore be fixed). > > Agree with the criticism here, however, i can't see other > options than the following: > > 1) Given an activity, which contains a sequence of instructions > to execute on a CPU, to change the algorithm > to execute that code remotely (therefore avoid interrupting a CPU), > or to avoid the interruption somehow (which must be dealt with > on a case-by-case basis). > > 2) To block that activity from happening in the first place, > for the sites where it can be blocked (that return errors to > userspace, for example). > > 3) Completly isolate the CPU from the kernel (off-line it). I agree that a reliable cpu isolation implementation needs to address queue_work_on problem. And it has to do that _realiably_. This cannot by achieved by an endless whack-a-mole and chasing each new instance. There must be a more systematic approach. One way would be to change the semantic of schedule_work_on and fail call for an isolated CPU. The caller would have a way to fallback and handle the operation by other means. E.g. vmstat could simply ignore folding pcp data because an imprecision shouldn't really matter. Other callers might chose to do the operation remotely. This is a lot of work, no doubt about that, but it is a long term maintainable solution that doesn't give you new surprises with any new released kernel. There are likely other remote interfaces that would need to follow that scheme. If the cpu isolation is not planned to be worth that time investment then I do not think it is also worth reducing a highly optimized vmstat code. These stats are invoked from many hot paths and per-cpu implementation has been optimized for that case. If your workload would like to avoid that as disturbing then you already have a quiet_vmstat precedence so find a way how to use it for your workload instead. -- Michal Hocko SUSE Labs
On Wed, May 24, 2023 at 02:51:55PM +0200, Michal Hocko wrote: > [Sorry for a late response but I was conferencing last two weeks and now > catching up] > > On Mon 15-05-23 15:00:15, Marcelo Tosatti wrote: > [...] > > v8 > > - Add summary of discussion on -v7 to cover letter > > Thanks this is very useful! This helps to frame the further discussion. > > I believe the most important question to answer is this in fact > > I think what needs to be done is to avoid new queue_work_on() > > users from being introduced in the tree (the number of > > existing ones is finite and can therefore be fixed). > > > > Agree with the criticism here, however, i can't see other > > options than the following: > > > > 1) Given an activity, which contains a sequence of instructions > > to execute on a CPU, to change the algorithm > > to execute that code remotely (therefore avoid interrupting a CPU), > > or to avoid the interruption somehow (which must be dealt with > > on a case-by-case basis). > > > > 2) To block that activity from happening in the first place, > > for the sites where it can be blocked (that return errors to > > userspace, for example). > > > > 3) Completly isolate the CPU from the kernel (off-line it). > > I agree that a reliable cpu isolation implementation needs to address > queue_work_on problem. And it has to do that _realiably_. This cannot by > achieved by an endless whack-a-mole and chasing each new instance. There > must be a more systematic approach. One way would be to change the > semantic of schedule_work_on and fail call for an isolated CPU. The > caller would have a way to fallback and handle the operation by other > means. E.g. vmstat could simply ignore folding pcp data because an > imprecision shouldn't really matter. Other callers might chose to do the > operation remotely. This is a lot of work, no doubt about that, but it > is a long term maintainable solution that doesn't give you new surprises > with any new released kernel. There are likely other remote interfaces > that would need to follow that scheme. > > If the cpu isolation is not planned to be worth that time investment > then I do not think it is also worth reducing a highly optimized vmstat > code. These stats are invoked from many hot paths and per-cpu > implementation has been optimized for that case. It is exactly the same code, but now with a "LOCK" prefix for CMPXCHG instruction. Which should not cost much due to cache locking (these are per-CPU variables anyway). > If your workload would > like to avoid that as disturbing then you already have a quiet_vmstat > precedence so find a way how to use it for your workload instead. > > -- > Michal Hocko > SUSE Labs OK so an alternative solution is to completly disable vmstat updates for isolated CPUs. Are you OK with that ?
On Wed 24-05-23 10:53:23, Marcelo Tosatti wrote: > On Wed, May 24, 2023 at 02:51:55PM +0200, Michal Hocko wrote: > > [Sorry for a late response but I was conferencing last two weeks and now > > catching up] > > > > On Mon 15-05-23 15:00:15, Marcelo Tosatti wrote: > > [...] > > > v8 > > > - Add summary of discussion on -v7 to cover letter > > > > Thanks this is very useful! This helps to frame the further discussion. > > > > I believe the most important question to answer is this in fact > > > I think what needs to be done is to avoid new queue_work_on() > > > users from being introduced in the tree (the number of > > > existing ones is finite and can therefore be fixed). > > > > > > Agree with the criticism here, however, i can't see other > > > options than the following: > > > > > > 1) Given an activity, which contains a sequence of instructions > > > to execute on a CPU, to change the algorithm > > > to execute that code remotely (therefore avoid interrupting a CPU), > > > or to avoid the interruption somehow (which must be dealt with > > > on a case-by-case basis). > > > > > > 2) To block that activity from happening in the first place, > > > for the sites where it can be blocked (that return errors to > > > userspace, for example). > > > > > > 3) Completly isolate the CPU from the kernel (off-line it). > > > > I agree that a reliable cpu isolation implementation needs to address > > queue_work_on problem. And it has to do that _realiably_. This cannot by > > achieved by an endless whack-a-mole and chasing each new instance. There > > must be a more systematic approach. One way would be to change the > > semantic of schedule_work_on and fail call for an isolated CPU. The > > caller would have a way to fallback and handle the operation by other > > means. E.g. vmstat could simply ignore folding pcp data because an > > imprecision shouldn't really matter. Other callers might chose to do the > > operation remotely. This is a lot of work, no doubt about that, but it > > is a long term maintainable solution that doesn't give you new surprises > > with any new released kernel. There are likely other remote interfaces > > that would need to follow that scheme. > > > > If the cpu isolation is not planned to be worth that time investment > > then I do not think it is also worth reducing a highly optimized vmstat > > code. These stats are invoked from many hot paths and per-cpu > > implementation has been optimized for that case. > > It is exactly the same code, but now with a "LOCK" prefix for CMPXCHG > instruction. Which should not cost much due to cache locking (these are > per-CPU variables anyway). Sorry but just a LOCK prefix for a hot path is not a serious argument. > > If your workload would > > like to avoid that as disturbing then you already have a quiet_vmstat > > precedence so find a way how to use it for your workload instead. > > > > -- > > Michal Hocko > > SUSE Labs > > OK so an alternative solution is to completly disable vmstat updates > for isolated CPUs. Are you OK with that ? Yes, the number of events should be reasonably small and those places in the kernel which really need a precise value need to do a per-cpu walk anyway. IIRC /proc/vmstat et al also do accumulate pcp state. But let me reiterate. Even with vmstat updates out of the game, you have so many other sources of disruption that your isolated workload will be fragile until you actually try to deal with the problem on a more fundamental level. -- Michal Hocko SUSE Labs
The patchset still modifies the semantics of this_cpu operations semantics replacing the lockless RMV operations with locked ones. One of the rationales for the use this_cpu operations is their efficiency since locked RMV atomics are avoided. This patchset destroys that functionality. If you want locked RMV semantics then use them through cmpxchg() and friends. Do not modify this_cpu operations by changing the implementation in the arch code.
Hi Christoph, On Tue, May 16, 2023 at 10:09:02AM +0200, Christoph Lameter wrote: > The patchset still modifies the semantics of this_cpu operations semantics > replacing the lockless RMV operations with locked ones. It does that to follow the pre-existing kernel convention: function-name LOCK prefix cmpxchg YES cmpxchg_local NO So the patchset introduces: function-name LOCK prefix this_cpu_cmpxchg YES this_cpu_cmpxchg_local NO > One of the > rationales for the use this_cpu operations is their efficiency since > locked RMV atomics are avoided. And there is the freedom to choose between this_cpu_cmpxchg and this_cpu_cmpxchg_local (depending on intended usage). > This patchset destroys that functionality. Patch 6 is Subject: [PATCH v8 06/13] add this_cpu_cmpxchg_local and asm-generic definitions Which adds this_cpu_cmpxchg_local Patch 7 converts all other this_cmpxchg users (except the vmstat ones) [PATCH v8 07/13] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local So the non-LOCK'ed behaviour is maintained for existing users. > If you want locked RMV semantics then use them through cmpxchg() and > friends. Do not modify this_cpu operations by changing the implementation > in the arch code. But then it would be necessary to disable preemption here: static inline void mod_zone_state(struct zone *zone, enum zone_stat_item item, long delta, int overstep_mode) { struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats; s32 __percpu *p = pcp->vm_stat_diff + item; long o, n, t, z; do { z = 0; /* overflow to zone counters */ /* * The fetching of the stat_threshold is racy. We may apply * a counter threshold to the wrong the cpu if we get * rescheduled while executing here. However, the next * counter update will apply the threshold again and * therefore bring the counter under the threshold again. * * Most of the time the thresholds are the same anyways * for all cpus in a zone. */ t = this_cpu_read(pcp->stat_threshold); o = this_cpu_read(*p); n = delta + o; if (abs(n) > t) { int os = overstep_mode * (t >> 1); /* Overflow must be added to zone counters */ z = n + os; n = -os; } } while (this_cpu_cmpxchg(*p, o, n) != o); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ if (z) zone_page_state_add(z, zone, item); } Earlier you objected to disabling preemption on this codepath (which is what led to this patchset in the first place): "Using preemption is a way to make this work correctly. However, doing so would sacrifice the performance, low impact and the scalability of the vm counters." So it seems a locked, this_cpu function which does lock cmxpchg is desired. Perhaps you disagree with the this_cpu_cmpxchg_local/this_cpu_cmpxchg naming?
© 2016 - 2025 Red Hat, Inc.