include/linux/memcontrol.h | 8 ++--- mm/memcontrol.c | 74 +++++++++++++++++++++++++++----------- mm/vmscan.c | 2 +- mm/workingset.c | 4 +-- 4 files changed, 60 insertions(+), 28 deletions(-)
Most memcg flushing contexts using "unified" flushing, where only one flusher is allowed at a time (others skip), and all flushers need to flush the entire tree. This works well with high concurrency, which mostly comes from in-kernel flushers (e.g. reclaim, refault, ..). For userspace reads, unified flushing leads to non-deterministic stats staleness and reading cost. This series clarifies and documents the differences between unified and non-unified flushing (patches 1 & 2), then opts userspace reads out of unified flushing (patch 3). This patch series is a follow up on the discussion in [1]. That was a patch that proposed that userspace reads wait for ongoing unified flushers to complete before returning. There were concerns about the latency that this introduces to userspace reads, especially with ongoing reports of expensive stat reads even with unified flushing. Hence, this series follows a different approach, by opting userspace reads out of unified flushing completely. The cost of userspace reads are now determinstic, and depend on the size of the subtree being read. This should fix both the *sometimes* expensive reads (due to flushing the entire tree) and occasional staless (due to skipping flushing). I attempted to remove unified flushing completely, but noticed that in-kernel flushers with high concurrency (e.g. hundreds of concurrent reclaimers). This sort of concurrency is not expected from userspace reads. More details about testing and some numbers in the last patch's changelog. [1]https://lore.kernel.org/lkml/20230809045810.1659356-1-yosryahmed@google.com/ [2]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/ Yosry Ahmed (3): mm: memcg: properly name and document unified stats flushing mm: memcg: add a helper for non-unified stats flushing mm: memcg: use non-unified stats flushing for userspace reads include/linux/memcontrol.h | 8 ++--- mm/memcontrol.c | 74 +++++++++++++++++++++++++++----------- mm/vmscan.c | 2 +- mm/workingset.c | 4 +-- 4 files changed, 60 insertions(+), 28 deletions(-) -- 2.42.0.rc1.204.g551eb34607-goog
Hello.
On Mon, Aug 21, 2023 at 08:54:55PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> For userspace reads, unified flushing leads to non-deterministic stats
> staleness and reading cost.
I only skimed previous threads but I don't remember if it was resolved:
a) periodic flushing was too much spaced for user space readers (i.e. 2s
delay is too much [1]),
b) periodic flushing didn't catch up (i.e. full tree flush can
occassionaly take more than 2s) leading to extra staleness?
[1] Assuming that nr_cpus*MEMCG_CHARGE_BATCH error bound is also too
much for userspace readers, correct?
> The cost of userspace reads are now determinstic, and depend on the
> size of the subtree being read. This should fix both the *sometimes*
> expensive reads (due to flushing the entire tree) and occasional
> staless (due to skipping flushing).
This is nice, thanks to the atomic removal in the commit 0a2dc6ac3329
("cgroup: remove cgroup_rstat_flush_atomic()"). I think the smaller
chunks with yielding could be universally better (last words :-).
Thanks,
Michal
On Tue, Aug 22, 2023 at 6:01 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello.
>
> On Mon, Aug 21, 2023 at 08:54:55PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> > For userspace reads, unified flushing leads to non-deterministic stats
> > staleness and reading cost.
>
> I only skimed previous threads but I don't remember if it was resolved:
> a) periodic flushing was too much spaced for user space readers (i.e. 2s
> delay is too much [1]),
> b) periodic flushing didn't catch up (i.e. full tree flush can
> occassionaly take more than 2s) leading to extra staleness?
The idea is that having stats anywhere between 0-2 seconds stale is
inconsistent, especially when compared to other values (such as
memory.usage_in_bytes), which can give an inconsistent and stale view
of the system. For some readers, 2s is too spaced (we have readers
that read every second). A time-only bound is scary because on large
systems a lot can happen in a second. There will always be a delay
anyway, but ideally we want to minimize it.
I think 2s is also not a strict bound (e.g. flushing is taking a lot
of time, a flush started but the cgroup you care about hasn't been
flushed yet, etc).
There is also the problem of variable cost of reading.
>
> [1] Assuming that nr_cpus*MEMCG_CHARGE_BATCH error bound is also too
> much for userspace readers, correct?
I can't tell for sure to be honest, but given the continuously
increased number of cpus on modern systems, and the fact that
nr_cpus*MEMCG_CHARGE_BATCH can be localized in one cgroup or spread
across the hierarchy, I think it's better if we drop it for userspace
reads if possible.
>
> > The cost of userspace reads are now determinstic, and depend on the
> > size of the subtree being read. This should fix both the *sometimes*
> > expensive reads (due to flushing the entire tree) and occasional
> > staless (due to skipping flushing).
>
> This is nice, thanks to the atomic removal in the commit 0a2dc6ac3329
> ("cgroup: remove cgroup_rstat_flush_atomic()"). I think the smaller
> chunks with yielding could be universally better (last words :-).
I was hoping we can remove unified flushing completely, but stress
testing with hundreds of concurrent reclaimers shows a lot of
regression. Maybe it doesn't matter in practice, but I don't want to
pull that trigger :)
FWIW, with unified flushing we are getting great concurrency for
in-kernel flushers like reclaim or refault, but at the expense of
stats staleness. I really wonder what hidden cost we are paying due to
the stale stats. I would imagine any problems that manifest from this
would be difficult to tie back to the stale stats.
>
> Thanks,
> Michal
>
© 2016 - 2025 Red Hat, Inc.