[v1] memcg: non-unified flushing for userspace stats

[PATCH 0/3] memcg: non-unified flushing for userspace stats

Posted by Yosry Ahmed 2 years, 5 months ago

Most memcg flushing contexts using "unified" flushing, where only one
flusher is allowed at a time (others skip), and all flushers need to
flush the entire tree. This works well with high concurrency, which
mostly comes from in-kernel flushers (e.g. reclaim, refault, ..).

For userspace reads, unified flushing leads to non-deterministic stats
staleness and reading cost. This series clarifies and documents the
differences between unified and non-unified flushing (patches 1 & 2),
then opts userspace reads out of unified flushing (patch 3).

This patch series is a follow up on the discussion in [1]. That was a
patch that proposed that userspace reads wait for ongoing unified
flushers to complete before returning. There were concerns about the
latency that this introduces to userspace reads, especially with ongoing
reports of expensive stat reads even with unified flushing. Hence, this
series follows a different approach, by opting userspace reads out of
unified flushing completely. The cost of userspace reads are now
determinstic, and depend on the size of the subtree being read. This
should fix both the *sometimes* expensive reads (due to flushing the
entire tree) and occasional staless (due to skipping flushing).

I attempted to remove unified flushing completely, but noticed that
in-kernel flushers with high concurrency (e.g. hundreds of concurrent
reclaimers). This sort of concurrency is not expected from userspace
reads. More details about testing and some numbers in the last patch's
changelog.

[1]https://lore.kernel.org/lkml/20230809045810.1659356-1-yosryahmed@google.com/
[2]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/

Yosry Ahmed (3):
mm: memcg: properly name and document unified stats flushing
mm: memcg: add a helper for non-unified stats flushing
mm: memcg: use non-unified stats flushing for userspace reads

include/linux/memcontrol.h | 8 ++---
mm/memcontrol.c | 74 +++++++++++++++++++++++++++-----------
mm/vmscan.c | 2 +-
mm/workingset.c | 4 +--
4 files changed, 60 insertions(+), 28 deletions(-)

--
2.42.0.rc1.204.g551eb34607-goog

Re: [PATCH 0/3] memcg: non-unified flushing for userspace stats

Posted by Michal Koutný 2 years, 5 months ago

Hello.

On Mon, Aug 21, 2023 at 08:54:55PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> For userspace reads, unified flushing leads to non-deterministic stats
> staleness and reading cost.

I only skimed previous threads but I don't remember if it was resolved:
a) periodic flushing was too much spaced for user space readers (i.e. 2s
   delay is too much [1]),
b) periodic flushing didn't catch up (i.e. full tree flush can
   occassionaly take more than 2s) leading to extra staleness?

[1] Assuming that nr_cpus*MEMCG_CHARGE_BATCH error bound is also too
    much for userspace readers, correct?

> The cost of userspace reads are now determinstic, and depend on the
> size of the subtree being read. This should fix both the *sometimes*
> expensive reads (due to flushing the entire tree) and occasional
> staless (due to skipping flushing).

This is nice, thanks to the atomic removal in the commit 0a2dc6ac3329
("cgroup: remove cgroup_rstat_flush_atomic()"). I think the smaller
chunks with yielding could be universally better (last words :-).

Thanks,
Michal

Re: [PATCH 0/3] memcg: non-unified flushing for userspace stats

Posted by Yosry Ahmed 2 years, 5 months ago

On Tue, Aug 22, 2023 at 6:01 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello.
>
> On Mon, Aug 21, 2023 at 08:54:55PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> > For userspace reads, unified flushing leads to non-deterministic stats
> > staleness and reading cost.
>
> I only skimed previous threads but I don't remember if it was resolved:
> a) periodic flushing was too much spaced for user space readers (i.e. 2s
>    delay is too much [1]),
> b) periodic flushing didn't catch up (i.e. full tree flush can
>    occassionaly take more than 2s) leading to extra staleness?

The idea is that having stats anywhere between 0-2 seconds stale is
inconsistent, especially when compared to other values (such as
memory.usage_in_bytes), which can give an inconsistent and stale view
of the system. For some readers, 2s is too spaced (we have readers
that read every second). A time-only bound is scary because on large
systems a lot can happen in a second. There will always be a delay
anyway, but ideally we want to minimize it.

I think 2s is also not a strict bound (e.g. flushing is taking a lot
of time, a flush started but the cgroup you care about hasn't been
flushed yet, etc).

There is also the problem of variable cost of reading.

>
> [1] Assuming that nr_cpus*MEMCG_CHARGE_BATCH error bound is also too
>     much for userspace readers, correct?

I can't tell for sure to be honest, but given the continuously
increased number of cpus on modern systems, and the fact that
nr_cpus*MEMCG_CHARGE_BATCH can be localized in one cgroup or spread
across the hierarchy, I think it's better if we drop it for userspace
reads if possible.

>
> > The cost of userspace reads are now determinstic, and depend on the
> > size of the subtree being read. This should fix both the *sometimes*
> > expensive reads (due to flushing the entire tree) and occasional
> > staless (due to skipping flushing).
>
> This is nice, thanks to the atomic removal in the commit 0a2dc6ac3329
> ("cgroup: remove cgroup_rstat_flush_atomic()"). I think the smaller
> chunks with yielding could be universally better (last words :-).

I was hoping we can remove unified flushing completely, but stress
testing with hundreds of concurrent reclaimers shows a lot of
regression. Maybe it doesn't matter in practice, but I don't want to
pull that trigger :)

FWIW, with unified flushing we are getting great concurrency for
in-kernel flushers like reclaim or refault, but at the expense of
stats staleness. I really wonder what hidden cost we are paying due to
the stale stats. I would imagine any problems that manifest from this
would be difficult to tie back to the stale stats.

>
> Thanks,
> Michal
>