include/linux/memcontrol.h | 9 +++++++-- mm/memcontrol.c | 18 +----------------- 2 files changed, 8 insertions(+), 19 deletions(-)
From: Kairui Song <kasong@tencent.com>
commit ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size")
introduced a sanity check to catch memcg counter underflow, which is
more like a workaround for another bug: lru_zone_size is unsigned, so
underflow will wrap it around and return an enormously large number,
then the memcg shrinker will loop almost forever as the calculated
number of folios to shrink is huge. That commit also checks if a zero
value matches the empty LRU list, so we have to hold the LRU lock, and
do the counter adding differently depending on whether the nr_pages is
negative.
But later commit b4536f0c829c ("mm, memcg: fix the active list aging for
lowmem requests when memcg is enabled") already removed the LRU
emptiness check, doing the adding differently is meaningless now. And if
we just turn it into an atomic long, underflow isn't a big issue either,
and can be checked at the reader side. The reader size is much less
frequently called than the updater.
So let's turn the counter into an atomic long and check at the
reader side instead, which has a smaller overhead. Use atomic to avoid
potential locking issue. The underflow correction is removed, which
should be fine as if there is a mass leaking of the LRU size counter,
something else may also have gone very wrong, and one should fix that
leaking site instead.
For now still keep the LRU lock context, in thoery that can be removed
too since the update is atomic, if we can tolerate a temporary
inaccurate reading, but currently there is no benefit doing so yet.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
include/linux/memcontrol.h | 9 +++++++--
mm/memcontrol.c | 18 +-----------------
2 files changed, 8 insertions(+), 19 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0651865a4564..197f48faa8ba 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -112,7 +112,7 @@ struct mem_cgroup_per_node {
/* Fields which get updated often at the end. */
struct lruvec lruvec;
CACHELINE_PADDING(_pad2_);
- unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
+ atomic_long_t lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
struct mem_cgroup_reclaim_iter iter;
#ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC
@@ -903,10 +903,15 @@ static inline
unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec,
enum lru_list lru, int zone_idx)
{
+ long val;
struct mem_cgroup_per_node *mz;
mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- return READ_ONCE(mz->lru_zone_size[zone_idx][lru]);
+ val = atomic_long_read(&mz->lru_zone_size[zone_idx][lru]);
+ if (WARN_ON_ONCE(val < 0))
+ return 0;
+
+ return val;
}
void __mem_cgroup_handle_over_high(gfp_t gfp_mask);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9b07db2cb232..d5da09fbe43e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1273,28 +1273,12 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
int zid, int nr_pages)
{
struct mem_cgroup_per_node *mz;
- unsigned long *lru_size;
- long size;
if (mem_cgroup_disabled())
return;
mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- lru_size = &mz->lru_zone_size[zid][lru];
-
- if (nr_pages < 0)
- *lru_size += nr_pages;
-
- size = *lru_size;
- if (WARN_ONCE(size < 0,
- "%s(%p, %d, %d): lru_size %ld\n",
- __func__, lruvec, lru, nr_pages, size)) {
- VM_BUG_ON(1);
- *lru_size = 0;
- }
-
- if (nr_pages > 0)
- *lru_size += nr_pages;
+ atomic_long_add(nr_pages, &mz->lru_zone_size[zid][lru]);
}
/**
---
base-commit: 1ef4e3be45a85a103a667cc39fd68c3826e6acb9
change-id: 20251211-mz-lru-size-cleanup-c81deccfd5d7
Best regards,
--
Kairui Song <kasong@tencent.com>
On 2025/12/15 14:45, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
>
> commit ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size")
> introduced a sanity check to catch memcg counter underflow, which is
> more like a workaround for another bug: lru_zone_size is unsigned, so
> underflow will wrap it around and return an enormously large number,
> then the memcg shrinker will loop almost forever as the calculated
> number of folios to shrink is huge. That commit also checks if a zero
> value matches the empty LRU list, so we have to hold the LRU lock, and
> do the counter adding differently depending on whether the nr_pages is
> negative.
>
> But later commit b4536f0c829c ("mm, memcg: fix the active list aging for
> lowmem requests when memcg is enabled") already removed the LRU
> emptiness check, doing the adding differently is meaningless now. And if
Agree.
I did submit a patch to address that, but it was not accepted.
For reference, here is the link to the discussion:
https://lore.kernel.org/lkml/CAOUHufbCCkOBGcSPZqNY+FXcrH8+U7_nRvftzOzKUBS4hn+kuQ@mail.gmail.com/
> we just turn it into an atomic long, underflow isn't a big issue either,
> and can be checked at the reader side. The reader size is much less
> frequently called than the updater.
>
> So let's turn the counter into an atomic long and check at the
> reader side instead, which has a smaller overhead. Use atomic to avoid
> potential locking issue. The underflow correction is removed, which
> should be fine as if there is a mass leaking of the LRU size counter,
> something else may also have gone very wrong, and one should fix that
> leaking site instead.
>
> For now still keep the LRU lock context, in thoery that can be removed
> too since the update is atomic, if we can tolerate a temporary
> inaccurate reading, but currently there is no benefit doing so yet.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> include/linux/memcontrol.h | 9 +++++++--
> mm/memcontrol.c | 18 +-----------------
> 2 files changed, 8 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0651865a4564..197f48faa8ba 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -112,7 +112,7 @@ struct mem_cgroup_per_node {
> /* Fields which get updated often at the end. */
> struct lruvec lruvec;
> CACHELINE_PADDING(_pad2_);
> - unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
> + atomic_long_t lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
> struct mem_cgroup_reclaim_iter iter;
>
> #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC
> @@ -903,10 +903,15 @@ static inline
> unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec,
> enum lru_list lru, int zone_idx)
> {
> + long val;
> struct mem_cgroup_per_node *mz;
>
> mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> - return READ_ONCE(mz->lru_zone_size[zone_idx][lru]);
> + val = atomic_long_read(&mz->lru_zone_size[zone_idx][lru]);
> + if (WARN_ON_ONCE(val < 0))
> + return 0;
> +
> + return val;
> }
>
> void __mem_cgroup_handle_over_high(gfp_t gfp_mask);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9b07db2cb232..d5da09fbe43e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1273,28 +1273,12 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> int zid, int nr_pages)
> {
> struct mem_cgroup_per_node *mz;
> - unsigned long *lru_size;
> - long size;
>
> if (mem_cgroup_disabled())
> return;
>
> mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> - lru_size = &mz->lru_zone_size[zid][lru];
> -
> - if (nr_pages < 0)
> - *lru_size += nr_pages;
> -
> - size = *lru_size;
> - if (WARN_ONCE(size < 0,
> - "%s(%p, %d, %d): lru_size %ld\n",
> - __func__, lruvec, lru, nr_pages, size)) {
> - VM_BUG_ON(1);
> - *lru_size = 0;
> - }
> -
> - if (nr_pages > 0)
> - *lru_size += nr_pages;
> + atomic_long_add(nr_pages, &mz->lru_zone_size[zid][lru]);
> }
>
> /**
>
> ---
> base-commit: 1ef4e3be45a85a103a667cc39fd68c3826e6acb9
> change-id: 20251211-mz-lru-size-cleanup-c81deccfd5d7
>
> Best regards,
--
Best regards,
Ridong
On Mon, Dec 15, 2025 at 3:38 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
>
> On 2025/12/15 14:45, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > commit ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size")
> > introduced a sanity check to catch memcg counter underflow, which is
> > more like a workaround for another bug: lru_zone_size is unsigned, so
> > underflow will wrap it around and return an enormously large number,
> > then the memcg shrinker will loop almost forever as the calculated
> > number of folios to shrink is huge. That commit also checks if a zero
> > value matches the empty LRU list, so we have to hold the LRU lock, and
> > do the counter adding differently depending on whether the nr_pages is
> > negative.
> >
> > But later commit b4536f0c829c ("mm, memcg: fix the active list aging for
> > lowmem requests when memcg is enabled") already removed the LRU
> > emptiness check, doing the adding differently is meaningless now. And if
>
> Agree.
>
> I did submit a patch to address that, but it was not accepted.
>
> For reference, here is the link to the discussion:
> https://lore.kernel.org/lkml/CAOUHufbCCkOBGcSPZqNY+FXcrH8+U7_nRvftzOzKUBS4hn+kuQ@mail.gmail.com/
>
Thanks for letting me know, I wasn't aware that you noticed this too.
From my side I'm noticing that, lru_zone_size has only one reader:
shrink_lruvec -> get_scan_count -> lruvec_lru_size, while the updater
is every folio on LRU.
The oldest commit introduced this was trying to catch an underflow,
with extra sanity check to catch LRU emptiness mis-match. A later
commit removed the LRU emptiness mis-match, and the only thing left
here is the underflow check.
LRU counter leak (if there are) may happen anytime due to different
reasons, and I think the time an updater sees an underflowed value is
not unlikely to be the time the actual leak happens. (e.g. A folio was
removed without updating the counter minutes ago while there are other
folios still on the LRU, then an updater will trigger the WARN much
later). So the major issue here is the underflow mitigation.
Turning it into an atomic long should help mitigate both underflow,
and race (e.g. forgot to put the counter update inside LRU lock).
Overflow is really unlikely to happen IMO, and if that's corruption,
corruption could happen anywhere.
The reason I sent this patch is trying to move
mem_cgroup_update_lru_size out of lru lock scope in another series for
some other feature, just to gather some comments for this particular
sanity check, it seems a valid clean up and micro optimization on its
own.
On 2025/12/15 16:29, Kairui Song wrote:
> On Mon, Dec 15, 2025 at 3:38 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
>>
>> On 2025/12/15 14:45, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> commit ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size")
>>> introduced a sanity check to catch memcg counter underflow, which is
>>> more like a workaround for another bug: lru_zone_size is unsigned, so
>>> underflow will wrap it around and return an enormously large number,
>>> then the memcg shrinker will loop almost forever as the calculated
>>> number of folios to shrink is huge. That commit also checks if a zero
>>> value matches the empty LRU list, so we have to hold the LRU lock, and
>>> do the counter adding differently depending on whether the nr_pages is
>>> negative.
>>>
>>> But later commit b4536f0c829c ("mm, memcg: fix the active list aging for
>>> lowmem requests when memcg is enabled") already removed the LRU
>>> emptiness check, doing the adding differently is meaningless now. And if
>>
>> Agree.
>>
>> I did submit a patch to address that, but it was not accepted.
>>
>> For reference, here is the link to the discussion:
>> https://lore.kernel.org/lkml/CAOUHufbCCkOBGcSPZqNY+FXcrH8+U7_nRvftzOzKUBS4hn+kuQ@mail.gmail.com/
>>
>
> Thanks for letting me know, I wasn't aware that you noticed this too.
>
>>From my side I'm noticing that, lru_zone_size has only one reader:
> shrink_lruvec -> get_scan_count -> lruvec_lru_size, while the updater
> is every folio on LRU.
>
> The oldest commit introduced this was trying to catch an underflow,
> with extra sanity check to catch LRU emptiness mis-match. A later
> commit removed the LRU emptiness mis-match, and the only thing left
> here is the underflow check.
>
> LRU counter leak (if there are) may happen anytime due to different
> reasons, and I think the time an updater sees an underflowed value is
> not unlikely to be the time the actual leak happens. (e.g. A folio was
This is exactly our concern. If we don’t read the LRU size, we won’t get a warning even if it has
overflowed. I’d like to hear the experts’ opinions on this.
> removed without updating the counter minutes ago while there are other
> folios still on the LRU, then an updater will trigger the WARN much
> later). So the major issue here is the underflow mitigation.
>
> Turning it into an atomic long should help mitigate both underflow,
> and race (e.g. forgot to put the counter update inside LRU lock).
> Overflow is really unlikely to happen IMO, and if that's corruption,
> corruption could happen anywhere.
>
> The reason I sent this patch is trying to move
> mem_cgroup_update_lru_size out of lru lock scope in another series for
> some other feature, just to gather some comments for this particular
> sanity check, it seems a valid clean up and micro optimization on its
> own.
I agree — it is confusing. If several of us have been confused by this, maybe it’s time we consider
making some changes.
--
Best regards,
Ridong
On Mon, Dec 15, 2025 at 4:29 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, Dec 15, 2025 at 3:38 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
> >
> > On 2025/12/15 14:45, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > commit ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size")
> > > introduced a sanity check to catch memcg counter underflow, which is
> > > more like a workaround for another bug: lru_zone_size is unsigned, so
> > > underflow will wrap it around and return an enormously large number,
> > > then the memcg shrinker will loop almost forever as the calculated
> > > number of folios to shrink is huge. That commit also checks if a zero
> > > value matches the empty LRU list, so we have to hold the LRU lock, and
> > > do the counter adding differently depending on whether the nr_pages is
> > > negative.
> > >
> > > But later commit b4536f0c829c ("mm, memcg: fix the active list aging for
> > > lowmem requests when memcg is enabled") already removed the LRU
> > > emptiness check, doing the adding differently is meaningless now. And if
> >
> > Agree.
> >
> > I did submit a patch to address that, but it was not accepted.
> >
> > For reference, here is the link to the discussion:
> > https://lore.kernel.org/lkml/CAOUHufbCCkOBGcSPZqNY+FXcrH8+U7_nRvftzOzKUBS4hn+kuQ@mail.gmail.com/
> >
>
> Thanks for letting me know, I wasn't aware that you noticed this too.
>
> From my side I'm noticing that, lru_zone_size has only one reader:
> shrink_lruvec -> get_scan_count -> lruvec_lru_size, while the updater
> is every folio on LRU.
>
> The oldest commit introduced this was trying to catch an underflow,
> with extra sanity check to catch LRU emptiness mis-match. A later
> commit removed the LRU emptiness mis-match, and the only thing left
> here is the underflow check.
>
> LRU counter leak (if there are) may happen anytime due to different
> reasons, and I think the time an updater sees an underflowed value is
> not unlikely to be the time the actual leak happens. (e.g. A folio was
> removed without updating the counter minutes ago while there are other
> folios still on the LRU, then an updater will trigger the WARN much
> later). So the major issue here is the underflow mitigation.
>
> Turning it into an atomic long should help mitigate both underflow,
> and race (e.g. forgot to put the counter update inside LRU lock).
> Overflow is really unlikely to happen IMO, and if that's corruption,
> corruption could happen anywhere.
BTW, I think maybe we can add a few more WARN_ON at LRU destruction time?
© 2016 - 2025 Red Hat, Inc.