mm/memcontrol.c | 6 ++++++ 1 file changed, 6 insertions(+)
In memory_max_write() we first set memcg->memory.max and only then
try to enforce it in loop. What if while we are in loop someone else
have changed memcg->memory.max but we are still trying to enforce
the old value? I believe this can lead to nasty consequence like getting
an oom on perfectly fine cgroup within it's limits or excess reclaim.
We also have exactly the same thing in memory_high_write().
So let's stop enforcing old limits if we already have a new ones.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
---
mm/memcontrol.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 61932c9215e7..81b303728491 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6769,6 +6769,9 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
unsigned long nr_pages = page_counter_read(&memcg->memory);
unsigned long reclaimed;
+ if (memcg->memory.high != high)
+ break;
+
if (nr_pages <= high)
break;
@@ -6817,6 +6820,9 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
for (;;) {
unsigned long nr_pages = page_counter_read(&memcg->memory);
+ if (memcg->memory.max != max)
+ break;
+
if (nr_pages <= max)
break;
--
2.43.0
On 3/20/24 06:03, Pavel Tikhomirov wrote:
> In memory_max_write() we first set memcg->memory.max and only then
> try to enforce it in loop. What if while we are in loop someone else
> have changed memcg->memory.max but we are still trying to enforce
> the old value? I believe this can lead to nasty consequence like getting
> an oom on perfectly fine cgroup within it's limits or excess reclaim.
Concurrent write to the same cgroup control file is not possible as the
underlying kernfs_open_file structure has a mutex that serialize access
to the file. Concurrent write to different cgroup control files is
possible, though.
Cheers,
Longman
>
> We also have exactly the same thing in memory_high_write().
>
> So let's stop enforcing old limits if we already have a new ones.
>
> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> ---
> mm/memcontrol.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 61932c9215e7..81b303728491 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6769,6 +6769,9 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> unsigned long nr_pages = page_counter_read(&memcg->memory);
> unsigned long reclaimed;
>
> + if (memcg->memory.high != high)
> + break;
> +
> if (nr_pages <= high)
> break;
>
> @@ -6817,6 +6820,9 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> for (;;) {
> unsigned long nr_pages = page_counter_read(&memcg->memory);
>
> + if (memcg->memory.max != max)
> + break;
> +
> if (nr_pages <= max)
> break;
>
On 21/03/2024 01:09, Waiman Long wrote:
>
> On 3/20/24 06:03, Pavel Tikhomirov wrote:
>> In memory_max_write() we first set memcg->memory.max and only then
>> try to enforce it in loop. What if while we are in loop someone else
>> have changed memcg->memory.max but we are still trying to enforce
>> the old value? I believe this can lead to nasty consequence like getting
>> an oom on perfectly fine cgroup within it's limits or excess reclaim.
>
> Concurrent write to the same cgroup control file is not possible as the
> underlying kernfs_open_file structure has a mutex that serialize access
> to the file. Concurrent write to different cgroup control files is
> possible, though.
Thanks for pointing this out, now I see it, in kernfs_fop_write_iter()
we take of->mutex before ops->write() -> cgroup_file_write(). That means
patch is not needed.
>
> Cheers,
> Longman
>
>>
>> We also have exactly the same thing in memory_high_write().
>>
>> So let's stop enforcing old limits if we already have a new ones.
>>
>> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>> ---
>> mm/memcontrol.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 61932c9215e7..81b303728491 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -6769,6 +6769,9 @@ static ssize_t memory_high_write(struct
>> kernfs_open_file *of,
>> unsigned long nr_pages = page_counter_read(&memcg->memory);
>> unsigned long reclaimed;
>> + if (memcg->memory.high != high)
>> + break;
>> +
>> if (nr_pages <= high)
>> break;
>> @@ -6817,6 +6820,9 @@ static ssize_t memory_max_write(struct
>> kernfs_open_file *of,
>> for (;;) {
>> unsigned long nr_pages = page_counter_read(&memcg->memory);
>> + if (memcg->memory.max != max)
>> + break;
>> +
>> if (nr_pages <= max)
>> break;
>
--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
On Wed 20-03-24 13:09:07, Waiman Long wrote: > > On 3/20/24 06:03, Pavel Tikhomirov wrote: > > In memory_max_write() we first set memcg->memory.max and only then > > try to enforce it in loop. What if while we are in loop someone else > > have changed memcg->memory.max but we are still trying to enforce > > the old value? I believe this can lead to nasty consequence like getting > > an oom on perfectly fine cgroup within it's limits or excess reclaim. > > Concurrent write to the same cgroup control file is not possible as the > underlying kernfs_open_file structure has a mutex that serialize access to > the file. This is good to know and I was not aware of that. Thanks! -- Michal Hocko SUSE Labs
On Wed 20-03-24 18:12:50, Michal Hocko wrote: > On Wed 20-03-24 13:09:07, Waiman Long wrote: > > > > On 3/20/24 06:03, Pavel Tikhomirov wrote: > > > In memory_max_write() we first set memcg->memory.max and only then > > > try to enforce it in loop. What if while we are in loop someone else > > > have changed memcg->memory.max but we are still trying to enforce > > > the old value? I believe this can lead to nasty consequence like getting > > > an oom on perfectly fine cgroup within it's limits or excess reclaim. > > > > Concurrent write to the same cgroup control file is not possible as the > > underlying kernfs_open_file structure has a mutex that serialize access to > > the file. > > This is good to know and I was not aware of that. Thanks! Btw. even if the interface itself is serialized then uncoordinated userspace is still timing dependent so fundamentally racy. -- Michal Hocko SUSE Labs
On Wed 20-03-24 18:03:30, Pavel Tikhomirov wrote:
> In memory_max_write() we first set memcg->memory.max and only then
> try to enforce it in loop. What if while we are in loop someone else
> have changed memcg->memory.max but we are still trying to enforce
> the old value? I believe this can lead to nasty consequence like getting
> an oom on perfectly fine cgroup within it's limits or excess reclaim.
I would argue that uncoordinated hard limit configuration can cause
problems on their own. Beside how is this any different from changing
the high limit while we are inside the reclaim loop?
> We also have exactly the same thing in memory_high_write().
>
> So let's stop enforcing old limits if we already have a new ones.
I do see any reasons why this would be harmful I just do not see why
this is a real thing or why the new behavior is any better for racing
updaters as those are not deterministic anyway. If you have any actual
usecase then more details would really help to justify this change.
The existing behavior makes some sense as it enforces the given limit
deterministically.
> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> ---
> mm/memcontrol.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 61932c9215e7..81b303728491 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6769,6 +6769,9 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> unsigned long nr_pages = page_counter_read(&memcg->memory);
> unsigned long reclaimed;
>
> + if (memcg->memory.high != high)
> + break;
> +
> if (nr_pages <= high)
> break;
>
> @@ -6817,6 +6820,9 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> for (;;) {
> unsigned long nr_pages = page_counter_read(&memcg->memory);
>
> + if (memcg->memory.max != max)
> + break;
> +
> if (nr_pages <= max)
> break;
>
> --
> 2.43.0
--
Michal Hocko
SUSE Labs
On 20/03/2024 18:28, Michal Hocko wrote:
> On Wed 20-03-24 18:03:30, Pavel Tikhomirov wrote:
>> In memory_max_write() we first set memcg->memory.max and only then
>> try to enforce it in loop. What if while we are in loop someone else
>> have changed memcg->memory.max but we are still trying to enforce
>> the old value? I believe this can lead to nasty consequence like getting
>> an oom on perfectly fine cgroup within it's limits or excess reclaim.
>
> I would argue that uncoordinated hard limit configuration can cause
> problems on their own.
Sorry, didn't know that.
> Beside how is this any different from changing
> the high limit while we are inside the reclaim loop?
I believe reclaim loop rereads limits on each iteration, e.g. in
reclaim_high(), so it should always be enforcing the right limit.
>
>> We also have exactly the same thing in memory_high_write().
>>
>> So let's stop enforcing old limits if we already have a new ones.
>
> I do see any reasons why this would be harmful I just do not see why
> this is a real thing or why the new behavior is any better for racing
> updaters as those are not deterministic anyway. If you have any actual
> usecase then more details would really help to justify this change.
>
> The existing behavior makes some sense as it enforces the given limit
> deterministically.
I don't have any actual problem, usecase or reproduce at hand, I only
see a potential problem:
Let's imagine that:
a) We set cgroup max limit to some small value, memory_max_write updates
memcg->memory.max and starts spinning in loop as it wants to reclaim
some memory which does not fit in new limit.
b) We don't need small limit anymore and we raise the limit to a big
value, but memory_max_write() from (a) is still spinning. And if we are
lucky enough and processes of cgroup are constantly consuming memory, to
compensate effect from memory_max_write() from (a), so that it will
continue spinning there forever.
Yes it is not that bad, because memory_max/high_write() also constantly
checks for pending signals in loop so they won't actually get
irreversibly stuck. But I just thought it was worth fixing.
>
>> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>> ---
>> mm/memcontrol.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 61932c9215e7..81b303728491 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -6769,6 +6769,9 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>> unsigned long nr_pages = page_counter_read(&memcg->memory);
>> unsigned long reclaimed;
>>
>> + if (memcg->memory.high != high)
>> + break;
>> +
>> if (nr_pages <= high)
>> break;
>>
>> @@ -6817,6 +6820,9 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>> for (;;) {
>> unsigned long nr_pages = page_counter_read(&memcg->memory);
>>
>> + if (memcg->memory.max != max)
>> + break;
>> +
>> if (nr_pages <= max)
>> break;
>>
>> --
>> 2.43.0
>
--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
On Wed, Mar 20, 2024 at 06:55:05PM +0800, Pavel Tikhomirov wrote: > > > On 20/03/2024 18:28, Michal Hocko wrote: > > On Wed 20-03-24 18:03:30, Pavel Tikhomirov wrote: > > > In memory_max_write() we first set memcg->memory.max and only then > > > try to enforce it in loop. What if while we are in loop someone else > > > have changed memcg->memory.max but we are still trying to enforce > > > the old value? I believe this can lead to nasty consequence like getting > > > an oom on perfectly fine cgroup within it's limits or excess reclaim. > > > > I would argue that uncoordinated hard limit configuration can cause > > problems on their own. > > Sorry, didn't know that. > > > Beside how is this any different from changing > > the high limit while we are inside the reclaim loop? > > I believe reclaim loop rereads limits on each iteration, e.g. in > reclaim_high(), so it should always be enforcing the right limit. > > > > > > We also have exactly the same thing in memory_high_write(). > > > > > > So let's stop enforcing old limits if we already have a new ones. > > > > I do see any reasons why this would be harmful I just do not see why > > this is a real thing or why the new behavior is any better for racing > > updaters as those are not deterministic anyway. If you have any actual > > usecase then more details would really help to justify this change. > > > > The existing behavior makes some sense as it enforces the given limit > > deterministically. > > I don't have any actual problem, usecase or reproduce at hand, I only see a > potential problem: If the problem is only potential and also not very severe (it's not a crash or memory corruption or something like this), I'd say let's keep things as they are. Thanks!
On Wed 20-03-24 18:55:05, Pavel Tikhomirov wrote: > > > On 20/03/2024 18:28, Michal Hocko wrote: > > On Wed 20-03-24 18:03:30, Pavel Tikhomirov wrote: > > > In memory_max_write() we first set memcg->memory.max and only then > > > try to enforce it in loop. What if while we are in loop someone else > > > have changed memcg->memory.max but we are still trying to enforce > > > the old value? I believe this can lead to nasty consequence like getting > > > an oom on perfectly fine cgroup within it's limits or excess reclaim. > > > > I would argue that uncoordinated hard limit configuration can cause > > problems on their own. > > Sorry, didn't know that. Well, just consider potential over-reclaim as a result of several competing actors to set the same limit. Or completely indeterministic final output of the limit setting depending on timing. This simply cannot work reliably. > > Beside how is this any different from changing > > the high limit while we are inside the reclaim loop? > > I believe reclaim loop rereads limits on each iteration, e.g. in > reclaim_high(), so it should always be enforcing the right limit. Reclaim loop might happen to take quite some time... > > > We also have exactly the same thing in memory_high_write(). > > > > > > So let's stop enforcing old limits if we already have a new ones. > > > > I do see any reasons why this would be harmful I just do not see why > > this is a real thing or why the new behavior is any better for racing > > updaters as those are not deterministic anyway. If you have any actual > > usecase then more details would really help to justify this change. > > > > The existing behavior makes some sense as it enforces the given limit > > deterministically. > > I don't have any actual problem, usecase or reproduce at hand, I only see a > potential problem: > > Let's imagine that: > > a) We set cgroup max limit to some small value, memory_max_write updates > memcg->memory.max and starts spinning in loop as it wants to reclaim some > memory which does not fit in new limit. > > b) We don't need small limit anymore and we raise the limit to a big value, > but memory_max_write() from (a) is still spinning. And if we are lucky > enough and processes of cgroup are constantly consuming memory, to > compensate effect from memory_max_write() from (a), so that it will continue > spinning there forever. This is a killable operation, so if you decide to change mind about limit setting and the current update is still in progress then just terminate it rather then override by a different process. > Yes it is not that bad, because memory_max/high_write() also constantly > checks for pending signals in loop so they won't actually get irreversibly > stuck. But I just thought it was worth fixing. If we want to fix this parallel limits setting then we should also think about a reasonable and predictable behavior and that would likely require some sort of locking IMO. -- Michal Hocko SUSE Labs
© 2016 - 2026 Red Hat, Inc.