cpuset: Add cpuset.mems.spread_page to cgroup v2

[PATCH -next RFC 0/2] cpuset: Add cpuset.mems.spread_page to cgroup v2

Posted by Cai Xinchen 4 months, 1 week ago

I encountered a scenario where a machine with 1.5TB of memory,
while testing the Spark TPCDS 3TB dataset, experienced a significant
concentration of page cache usage on one of the NUMA nodes.
I discovered that the DataNode process had requested a large amount
of page cache. most of the page cache was concentrated in one NUMA node,
ultimately leading to the exhaustion of memory in that NUMA node.
At this point, all other processes in that NUMA node have to alloc
memory across NUMA nodes, or even across sockets. This eventually
caused a degradation in the end-to-end performance of the Spark test.

I do not want to restart the Spark DataNode service during business
operations. This issue can be resolved by migrating the DataNode into
a cpuset, dropping the cache, and setting cpuset.memory_spread_page to
allow it to evenly request memory. The core business threads could still
allocate local numa memory. After using cpuset.memory_spread_page, the
performance in the tpcds-99 test is improved by 2%.

The key point is that the even distribution of page cache within the
DataNode process (rather than the current NUMA distribution) does not
significantly affect end-to-end performance. However, the allocation
of core business processes, such as Executors, to the same NUMA node
does have a noticeable impact on end-to-end performance.

However, I found that cgroup v2 does not provide this interface. I
believe this interface still holds value in addressing issues caused
by uneven distribution of page cache allocation among process groups.

Thus I add cpuset.mems.spread_page to cpuset v2 interface.

Cai Xinchen (2):
  cpuset: Move cpuset1_update_spread_flag to cpuset
  cpuset: Add spread_page interface to cpuset v2

 kernel/cgroup/cpuset-internal.h |  6 ++--
 kernel/cgroup/cpuset-v1.c       | 25 +----------------
 kernel/cgroup/cpuset.c          | 49 ++++++++++++++++++++++++++++++++-
 3 files changed, 51 insertions(+), 29 deletions(-)

-- 
2.34.1

Re: [PATCH -next RFC 0/2] cpuset: Add cpuset.mems.spread_page to cgroup v2

Posted by Waiman Long 4 months, 1 week ago

On 9/30/25 5:35 AM, Cai Xinchen wrote:
> I encountered a scenario where a machine with 1.5TB of memory,
> while testing the Spark TPCDS 3TB dataset, experienced a significant
> concentration of page cache usage on one of the NUMA nodes.
> I discovered that the DataNode process had requested a large amount
> of page cache. most of the page cache was concentrated in one NUMA node,
> ultimately leading to the exhaustion of memory in that NUMA node.
> At this point, all other processes in that NUMA node have to alloc
> memory across NUMA nodes, or even across sockets. This eventually
> caused a degradation in the end-to-end performance of the Spark test.
>
> I do not want to restart the Spark DataNode service during business
> operations. This issue can be resolved by migrating the DataNode into
> a cpuset, dropping the cache, and setting cpuset.memory_spread_page to
> allow it to evenly request memory. The core business threads could still
> allocate local numa memory. After using cpuset.memory_spread_page, the
> performance in the tpcds-99 test is improved by 2%.
>
> The key point is that the even distribution of page cache within the
> DataNode process (rather than the current NUMA distribution) does not
> significantly affect end-to-end performance. However, the allocation
> of core business processes, such as Executors, to the same NUMA node
> does have a noticeable impact on end-to-end performance.
>
> However, I found that cgroup v2 does not provide this interface. I
> believe this interface still holds value in addressing issues caused
> by uneven distribution of page cache allocation among process groups.
>
> Thus I add cpuset.mems.spread_page to cpuset v2 interface.
>
> Cai Xinchen (2):
>    cpuset: Move cpuset1_update_spread_flag to cpuset
>    cpuset: Add spread_page interface to cpuset v2
>
>   kernel/cgroup/cpuset-internal.h |  6 ++--
>   kernel/cgroup/cpuset-v1.c       | 25 +----------------
>   kernel/cgroup/cpuset.c          | 49 ++++++++++++++++++++++++++++++++-
>   3 files changed, 51 insertions(+), 29 deletions(-)
>
The spread_page flag is only used in filemap_alloc_folio_noprof() of 
mm/filemap.c. By setting it, the code will attempt to spread the 
folio/page allocation across different nodes. As noted by Michal,  it is 
more or less equivalent to setting a MPOL_INTERLEAVE memory policy with 
set_mempolicy(2) with the node mask of cpuset.mems. Using 
set_mempolicy(2) has a finer task granularity instead of all the tasks 
within a cpuset. Of course, this requires making changes to the 
application instead of making change to external cgroup control file.

cpusets.mems.spread_page is a legacy interface, we need good 
justification if we want to enable it in cgroup v2.

Cheers,
Longman

Re: [PATCH -next RFC 0/2] cpuset: Add cpuset.mems.spread_page to cgroup v2

Posted by Michal Koutný 4 months, 1 week ago

Hello Xinchen.

On Tue, Sep 30, 2025 at 09:35:50AM +0000, Cai Xinchen <caixinchen1@huawei.com> wrote:
> I discovered that the DataNode process had requested a large amount
> of page cache. most of the page cache was concentrated in one NUMA node,
> ultimately leading to the exhaustion of memory in that NUMA node.
[...]
> This issue can be resolved by migrating the DataNode into
> a cpuset, dropping the cache, and setting cpuset.memory_spread_page to
> allow it to evenly request memory.

Would it work in your case instead to apply memory.max or apply
MPOL_INTERLEAVE to DataNode process?

In anyway, please see commit 012c419f8d248 ("cgroup/cpuset-v1: Add
deprecation messages to memory_spread_page and memory_spread_slab")
since your patchset would need to touch that place(s) too.

Thanks,
Michal

Re: [PATCH -next RFC 0/2] cpuset: Add cpuset.mems.spread_page to cgroup v2

Posted by Cai Xinchen 3 months, 2 weeks ago

I tried using memory.max, but clearing the page cache caused performance 
degradation. The MPOL_INTERLEAVE setting requires restarting the 
DataNode service. In this scenario, the issue can be resolved by 
restarting the service, but I would prefer not to restart the DataNode 
service if possible, as it would cause a period of service interruption.

On 9/30/2025 8:05 PM, Michal Koutný wrote:
> Hello Xinchen.
>
> On Tue, Sep 30, 2025 at 09:35:50AM +0000, Cai Xinchen <caixinchen1@huawei.com> wrote:
>> I discovered that the DataNode process had requested a large amount
>> of page cache. most of the page cache was concentrated in one NUMA node,
>> ultimately leading to the exhaustion of memory in that NUMA node.
> [...]
>> This issue can be resolved by migrating the DataNode into
>> a cpuset, dropping the cache, and setting cpuset.memory_spread_page to
>> allow it to evenly request memory.
> Would it work in your case instead to apply memory.max or apply
> MPOL_INTERLEAVE to DataNode process?
>
> In anyway, please see commit 012c419f8d248 ("cgroup/cpuset-v1: Add
> deprecation messages to memory_spread_page and memory_spread_slab")
> since your patchset would need to touch that place(s) too.
>
> Thanks,
> Michal

Re: [PATCH -next RFC 0/2] cpuset: Add cpuset.mems.spread_page to cgroup v2

Posted by Michal Koutný 3 months ago

On Mon, Oct 20, 2025 at 02:20:30PM +0800, Cai Xinchen <caixinchen1@huawei.com> wrote:
> The MPOL_INTERLEAVE setting requires restarting the DataNode service.
> In this scenario, the issue can be resolved by restarting the service,

\o/

> but I would prefer not to restart the DataNode service if possible,
> as it would cause a period of service interruption.

AFAICS, the implementation of cpuset's spread page works only for new
allocations (filemap_alloc_folio_noprof/cpuset_do_page_mem_spread) and
there's no migraion in cpuset1_update_task_spread_flags(). So this
simple v1-like spreading would still require restart of the service.

(This challenge of dynamism also a reason why it's not ready for v2,
IMO.)

HTH,
Michal

Re: [PATCH -next RFC 0/2] cpuset: Add cpuset.mems.spread_page to cgroup v2

Posted by Cai Xinchen 3 months ago

Therefore, when setting memory_spread_page, I also need to perform a 
drop cache operation.

This way, there is no need to restart the service.


Some previous ideas, such as using procfs or set_mempolicy to change the 
mempolicy of

non-current processes, could also achieve the goal after dropping the 
cache. [1]


Of course, changing the current memory distribution does not necessarily 
require dropping

the cache, but it will inevitably lead to a series of memory migrations.


[1] 
https://lore.kernel.org/all/20231122211200.31620-1-gregory.price@memverge.com/

https://lore.kernel.org/all/ZWS19JFHm_LFSsFd@tiehlicka/

On 11/3/2025 9:39 PM, Michal Koutný wrote:
> On Mon, Oct 20, 2025 at 02:20:30PM +0800, Cai Xinchen <caixinchen1@huawei.com> wrote:
>> The MPOL_INTERLEAVE setting requires restarting the DataNode service.
>> In this scenario, the issue can be resolved by restarting the service,
> \o/
>
>> but I would prefer not to restart the DataNode service if possible,
>> as it would cause a period of service interruption.
> AFAICS, the implementation of cpuset's spread page works only for new
> allocations (filemap_alloc_folio_noprof/cpuset_do_page_mem_spread) and
> there's no migraion in cpuset1_update_task_spread_flags(). So this
> simple v1-like spreading would still require restart of the service.
>
> (This challenge of dynamism also a reason why it's not ready for v2,
> IMO.)
>
> HTH,
> Michal