kernel/cgroup/cpuset-internal.h | 6 ++-- kernel/cgroup/cpuset-v1.c | 25 +---------------- kernel/cgroup/cpuset.c | 49 ++++++++++++++++++++++++++++++++- 3 files changed, 51 insertions(+), 29 deletions(-)
I encountered a scenario where a machine with 1.5TB of memory, while testing the Spark TPCDS 3TB dataset, experienced a significant concentration of page cache usage on one of the NUMA nodes. I discovered that the DataNode process had requested a large amount of page cache. most of the page cache was concentrated in one NUMA node, ultimately leading to the exhaustion of memory in that NUMA node. At this point, all other processes in that NUMA node have to alloc memory across NUMA nodes, or even across sockets. This eventually caused a degradation in the end-to-end performance of the Spark test. I do not want to restart the Spark DataNode service during business operations. This issue can be resolved by migrating the DataNode into a cpuset, dropping the cache, and setting cpuset.memory_spread_page to allow it to evenly request memory. The core business threads could still allocate local numa memory. After using cpuset.memory_spread_page, the performance in the tpcds-99 test is improved by 2%. The key point is that the even distribution of page cache within the DataNode process (rather than the current NUMA distribution) does not significantly affect end-to-end performance. However, the allocation of core business processes, such as Executors, to the same NUMA node does have a noticeable impact on end-to-end performance. However, I found that cgroup v2 does not provide this interface. I believe this interface still holds value in addressing issues caused by uneven distribution of page cache allocation among process groups. Thus I add cpuset.mems.spread_page to cpuset v2 interface. Cai Xinchen (2): cpuset: Move cpuset1_update_spread_flag to cpuset cpuset: Add spread_page interface to cpuset v2 kernel/cgroup/cpuset-internal.h | 6 ++-- kernel/cgroup/cpuset-v1.c | 25 +---------------- kernel/cgroup/cpuset.c | 49 ++++++++++++++++++++++++++++++++- 3 files changed, 51 insertions(+), 29 deletions(-) -- 2.34.1
On 9/30/25 5:35 AM, Cai Xinchen wrote: > I encountered a scenario where a machine with 1.5TB of memory, > while testing the Spark TPCDS 3TB dataset, experienced a significant > concentration of page cache usage on one of the NUMA nodes. > I discovered that the DataNode process had requested a large amount > of page cache. most of the page cache was concentrated in one NUMA node, > ultimately leading to the exhaustion of memory in that NUMA node. > At this point, all other processes in that NUMA node have to alloc > memory across NUMA nodes, or even across sockets. This eventually > caused a degradation in the end-to-end performance of the Spark test. > > I do not want to restart the Spark DataNode service during business > operations. This issue can be resolved by migrating the DataNode into > a cpuset, dropping the cache, and setting cpuset.memory_spread_page to > allow it to evenly request memory. The core business threads could still > allocate local numa memory. After using cpuset.memory_spread_page, the > performance in the tpcds-99 test is improved by 2%. > > The key point is that the even distribution of page cache within the > DataNode process (rather than the current NUMA distribution) does not > significantly affect end-to-end performance. However, the allocation > of core business processes, such as Executors, to the same NUMA node > does have a noticeable impact on end-to-end performance. > > However, I found that cgroup v2 does not provide this interface. I > believe this interface still holds value in addressing issues caused > by uneven distribution of page cache allocation among process groups. > > Thus I add cpuset.mems.spread_page to cpuset v2 interface. > > Cai Xinchen (2): > cpuset: Move cpuset1_update_spread_flag to cpuset > cpuset: Add spread_page interface to cpuset v2 > > kernel/cgroup/cpuset-internal.h | 6 ++-- > kernel/cgroup/cpuset-v1.c | 25 +---------------- > kernel/cgroup/cpuset.c | 49 ++++++++++++++++++++++++++++++++- > 3 files changed, 51 insertions(+), 29 deletions(-) > The spread_page flag is only used in filemap_alloc_folio_noprof() of mm/filemap.c. By setting it, the code will attempt to spread the folio/page allocation across different nodes. As noted by Michal, it is more or less equivalent to setting a MPOL_INTERLEAVE memory policy with set_mempolicy(2) with the node mask of cpuset.mems. Using set_mempolicy(2) has a finer task granularity instead of all the tasks within a cpuset. Of course, this requires making changes to the application instead of making change to external cgroup control file. cpusets.mems.spread_page is a legacy interface, we need good justification if we want to enable it in cgroup v2. Cheers, Longman
Hello Xinchen.
On Tue, Sep 30, 2025 at 09:35:50AM +0000, Cai Xinchen <caixinchen1@huawei.com> wrote:
> I discovered that the DataNode process had requested a large amount
> of page cache. most of the page cache was concentrated in one NUMA node,
> ultimately leading to the exhaustion of memory in that NUMA node.
[...]
> This issue can be resolved by migrating the DataNode into
> a cpuset, dropping the cache, and setting cpuset.memory_spread_page to
> allow it to evenly request memory.
Would it work in your case instead to apply memory.max or apply
MPOL_INTERLEAVE to DataNode process?
In anyway, please see commit 012c419f8d248 ("cgroup/cpuset-v1: Add
deprecation messages to memory_spread_page and memory_spread_slab")
since your patchset would need to touch that place(s) too.
Thanks,
Michal
I tried using memory.max, but clearing the page cache caused performance
degradation. The MPOL_INTERLEAVE setting requires restarting the
DataNode service. In this scenario, the issue can be resolved by
restarting the service, but I would prefer not to restart the DataNode
service if possible, as it would cause a period of service interruption.
On 9/30/2025 8:05 PM, Michal Koutný wrote:
> Hello Xinchen.
>
> On Tue, Sep 30, 2025 at 09:35:50AM +0000, Cai Xinchen <caixinchen1@huawei.com> wrote:
>> I discovered that the DataNode process had requested a large amount
>> of page cache. most of the page cache was concentrated in one NUMA node,
>> ultimately leading to the exhaustion of memory in that NUMA node.
> [...]
>> This issue can be resolved by migrating the DataNode into
>> a cpuset, dropping the cache, and setting cpuset.memory_spread_page to
>> allow it to evenly request memory.
> Would it work in your case instead to apply memory.max or apply
> MPOL_INTERLEAVE to DataNode process?
>
> In anyway, please see commit 012c419f8d248 ("cgroup/cpuset-v1: Add
> deprecation messages to memory_spread_page and memory_spread_slab")
> since your patchset would need to touch that place(s) too.
>
> Thanks,
> Michal
On Mon, Oct 20, 2025 at 02:20:30PM +0800, Cai Xinchen <caixinchen1@huawei.com> wrote: > The MPOL_INTERLEAVE setting requires restarting the DataNode service. > In this scenario, the issue can be resolved by restarting the service, \o/ > but I would prefer not to restart the DataNode service if possible, > as it would cause a period of service interruption. AFAICS, the implementation of cpuset's spread page works only for new allocations (filemap_alloc_folio_noprof/cpuset_do_page_mem_spread) and there's no migraion in cpuset1_update_task_spread_flags(). So this simple v1-like spreading would still require restart of the service. (This challenge of dynamism also a reason why it's not ready for v2, IMO.) HTH, Michal
Therefore, when setting memory_spread_page, I also need to perform a drop cache operation. This way, there is no need to restart the service. Some previous ideas, such as using procfs or set_mempolicy to change the mempolicy of non-current processes, could also achieve the goal after dropping the cache. [1] Of course, changing the current memory distribution does not necessarily require dropping the cache, but it will inevitably lead to a series of memory migrations. [1] https://lore.kernel.org/all/20231122211200.31620-1-gregory.price@memverge.com/ https://lore.kernel.org/all/ZWS19JFHm_LFSsFd@tiehlicka/ On 11/3/2025 9:39 PM, Michal Koutný wrote: > On Mon, Oct 20, 2025 at 02:20:30PM +0800, Cai Xinchen <caixinchen1@huawei.com> wrote: >> The MPOL_INTERLEAVE setting requires restarting the DataNode service. >> In this scenario, the issue can be resolved by restarting the service, > \o/ > >> but I would prefer not to restart the DataNode service if possible, >> as it would cause a period of service interruption. > AFAICS, the implementation of cpuset's spread page works only for new > allocations (filemap_alloc_folio_noprof/cpuset_do_page_mem_spread) and > there's no migraion in cpuset1_update_task_spread_flags(). So this > simple v1-like spreading would still require restart of the service. > > (This challenge of dynamism also a reason why it's not ready for v2, > IMO.) > > HTH, > Michal
© 2016 - 2025 Red Hat, Inc.