From nobody Wed Dec 17 14:31:57 2025 Received: from canpmsgout03.his.huawei.com (canpmsgout03.his.huawei.com [113.46.200.218]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D02FC2F6164; Tue, 30 Sep 2025 09:51:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=113.46.200.218 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759225913; cv=none; b=VlA+fQ3ckDUYnf/WAH8JP9+kDDwfIR7WCMawBqT4+ZWcLUJvZTogMM4WXN1uipVqfNENFcAmOdDkMHA7LfLUX7Gwi5LTgTRfqy/l42UKHo6BXOqWTXyYyRM6la/JDv0o65/Cpn8e9KrmJG9EK4KK9XVDWt8kZHbQFIlqanJ528k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759225913; c=relaxed/simple; bh=7ZLaEJ0ebPuWyyKFGFYC43JjOCxEZxcSF09jQrdtGIA=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=tNBLwP28Al5Hj/nAzmhO4E1s7hlhTo4ViMczFCz4zotiRCluVJEtdCoKSZRw70eT6dmTwiml9zoQ3a5Al+wCcBYNuh+a46bld/oG6f13lblvEIa+bEwmb0P2Xunygk5sux4+WHkxmLvP6P5y01gePyOTwSCJFisy4FvKfpwcvkU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com header.b=Z6m78rVT; arc=none smtp.client-ip=113.46.200.218 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com header.b="Z6m78rVT" dkim-signature: v=1; a=rsa-sha256; d=huawei.com; s=dkim; c=relaxed/relaxed; q=dns/txt; h=From; bh=nADysAodi9/Cp0iwK3kIOTrsOn+wVW7FsZKmEos2uoM=; b=Z6m78rVTntq34mnVpYg0Z2ehH44nYuHrL81WqnQUmHlvp9AwsSkzpfKVO5ULkVxJdSnS/Dpt/ Mnzp8t4jtbktrIH96TlsF8kGK/BbiU21E/yf1WgpGbcs9HD+SGJekofLAg2b1DYPMQRfSTfO4CG PENyVHTzmBOBoczI+Y2RtCA= Received: from mail.maildlp.com (unknown [172.19.162.254]) by canpmsgout03.his.huawei.com (SkyGuard) with ESMTPS id 4cbYGt48lczpStY; Tue, 30 Sep 2025 17:51:18 +0800 (CST) Received: from dggemv706-chm.china.huawei.com (unknown [10.3.19.33]) by mail.maildlp.com (Postfix) with ESMTPS id 79BEA180464; Tue, 30 Sep 2025 17:51:47 +0800 (CST) Received: from kwepemq200017.china.huawei.com (7.202.195.228) by dggemv706-chm.china.huawei.com (10.3.19.33) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Tue, 30 Sep 2025 17:51:47 +0800 Received: from hulk-cxc.huawei.com (10.67.174.72) by kwepemq200017.china.huawei.com (7.202.195.228) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Tue, 30 Sep 2025 17:51:46 +0800 From: Cai Xinchen To: , , , CC: , , , Subject: [PATCH -next RFC 2/2] cpuset: Add spread_page interface to cpuset v2 Date: Tue, 30 Sep 2025 09:35:52 +0000 Message-ID: <20250930093552.2842885-3-caixinchen1@huawei.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20250930093552.2842885-1-caixinchen1@huawei.com> References: <20250930093552.2842885-1-caixinchen1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems100001.china.huawei.com (7.221.188.238) To kwepemq200017.china.huawei.com (7.202.195.228) Content-Type: text/plain; charset="utf-8" I encountered a scenario where a machine with 1.5TB of memory, while testing the Spark TPCDS 3TB dataset, experienced a significant concentration of page cache usage on one of the NUMA nodes. I discovered that the DataNode process had requested a large amount of page cache. most of the page cache was concentrated in one NUMA node, ultimately leading to the exhaustion of memory in that NUMA node. At this point, all other processes in that NUMA node have to alloc memory across NUMA nodes, or even across sockets. This eventually caused a degradation in the end-to-end performance of the Spark test. I do not want to restart the Spark DataNode service during business operations. This issue can be resolved by migrating the DataNode into a cpuset, dropping the cache, and setting cpuset.memory_spread_page to allow it to evenly request memory. The core business threads could still allocate local numa memory. After using cpuset.memory_spread_page, the performance in the tpcds-99 test is improved by 2%. The key point is that the even distribution of page cache within the DataNode process (rather than the current NUMA distribution) does not significantly affect end-to-end performance. However, the allocation of core business processes, such as Executors, to the same NUMA node does have a noticeable impact on end-to-end performance. However, I found that cgroup v2 does not provide this interface. I believe this interface still holds value in addressing issues caused by uneven distribution of page cache allocation among process groups. Thus I add cpuset.mems.spread_page to cpuset v2 interface. Signed-off-by: Cai Xinchen --- kernel/cgroup/cpuset.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 3a0d443180c6..04a24642f490 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3461,6 +3461,32 @@ void cpuset_update_task_spread_flags(struct cpuset *= cs, task_clear_spread_page(tsk); } =20 +static int cpuset_update_spread_page(struct cgroup_subsys_state *css, stru= ct cftype *cft, + u64 val) +{ + struct cpuset *cs =3D css_cs(css); + int retval =3D 0; + + cpuset_full_lock(); + if (!is_cpuset_online(cs)) { + retval =3D -ENODEV; + goto out_unlock; + } + + retval =3D cpuset_update_flag(CS_SPREAD_PAGE, cs, val); + +out_unlock: + cpuset_full_unlock(); + return retval; +} + +static u64 cpuset_read_spread_page(struct cgroup_subsys_state *css, struct= cftype *cft) +{ + struct cpuset *cs =3D css_cs(css); + + return is_spread_page(cs); +} + /* * This is currently a minimal set for the default hierarchy. It can be * expanded later on by migrating more features and control files from v1. @@ -3535,6 +3561,13 @@ static struct cftype dfl_files[] =3D { .flags =3D CFTYPE_ONLY_ON_ROOT, }, =20 + { + .name =3D "mems.spread_page", + .read_u64 =3D cpuset_read_spread_page, + .write_u64 =3D cpuset_update_spread_page, + .private =3D FILE_SPREAD_PAGE, + }, + { } /* terminate */ }; =20 --=20 2.34.1