From nobody Sat Dec 27 03:14:25 2025 Received: from mail-io1-f67.google.com (mail-io1-f67.google.com [209.85.166.67]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2E49F134D0; Sat, 23 Dec 2023 18:11:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="WyqVKokd" Received: by mail-io1-f67.google.com with SMTP id ca18e2360f4ac-7b71e389fb2so138776639f.3; Sat, 23 Dec 2023 10:11:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1703355078; x=1703959878; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=GDed0UIc9OzNQfDEhqciny76MTGnL2wnvbADB9nJS3M=; b=WyqVKokdBTotl57feP0m2NBEOm5U1l6GJfV1wwRbbzgaNJCO/BUCBRJoA4V6Y51DcT e7NKVRhZtSco2fjvpXqSB4UxNQ2jWH3tEicz8wb+KvSpSD/HpTB2bxkMnYNNAT05Acdc WOps2p186wncw6El6upWcRumr4bQigMJNB4lc7hqm5DRmD13txbCV67/tqN7Ee6XIQbP mr3rgULviqJB3fuMkAo1l61+hzZsE+UfgX1MyCteE5pSxnB3HR1WnzbBKC5qvnwZytNm edmrj4BK9Yw/48oHyRvZ/gkn/SBbbIa58rnK0vmGTRqUR1LmAMDETAbTEV0+JYy5ccsv mUpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703355078; x=1703959878; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GDed0UIc9OzNQfDEhqciny76MTGnL2wnvbADB9nJS3M=; b=vzcVbQCfwEo8RnTstF/ZYIdxIJ/oDZOBeujhm+lyahd/igj63wgx35nQ8ALCSKafDb IPPzwvT/nPJX8mcGKfb0jrjY7cy5lPPUojdibqNuygtU6t+CUttKBaJtXfrRvYrlxNOG YNO/TBX1C+JXnuhtg2GetITJExpqfLUK969L/sA74CTOzIUPh4XGAxbhLpA9v5UXdRrB MjoC19cQK5sKY6l3Qb1HmrIC8DlaKI6fNAlRodOL/U5fs6tv1P7jYVYDNjBXf/XUPa4W dbZLkMQd6Pq9kCs1ZX4DCuygMmu4s5T5Ca95HhIYWOrLvnnJQzdwcCLnDAR/h0sGyWsq zjWA== X-Gm-Message-State: AOJu0YxrJ8K1AL9ToMsreNhnGhpsYcoaJHPARYWyT9/38fD9ZJ3gcTA3 8MvnYWh/Iok2Zg/IM7Inkg== X-Google-Smtp-Source: AGHT+IHSi4vx9RtHFMBYpv6UNwBntumkcCIAS9GN1u4J0iQ3IXJ4KzejH0Gsm5IoE08eBlvq3jhBcQ== X-Received: by 2002:a05:6e02:190c:b0:35d:6f9f:5743 with SMTP id w12-20020a056e02190c00b0035d6f9f5743mr6343355ilu.57.1703355078150; Sat, 23 Dec 2023 10:11:18 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id t6-20020a170902a5c600b001d3bfd30886sm4316396plq.37.2023.12.23.10.11.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Dec 2023 10:11:17 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, x86@kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com Subject: [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface Date: Sat, 23 Dec 2023 13:10:51 -0500 Message-Id: <20231223181101.1954-2-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231223181101.1954-1-gregory.price@memverge.com> References: <20231223181101.1954-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Rakie Kim This patch provides a way to set interleave weight information under sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN The sysfs structure is designed as follows. $ tree /sys/kernel/mm/mempolicy/ /sys/kernel/mm/mempolicy/ [1] =E2=94=94=E2=94=80=E2=94=80 weighted_interleave [2] =E2=94=9C=E2=94=80=E2=94=80 node0 [3] =E2=94=94=E2=94=80=E2=94=80 node1 Each file above can be explained as follows. [1] mm/mempolicy: configuration interface for mempolicy subsystem [2] weighted_interleave/: config interface for weighted interleave policy [3] weighted_interleave/nodeN: weight for nodeN If sysfs is disabled in the config, the global interleave weights will default to "1" for all nodes. Signed-off-by: Rakie Kim Signed-off-by: Honggyu Kim Co-developed-by: Gregory Price Signed-off-by: Gregory Price Co-developed-by: Hyeongtak Ji Signed-off-by: Hyeongtak Ji Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Hyeongtak Ji Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Michal Hocko Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Vinicius Tavares Petrucci Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- .../ABI/testing/sysfs-kernel-mm-mempolicy | 4 + ...fs-kernel-mm-mempolicy-weighted-interleave | 22 +++ mm/mempolicy.c | 156 ++++++++++++++++++ 3 files changed, 182 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-wei= ghted-interleave diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Document= ation/ABI/testing/sysfs-kernel-mm-mempolicy new file mode 100644 index 000000000000..2dcf24f4384a --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy @@ -0,0 +1,4 @@ +What: /sys/kernel/mm/mempolicy/ +Date: December 2023 +Contact: Linux memory management mailing list +Description: Interface for Mempolicy diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-i= nterleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-in= terleave new file mode 100644 index 000000000000..aa27fdf08c19 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interlea= ve @@ -0,0 +1,22 @@ +What: /sys/kernel/mm/mempolicy/weighted_interleave/ +Date: December 2023 +Contact: Linux memory management mailing list +Description: Configuration Interface for the Weighted Interleave policy + +What: /sys/kernel/mm/mempolicy/weighted_interleave/nodeN +Date: December 2023 +Contact: Linux memory management mailing list +Description: Weight configuration interface for nodeN + + The interleave weight for a memory node (N). These weights are + utilized by processes which have set their mempolicy to + MPOL_WEIGHTED_INTERLEAVE and have opted into global weights by + omitting a task-local weight array. + + These weights only affect new allocations, and changes at runtime + will not cause migrations on already allocated pages. + + Writing an empty string resets the weight value to 1. + + Minimum weight: 1 + Maximum weight: 255 diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 10a590ee1c89..0e77633b07a5 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -131,6 +131,8 @@ static struct mempolicy default_policy =3D { =20 static struct mempolicy preferred_node_policy[MAX_NUMNODES]; =20 +static char iw_table[MAX_NUMNODES]; + /** * numa_nearest_node - Find nearest node by state * @node: Node id to start the search @@ -3067,3 +3069,157 @@ void mpol_to_str(char *buffer, int maxlen, struct m= empolicy *pol) p +=3D scnprintf(p, buffer + maxlen - p, ":%*pbl", nodemask_pr_args(&nodes)); } + +#ifdef CONFIG_SYSFS +struct iw_node_attr { + struct kobj_attribute kobj_attr; + int nid; +}; + +static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + struct iw_node_attr *node_attr; + + node_attr =3D container_of(attr, struct iw_node_attr, kobj_attr); + return sysfs_emit(buf, "%d\n", iw_table[node_attr->nid]); +} + +static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *att= r, + const char *buf, size_t count) +{ + struct iw_node_attr *node_attr; + unsigned char weight =3D 0; + + node_attr =3D container_of(attr, struct iw_node_attr, kobj_attr); + /* If no input, set default weight to 1 */ + if (count =3D=3D 0 || sysfs_streq(buf, "")) + weight =3D 1; + else if (kstrtou8(buf, 0, &weight) || !weight) + return -EINVAL; + + iw_table[node_attr->nid] =3D weight; + return count; +} + +static struct iw_node_attr *node_attrs[MAX_NUMNODES]; + +static void sysfs_wi_node_release(struct iw_node_attr *node_attr, + struct kobject *parent) +{ + if (!node_attr) + return; + sysfs_remove_file(parent, &node_attr->kobj_attr.attr); + kfree(node_attr->kobj_attr.attr.name); + kfree(node_attr); +} + +static void sysfs_mempolicy_release(struct kobject *mempolicy_kobj) +{ + int i; + + for (i =3D 0; i < MAX_NUMNODES; i++) + sysfs_wi_node_release(node_attrs[i], mempolicy_kobj); + kobject_put(mempolicy_kobj); +} + +static const struct kobj_type mempolicy_ktype =3D { + .sysfs_ops =3D &kobj_sysfs_ops, + .release =3D sysfs_mempolicy_release, +}; + +static int add_weight_node(int nid, struct kobject *wi_kobj) +{ + struct iw_node_attr *node_attr; + char *name; + + node_attr =3D kzalloc(sizeof(*node_attr), GFP_KERNEL); + if (!node_attr) + return -ENOMEM; + + name =3D kasprintf(GFP_KERNEL, "node%d", nid); + if (!name) { + kfree(node_attr); + return -ENOMEM; + } + + sysfs_attr_init(&node_attr->kobj_attr.attr); + node_attr->kobj_attr.attr.name =3D name; + node_attr->kobj_attr.attr.mode =3D 0644; + node_attr->kobj_attr.show =3D node_show; + node_attr->kobj_attr.store =3D node_store; + node_attr->nid =3D nid; + + if (sysfs_create_file(wi_kobj, &node_attr->kobj_attr.attr)) { + kfree(node_attr->kobj_attr.attr.name); + kfree(node_attr); + pr_err("failed to add attribute to weighted_interleave\n"); + return -ENOMEM; + } + + node_attrs[nid] =3D node_attr; + return 0; +} + +static int add_weighted_interleave_group(struct kobject *root_kobj) +{ + struct kobject *wi_kobj; + int nid, err; + + wi_kobj =3D kzalloc(sizeof(struct kobject), GFP_KERNEL); + if (!wi_kobj) + return -ENOMEM; + + err =3D kobject_init_and_add(wi_kobj, &mempolicy_ktype, root_kobj, + "weighted_interleave"); + if (err) { + kfree(wi_kobj); + return err; + } + + memset(node_attrs, 0, sizeof(node_attrs)); + for_each_node_state(nid, N_POSSIBLE) { + err =3D add_weight_node(nid, wi_kobj); + if (err) { + pr_err("failed to add sysfs [node%d]\n", nid); + break; + } + } + if (err) + kobject_put(wi_kobj); + return 0; +} + +static int __init mempolicy_sysfs_init(void) +{ + int err; + struct kobject *root_kobj; + + memset(&iw_table, 1, sizeof(iw_table)); + + root_kobj =3D kobject_create_and_add("mempolicy", mm_kobj); + if (!root_kobj) { + pr_err("failed to add mempolicy kobject to the system\n"); + return -ENOMEM; + } + + err =3D add_weighted_interleave_group(root_kobj); + + if (err) + kobject_put(root_kobj); + return err; + +} +#else +static int __init mempolicy_sysfs_init(void) +{ + /* + * if sysfs is not enabled MPOL_WEIGHTED_INTERLEAVE defaults to + * MPOL_INTERLEAVE behavior, but is still defined separately to + * allow task-local weighted interleave to operate as intended. + */ + memset(&iw_table, 1, sizeof(iw_table)); + return 0; +} +#endif /* CONFIG_SYSFS */ +late_initcall(mempolicy_sysfs_init); --=20 2.39.1 From nobody Sat Dec 27 03:14:25 2025 Received: from mail-pl1-f194.google.com (mail-pl1-f194.google.com [209.85.214.194]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0599C171B1; Sat, 23 Dec 2023 18:11:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="lsqwgz/k" Received: by mail-pl1-f194.google.com with SMTP id d9443c01a7336-1d3eb299e2eso17205115ad.2; Sat, 23 Dec 2023 10:11:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1703355083; x=1703959883; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=MdZLBs+oiNG0zm3w3BPfGZMs3PYCh1bPB4eoTV769Q8=; b=lsqwgz/kQPVbtxgcHF7GzxKtvPSwfoMRwwkTl++5u6ZZJFO8WK2jO9agMq1VlS43Me 7JLJVkgzUZFZefn8wqwWcq8JuCbXOl5QyeJwGkC2nco9rdh7gPilp104YJ/o1hbAYM6X I32AksXKDCtPBqEVmFa1EZTXP23pLp0udpflRAnWZuFhWFqAfpQFyr66IDb7JCqyPEUV aJOzzDDYDS3LbfqtkTB+3iZHTX4qotP58ZqrZVg646S2NN0LAxhGeSWJURzbkQmelLec NjWYiILPy809NQtFcnqNReJEJZQlNCmir5pmC+Q7l6/IFzUTbrXMrjg90jmRYmTJj3SE dVJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703355083; x=1703959883; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=MdZLBs+oiNG0zm3w3BPfGZMs3PYCh1bPB4eoTV769Q8=; b=N4IfJ2sPs1j0M8hynaJGtpvnVz1E7WhRIC6F06qvGdsVgKbw6DerScwqQLr/fAGY2E TrJypBmnLknZ58fMlmXHTLBW/LaZEwf1Q93GRmbYxICdg1FrrRKSzU3cFmwwuVoJONlH 3eEjaio2HJJ2DUqCD/U9KJkyk4bXMBy/661D6Y56TyhrWXRQGQLZ8rVWfICfYa1Q59wf qvbdCxpNymJrC7ZCyihTew31Z/9M3qIMj8m9rOGEXSh1/+QjyNz/T0cVgr07Q8h7rTnJ nWWv/SpXBIm898dcoZerMJi/LWnd3QuAJQ5qyaUXATwIagkIe8Q6wtFTCpDtiD59pJc/ ytaw== X-Gm-Message-State: AOJu0YwNTDGVbTK6NXjv5mxq6XAhhYvf7yeL1LOt7Ns6GogYdSmVn0ps UyQRdKtpnTYhVSArKMO6DA== X-Google-Smtp-Source: AGHT+IHMVZClDQ0rRVkP7aNrTBYSAdCZfQ7O+nlYI65HqDQNLO0DTVwBIxr1N3FgvIM0Y2eMC5O5YQ== X-Received: by 2002:a17:903:2284:b0:1d3:7625:9526 with SMTP id b4-20020a170903228400b001d376259526mr2267728plh.122.1703355083229; Sat, 23 Dec 2023 10:11:23 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id t6-20020a170902a5c600b001d3bfd30886sm4316396plq.37.2023.12.23.10.11.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Dec 2023 10:11:22 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, x86@kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com, Srinivasulu Thanneeru Subject: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Date: Sat, 23 Dec 2023 13:10:52 -0500 Message-Id: <20231223181101.1954-3-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231223181101.1954-1-gregory.price@memverge.com> References: <20231223181101.1954-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When a system has multiple NUMA nodes and it becomes bandwidth hungry, the current MPOL_INTERLEAVE could be an wise option. However, if those NUMA nodes consist of different types of memory such as having local DRAM and CXL memory together, the current round-robin based interleaving policy doesn't maximize the overall bandwidth because of their different bandwidth characteristics. Instead, the interleaving can be more efficient when the allocation policy follows each NUMA nodes' bandwidth weight rather than having 1:1 round-robin allocation. This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, which enables weighted interleaving between NUMA nodes. Weighted interleave allows for a proportional distribution of memory across multiple numa nodes, preferablly apportioned to match the bandwidth capacity of each node from the perspective of the accessing node. For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), with a relative bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight distribution is (2:1). Weights will be acquired from the global weight matrix exposed by the sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/ The policy will then allocate the number of pages according to the set weights. For example, if the weights are (2,1), then 2 pages will be allocated on node0 for every 1 page allocated on node1. The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) and mbind(2). There are 3 integration points: weighted_interleave_nodes: Counts the number of allocations as they occur, and applies the weight for the current node. When the weight reaches 0, switch to the next node. Applied by `mempolicy_slab_node()` and `policy_nodemask()` weighted_interleave_nid: Gets the total weight of the nodemask as well as each individual node weight, then calculates the node based on the given index. Applied by `policy_nodemask()` and `mpol_misplaced()` bulk_array_weighted_interleave: Gets the total weight of the nodemask as well as each individual node weight, then calculates the number of "interleave rounds" as well as any delta ("partial round"). Calculates the number of pages for each node and allocates them. If a node was scheduled for interleave via interleave_nodes, the current weight (pol->cur_weight) will be allocated first, before the remaining bulk calculation is done. This simplifies the calculation at the cost of an additional allocation call. One piece of complexity is the interaction between a recent refactor which split the logic to acquire the "ilx" (interleave index) of an allocation and the actually application of the interleave. The calculation of the `interleave index` is done by `get_vma_policy()`, while the actual selection of the node will be later appliex by the relevant weighted_interleave function. If CONFIG_SYSFS is disabled, the weight table will be initialized to set all nodes to weight 1, but the weighting code is still called. This is so that task-local weights (future patch) can still be engaged cleanly without ifdef spaghetti. Suggested-by: Hasan Al Maruf Signed-off-by: Gregory Price Co-developed-by: Rakie Kim Signed-off-by: Rakie Kim Co-developed-by: Honggyu Kim Signed-off-by: Honggyu Kim Co-developed-by: Hyeongtak Ji Signed-off-by: Hyeongtak Ji Co-developed-by: Srinivasulu Thanneeru Signed-off-by: Srinivasulu Thanneeru Co-developed-by: Ravi Jonnalagadda Signed-off-by: Ravi Jonnalagadda Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Hyeongtak Ji Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Michal Hocko Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Vinicius Tavares Petrucci Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- .../admin-guide/mm/numa_memory_policy.rst | 11 + include/linux/mempolicy.h | 5 + include/uapi/linux/mempolicy.h | 1 + mm/mempolicy.c | 197 +++++++++++++++++- 4 files changed, 211 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Document= ation/admin-guide/mm/numa_memory_policy.rst index eca38fa81e0f..d2c8e712785b 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -250,6 +250,17 @@ MPOL_PREFERRED_MANY can fall back to all existing numa nodes. This is effectively MPOL_PREFERRED allowed for a mask rather than a single node. =20 +MPOL_WEIGHTED_INTERLEAVE + This mode operates the same as MPOL_INTERLEAVE, except that + interleaving behavior is executed based on weights set in + /sys/kernel/mm/mempolicy/weighted_interleave/ + + Weighted interleave allocations pages on nodes according to + their weight. For example if nodes [0,1] are weighted [5,2] + respectively, 5 pages will be allocated on node0 for every + 2 pages allocated on node1. This can better distribute data + according to bandwidth on heterogeneous memory systems. + NUMA memory policy supports the following optional mode flags: =20 MPOL_F_STATIC_NODES diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 931b118336f4..ba09167e80f7 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -54,6 +54,11 @@ struct mempolicy { nodemask_t cpuset_mems_allowed; /* relative to these nodes */ nodemask_t user_nodemask; /* nodemask passed by user */ } w; + + /* Weighted interleave settings */ + struct { + unsigned char cur_weight; + } wil; }; =20 /* diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index a8963f7ef4c2..1f9bb10d1a47 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -23,6 +23,7 @@ enum { MPOL_INTERLEAVE, MPOL_LOCAL, MPOL_PREFERRED_MANY, + MPOL_WEIGHTED_INTERLEAVE, MPOL_MAX, /* always last member of enum */ }; =20 diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0e77633b07a5..0a180c670f0c 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -305,6 +305,7 @@ static struct mempolicy *mpol_new(unsigned short mode, = unsigned short flags, policy->mode =3D mode; policy->flags =3D flags; policy->home_node =3D NUMA_NO_NODE; + policy->wil.cur_weight =3D 0; =20 return policy; } @@ -417,6 +418,10 @@ static const struct mempolicy_operations mpol_ops[MPOL= _MAX] =3D { .create =3D mpol_new_nodemask, .rebind =3D mpol_rebind_preferred, }, + [MPOL_WEIGHTED_INTERLEAVE] =3D { + .create =3D mpol_new_nodemask, + .rebind =3D mpol_rebind_nodemask, + }, }; =20 static bool migrate_folio_add(struct folio *folio, struct list_head *folio= list, @@ -838,7 +843,8 @@ static long do_set_mempolicy(unsigned short mode, unsig= ned short flags, =20 old =3D current->mempolicy; current->mempolicy =3D new; - if (new && new->mode =3D=3D MPOL_INTERLEAVE) + if (new && (new->mode =3D=3D MPOL_INTERLEAVE || + new->mode =3D=3D MPOL_WEIGHTED_INTERLEAVE)) current->il_prev =3D MAX_NUMNODES-1; task_unlock(current); mpol_put(old); @@ -864,6 +870,7 @@ static void get_policy_nodemask(struct mempolicy *pol, = nodemask_t *nodes) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: *nodes =3D pol->nodes; break; case MPOL_LOCAL: @@ -948,6 +955,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *= nmask, } else if (pol =3D=3D current->mempolicy && pol->mode =3D=3D MPOL_INTERLEAVE) { *policy =3D next_node_in(current->il_prev, pol->nodes); + } else if (pol =3D=3D current->mempolicy && + (pol->mode =3D=3D MPOL_WEIGHTED_INTERLEAVE)) { + if (pol->wil.cur_weight) + *policy =3D current->il_prev; + else + *policy =3D next_node_in(current->il_prev, + pol->nodes); } else { err =3D -EINVAL; goto out; @@ -1777,7 +1791,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struc= t *vma, pol =3D __get_vma_policy(vma, addr, ilx); if (!pol) pol =3D get_task_policy(current); - if (pol->mode =3D=3D MPOL_INTERLEAVE) { + if (pol->mode =3D=3D MPOL_INTERLEAVE || + pol->mode =3D=3D MPOL_WEIGHTED_INTERLEAVE) { *ilx +=3D vma->vm_pgoff >> order; *ilx +=3D (addr - vma->vm_start) >> (PAGE_SHIFT + order); } @@ -1827,6 +1842,24 @@ bool apply_policy_zone(struct mempolicy *policy, enu= m zone_type zone) return zone >=3D dynamic_policy_zone; } =20 +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) +{ + unsigned int next; + struct task_struct *me =3D current; + + next =3D next_node_in(me->il_prev, policy->nodes); + if (next =3D=3D MAX_NUMNODES) + return next; + + if (!policy->wil.cur_weight) + policy->wil.cur_weight =3D iw_table[next]; + + policy->wil.cur_weight--; + if (!policy->wil.cur_weight) + me->il_prev =3D next; + return next; +} + /* Do dynamic interleaving for a process */ static unsigned int interleave_nodes(struct mempolicy *policy) { @@ -1861,6 +1894,9 @@ unsigned int mempolicy_slab_node(void) case MPOL_INTERLEAVE: return interleave_nodes(policy); =20 + case MPOL_WEIGHTED_INTERLEAVE: + return weighted_interleave_nodes(policy); + case MPOL_BIND: case MPOL_PREFERRED_MANY: { @@ -1885,6 +1921,41 @@ unsigned int mempolicy_slab_node(void) } } =20 +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t= ilx) +{ + nodemask_t nodemask =3D pol->nodes; + unsigned int target, weight_total =3D 0; + int nid; + unsigned char weights[MAX_NUMNODES]; + unsigned char weight; + + barrier(); + + /* first ensure we have a valid nodemask */ + nid =3D first_node(nodemask); + if (nid =3D=3D MAX_NUMNODES) + return nid; + + /* Then collect weights on stack and calculate totals */ + for_each_node_mask(nid, nodemask) { + weight =3D iw_table[nid]; + weight_total +=3D weight; + weights[nid] =3D weight; + } + + /* Finally, calculate the node offset based on totals */ + target =3D (unsigned int)ilx % weight_total; + nid =3D first_node(nodemask); + while (target) { + weight =3D weights[nid]; + if (target < weight) + break; + target -=3D weight; + nid =3D next_node_in(nid, nodemask); + } + return nid; +} + /* * Do static interleaving for interleave index @ilx. Returns the ilx'th * node in pol->nodes (starting from ilx=3D0), wrapping around if ilx @@ -1953,6 +2024,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct= mempolicy *pol, *nid =3D (ilx =3D=3D NO_INTERLEAVE_INDEX) ? interleave_nodes(pol) : interleave_nid(pol, ilx); break; + case MPOL_WEIGHTED_INTERLEAVE: + *nid =3D (ilx =3D=3D NO_INTERLEAVE_INDEX) ? + weighted_interleave_nodes(pol) : + weighted_interleave_nid(pol, ilx); + break; } =20 return nodemask; @@ -2014,6 +2090,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: *mask =3D mempolicy->nodes; break; =20 @@ -2113,7 +2190,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int= order, * If the policy is interleave or does not allow the current * node in its nodemask, we allocate the standard way. */ - if (pol->mode !=3D MPOL_INTERLEAVE && + if ((pol->mode !=3D MPOL_INTERLEAVE && + pol->mode !=3D MPOL_WEIGHTED_INTERLEAVE) && (!nodemask || node_isset(nid, *nodemask))) { /* * First, try to allocate THP only on local node, but @@ -2249,6 +2327,106 @@ static unsigned long alloc_pages_bulk_array_interle= ave(gfp_t gfp, return total_allocated; } =20 +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, + struct mempolicy *pol, unsigned long nr_pages, + struct page **page_array) +{ + struct task_struct *me =3D current; + unsigned long total_allocated =3D 0; + unsigned long nr_allocated; + unsigned long rounds; + unsigned long node_pages, delta; + unsigned char weight; + unsigned char weights[MAX_NUMNODES]; + unsigned int weight_total =3D 0; + unsigned long rem_pages =3D nr_pages; + nodemask_t nodes =3D pol->nodes; + int nnodes, node, prev_node; + int i; + + /* Stabilize the nodemask on the stack */ + barrier(); + + nnodes =3D nodes_weight(nodes); + + /* Collect weights and save them on stack so they don't change */ + for_each_node_mask(node, nodes) { + weight =3D iw_table[node]; + weight_total +=3D weight; + weights[node] =3D weight; + } + + /* Continue allocating from most recent node and adjust the nr_pages */ + if (pol->wil.cur_weight) { + node =3D next_node_in(me->il_prev, nodes); + node_pages =3D pol->wil.cur_weight; + if (node_pages > rem_pages) + node_pages =3D rem_pages; + nr_allocated =3D __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array +=3D nr_allocated; + total_allocated +=3D nr_allocated; + /* if that's all the pages, no need to interleave */ + if (rem_pages <=3D pol->wil.cur_weight) { + pol->wil.cur_weight -=3D rem_pages; + return total_allocated; + } + /* Otherwise we adjust nr_pages down, and continue from there */ + rem_pages -=3D pol->wil.cur_weight; + pol->wil.cur_weight =3D 0; + prev_node =3D node; + } + + /* Now we can continue allocating as if from 0 instead of an offset */ + rounds =3D rem_pages / weight_total; + delta =3D rem_pages % weight_total; + for (i =3D 0; i < nnodes; i++) { + node =3D next_node_in(prev_node, nodes); + weight =3D weights[node]; + node_pages =3D weight * rounds; + if (delta) { + if (delta > weight) { + node_pages +=3D weight; + delta -=3D weight; + } else { + node_pages +=3D delta; + delta =3D 0; + } + } + /* We may not make it all the way around */ + if (!node_pages) + break; + /* If an over-allocation would occur, floor it */ + if (node_pages + total_allocated > nr_pages) { + node_pages =3D nr_pages - total_allocated; + delta =3D 0; + } + nr_allocated =3D __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array +=3D nr_allocated; + total_allocated +=3D nr_allocated; + prev_node =3D node; + } + + /* + * Finally, we need to update me->il_prev and pol->wil.cur_weight + * if there were overflow pages, but not equivalent to the node + * weight, set the cur_weight to node_weight - delta and the + * me->il_prev to the previous node. Otherwise if it was perfect + * we can simply set il_prev to node and cur_weight to 0 + */ + if (node_pages) { + me->il_prev =3D prev_node; + node_pages %=3D weight; + pol->wil.cur_weight =3D weight - node_pages; + } else { + me->il_prev =3D node; + pol->wil.cur_weight =3D 0; + } + + return total_allocated; +} + static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int = nid, struct mempolicy *pol, unsigned long nr_pages, struct page **page_array) @@ -2289,6 +2467,11 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t= gfp, return alloc_pages_bulk_array_interleave(gfp, pol, nr_pages, page_array); =20 + if (pol->mode =3D=3D MPOL_WEIGHTED_INTERLEAVE) + return alloc_pages_bulk_array_weighted_interleave(gfp, pol, + nr_pages, + page_array); + if (pol->mode =3D=3D MPOL_PREFERRED_MANY) return alloc_pages_bulk_array_preferred_many(gfp, numa_node_id(), pol, nr_pages, page_array); @@ -2364,6 +2547,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempoli= cy *b) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: return !!nodes_equal(a->nodes, b->nodes); case MPOL_LOCAL: return true; @@ -2500,6 +2684,10 @@ int mpol_misplaced(struct folio *folio, struct vm_ar= ea_struct *vma, polnid =3D interleave_nid(pol, ilx); break; =20 + case MPOL_WEIGHTED_INTERLEAVE: + polnid =3D weighted_interleave_nid(pol, ilx); + break; + case MPOL_PREFERRED: if (node_isset(curnid, pol->nodes)) goto out; @@ -2874,6 +3062,7 @@ static const char * const policy_modes[] =3D [MPOL_PREFERRED] =3D "prefer", [MPOL_BIND] =3D "bind", [MPOL_INTERLEAVE] =3D "interleave", + [MPOL_WEIGHTED_INTERLEAVE] =3D "weighted interleave", [MPOL_LOCAL] =3D "local", [MPOL_PREFERRED_MANY] =3D "prefer (many)", }; @@ -2933,6 +3122,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) } break; case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: /* * Default to online nodes with memory if no nodelist */ @@ -3043,6 +3233,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mem= policy *pol) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: nodes =3D pol->nodes; break; default: --=20 2.39.1 From nobody Sat Dec 27 03:14:25 2025 Received: from mail-ot1-f68.google.com (mail-ot1-f68.google.com [209.85.210.68]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D1A221D681; Sat, 23 Dec 2023 18:11:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="eoKeneyg" Received: by mail-ot1-f68.google.com with SMTP id 46e09a7af769-6dba02a162aso2107178a34.0; Sat, 23 Dec 2023 10:11:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1703355088; x=1703959888; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=F7pmxfeZqglD0vnlUxD8yjybFz/b8pMpATnV83Z/Zpk=; b=eoKeneygMMdYA26gwZyhLLyx9qPIzcg5wn8l6Us/ew9g2x89mH0jVftCKCtn36OUJd DDrYnEGA5fHMNR09PDj1m3Msxs9laViRX0Zji2NRf3dh5f+JQTx/OqBM3jF88gjWvUjW MmDE/BJLyjSm9tA154c4AxMiZkLuPziB3FNVjSLvaAvXiTP0e1MddF9ekLg/ewjuD3wQ ftC6Rw5CYfav6+H1YYrIHM6FecW5pELcVMFLc+dOX7ut8nzOUaxAYKKizJYLZPgRLOdG a3J2y7IJBteCZffAW8osGQd7NRwSQXQWP0nytV4FA21pHpcOE1Ck+y3g48Hbn3gBdaM/ txCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703355088; x=1703959888; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=F7pmxfeZqglD0vnlUxD8yjybFz/b8pMpATnV83Z/Zpk=; b=Bi3vAHfkEL3o9XKYRo2aR3NsvI88Az+CJBn6cR1Hw1DaaaKqQJfAWuVr+U0Pcb1ojD WNZ3roq9M889FrEBL5EC6454FupBFf6athOAuYHyVJaSue+lfiOGPySvak+BRuTGpO2r BxlAaQNAFlhCYqciYaanzppbrKXGdtNw4EY8Nh+mkCbCSdbjzsi4ySqyow8vOUjZ5vUe gWaFoRyqJf2PA2Td3g80Oev11QE0tjcyrOEsW0VQOkhWC1RdnUy8Yz0Dgf9B2ity6fJY 23eZH8wZ0yOkhN+aK9pvJ+w0ts0w684202bxTugwNjTGySS634n4nV9GfTQ+S/K7+6Zr No1w== X-Gm-Message-State: AOJu0Yxmcn5XEC8yuducEd9SnvFFORVY4JeTFy6jzbZmfNCIGLNzKiqU Wk4N5wV9NgZGDVXi/bl7xA== X-Google-Smtp-Source: AGHT+IGduDJamDKHmR+aQaOtBDinjtfu7pRzN+JQ8ZvU2VnXEa8xH+EQdgRFEdxiiUKgsn5BH0VfBQ== X-Received: by 2002:a05:6808:3703:b0:3b9:e5d5:a69e with SMTP id cq3-20020a056808370300b003b9e5d5a69emr3283728oib.119.1703355087849; Sat, 23 Dec 2023 10:11:27 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id t6-20020a170902a5c600b001d3bfd30886sm4316396plq.37.2023.12.23.10.11.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Dec 2023 10:11:27 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, x86@kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com Subject: [PATCH v5 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse Date: Sat, 23 Dec 2023 13:10:53 -0500 Message-Id: <20231223181101.1954-4-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231223181101.1954-1-gregory.price@memverge.com> References: <20231223181101.1954-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" split sanitize_mpol_flags into sanitize and validate. Sanitize is used by set_mempolicy to split (int mode) into mode and mode_flags, and then validates them. Validate validates already split flags. Validate will be reused for new syscalls that accept already split mode and mode_flags. Signed-off-by: Gregory Price Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Hyeongtak Ji Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Michal Hocko Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Vinicius Tavares Petrucci Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- mm/mempolicy.c | 29 ++++++++++++++++++++++------- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0a180c670f0c..59ac0da24f56 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1463,24 +1463,39 @@ static int copy_nodes_to_user(unsigned long __user = *mask, unsigned long maxnode, return copy_to_user(mask, nodes_addr(*nodes), copy) ? -EFAULT : 0; } =20 -/* Basic parameter sanity check used by both mbind() and set_mempolicy() */ -static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) +/* + * Basic parameter sanity check used by mbind/set_mempolicy + * May modify flags to include internal flags (e.g. MPOL_F_MOF/F_MORON) + */ +static inline int validate_mpol_flags(unsigned short mode, unsigned short = *flags) { - *flags =3D *mode & MPOL_MODE_FLAGS; - *mode &=3D ~MPOL_MODE_FLAGS; - - if ((unsigned int)(*mode) >=3D MPOL_MAX) + if ((unsigned int)(mode) >=3D MPOL_MAX) return -EINVAL; if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) return -EINVAL; if (*flags & MPOL_F_NUMA_BALANCING) { - if (*mode !=3D MPOL_BIND) + if (mode !=3D MPOL_BIND) return -EINVAL; *flags |=3D (MPOL_F_MOF | MPOL_F_MORON); } return 0; } =20 +/* + * Used by mbind/set_memplicy to split and validate mode/flags + * set_mempolicy combines (mode | flags), split them out into separate + * fields and return just the mode in mode_arg and flags in flags. + */ +static inline int sanitize_mpol_flags(int *mode_arg, unsigned short *flags) +{ + unsigned short mode =3D (*mode_arg & ~MPOL_MODE_FLAGS); + + *flags =3D *mode_arg & MPOL_MODE_FLAGS; + *mode_arg =3D mode; + + return validate_mpol_flags(mode, flags); +} + static long kernel_mbind(unsigned long start, unsigned long len, unsigned long mode, const unsigned long __user *nmask, unsigned long maxnode, unsigned int flags) --=20 2.39.1 From nobody Sat Dec 27 03:14:25 2025 Received: from mail-pl1-f194.google.com (mail-pl1-f194.google.com [209.85.214.194]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 18EC120B00; Sat, 23 Dec 2023 18:11:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="PSPX7jPR" Received: by mail-pl1-f194.google.com with SMTP id d9443c01a7336-1d3ef33e68dso19303445ad.1; Sat, 23 Dec 2023 10:11:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1703355092; x=1703959892; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=AcdBAO4D7oasihncdh2x04kUxpX7OQbCvj+pKHii5sc=; b=PSPX7jPRlqVFGHv5G4ofRDNt466fi8ho3YN3ys2KavTgcUc0SKDmQ90h1QOm5E/rZI 29Y1B52XiZ4hRcCSUMyV3Xo6hfu8Jn6VKI+yCpqw8LjarNrNzSZmlYbsUq3Raw6/fG5L DZiZa9ue02LortzrEJ/nvej+YyWSjN6khNAf4TJaPou8II9y8dc7pYohp7Yy0m/NV5fS ll894E5z4ufYYPqErCk/ubzRA6CujAwTPMmVEEoIkXjZypX/lU/61OZ16fY3ObCCDKBf zzbo3Ml+GJVKeQh2g7gI9vH0brXc++VOrdrYNftOfG5vjg0j3GUjX76d1RxB8hIggTNv kRpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703355092; x=1703959892; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=AcdBAO4D7oasihncdh2x04kUxpX7OQbCvj+pKHii5sc=; b=ZgOVyMCUvEei9pAe4qU4tf9kU4v7MzBoAEbPoSLc/iP2B3jksp8rxuuCVEyeplUDxj Lv96lbvXw8SiCD9b3GsajQdASx+vRX8WZ/2XUZRygaW+y0LS47hVMI7ZIIgTf5PCvfr9 6c6deWKRTei9RVpK9y6atXPO5v2itrUVotgcakYEMoZm5DVCW//+ipezl7b+tEmDN4GO FpdyaFBVr7mSpEH/SLOy6Nnr+D8XXLWIpgdVB2eVRIGaKuv0ekOOc2MqdQVm8WEAAxND GU58UqfXRSpPtYZoCIIy9dG14+aU/ws1dkQuYOCY8ogjSBW1K/nTkMKpWqLTnHjGd/YZ DZfw== X-Gm-Message-State: AOJu0YyygMcWbGdmpzFTqzPqqMLzDxQ5F3rPy1BksKY2SaOaqFSzG9i6 9Anir0jwpTJjNlqez/abNA== X-Google-Smtp-Source: AGHT+IGnHeKaYmZNEROhR/URDf6HE4vLb2IfXEioLlNcOJwxkuJbFj4kZa/WgJIefHkPpeXNmWeqDw== X-Received: by 2002:a17:903:1cf:b0:1d3:fa6a:fc8e with SMTP id e15-20020a17090301cf00b001d3fa6afc8emr3444247plh.41.1703355092358; Sat, 23 Dec 2023 10:11:32 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id t6-20020a170902a5c600b001d3bfd30886sm4316396plq.37.2023.12.23.10.11.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Dec 2023 10:11:32 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, x86@kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com Subject: [PATCH v5 04/11] mm/mempolicy: create struct mempolicy_args for creating new mempolicies Date: Sat, 23 Dec 2023 13:10:54 -0500 Message-Id: <20231223181101.1954-5-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231223181101.1954-1-gregory.price@memverge.com> References: <20231223181101.1954-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This patch adds a new kernel structure `struct mempolicy_args`, intended to be used for an extensible get/set_mempolicy interface. This implements the fields required to support the existing syscall interfaces interfaces, but does not expose any user-facing arg structure. mpol_new is refactored to take the argument structure so that future mempolicy extensions can all be managed in the mempolicy constructor. The get_mempolicy and mbind syscalls are refactored to utilize the new argument structure, as are all the callers of mpol_new() and do_set_mempolicy. Signed-off-by: Gregory Price Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Hyeongtak Ji Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Michal Hocko Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Vinicius Tavares Petrucci Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- include/linux/mempolicy.h | 11 +++++++ mm/mempolicy.c | 69 +++++++++++++++++++++++++++++---------- 2 files changed, 62 insertions(+), 18 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index ba09167e80f7..0f1c85527626 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -61,6 +61,17 @@ struct mempolicy { } wil; }; =20 +/* + * Describes settings of a mempolicy during set/get syscalls and + * kernel internal calls to do_set_mempolicy() + */ +struct mempolicy_args { + unsigned short mode; /* policy mode */ + unsigned short mode_flags; /* policy mode flags */ + int home_node; /* mbind: use MPOL_MF_HOME_NODE */ + nodemask_t *policy_nodes; /* get/set/mbind */ +}; + /* * Support for managing mempolicy data objects (clone, copy, destroy) * The default fast path of a NULL MPOL_DEFAULT policy is always inlined. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 59ac0da24f56..42037b7ff6d6 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -265,10 +265,12 @@ static int mpol_set_nodemask(struct mempolicy *pol, * This function just creates a new policy, does some check and simple * initialization. You must invoke mpol_set_nodemask() to set nodes. */ -static struct mempolicy *mpol_new(unsigned short mode, unsigned short flag= s, - nodemask_t *nodes) +static struct mempolicy *mpol_new(struct mempolicy_args *args) { struct mempolicy *policy; + unsigned short mode =3D args->mode; + unsigned short flags =3D args->mode_flags; + nodemask_t *nodes =3D args->policy_nodes; =20 if (mode =3D=3D MPOL_DEFAULT) { if (nodes && !nodes_empty(*nodes)) @@ -817,8 +819,7 @@ static int mbind_range(struct vma_iterator *vmi, struct= vm_area_struct *vma, } =20 /* Set the process memory policy */ -static long do_set_mempolicy(unsigned short mode, unsigned short flags, - nodemask_t *nodes) +static long do_set_mempolicy(struct mempolicy_args *args) { struct mempolicy *new, *old; NODEMASK_SCRATCH(scratch); @@ -827,14 +828,14 @@ static long do_set_mempolicy(unsigned short mode, uns= igned short flags, if (!scratch) return -ENOMEM; =20 - new =3D mpol_new(mode, flags, nodes); + new =3D mpol_new(args); if (IS_ERR(new)) { ret =3D PTR_ERR(new); goto out; } =20 task_lock(current); - ret =3D mpol_set_nodemask(new, nodes, scratch); + ret =3D mpol_set_nodemask(new, args->policy_nodes, scratch); if (ret) { task_unlock(current); mpol_put(new); @@ -1232,8 +1233,7 @@ static struct folio *alloc_migration_target_by_mpol(s= truct folio *src, #endif =20 static long do_mbind(unsigned long start, unsigned long len, - unsigned short mode, unsigned short mode_flags, - nodemask_t *nmask, unsigned long flags) + struct mempolicy_args *margs, unsigned long flags) { struct mm_struct *mm =3D current->mm; struct vm_area_struct *vma, *prev; @@ -1253,7 +1253,7 @@ static long do_mbind(unsigned long start, unsigned lo= ng len, if (start & ~PAGE_MASK) return -EINVAL; =20 - if (mode =3D=3D MPOL_DEFAULT) + if (margs->mode =3D=3D MPOL_DEFAULT) flags &=3D ~MPOL_MF_STRICT; =20 len =3D PAGE_ALIGN(len); @@ -1264,7 +1264,7 @@ static long do_mbind(unsigned long start, unsigned lo= ng len, if (end =3D=3D start) return 0; =20 - new =3D mpol_new(mode, mode_flags, nmask); + new =3D mpol_new(margs); if (IS_ERR(new)) return PTR_ERR(new); =20 @@ -1281,7 +1281,8 @@ static long do_mbind(unsigned long start, unsigned lo= ng len, NODEMASK_SCRATCH(scratch); if (scratch) { mmap_write_lock(mm); - err =3D mpol_set_nodemask(new, nmask, scratch); + err =3D mpol_set_nodemask(new, margs->policy_nodes, + scratch); if (err) mmap_write_unlock(mm); } else @@ -1295,7 +1296,7 @@ static long do_mbind(unsigned long start, unsigned lo= ng len, * Lock the VMAs before scanning for pages to migrate, * to ensure we don't miss a concurrently inserted page. */ - nr_failed =3D queue_pages_range(mm, start, end, nmask, + nr_failed =3D queue_pages_range(mm, start, end, margs->policy_nodes, flags | MPOL_MF_INVERT | MPOL_MF_WRLOCK, &pagelist); =20 if (nr_failed < 0) { @@ -1500,6 +1501,7 @@ static long kernel_mbind(unsigned long start, unsigne= d long len, unsigned long mode, const unsigned long __user *nmask, unsigned long maxnode, unsigned int flags) { + struct mempolicy_args margs; unsigned short mode_flags; nodemask_t nodes; int lmode =3D mode; @@ -1514,7 +1516,12 @@ static long kernel_mbind(unsigned long start, unsign= ed long len, if (err) return err; =20 - return do_mbind(start, len, lmode, mode_flags, &nodes, flags); + memset(&margs, 0, sizeof(margs)); + margs.mode =3D lmode; + margs.mode_flags =3D mode_flags; + margs.policy_nodes =3D &nodes; + + return do_mbind(start, len, &margs, flags); } =20 SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned lo= ng, len, @@ -1595,6 +1602,7 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned= long, len, static long kernel_set_mempolicy(int mode, const unsigned long __user *nma= sk, unsigned long maxnode) { + struct mempolicy_args args; unsigned short mode_flags; nodemask_t nodes; int lmode =3D mode; @@ -1608,7 +1616,12 @@ static long kernel_set_mempolicy(int mode, const uns= igned long __user *nmask, if (err) return err; =20 - return do_set_mempolicy(lmode, mode_flags, &nodes); + memset(&args, 0, sizeof(args)); + args.mode =3D lmode; + args.mode_flags =3D mode_flags; + args.policy_nodes =3D &nodes; + + return do_set_mempolicy(&args); } =20 SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nm= ask, @@ -2890,6 +2903,7 @@ static int shared_policy_replace(struct shared_policy= *sp, pgoff_t start, void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *m= pol) { int ret; + struct mempolicy_args margs; =20 sp->root =3D RB_ROOT; /* empty tree =3D=3D default mempolicy */ rwlock_init(&sp->lock); @@ -2902,8 +2916,12 @@ void mpol_shared_policy_init(struct shared_policy *s= p, struct mempolicy *mpol) if (!scratch) goto put_mpol; =20 + memset(&margs, 0, sizeof(margs)); + margs.mode =3D mpol->mode; + margs.mode_flags =3D mpol->flags; + margs.policy_nodes =3D &mpol->w.user_nodemask; /* contextualize the tmpfs mount point mempolicy to this file */ - npol =3D mpol_new(mpol->mode, mpol->flags, &mpol->w.user_nodemask); + npol =3D mpol_new(&margs); if (IS_ERR(npol)) goto free_scratch; /* no valid nodemask intersection */ =20 @@ -3011,6 +3029,7 @@ static inline void __init check_numabalancing_enable(= void) =20 void __init numa_policy_init(void) { + struct mempolicy_args args; nodemask_t interleave_nodes; unsigned long largest =3D 0; int nid, prefer =3D 0; @@ -3056,7 +3075,11 @@ void __init numa_policy_init(void) if (unlikely(nodes_empty(interleave_nodes))) node_set(prefer, interleave_nodes); =20 - if (do_set_mempolicy(MPOL_INTERLEAVE, 0, &interleave_nodes)) + memset(&args, 0, sizeof(args)); + args.mode =3D MPOL_INTERLEAVE; + args.policy_nodes =3D &interleave_nodes; + + if (do_set_mempolicy(&args)) pr_err("%s: interleaving failed\n", __func__); =20 check_numabalancing_enable(); @@ -3065,7 +3088,12 @@ void __init numa_policy_init(void) /* Reset policy of current process to default */ void numa_default_policy(void) { - do_set_mempolicy(MPOL_DEFAULT, 0, NULL); + struct mempolicy_args args; + + memset(&args, 0, sizeof(args)); + args.mode =3D MPOL_DEFAULT; + + do_set_mempolicy(&args); } =20 /* @@ -3095,6 +3123,7 @@ static const char * const policy_modes[] =3D */ int mpol_parse_str(char *str, struct mempolicy **mpol) { + struct mempolicy_args margs; struct mempolicy *new =3D NULL; unsigned short mode_flags; nodemask_t nodes; @@ -3181,7 +3210,11 @@ int mpol_parse_str(char *str, struct mempolicy **mpo= l) goto out; } =20 - new =3D mpol_new(mode, mode_flags, &nodes); + memset(&margs, 0, sizeof(margs)); + margs.mode =3D mode; + margs.mode_flags =3D mode_flags; + margs.policy_nodes =3D &nodes; + new =3D mpol_new(&margs); if (IS_ERR(new)) goto out; =20 --=20 2.39.1 From nobody Sat Dec 27 03:14:25 2025 Received: from mail-pl1-f193.google.com (mail-pl1-f193.google.com [209.85.214.193]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AFBED219FC; Sat, 23 Dec 2023 18:11:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="EQqRyS8A" Received: by mail-pl1-f193.google.com with SMTP id d9443c01a7336-1d3aa0321b5so23617255ad.2; Sat, 23 Dec 2023 10:11:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1703355097; x=1703959897; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=VrlM3OtBLMkU2ODnNDQfBhvmILCQxgyt+ITv2QnmB0w=; b=EQqRyS8AnxjZ4YGGC698sL42NNhiC9QgQpnIC89YHBeJbAZJmkbMNXV6HQ/vwGsrSE p9U0+bfyW5RC330jmoGbdHs36fZbqSAGnItEchWtivYr60++sjrE3jRZJ1zyGSrz1eZp Iryn05+5Px90EmA6cK6V5rZMqzIww6CCnn24VD9VjgTs8TbRScAgyWz1uEvl/QKXCeA4 ZKC+lMejG+YJuUsiaPplNNcAf7bxNX6YvTCJFkPL808aFl/jI6VCiywPs50Q0eH3tphl qdDEFri1H7i22ZPb9egCcSDwIiIdrOWds8wnvwonTTZe4uPc0lqH3ePAkeh9ZdXZYDyC Rvrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703355097; x=1703959897; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VrlM3OtBLMkU2ODnNDQfBhvmILCQxgyt+ITv2QnmB0w=; b=D9zaM8osuN79JhnH2WHU54/7vRImpjvCu7Esby2uW4p6A4ByG+j0e45dxK6vrpXCwM dF9SqHL2drcrH5ay3L8f+B8mV7caLGR/Y4gP5YNUNNSomybHk5eDGotfOmnMExtr14AT U573pJc/BJLp8kXqE2VZcZvIt3ZWXgrhuDY27AqhWp9tZXcuZr8o7o6zfRKrjhCkIN7d EcOmEEu/6VDG1ggRU1vzjEyRquwC2fmyzk0byzvZm+QxTc4UkJo0++4MuCMW2ejwaEiG jRLLdRCafjXQ55OyeUeQPDYOv7Ue2f2D74R8NR1958JBjtJMHZGRQRLMVzH1aLvZqnZ2 8o8w== X-Gm-Message-State: AOJu0YwoJJsYloDnc/gH7/kVnuvz2+apnjnzs15914fYV3rZMTGz6h2w 2dXLoTG5x64ipxdwbFW1Yg== X-Google-Smtp-Source: AGHT+IFMfNBDt3dXxY6WKgwaBE8PaWEHuwylcgzLZQLLrDgLl7ebAxpxtiMRV5SXsrguD3WmSEKfNw== X-Received: by 2002:a17:903:244a:b0:1d0:68ad:d89f with SMTP id l10-20020a170903244a00b001d068add89fmr4381166pls.64.1703355096898; Sat, 23 Dec 2023 10:11:36 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id t6-20020a170902a5c600b001d3bfd30886sm4316396plq.37.2023.12.23.10.11.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Dec 2023 10:11:36 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, x86@kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com Subject: [PATCH v5 05/11] mm/mempolicy: refactor kernel_get_mempolicy for code re-use Date: Sat, 23 Dec 2023 13:10:55 -0500 Message-Id: <20231223181101.1954-6-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231223181101.1954-1-gregory.price@memverge.com> References: <20231223181101.1954-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Pull operation flag checking from inside do_get_mempolicy out to kernel_get_mempolicy. This allows us to flatten the internal code, and break it into separate functions for future syscalls (get_mempolicy2, process_get_mempolicy) to re-use the code, even after additional extensions are made. The primary change is that the flag is treated as the multiplexer that it actually is. For get_mempolicy, the flags represents 3 different primary operations: if (flags & MPOL_F_MEMS_ALLOWED) return task->mems_allowed else if (flags & MPOL_F_ADDR) return vma mempolicy information else return task mempolicy information Plus the behavior modifying flag: if (flags & MPOL_F_NODE) change the return value of (int __user *policy) based on whether MPOL_F_ADDR was set. The original behavior of get_mempolicy is retained, but we utilize the new mempolicy_args structure to pass the operations down the stack. This will allow us to extend the internal functions without affecting the legacy behavior of get_mempolicy. Signed-off-by: Gregory Price Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Hyeongtak Ji Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Michal Hocko Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Vinicius Tavares Petrucci Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- mm/mempolicy.c | 244 +++++++++++++++++++++++++++++++------------------ 1 file changed, 154 insertions(+), 90 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 42037b7ff6d6..da84dc33a645 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -895,106 +895,109 @@ static int lookup_node(struct mm_struct *mm, unsign= ed long addr) return ret; } =20 -/* Retrieve NUMA policy */ -static long do_get_mempolicy(int *policy, nodemask_t *nmask, - unsigned long addr, unsigned long flags) +/* Retrieve the mems_allowed for current task */ +static inline long do_get_mems_allowed(nodemask_t *nmask) { - int err; - struct mm_struct *mm =3D current->mm; - struct vm_area_struct *vma =3D NULL; - struct mempolicy *pol =3D current->mempolicy, *pol_refcount =3D NULL; + task_lock(current); + *nmask =3D cpuset_current_mems_allowed; + task_unlock(current); + return 0; +} =20 - if (flags & - ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED)) - return -EINVAL; +/* If the policy has additional node information to retrieve, return it */ +static long do_get_policy_node(struct mempolicy *pol) +{ + /* + * For MPOL_INTERLEAVE, the extended node information is the next + * node that will be selected for interleave. For weighted interleave + * we return the next node based on the current weight. + */ + if (pol =3D=3D current->mempolicy && pol->mode =3D=3D MPOL_INTERLEAVE) + return next_node_in(current->il_prev, pol->nodes); =20 - if (flags & MPOL_F_MEMS_ALLOWED) { - if (flags & (MPOL_F_NODE|MPOL_F_ADDR)) - return -EINVAL; - *policy =3D 0; /* just so it's initialized */ + if (pol =3D=3D current->mempolicy && + pol->mode =3D=3D MPOL_WEIGHTED_INTERLEAVE) { + if (pol->wil.cur_weight) + return current->il_prev; + else + return next_node_in(current->il_prev, pol->nodes); + } + return -EINVAL; +} + +/* Handle user_nodemask condition when fetching nodemask for userspace */ +static void do_get_mempolicy_nodemask(struct mempolicy *pol, nodemask_t *n= mask) +{ + if (mpol_store_user_nodemask(pol)) { + *nmask =3D pol->w.user_nodemask; + } else { task_lock(current); - *nmask =3D cpuset_current_mems_allowed; + get_policy_nodemask(pol, nmask); task_unlock(current); - return 0; } +} =20 - if (flags & MPOL_F_ADDR) { - pgoff_t ilx; /* ignored here */ - /* - * Do NOT fall back to task policy if the - * vma/shared policy at addr is NULL. We - * want to return MPOL_DEFAULT in this case. - */ - mmap_read_lock(mm); - vma =3D vma_lookup(mm, addr); - if (!vma) { - mmap_read_unlock(mm); - return -EFAULT; - } - pol =3D __get_vma_policy(vma, addr, &ilx); - } else if (addr) - return -EINVAL; +/* Retrieve NUMA policy for a VMA assocated with a given address */ +static long do_get_vma_mempolicy(unsigned long addr, int *addr_node, + struct mempolicy_args *args) +{ + pgoff_t ilx; + struct mm_struct *mm =3D current->mm; + struct vm_area_struct *vma =3D NULL; + struct mempolicy *pol =3D NULL; =20 + mmap_read_lock(mm); + vma =3D vma_lookup(mm, addr); + if (!vma) { + mmap_read_unlock(mm); + return -EFAULT; + } + pol =3D __get_vma_policy(vma, addr, &ilx); if (!pol) - pol =3D &default_policy; /* indicates default behavior */ + pol =3D &default_policy; + else + mpol_get(pol); + mmap_read_unlock(mm); =20 - if (flags & MPOL_F_NODE) { - if (flags & MPOL_F_ADDR) { - /* - * Take a refcount on the mpol, because we are about to - * drop the mmap_lock, after which only "pol" remains - * valid, "vma" is stale. - */ - pol_refcount =3D pol; - vma =3D NULL; - mpol_get(pol); - mmap_read_unlock(mm); - err =3D lookup_node(mm, addr); - if (err < 0) - goto out; - *policy =3D err; - } else if (pol =3D=3D current->mempolicy && - pol->mode =3D=3D MPOL_INTERLEAVE) { - *policy =3D next_node_in(current->il_prev, pol->nodes); - } else if (pol =3D=3D current->mempolicy && - (pol->mode =3D=3D MPOL_WEIGHTED_INTERLEAVE)) { - if (pol->wil.cur_weight) - *policy =3D current->il_prev; - else - *policy =3D next_node_in(current->il_prev, - pol->nodes); - } else { - err =3D -EINVAL; - goto out; - } - } else { - *policy =3D pol =3D=3D &default_policy ? MPOL_DEFAULT : - pol->mode; - /* - * Internal mempolicy flags must be masked off before exposing - * the policy to userspace. - */ - *policy |=3D (pol->flags & MPOL_MODE_FLAGS); - } + /* Fetch the node for the given address */ + if (addr_node) + *addr_node =3D lookup_node(mm, addr); =20 - err =3D 0; - if (nmask) { - if (mpol_store_user_nodemask(pol)) { - *nmask =3D pol->w.user_nodemask; - } else { - task_lock(current); - get_policy_nodemask(pol, nmask); - task_unlock(current); - } + args->mode =3D pol =3D=3D &default_policy ? MPOL_DEFAULT : pol->mode; + args->mode_flags =3D (pol->flags & MPOL_MODE_FLAGS); + args->home_node =3D pol->home_node; + + if (args->policy_nodes) + do_get_mempolicy_nodemask(pol, args->policy_nodes); + + if (pol !=3D &default_policy) { + mpol_put(pol); + mpol_cond_put(pol); } =20 - out: - mpol_cond_put(pol); - if (vma) - mmap_read_unlock(mm); - if (pol_refcount) - mpol_put(pol_refcount); - return err; + return 0; +} + +/* Retrieve NUMA policy for the current task */ +static long do_get_task_mempolicy(struct mempolicy_args *args, int *pol_no= de) +{ + struct mempolicy *pol =3D current->mempolicy; + + if (!pol) + pol =3D &default_policy; /* indicates default behavior */ + + args->mode =3D pol =3D=3D &default_policy ? MPOL_DEFAULT : pol->mode; + /* Internal flags must be masked off before exposing to userspace */ + args->mode_flags =3D (pol->flags & MPOL_MODE_FLAGS); + args->home_node =3D NUMA_NO_NODE; + + if (pol_node) + *pol_node =3D do_get_policy_node(pol); + + if (args->policy_nodes) + do_get_mempolicy_nodemask(pol, args->policy_nodes); + + return 0; } =20 #ifdef CONFIG_MIGRATION @@ -1731,16 +1734,77 @@ static int kernel_get_mempolicy(int __user *policy, unsigned long addr, unsigned long flags) { + struct mempolicy_args args; int err; - int pval; + int address_node =3D NUMA_NO_NODE; + int pval =3D 0; + int pol_node =3D 0; nodemask_t nodes; =20 if (nmask !=3D NULL && maxnode < nr_node_ids) return -EINVAL; =20 - addr =3D untagged_addr(addr); + if (flags & + ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED)) + return -EINVAL; =20 - err =3D do_get_mempolicy(&pval, &nodes, addr, flags); + /* Ensure any data that may be copied to userland is initialized */ + memset(&args, 0, sizeof(args)); + args.policy_nodes =3D &nodes; + + /* + * set_mempolicy was originally multiplexed based on 3 flags: + * MPOL_F_MEMS_ALLOWED: fetch task->mems_allowed + * MPOL_F_ADDR : operate on vma->mempolicy + * MPOL_F_NODE : change return value of *policy + * + * Split this behavior out here, rather than internal functions, + * so that the internal functions can be re-used by future + * get_mempolicy2 interfaces and the arg structure made extensible + */ + if (flags & MPOL_F_MEMS_ALLOWED) { + if (flags & (MPOL_F_NODE|MPOL_F_ADDR)) + return -EINVAL; + pval =3D 0; /* just so it's initialized */ + err =3D do_get_mems_allowed(&nodes); + } else if (flags & MPOL_F_ADDR) { + /* If F_ADDR, we operation on a vma policy (or default) */ + err =3D do_get_vma_mempolicy(untagged_addr(addr), + &address_node, &args); + if (err) + return err; + /* if (F_ADDR | F_NODE), *pval is the address' node */ + if (flags & MPOL_F_NODE) { + /* if we failed to fetch, that's likely an EFAULT */ + if (address_node < 0) + return address_node; + pval =3D address_node; + } else + pval =3D args.mode | args.mode_flags; + } else { + /* if not F_ADDR and addr !=3D null, EINVAL */ + if (addr) + return -EINVAL; + + err =3D do_get_task_mempolicy(&args, &pol_node); + if (err) + return err; + /* + * if F_NODE was set and mode was MPOL_INTERLEAVE + * *pval is equal to next interleave node. + * + * if pol_node < 0, this means the mode did not have a + * a compatible policy. This presently emulates the + * original behavior of (F_NODE) & (!MPOL_INTERLEAVE) + * producing -EINVAL + */ + if (flags & MPOL_F_NODE) { + if (pol_node < 0) + return pol_node; + pval =3D pol_node; + } else + pval =3D args.mode | args.mode_flags; + } =20 if (err) return err; --=20 2.39.1 From nobody Sat Dec 27 03:14:25 2025 Received: from mail-io1-f65.google.com (mail-io1-f65.google.com [209.85.166.65]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 25E5F224CD; Sat, 23 Dec 2023 18:11:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="UhdtfXD0" Received: by mail-io1-f65.google.com with SMTP id ca18e2360f4ac-7ba7c845e1aso138922739f.2; Sat, 23 Dec 2023 10:11:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1703355101; x=1703959901; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=BxtEsEea8M2tXpOIXXydNAgz6njKrExc/jYMN+DERr8=; b=UhdtfXD0jAMMk6t9viMWp3sX2/dziWeXszsa2T/JJRdVfJDJtAC/WNU5kVMkxkgAnV bSBjtBOwgNyIq5qoiwgiwowJYwNs3fVKQ8mYFs0OnkcetUMBNVqWM/4+Sk5r6eA8TxSJ btfYBCT5n+vKVwnodfwYoxQE87EF0CikxYys/5egAcaYN2X6XJZyKfwd9rtWEmdnHg0/ olz36+DEd3jul3KsOrQLNz7rIinO7qWhbJeh/NH4yGuWP22wt8bFCuxqsX+PmRb8OKlk ++Ok/JaZ0gTJSGIaJ1/JO0LYUO9RJgAPKpkYqbR/NMPQjgCKway+ovfPc4z/wE7f3S1G bTcQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703355101; x=1703959901; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=BxtEsEea8M2tXpOIXXydNAgz6njKrExc/jYMN+DERr8=; b=W8X2cxURvDwoNIAQbCHlogo171tVmfQeADMdfDrVA4WyWcaDGKFVUZvRIwOi9OZukX gOK1nuaJqnCApVQcMDCk6AmGfRZBLGbp372YDU7lYqPdMboeLR5AILVzn7VYZZu2pMWp E3bWl95vDCbYkWFO3Yda/kI7An+2sT+L1n6BhsHhWJJgHze8dTfh0xY07glYWOHYl1gR uA9xOXokNuxPSlMkdyLjp40kyFO9bsK96LZvlYm2YMopfbJuXCsSf1UgeK42UNnsuIoW yf58bechRSR+JLkSxcTapH/2JGCkBIHEmQ25VZ5eNYePCDb14Eu5Ua4qb6sJbreUMwtg juGw== X-Gm-Message-State: AOJu0Yxh5IhZdDQ4OIjMNWbWPAHcW+LY1SmkrHu5DdpxyJpLsAUI73ui icn/o40nFiTNxPrXnwRfuNeW+gzKc8osnKw= X-Google-Smtp-Source: AGHT+IEIIK5nnQRVINOQp5Dwg3gMHuxAX94IyAoS97JCyCdhV9IOLTWfMybT9a4QCLjoSGJGFFoGtg== X-Received: by 2002:a6b:7006:0:b0:7b7:bbbe:ede6 with SMTP id l6-20020a6b7006000000b007b7bbbeede6mr4464552ioc.6.1703355101217; Sat, 23 Dec 2023 10:11:41 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id t6-20020a170902a5c600b001d3bfd30886sm4316396plq.37.2023.12.23.10.11.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Dec 2023 10:11:40 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, x86@kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com Subject: [PATCH v5 06/11] mm/mempolicy: allow home_node to be set by mpol_new Date: Sat, 23 Dec 2023 13:10:56 -0500 Message-Id: <20231223181101.1954-7-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231223181101.1954-1-gregory.price@memverge.com> References: <20231223181101.1954-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This patch adds the plumbing into mpol_new() to allow the argument structure's home_node field to be set during mempolicy creation. The syscall sys_set_mempolicy_home_node was added to allow a home node to be registered for a vma. For set_mempolicy2 and mbind2 syscalls, it would be useful to add this as an extension to allow the user to submit a fully formed mempolicy configuration in a single call, rather than require multiple calls to configure a mempolicy. This will become particularly useful if/when pidfd interfaces to change process mempolicies from outside the task appear, as each call to change the mempolicy does an atomic swap of that policy in the task, rather than mutate the policy. Signed-off-by: Gregory Price Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Hyeongtak Ji Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Michal Hocko Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Vinicius Tavares Petrucci Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- mm/mempolicy.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index da84dc33a645..35a0f8630ead 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -306,7 +306,7 @@ static struct mempolicy *mpol_new(struct mempolicy_args= *args) atomic_set(&policy->refcnt, 1); policy->mode =3D mode; policy->flags =3D flags; - policy->home_node =3D NUMA_NO_NODE; + policy->home_node =3D args->home_node; policy->wil.cur_weight =3D 0; =20 return policy; @@ -1623,6 +1623,7 @@ static long kernel_set_mempolicy(int mode, const unsi= gned long __user *nmask, args.mode =3D lmode; args.mode_flags =3D mode_flags; args.policy_nodes =3D &nodes; + args.home_node =3D NUMA_NO_NODE; =20 return do_set_mempolicy(&args); } @@ -2984,6 +2985,8 @@ void mpol_shared_policy_init(struct shared_policy *sp= , struct mempolicy *mpol) margs.mode =3D mpol->mode; margs.mode_flags =3D mpol->flags; margs.policy_nodes =3D &mpol->w.user_nodemask; + margs.home_node =3D NUMA_NO_NODE; + /* contextualize the tmpfs mount point mempolicy to this file */ npol =3D mpol_new(&margs); if (IS_ERR(npol)) @@ -3142,6 +3145,7 @@ void __init numa_policy_init(void) memset(&args, 0, sizeof(args)); args.mode =3D MPOL_INTERLEAVE; args.policy_nodes =3D &interleave_nodes; + args.home_node =3D NUMA_NO_NODE; =20 if (do_set_mempolicy(&args)) pr_err("%s: interleaving failed\n", __func__); @@ -3156,6 +3160,7 @@ void numa_default_policy(void) =20 memset(&args, 0, sizeof(args)); args.mode =3D MPOL_DEFAULT; + args.home_node =3D NUMA_NO_NODE; =20 do_set_mempolicy(&args); } @@ -3278,6 +3283,8 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) margs.mode =3D mode; margs.mode_flags =3D mode_flags; margs.policy_nodes =3D &nodes; + margs.home_node =3D NUMA_NO_NODE; + new =3D mpol_new(&margs); if (IS_ERR(new)) goto out; --=20 2.39.1 From nobody Sat Dec 27 03:14:25 2025 Received: from mail-io1-f67.google.com (mail-io1-f67.google.com [209.85.166.67]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 77C62224F5; Sat, 23 Dec 2023 18:11:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="NFZf2LVC" Received: by mail-io1-f67.google.com with SMTP id ca18e2360f4ac-7b7fe0ae57bso139829339f.0; Sat, 23 Dec 2023 10:11:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1703355105; x=1703959905; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ISxTalquAz27/WYFKI+kbLPuYM6OIDaSuIU1FtMlimc=; b=NFZf2LVCtMfiMUFaBiZBvvxgg1/Drrrzk9GDp75fl4wMMKICGwVXXHYLfg/AT1t+Qh kCL+JrMvxSODXYoGsRBStu6vIBePCG0cs0gSqwBQpmLonXfWzgNwuI89owQWAB3+UX8U CF3RO6nU+wL5FyrT2tVoLcAPu4qs+ClLxyfdoNBJqDppCfKjE5qrF18EW7y1u8SClXmI KQKssq2QxKjvkXjIhZuLxnbsJj+wCnBChadD7SJ8JCtA5BV10tvgZBYrnRT24GpYA8Y7 2pBYF+t5T2x3IibNa/prFy74EBFY0UFTdjRehLWfpVqA+wWvvQDlqQ+nbcSAoqFyRDVx YSHw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703355105; x=1703959905; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ISxTalquAz27/WYFKI+kbLPuYM6OIDaSuIU1FtMlimc=; b=rbIJo1aqjfImOHH+D2NWGVlr7hGXAmNQtNMTxvTHG8uuUlgzprnkFODVBHaEe6qKsR 4f31Q8LXWR3GNZe50RmzGAz5fB7MxW80ohJwqppLFT0n1pFpzs876JhZ0bg/mgbR365c EspAFAZCvupPn0RutjsPNjvnWtVJp1ESOg2WnT6f4fM0KL3XgNZX5xAAYFhHpsWCukz2 qrhgqjCn3URwnRPZHGBxo6F6S5/eE4ufJN6E5+HyiMzmb4iXwBhqt41vDxbsWCAgTlP1 TKCdPEiE/cxPfDJW3p5c3NgPFMd7g2Zz5DB3W7RL3L6scG/yuqsiAXlpTwzfOTyHdodU v+PQ== X-Gm-Message-State: AOJu0Yx+d3RguqkHzJxj6AUROS+9C9g0ro6YwdMkV2dXWmVTW7N4CsHB 4jTzdNj7yLFWHMrg8+K2PQ== X-Google-Smtp-Source: AGHT+IG9pV+0w/aMA3mSXFm4w9DO8STGhMZGdvT0rddVOrZR2ExfQgVDkFTg1RD225ANbkhnPt84zw== X-Received: by 2002:a05:6e02:388f:b0:35f:e308:5287 with SMTP id cn15-20020a056e02388f00b0035fe3085287mr3363382ilb.21.1703355105621; Sat, 23 Dec 2023 10:11:45 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id t6-20020a170902a5c600b001d3bfd30886sm4316396plq.37.2023.12.23.10.11.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Dec 2023 10:11:45 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, x86@kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com, Frank van der Linden Subject: [PATCH v5 07/11] mm/mempolicy: add userland mempolicy arg structure Date: Sat, 23 Dec 2023 13:10:57 -0500 Message-Id: <20231223181101.1954-8-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231223181101.1954-1-gregory.price@memverge.com> References: <20231223181101.1954-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This patch adds the new user-api argument structure intended for set_mempolicy2 and mbind2. struct mpol_args { __u16 mode; __u16 mode_flags; __s32 home_node; /* mbind2: policy home node */ __u64 pol_maxnodes; __aligned_u64 *pol_nodes; }; This structure is intended to be extensible as new mempolicy extensions are added. For example, set_mempolicy_home_node was added to allow vma mempolicies to have a preferred/home node assigned. This structure allows the setting the home node at the time mempolicy is set, rather than requiring an additional syscalls. Full breakdown of arguments as of this patch: mode: Mempolicy mode (MPOL_DEFAULT, MPOL_INTERLEAVE) mode_flags: Flags previously or'd into mode in set_mempolicy (e.g.: MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES) home_node: for mbind2. Allows the setting of a policy's home with the use of MPOL_MF_HOME_NODE pol_maxnodes: Max number of nodes in the policy nodemask pol_nodes: Policy nodemask Suggested-by: Frank van der Linden Suggested-by: Vinicius Tavares Petrucci Suggested-by: Hasan Al Maruf Signed-off-by: Gregory Price Co-developed-by: Vinicius Tavares Petrucci Signed-off-by: Vinicius Tavares Petrucci Suggested-by: Dan Williams Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Hyeongtak Ji Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Michal Hocko Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- .../admin-guide/mm/numa_memory_policy.rst | 17 +++++++++++++++++ include/linux/syscalls.h | 1 + include/uapi/linux/mempolicy.h | 8 ++++++++ 3 files changed, 26 insertions(+) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Document= ation/admin-guide/mm/numa_memory_policy.rst index d2c8e712785b..5ee047b0d981 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -482,6 +482,23 @@ closest to which page allocation will come from. Speci= fying the home node overri the default allocation policy to allocate memory close to the local node f= or an executing CPU. =20 +Extended Mempolicy Arguments:: + + struct mpol_args { + __u16 mode; + __u16 mode_flags; + __s32 home_node; /* mbind2: set home node */ + __u64 pol_maxnodes; + __aligned_u64 pol_nodes; /* nodemask pointer */ + }; + +The extended mempolicy argument structure is defined to allow the mempolicy +interfaces future extensibility without the need for additional system cal= ls. + +The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to +all interfaces relative to their non-extended counterparts. Each additional +field may only apply to specific extended interfaces. See the respective +extended interface man page for more details. =20 Memory Policy Command Line Interface =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index fd9d12de7e92..a52395ca3f00 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -74,6 +74,7 @@ struct landlock_ruleset_attr; enum landlock_rule_type; struct cachestat_range; struct cachestat; +struct mpol_args; =20 #include #include diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 1f9bb10d1a47..4dd2d2e0d2ed 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -27,6 +27,14 @@ enum { MPOL_MAX, /* always last member of enum */ }; =20 +struct mpol_args { + __u16 mode; + __u16 mode_flags; + __s32 home_node; /* mbind2: policy home node */ + __u64 pol_maxnodes; + __aligned_u64 pol_nodes; +}; + /* Flags for set_mempolicy */ #define MPOL_F_STATIC_NODES (1 << 15) #define MPOL_F_RELATIVE_NODES (1 << 14) --=20 2.39.1 From nobody Sat Dec 27 03:14:25 2025 Received: from mail-pl1-f196.google.com (mail-pl1-f196.google.com [209.85.214.196]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 26CCE22EE9; Sat, 23 Dec 2023 18:11:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="XCETgdSb" Received: by mail-pl1-f196.google.com with SMTP id d9443c01a7336-1d2e6e14865so17395805ad.0; Sat, 23 Dec 2023 10:11:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1703355110; x=1703959910; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=U+vmIcmRf6Ee/4rLLXFmmzQaLaa/fXF/nVhNbjE4zU8=; b=XCETgdSbdZZSp/Vb7PRPYiRN2wM7HH6EAIPP/DBORxjtrfuXhdZCZ1OPWztCReE/yA DiNMC3jITjRrWVOBqa3iKhxhoNOAhfifliqpy7IHVjcI4CtK6HGN5MTGBCq9C82Eosdj BjiANu9klLTjgKnHuA1i+i841a0rKFq53ssRmrHWSiO/Gy9tCYkk+ALjtEPsksliekBA o8g/7W8O5AIKOklA5GkfNegf9yub6Dkzf++BECZEU02T+USSt5BFL8uL+lLr4yAKEN2B 2kwlXMptfLoVzZFfistHoh/nZMir2jB+hCJjz7E+0BuUYKlIQh4o/bK5PqBEvHfMqGkW Bj0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703355110; x=1703959910; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=U+vmIcmRf6Ee/4rLLXFmmzQaLaa/fXF/nVhNbjE4zU8=; b=lD0RYWMdM8XGNVZHe3dFfc6KpcCqxUxYxrJxztX+4gYBhlYWWmkyvJ2c25lioQANuQ mgYkZQUqeLpK8ePG+xIcdR4xefPKRhx3iPg+LLqLjZRdfhu5eM5SX+Zl6xpZo/v7Dce2 4XuaS5yGCIMQOfkRG5AZ7Ard/5WKhQY7ptY5JqXDDSJK0kjhl48Dz/lfbgMwpF7xZXj4 odlpHHJ0+KUd7g61NQD9KdShEfdZivanQT9Z9yBwsx8eQlBdWOWTEmLHgdaMe/4OEvHn ZW6yBhIUnLrGTwKR21jNCCDpoxNzBBl8t1ZW7ehJf1CN/3lymbuS7KLRm/nGngsVURrv wjLw== X-Gm-Message-State: AOJu0YxD0qTbLhYZ0BVMCR20rySuEfyREjIhejzGKsqsCnvmaufHzi0z I1CR1SHNJeQYofJcqp+rAA== X-Google-Smtp-Source: AGHT+IGWptmXkqXEtOGXPdrKMGn5ZSg8CddLj5kV13HyEI0Ero2ed1rwUgY93kSXhGNv8kqOfHObaQ== X-Received: by 2002:a17:902:7848:b0:1d4:36ae:ff7e with SMTP id e8-20020a170902784800b001d436aeff7emr430287pln.60.1703355110432; Sat, 23 Dec 2023 10:11:50 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id t6-20020a170902a5c600b001d3bfd30886sm4316396plq.37.2023.12.23.10.11.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Dec 2023 10:11:50 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, x86@kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com, Michal Hocko Subject: [PATCH v5 08/11] mm/mempolicy: add set_mempolicy2 syscall Date: Sat, 23 Dec 2023 13:10:58 -0500 Message-Id: <20231223181101.1954-9-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231223181101.1954-1-gregory.price@memverge.com> References: <20231223181101.1954-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" set_mempolicy2 is an extensible set_mempolicy interface which allows a user to set the per-task memory policy. Defined as: set_mempolicy2(struct mpol_args *args, size_t size, unsigned long flags); relevant mpol_args fields include the following: mode: The MPOL_* policy (DEFAULT, INTERLEAVE, etc.) mode_flags: The MPOL_F_* flags that were previously passed in or'd into the mode. This was split to hopefully allow future extensions additional mode/flag space. home_node: ignored (see note below) pol_nodes: the nodemask to apply for the memory policy pol_maxnodes: The max number of nodes described by pol_nodes The usize arg is intended for the user to pass in sizeof(mpol_args) to allow forward/backward compatibility whenever possible. The flags argument is intended to future proof the syscall against future extensions which may require interpreting the arguments in the structure differently. Semantics of `set_mempolicy` are otherwise the same as `set_mempolicy` as of this patch. As of this patch, setting the home node of a task-policy is not supported, as this functionality was not supported by set_mempolicy. Additional research should be done to determine whether adding this functionality is safe, but doing so would only require setting MPOL_MF_HOME_NODE and providing a valid home node value. Suggested-by: Michal Hocko Signed-off-by: Gregory Price Acked-by: Geert Uytterhoeven Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Hyeongtak Ji Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Vinicius Tavares Petrucci Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- .../admin-guide/mm/numa_memory_policy.rst | 10 ++++++ arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 ++ arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/syscalls.h | 2 ++ include/uapi/asm-generic/unistd.h | 4 ++- kernel/sys_ni.c | 1 + mm/mempolicy.c | 36 +++++++++++++++++++ .../arch/mips/entry/syscalls/syscall_n64.tbl | 1 + .../arch/powerpc/entry/syscalls/syscall.tbl | 1 + .../perf/arch/s390/entry/syscalls/syscall.tbl | 1 + .../arch/x86/entry/syscalls/syscall_64.tbl | 1 + 25 files changed, 73 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Document= ation/admin-guide/mm/numa_memory_policy.rst index 5ee047b0d981..4720978ab1c2 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -432,6 +432,8 @@ Set [Task] Memory Policy:: =20 long set_mempolicy(int mode, const unsigned long *nmask, unsigned long maxnode); + long set_mempolicy2(struct mpol_args args, size_t size, + unsigned long flags); =20 Set's the calling task's "task/process memory policy" to mode specified by the 'mode' argument and the set of nodes defined by @@ -440,6 +442,12 @@ specified by the 'mode' argument and the set of nodes = defined by 'mode' argument with the flag (for example: MPOL_INTERLEAVE | MPOL_F_STATIC_NODES). =20 +set_mempolicy2() is an extended version of set_mempolicy() capable +of setting a mempolicy which requires more information than can be +passed via get_mempolicy(). For example, weighted interleave with +task-local weights requires a weight array to be passed via the +'mpol_args->il_weights' argument in the 'struct mpol_args' arg. + See the set_mempolicy(2) man page for more details =20 =20 @@ -495,6 +503,8 @@ Extended Mempolicy Arguments:: The extended mempolicy argument structure is defined to allow the mempolicy interfaces future extensibility without the need for additional system cal= ls. =20 +Extended interfaces (set_mempolicy2) use this argument structure. + The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to all interfaces relative to their non-extended counterparts. Each additional field may only apply to specific extended interfaces. See the respective diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/sys= calls/syscall.tbl index 18c842ca6c32..0dc288a1118a 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -496,3 +496,4 @@ 564 common futex_wake sys_futex_wake 565 common futex_wait sys_futex_wait 566 common futex_requeue sys_futex_requeue +567 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 584f9528c996..50172ec0e1f5 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -470,3 +470,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unist= d.h index 531effca5f1f..298313d2e0af 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -39,7 +39,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) =20 -#define __NR_compat_syscalls 457 +#define __NR_compat_syscalls 458 #endif =20 #define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/uni= std32.h index 9f7c1bf99526..cee8d669c342 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -919,6 +919,8 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake) __SYSCALL(__NR_futex_wait, sys_futex_wait) #define __NR_futex_requeue 456 __SYSCALL(__NR_futex_requeue, sys_futex_requeue) +#define __NR_set_mempolicy2 457 +__SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2) =20 /* * Please add new compat syscalls above this comment and update diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/sysca= lls/syscall.tbl index 7a4b780e82cb..839d90c535f2 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -456,3 +456,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/= kernel/syscalls/syscall.tbl index 5b6a0b02b7de..567c8b883735 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -462,3 +462,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/s= yscalls/syscall_n32.tbl index a842b41c8e06..cc0640e16f2f 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -395,3 +395,4 @@ 454 n32 futex_wake sys_futex_wake 455 n32 futex_wait sys_futex_wait 456 n32 futex_requeue sys_futex_requeue +457 n32 set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/s= yscalls/syscall_o32.tbl index 525cc54bc63b..f7262fde98d9 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -444,3 +444,4 @@ 454 o32 futex_wake sys_futex_wake 455 o32 futex_wait sys_futex_wait 456 o32 futex_requeue sys_futex_requeue +457 o32 set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/s= yscalls/syscall.tbl index a47798fed54e..e10f0e8bd064 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -455,3 +455,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel= /syscalls/syscall.tbl index 7fab411378f2..4f03f5f42b78 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -543,3 +543,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/sysca= lls/syscall.tbl index 86fec9b080f6..f98dadc2e9df 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -459,3 +459,4 @@ 454 common futex_wake sys_futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/= syscall.tbl index 363fae0fe9bf..f47ba9f2d05d 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -459,3 +459,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/sys= calls/syscall.tbl index 7bcaa3d5ea44..53fb16616728 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -502,3 +502,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscal= ls/syscall_32.tbl index c8fac5205803..4b4dc41b24ee 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -461,3 +461,4 @@ 454 i386 futex_wake sys_futex_wake 455 i386 futex_wait sys_futex_wait 456 i386 futex_requeue sys_futex_requeue +457 i386 set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscal= ls/syscall_64.tbl index 8cb8bf68721c..1bc2190bec27 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -378,6 +378,7 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 =20 # # Due to a historical design error, certain syscalls are numbered differen= tly diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/s= yscalls/syscall.tbl index 06eefa9c1458..e26dc89399eb 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -427,3 +427,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index a52395ca3f00..451f0089601f 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -823,6 +823,8 @@ asmlinkage long sys_get_mempolicy(int __user *policy, unsigned long addr, unsigned long flags); asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nm= ask, unsigned long maxnode); +asmlinkage long sys_set_mempolicy2(struct mpol_args __user *args, size_t s= ize, + unsigned long flags); asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode, const unsigned long __user *from, const unsigned long __user *to); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/u= nistd.h index 756b013fb832..55486aba099f 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -828,9 +828,11 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake) __SYSCALL(__NR_futex_wait, sys_futex_wait) #define __NR_futex_requeue 456 __SYSCALL(__NR_futex_requeue, sys_futex_requeue) +#define __NR_set_mempolicy2 457 +__SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2) =20 #undef __NR_syscalls -#define __NR_syscalls 457 +#define __NR_syscalls 458 =20 /* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 9a846439b36a..fa1373c8bff8 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -189,6 +189,7 @@ COND_SYSCALL(remap_file_pages); COND_SYSCALL(mbind); COND_SYSCALL(get_mempolicy); COND_SYSCALL(set_mempolicy); +COND_SYSCALL(set_mempolicy2); COND_SYSCALL(migrate_pages); COND_SYSCALL(move_pages); COND_SYSCALL(set_mempolicy_home_node); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 35a0f8630ead..d1abb1fc5a53 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1634,6 +1634,42 @@ SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsi= gned long __user *, nmask, return kernel_set_mempolicy(mode, nmask, maxnode); } =20 +SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __user *, uargs, size_t, = usize, + unsigned long, flags) +{ + struct mpol_args kargs; + struct mempolicy_args margs; + int err; + nodemask_t policy_nodemask; + unsigned long __user *nodes_ptr; + + if (flags) + return -EINVAL; + + err =3D copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize); + if (err) + return err; + + err =3D validate_mpol_flags(kargs.mode, &kargs.mode_flags); + if (err) + return err; + + memset(&margs, 0, sizeof(margs)); + margs.mode =3D kargs.mode; + margs.mode_flags =3D kargs.mode_flags; + if (kargs.pol_nodes) { + nodes_ptr =3D u64_to_user_ptr(kargs.pol_nodes); + err =3D get_nodes(&policy_nodemask, nodes_ptr, + kargs.pol_maxnodes); + if (err) + return err; + margs.policy_nodes =3D &policy_nodemask; + } else + margs.policy_nodes =3D NULL; + + return do_set_mempolicy(&margs); +} + static int kernel_migrate_pages(pid_t pid, unsigned long maxnode, const unsigned long __user *old_nodes, const unsigned long __user *new_nodes) diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl b/tools/pe= rf/arch/mips/entry/syscalls/syscall_n64.tbl index 116ff501bf92..bb1351df51d9 100644 --- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl +++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl @@ -371,3 +371,4 @@ 454 n64 futex_wake sys_futex_wake 455 n64 futex_wait sys_futex_wait 456 n64 futex_requeue sys_futex_requeue +457 n64 set_mempolicy2 sys_set_mempolicy2 diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/per= f/arch/powerpc/entry/syscalls/syscall.tbl index 7fab411378f2..4f03f5f42b78 100644 --- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl +++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl @@ -543,3 +543,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/a= rch/s390/entry/syscalls/syscall.tbl index 86fec9b080f6..f98dadc2e9df 100644 --- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl +++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl @@ -459,3 +459,4 @@ 454 common futex_wake sys_futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 sys_set_mempolicy2 diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf= /arch/x86/entry/syscalls/syscall_64.tbl index 8cb8bf68721c..21f2579679d4 100644 --- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl +++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl @@ -378,6 +378,7 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 =20 # # Due to a historical design error, certain syscalls are numbered differen= tly --=20 2.39.1 From nobody Sat Dec 27 03:14:25 2025 Received: from mail-pl1-f193.google.com (mail-pl1-f193.google.com [209.85.214.193]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E14CE22EFE; Sat, 23 Dec 2023 18:11:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="j+juM5Aw" Received: by mail-pl1-f193.google.com with SMTP id d9443c01a7336-1d3dee5f534so26703115ad.1; Sat, 23 Dec 2023 10:11:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1703355115; x=1703959915; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ArIQYfO4++0rzVrZsk0Fr3dtoB9jeFenlWgS51ZU9V0=; b=j+juM5AwqbgVcJrfefb+dmb1hG7MdZxQjAie55rOfOMI1FmZlwc5ws5nEihKk5KA13 dLdytLFnntHYz1tcD58QxoqNy0DkfqKYsD1KXQoMrnDX+wit28at1egsZFSoe8+oxklK oglfTRMpuDLU50KBACZvR4dn0Hp7XRDuZA6J9KgXH9ZnZpb4QPU3DK0PQ5MYtuh6LnVB 4dByUjTTSCeuSgaRb1xPDFW9+WWWjqItTVThRMrZ3FjsTt0MoH+G1jGdKNMCoN5tJB8t PtVQ8CTg/vhMLjRM5vdwAv697ayXvMBG3eAYFut7jIppDO2lCkMGMaGFvEqa9iT8ug4Q b2MA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703355115; x=1703959915; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ArIQYfO4++0rzVrZsk0Fr3dtoB9jeFenlWgS51ZU9V0=; b=bdswLjfNqoQzBZBn+yPYVqlJlAeS/m7t+DZ3qtIdF5GfU5uQT9LXnip9NoVHYKSLvn OOi73jhzrzIfr+QQLtnaeIAOGVmPqtrZo1Zegvsejh9o7hmJaqXtpJuwNjSpEg9UuA9u uSdfz5d/iOHiwe3MPjYnp5uycPSr/gku4tJc2YdVUqKmXnTTwl/NNxl64C9h6DIi94C8 qsXxhvi97a8t8bQZm0L8stj2Qr5eZ0n0IzQiuZGyx7a4aNBgxqY7tdJLhZ1U+g/er8IQ 3B2ioM6yb+rEseljZ1pMCElLapJ3HOxLh1YLL5pNS7MGmkndQHE/s7u5Pn6mkahAJ0Lk owlg== X-Gm-Message-State: AOJu0YwyuIkquu3AWIj7R1cDdBAJsF86qFGR6n2iCv9NO1mxpjnnrhFk UqKSKNetEkwXiKwujxcRkw== X-Google-Smtp-Source: AGHT+IHF4hMScju+QRunQy8QbhXdxHXyk3RWjYpe0nuyKSDJ1AhHIa4eGUDHwqAEfAaVj8MzseYqVg== X-Received: by 2002:a17:902:db0b:b0:1d3:f1b7:b467 with SMTP id m11-20020a170902db0b00b001d3f1b7b467mr4458527plx.14.1703355115090; Sat, 23 Dec 2023 10:11:55 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id t6-20020a170902a5c600b001d3bfd30886sm4316396plq.37.2023.12.23.10.11.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Dec 2023 10:11:54 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, x86@kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com, Michal Hocko Subject: [PATCH v5 09/11] mm/mempolicy: add get_mempolicy2 syscall Date: Sat, 23 Dec 2023 13:10:59 -0500 Message-Id: <20231223181101.1954-10-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231223181101.1954-1-gregory.price@memverge.com> References: <20231223181101.1954-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" get_mempolicy2 is an extensible get_mempolicy interface which allows a user to retrieve the memory policy for a task or address. Defined as: get_mempolicy2(struct mpol_args *args, size_t size, unsigned long addr, unsigned long flags) Top level input values: mpol_args: The field which collects information about the mempolicy returned to userspace. addr: if MPOL_F_ADDR is passed in `flags`, this address will be used to return the mempolicy details of the vma the address belongs to flags: if MPOL_F_ADDR, return mempolicy info vma containing addr else, returns task mempolicy information Input values include the following fields of mpol_args: pol_nodes: if set, the nodemask of the policy returned here pol_maxnodes: if pol_nodes is set, must describe max number of nodes to be copied to pol_nodes Output values include the following fields of mpol_args: mode: mempolicy mode mode_flags: mempolicy mode flags home_node: policy home node will be returned here, or -1 if not. pol_nodes: if set, the nodemask for the mempolicy policy_node: if the policy has extended node information, it will be placed here. For example MPOL_INTERLEAVE will return the next node which will be used for allocation MPOL_F_NODE has been dropped from get_mempolicy2 (EINVAL). MPOL_F_MEMS_ALLOWED has been dropped from get_mempolicy2 (EINVAL). Suggested-by: Michal Hocko Signed-off-by: Gregory Price Acked-by: Geert Uytterhoeven Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Hyeongtak Ji Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Vinicius Tavares Petrucci Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- .../admin-guide/mm/numa_memory_policy.rst | 11 ++++- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 4 +- kernel/sys_ni.c | 1 + mm/mempolicy.c | 42 +++++++++++++++++++ .../arch/mips/entry/syscalls/syscall_n64.tbl | 1 + .../arch/powerpc/entry/syscalls/syscall.tbl | 1 + .../perf/arch/s390/entry/syscalls/syscall.tbl | 1 + .../arch/x86/entry/syscalls/syscall_64.tbl | 1 + 25 files changed, 79 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Document= ation/admin-guide/mm/numa_memory_policy.rst index 4720978ab1c2..f50b7f7ddbf9 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -456,11 +456,20 @@ Get [Task] Memory Policy or Related Information:: long get_mempolicy(int *mode, const unsigned long *nmask, unsigned long maxnode, void *addr, int flags); + long get_mempolicy2(struct mpol_args args, size_t size, + unsigned long addr, unsigned long flags); =20 Queries the "task/process memory policy" of the calling task, or the policy or location of a specified virtual address, depending on the 'flags' argument. =20 +get_mempolicy2() is an extended version of get_mempolicy() capable of +acquiring extended information about a mempolicy, including those +that can only be set via set_mempolicy2() or mbind2(). + +MPOL_F_NODE functionality has been removed from get_mempolicy2(), +but can still be accessed via get_mempolicy(). + See the get_mempolicy(2) man page for more details =20 =20 @@ -503,7 +512,7 @@ Extended Mempolicy Arguments:: The extended mempolicy argument structure is defined to allow the mempolicy interfaces future extensibility without the need for additional system cal= ls. =20 -Extended interfaces (set_mempolicy2) use this argument structure. +Extended interfaces (set_mempolicy2 and get_mempolicy2) use this structure. =20 The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to all interfaces relative to their non-extended counterparts. Each additional diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/sys= calls/syscall.tbl index 0dc288a1118a..0301a8b0a262 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -497,3 +497,4 @@ 565 common futex_wait sys_futex_wait 566 common futex_requeue sys_futex_requeue 567 common set_mempolicy2 sys_set_mempolicy2 +568 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 50172ec0e1f5..771a33446e8e 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -471,3 +471,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unist= d.h index 298313d2e0af..b63f870debaf 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -39,7 +39,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) =20 -#define __NR_compat_syscalls 458 +#define __NR_compat_syscalls 459 #endif =20 #define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/uni= std32.h index cee8d669c342..f8d01007aee0 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -921,6 +921,8 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait) __SYSCALL(__NR_futex_requeue, sys_futex_requeue) #define __NR_set_mempolicy2 457 __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2) +#define __NR_get_mempolicy2 458 +__SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2) =20 /* * Please add new compat syscalls above this comment and update diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/sysca= lls/syscall.tbl index 839d90c535f2..048a409e684c 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -457,3 +457,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/= kernel/syscalls/syscall.tbl index 567c8b883735..327b01bd6793 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -463,3 +463,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/s= yscalls/syscall_n32.tbl index cc0640e16f2f..921d58e1da23 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -396,3 +396,4 @@ 455 n32 futex_wait sys_futex_wait 456 n32 futex_requeue sys_futex_requeue 457 n32 set_mempolicy2 sys_set_mempolicy2 +458 n32 get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/s= yscalls/syscall_o32.tbl index f7262fde98d9..9271c83c9993 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -445,3 +445,4 @@ 455 o32 futex_wait sys_futex_wait 456 o32 futex_requeue sys_futex_requeue 457 o32 set_mempolicy2 sys_set_mempolicy2 +458 o32 get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/s= yscalls/syscall.tbl index e10f0e8bd064..0654f3f89fc7 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -456,3 +456,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel= /syscalls/syscall.tbl index 4f03f5f42b78..ac11d2064e7a 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -544,3 +544,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/sysca= lls/syscall.tbl index f98dadc2e9df..1cdcafe1ccca 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -460,3 +460,4 @@ 455 common futex_wait sys_futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/= syscall.tbl index f47ba9f2d05d..f71742024c29 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -460,3 +460,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/sys= calls/syscall.tbl index 53fb16616728..2fbf5dbe0620 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -503,3 +503,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscal= ls/syscall_32.tbl index 4b4dc41b24ee..0af813b9a118 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -462,3 +462,4 @@ 455 i386 futex_wait sys_futex_wait 456 i386 futex_requeue sys_futex_requeue 457 i386 set_mempolicy2 sys_set_mempolicy2 +458 i386 get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscal= ls/syscall_64.tbl index 1bc2190bec27..0b777876fc15 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -379,6 +379,7 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 =20 # # Due to a historical design error, certain syscalls are numbered differen= tly diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/s= yscalls/syscall.tbl index e26dc89399eb..4536c9a4227d 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -428,3 +428,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 451f0089601f..f696855cbe8c 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -821,6 +821,8 @@ asmlinkage long sys_get_mempolicy(int __user *policy, unsigned long __user *nmask, unsigned long maxnode, unsigned long addr, unsigned long flags); +asmlinkage long sys_get_mempolicy2(struct mpol_args __user *args, size_t s= ize, + unsigned long addr, unsigned long flags); asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nm= ask, unsigned long maxnode); asmlinkage long sys_set_mempolicy2(struct mpol_args __user *args, size_t s= ize, diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/u= nistd.h index 55486aba099f..719accc731db 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -830,9 +830,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait) __SYSCALL(__NR_futex_requeue, sys_futex_requeue) #define __NR_set_mempolicy2 457 __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2) +#define __NR_get_mempolicy2 458 +__SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2) =20 #undef __NR_syscalls -#define __NR_syscalls 458 +#define __NR_syscalls 459 =20 /* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index fa1373c8bff8..6afbd3a41319 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -188,6 +188,7 @@ COND_SYSCALL(process_mrelease); COND_SYSCALL(remap_file_pages); COND_SYSCALL(mbind); COND_SYSCALL(get_mempolicy); +COND_SYSCALL(get_mempolicy2); COND_SYSCALL(set_mempolicy); COND_SYSCALL(set_mempolicy2); COND_SYSCALL(migrate_pages); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d1abb1fc5a53..f2c12a8ff7b8 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1862,6 +1862,48 @@ SYSCALL_DEFINE5(get_mempolicy, int __user *, policy, return kernel_get_mempolicy(policy, nmask, maxnode, addr, flags); } =20 +SYSCALL_DEFINE4(get_mempolicy2, struct mpol_args __user *, uargs, size_t, = usize, + unsigned long, addr, unsigned long, flags) +{ + struct mpol_args kargs; + struct mempolicy_args margs; + int err; + nodemask_t policy_nodemask; + unsigned long __user *nodes_ptr; + + if (flags & ~(MPOL_F_ADDR)) + return -EINVAL; + + /* initialize any memory liable to be copied to userland */ + memset(&margs, 0, sizeof(margs)); + + err =3D copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize); + if (err) + return -EINVAL; + + margs.policy_nodes =3D kargs.pol_nodes ? &policy_nodemask : NULL; + if (flags & MPOL_F_ADDR) + err =3D do_get_vma_mempolicy(untagged_addr(addr), NULL, &margs); + else + err =3D do_get_task_mempolicy(&margs, NULL); + + if (err) + return err; + + kargs.mode =3D margs.mode; + kargs.mode_flags =3D margs.mode_flags; + kargs.home_node =3D margs.home_node; + if (kargs.pol_nodes) { + nodes_ptr =3D u64_to_user_ptr(kargs.pol_nodes); + err =3D copy_nodes_to_user(nodes_ptr, kargs.pol_maxnodes, + margs.policy_nodes); + if (err) + return err; + } + + return copy_to_user(uargs, &kargs, usize) ? -EFAULT : 0; +} + bool vma_migratable(struct vm_area_struct *vma) { if (vma->vm_flags & (VM_IO | VM_PFNMAP)) diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl b/tools/pe= rf/arch/mips/entry/syscalls/syscall_n64.tbl index bb1351df51d9..c34c6877379e 100644 --- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl +++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl @@ -372,3 +372,4 @@ 455 n64 futex_wait sys_futex_wait 456 n64 futex_requeue sys_futex_requeue 457 n64 set_mempolicy2 sys_set_mempolicy2 +458 n64 get_mempolicy2 sys_get_mempolicy2 diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/per= f/arch/powerpc/entry/syscalls/syscall.tbl index 4f03f5f42b78..ac11d2064e7a 100644 --- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl +++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl @@ -544,3 +544,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/a= rch/s390/entry/syscalls/syscall.tbl index f98dadc2e9df..1cdcafe1ccca 100644 --- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl +++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl @@ -460,3 +460,4 @@ 455 common futex_wait sys_futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 sys_get_mempolicy2 diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf= /arch/x86/entry/syscalls/syscall_64.tbl index 21f2579679d4..edf338f32645 100644 --- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl +++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl @@ -379,6 +379,7 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 =20 # # Due to a historical design error, certain syscalls are numbered differen= tly --=20 2.39.1 From nobody Sat Dec 27 03:14:25 2025 Received: from mail-pl1-f194.google.com (mail-pl1-f194.google.com [209.85.214.194]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E6C90249E7; Sat, 23 Dec 2023 18:12:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="KBqTUbnz" Received: by mail-pl1-f194.google.com with SMTP id d9443c01a7336-1d2e6e14865so17396265ad.0; Sat, 23 Dec 2023 10:12:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1703355121; x=1703959921; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=/LsRssi/HSNpVhoocvXNcyoNsBKk58VGuCuI9zxl5aw=; b=KBqTUbnzmvyfL5940oVQ6DuwE0pC2C1xOEqre63h0mNLkz3fP5BloyZK+xT8hjMIby S7BW2nJpxYSoo3R2nZmWPDRO+WQUaUAGg4J3XwE5xWoetZpRTj16IRTNM+HD7yfFHypC X6BiH50EWPDkUMsN6M8QJ0+r9gvS8tGSUtJrXcSSikQtkRkKjruqGqkG5FcNg7Bo1KMV 09rLoHC7rppNGx6g96JeTZGqt9ZXE+muaD4l7TciyX7j0oWVDf2pa8ux3Uv1n4Rm8Uld 7okigYMolmT5So5layXzUSAv+UUuMnw4DhTNyEYtk0ipLO8P4TigzzMVgqPuY931+8zd Tt5Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703355121; x=1703959921; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/LsRssi/HSNpVhoocvXNcyoNsBKk58VGuCuI9zxl5aw=; b=NdFD7JEVaRVI1kL1DDZg8tcSFqzkBlz/gq3l8qFbqLoSZz0NC3LPn4P7GmROsFLoPV m6qnwL88kkdLwvPvwTlKyU7EtBEgY/HvT7Ea/fjjd/lKeQhw3nhUX4W33uamAbuImRjx YBEgMk3W1QMPyJozme9yWRZ7QrJemQE8stihbZ/DA51jEpSDpVka70apENMOjMq9iy/+ Z4UqzoqJMvluz9yPEpWsCEXed2ZfCqvlTT9e3ffHckNjVBbCVjCVbANr9Y+ynMaukjJq t5wVECdjSfPOZmNNJv4+3wGXixWH5UeNqW8sm3nZk9YOz3fHmE9oorZ1Z5n1+DnlbtCF wrtg== X-Gm-Message-State: AOJu0Yxs5tkrSWwnrfnurF44Z82RBiq6j07s5/OkpEiordjssnh+OC23 WiCrWjikjsT1dHD9HksXNQ== X-Google-Smtp-Source: AGHT+IErco+Uh7FU0/PHHoYj7I5juZyAeqtTbY6N2ahJshl7Q6N+HJTBpQfVRZBcD+YunQSSOP9IvA== X-Received: by 2002:a17:902:e5ce:b0:1d4:3b76:be90 with SMTP id u14-20020a170902e5ce00b001d43b76be90mr444933plf.39.1703355121229; Sat, 23 Dec 2023 10:12:01 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id t6-20020a170902a5c600b001d3bfd30886sm4316396plq.37.2023.12.23.10.11.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Dec 2023 10:12:00 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, x86@kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com, Michal Hocko , Frank van der Linden Subject: [PATCH v5 10/11] mm/mempolicy: add the mbind2 syscall Date: Sat, 23 Dec 2023 13:11:00 -0500 Message-Id: <20231223181101.1954-11-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231223181101.1954-1-gregory.price@memverge.com> References: <20231223181101.1954-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" mbind2 is an extensible mbind interface which allows a user to set the mempolicy for one or more address ranges. Defined as: mbind2(unsigned long addr, unsigned long len, struct mpol_args *args, size_t size, unsigned long flags) addr: address of the memory range to operate on len: length of the memory range flags: MPOL_MF_HOME_NODE + original mbind() flags Input values include the following fields of mpol_args: mode: The MPOL_* policy (DEFAULT, INTERLEAVE, etc.) mode_flags: The MPOL_F_* flags that were previously passed in or'd into the mode. This was split to hopefully allow future extensions additional mode/flag space. home_node: if (flags & MPOL_MF_HOME_NODE), set home node of policy to this otherwise it is ignored. pol_maxnodes: The max number of nodes described by pol_nodes pol_nodes: the nodemask to apply for the memory policy The semantics are otherwise the same as mbind(), except that the home_node can be set. Suggested-by: Michal Hocko Suggested-by: Frank van der Linden Suggested-by: Vinicius Tavares Petrucci Suggested-by: Rakie Kim Suggested-by: Hyeongtak Ji Suggested-by: Honggyu Kim Signed-off-by: Gregory Price Co-developed-by: Vinicius Tavares Petrucci Acked-by: Geert Uytterhoeven Suggested-by: Dan Williams Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- .../admin-guide/mm/numa_memory_policy.rst | 12 +++++- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/syscalls.h | 3 ++ include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/mempolicy.h | 5 ++- kernel/sys_ni.c | 1 + mm/mempolicy.c | 43 +++++++++++++++++++ .../arch/mips/entry/syscalls/syscall_n64.tbl | 1 + .../arch/powerpc/entry/syscalls/syscall.tbl | 1 + .../perf/arch/s390/entry/syscalls/syscall.tbl | 1 + .../arch/x86/entry/syscalls/syscall_64.tbl | 1 + 26 files changed, 85 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Document= ation/admin-guide/mm/numa_memory_policy.rst index f50b7f7ddbf9..7edee775cd2f 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -478,12 +478,18 @@ Install VMA/Shared Policy for a Range of Task's Addre= ss Space:: long mbind(void *start, unsigned long len, int mode, const unsigned long *nmask, unsigned long maxnode, unsigned flags); + long mbind2(void* start, unsigned long len, struct mpol_args args, + size_t size, unsigned long flags); =20 mbind() installs the policy specified by (mode, nmask, maxnodes) as a VMA policy for the range of the calling task's address space specified by the 'start' and 'len' arguments. Additional actions may be requested via the 'flags' argument. =20 +mbind2() is an extended version of mbind() capable of setting extended +mempolicy features. For example, one can set the home node for the memory +policy without an additional call to set_mempolicy_home_node(). + See the mbind(2) man page for more details. =20 Set home node for a Range of Task's Address Spacec:: @@ -499,6 +505,9 @@ closest to which page allocation will come from. Specif= ying the home node overri the default allocation policy to allocate memory close to the local node f= or an executing CPU. =20 +mbind2() also provides a way for the home node to be set at the time the +mempolicy is set. See the mbind(2) man page for more details. + Extended Mempolicy Arguments:: =20 struct mpol_args { @@ -512,7 +521,8 @@ Extended Mempolicy Arguments:: The extended mempolicy argument structure is defined to allow the mempolicy interfaces future extensibility without the need for additional system cal= ls. =20 -Extended interfaces (set_mempolicy2 and get_mempolicy2) use this structure. +Extended interfaces (set_mempolicy2, get_mempolicy2, and mbind2) use this +this argument structure. =20 The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to all interfaces relative to their non-extended counterparts. Each additional diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/sys= calls/syscall.tbl index 0301a8b0a262..e8239293c35a 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -498,3 +498,4 @@ 566 common futex_requeue sys_futex_requeue 567 common set_mempolicy2 sys_set_mempolicy2 568 common get_mempolicy2 sys_get_mempolicy2 +569 common mbind2 sys_mbind2 diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 771a33446e8e..a3f39750257a 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -472,3 +472,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unist= d.h index b63f870debaf..abe10a833fcd 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -39,7 +39,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) =20 -#define __NR_compat_syscalls 459 +#define __NR_compat_syscalls 460 #endif =20 #define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/uni= std32.h index f8d01007aee0..446b7f034332 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -923,6 +923,8 @@ __SYSCALL(__NR_futex_requeue, sys_futex_requeue) __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2) #define __NR_get_mempolicy2 458 __SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2) +#define __NR_mbind2 459 +__SYSCALL(__NR_mbind2, sys_mbind2) =20 /* * Please add new compat syscalls above this comment and update diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/sysca= lls/syscall.tbl index 048a409e684c..9a12dface18e 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -458,3 +458,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/= kernel/syscalls/syscall.tbl index 327b01bd6793..6cb740123137 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -464,3 +464,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/s= yscalls/syscall_n32.tbl index 921d58e1da23..52cf720f8ae2 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -397,3 +397,4 @@ 456 n32 futex_requeue sys_futex_requeue 457 n32 set_mempolicy2 sys_set_mempolicy2 458 n32 get_mempolicy2 sys_get_mempolicy2 +459 n32 mbind2 sys_mbind2 diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/s= yscalls/syscall_o32.tbl index 9271c83c9993..fd37c5301a48 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -446,3 +446,4 @@ 456 o32 futex_requeue sys_futex_requeue 457 o32 set_mempolicy2 sys_set_mempolicy2 458 o32 get_mempolicy2 sys_get_mempolicy2 +459 o32 mbind2 sys_mbind2 diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/s= yscalls/syscall.tbl index 0654f3f89fc7..fcd67bc405b1 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -457,3 +457,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel= /syscalls/syscall.tbl index ac11d2064e7a..89715417014c 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -545,3 +545,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/sysca= lls/syscall.tbl index 1cdcafe1ccca..c8304e0d0aa7 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -461,3 +461,4 @@ 456 common futex_requeue sys_futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 sys_mbind2 diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/= syscall.tbl index f71742024c29..e5c51b6c367f 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -461,3 +461,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/sys= calls/syscall.tbl index 2fbf5dbe0620..74527f585500 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -504,3 +504,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscal= ls/syscall_32.tbl index 0af813b9a118..be2e2aa17dd8 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -463,3 +463,4 @@ 456 i386 futex_requeue sys_futex_requeue 457 i386 set_mempolicy2 sys_set_mempolicy2 458 i386 get_mempolicy2 sys_get_mempolicy2 +459 i386 mbind2 sys_mbind2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscal= ls/syscall_64.tbl index 0b777876fc15..6e2347eb8773 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -380,6 +380,7 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 =20 # # Due to a historical design error, certain syscalls are numbered differen= tly diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/s= yscalls/syscall.tbl index 4536c9a4227d..f00a21317dc0 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -429,3 +429,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index f696855cbe8c..b42622ea9ed9 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -817,6 +817,9 @@ asmlinkage long sys_mbind(unsigned long start, unsigned= long len, const unsigned long __user *nmask, unsigned long maxnode, unsigned flags); +asmlinkage long sys_mbind2(unsigned long start, unsigned long len, + const struct mpol_args __user *uargs, size_t usize, + unsigned long flags); asmlinkage long sys_get_mempolicy(int __user *policy, unsigned long __user *nmask, unsigned long maxnode, diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/u= nistd.h index 719accc731db..cd31599bb9cc 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -832,9 +832,11 @@ __SYSCALL(__NR_futex_requeue, sys_futex_requeue) __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2) #define __NR_get_mempolicy2 458 __SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2) +#define __NR_mbind2 459 +__SYSCALL(__NR_mbind2, sys_mbind2) =20 #undef __NR_syscalls -#define __NR_syscalls 459 +#define __NR_syscalls 460 =20 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 4dd2d2e0d2ed..8880b753a446 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -52,13 +52,14 @@ struct mpol_args { #define MPOL_F_ADDR (1<<1) /* look up vma using address */ #define MPOL_F_MEMS_ALLOWED (1<<2) /* return allowed memories */ =20 -/* Flags for mbind */ +/* Flags for mbind/mbind2 */ #define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */ #define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to policy */ #define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to policy */ #define MPOL_MF_LAZY (1<<3) /* UNSUPPORTED FLAG: Lazy migrate on fault */ -#define MPOL_MF_INTERNAL (1<<4) /* Internal flags start here */ +#define MPOL_MF_HOME_NODE (1<<4) /* mbind2: set home node */ +#define MPOL_MF_INTERNAL (1<<5) /* Internal flags start here */ =20 #define MPOL_MF_VALID (MPOL_MF_STRICT | \ MPOL_MF_MOVE | \ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 6afbd3a41319..2483b5afa99f 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -187,6 +187,7 @@ COND_SYSCALL(process_madvise); COND_SYSCALL(process_mrelease); COND_SYSCALL(remap_file_pages); COND_SYSCALL(mbind); +COND_SYSCALL(mbind2); COND_SYSCALL(get_mempolicy); COND_SYSCALL(get_mempolicy2); COND_SYSCALL(set_mempolicy); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index f2c12a8ff7b8..b5aca779249a 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1601,6 +1601,49 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigne= d long, len, return kernel_mbind(start, len, mode, nmask, maxnode, flags); } =20 +SYSCALL_DEFINE5(mbind2, unsigned long, start, unsigned long, len, + const struct mpol_args __user *, uargs, size_t, usize, + unsigned long, flags) +{ + struct mpol_args kargs; + struct mempolicy_args margs; + nodemask_t policy_nodes; + unsigned long __user *nodes_ptr; + int err; + + if (!start || !len) + return -EINVAL; + + err =3D copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize); + if (err) + return -EINVAL; + + err =3D validate_mpol_flags(kargs.mode, &kargs.mode_flags); + if (err) + return err; + + margs.mode =3D kargs.mode; + margs.mode_flags =3D kargs.mode_flags; + + /* if home node given, validate it is online */ + if (flags & MPOL_MF_HOME_NODE) { + if ((kargs.home_node >=3D MAX_NUMNODES) || + !node_online(kargs.home_node)) + return -EINVAL; + margs.home_node =3D kargs.home_node; + } else + margs.home_node =3D NUMA_NO_NODE; + flags &=3D ~MPOL_MF_HOME_NODE; + + nodes_ptr =3D u64_to_user_ptr(kargs.pol_nodes); + err =3D get_nodes(&policy_nodes, nodes_ptr, kargs.pol_maxnodes); + if (err) + return err; + margs.policy_nodes =3D &policy_nodes; + + return do_mbind(untagged_addr(start), len, &margs, flags); +} + /* Set the process memory policy */ static long kernel_set_mempolicy(int mode, const unsigned long __user *nma= sk, unsigned long maxnode) diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl b/tools/pe= rf/arch/mips/entry/syscalls/syscall_n64.tbl index c34c6877379e..4fd9f742d903 100644 --- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl +++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl @@ -373,3 +373,4 @@ 456 n64 futex_requeue sys_futex_requeue 457 n64 set_mempolicy2 sys_set_mempolicy2 458 n64 get_mempolicy2 sys_get_mempolicy2 +459 n64 mbind2 sys_mbind2 diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/per= f/arch/powerpc/entry/syscalls/syscall.tbl index ac11d2064e7a..89715417014c 100644 --- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl +++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl @@ -545,3 +545,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/a= rch/s390/entry/syscalls/syscall.tbl index 1cdcafe1ccca..c8304e0d0aa7 100644 --- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl +++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl @@ -461,3 +461,4 @@ 456 common futex_requeue sys_futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 sys_mbind2 diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf= /arch/x86/entry/syscalls/syscall_64.tbl index edf338f32645..3fc74241da5d 100644 --- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl +++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl @@ -380,6 +380,7 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 =20 # # Due to a historical design error, certain syscalls are numbered differen= tly --=20 2.39.1 From nobody Sat Dec 27 03:14:25 2025 Received: from mail-pj1-f67.google.com (mail-pj1-f67.google.com [209.85.216.67]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 265A52511A; Sat, 23 Dec 2023 18:12:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="IiT4g2Qj" Received: by mail-pj1-f67.google.com with SMTP id 98e67ed59e1d1-28bd09e35e8so1416693a91.0; Sat, 23 Dec 2023 10:12:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1703355126; x=1703959926; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ktKBvOizNMtREUaFHrOesptiJBwbxl5fyFxNMUuf/cI=; b=IiT4g2QjWrw6yQ2fUmybuPgrGxk7Szq55IZ2rtYwVEghxyJgCCzGtQ6It+rYrFvz1Y FvQhm4y2exIDnaTEtEINCIiAChVEAGgOKoDTd+qDIeMB7+d4iLCwHesXDqJo99208o0C fLKbyd9l9tD/vtpodMqMlyYfdnzCoSE76ZtJ4l+tMHa0JivSL6E5VxIe1vS5h9lDPQ58 Qnzq1WZ/kCVTvVzKoCwSI/N6A9qiMkIY3Of69dZyTKbN3mu/u6UntgT+H51Vc1TPhf2f mDz9h1ffm0UZxLTZ7S9KplrJLFvf6jgOxat0cCDEQrihdHomTTcv0SZ2K4pKCYU5wrLx OHdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703355126; x=1703959926; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ktKBvOizNMtREUaFHrOesptiJBwbxl5fyFxNMUuf/cI=; b=muYFUQrtRsquTwnGbnVttxDhGFD31pXCmEsCepEuZR9AQZ+jr3aJ52tue7A9fXluVK DIfLskVarRNkBg/9DAo++5vUuK20377mQmq4nO91epqTU3D3sAUaQ01ZlHR41Ff5iwOa tyUrkizBqHwuqWSi4Qn4a5tyXe+pbk8XsoBOEmEKG5m1n0LHIRf7ZDrNduUTqvtvPr12 vrdi6n3KhrWsWMtq6VJZ4d5vBvsuWg8m8mDZYAB17j7/Lo+KCPdx9StcXdyBUvCYl1pU dQgdRM7Drdz8hYp+gbzamLbW9d32rQv3UUTArvpPk0tbBjXicUAEB41f7tgSe8Edp2bu TD7Q== X-Gm-Message-State: AOJu0YxznKLHTJ7LV/nwHNHzorhwYVa7CgQKqSjtoVBJG5x97raWmpbX NEIJwCt9HuglTZKlvdYavw== X-Google-Smtp-Source: AGHT+IEQEAyj53TNXOFLfTA8VfsUgMr1mgc3b7wM67YewA2cpaI7zBE5vpRRyH+qhXrMuXL7MsCifQ== X-Received: by 2002:a05:6a20:745:b0:190:d1d2:570d with SMTP id l5-20020a056a20074500b00190d1d2570dmr1437969pzl.29.1703355126387; Sat, 23 Dec 2023 10:12:06 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id t6-20020a170902a5c600b001d3bfd30886sm4316396plq.37.2023.12.23.10.12.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Dec 2023 10:12:06 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, x86@kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com Subject: [PATCH v5 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave Date: Sat, 23 Dec 2023 13:11:01 -0500 Message-Id: <20231223181101.1954-12-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231223181101.1954-1-gregory.price@memverge.com> References: <20231223181101.1954-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Extend set_mempolicy2 and mbind2 to support weighted interleave, and demonstrate the extensibility of the mpol_args structure. To support weighted interleave we add interleave weight fields to the following structures: Kernel Internal: (include/linux/mempolicy.h) struct mempolicy { /* task-local weights to apply to weighted interleave */ unsigned char weights[MAX_NUMNODES]; } struct mempolicy_args { /* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */ unsigned char *il_weights; /* of size MAX_NUMNODES */ } UAPI: (/include/uapi/linux/mempolicy.h) struct mpol_args { /* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */ unsigned char *il_weights; /* of size pol_max_nodes */ } The task-local weights are a single, one-dimensional array of weights that apply to all possible nodes on the system. If a node is set in the mempolicy nodemask, the weight in `il_weights` must be >=3D 1, otherwise set_mempolicy2() will return -EINVAL. If a node is not set in pol_nodemask, the weight will default to `1` in the task policy. The default value of `1` is required to handle the situation where a task migrates to a set of nodes for which weights were not set (up to and including the local numa node). For example, a migrated task whose nodemask changes entirely will have all its weights defaulted back to `1`, or if the nodemask changes to include a mix of nodes that were not previously accounted for - the weighted interleave may be suboptimal. If migrations are expected, a task should prefer not to use task-local interleave weights, and instead utilize the global settings for natural re-weighting on migration. To support global vs local weighting, we add the kernel-internal flag: MPOL_F_GWEIGHT (1 << 5) /* Utilize global weights */ This flag is set when il_weights is omitted by set_mempolicy2(), or when MPOL_WEIGHTED_INTERLEAVE is set by set_mempolicy(). This internal mode_flag dictates whether global weights or task-local weights are utilized by the the various weighted interleave functions: * weighted_interleave_nodes * weighted_interleave_nid * alloc_pages_bulk_array_weighted_interleave if (pol->flags & MPOL_F_GWEIGHT) pol_weights =3D iw_table; else pol_weights =3D pol->wil.weights; To simplify creations and duplication of mempolicies, the weights are added as a structure directly within mempolicy. This allows the existing logic in __mpol_dup to copy the weights without additional allocations: if (old =3D=3D current->mempolicy) { task_lock(current); *new =3D *old; task_unlock(current); } else *new =3D *old Suggested-by: Rakie Kim Suggested-by: Hyeongtak Ji Suggested-by: Honggyu Kim Suggested-by: Vinicius Tavares Petrucci Signed-off-by: Gregory Price Co-developed-by: Rakie Kim Signed-off-by: Rakie Kim Co-developed-by: Hyeongtak Ji Signed-off-by: Hyeongtak Ji Co-developed-by: Honggyu Kim Signed-off-by: Honggyu Kim Co-developed-by: Vinicius Tavares Petrucci Signed-off-by: Vinicius Tavares Petrucci Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Michal Hocko Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- .../admin-guide/mm/numa_memory_policy.rst | 10 ++ include/linux/mempolicy.h | 2 + include/uapi/linux/mempolicy.h | 2 + mm/mempolicy.c | 129 +++++++++++++++++- 4 files changed, 139 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Document= ation/admin-guide/mm/numa_memory_policy.rst index 7edee775cd2f..4f52a9108576 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -254,6 +254,8 @@ MPOL_WEIGHTED_INTERLEAVE This mode operates the same as MPOL_INTERLEAVE, except that interleaving behavior is executed based on weights set in /sys/kernel/mm/mempolicy/weighted_interleave/ + when configured to utilize global weights, or based on task-local + weights configured with set_mempolicy2(2) or mbind2(2). =20 Weighted interleave allocations pages on nodes according to their weight. For example if nodes [0,1] are weighted [5,2] @@ -261,6 +263,13 @@ MPOL_WEIGHTED_INTERLEAVE 2 pages allocated on node1. This can better distribute data according to bandwidth on heterogeneous memory systems. =20 + When utilizing task-local weights, weights are not rebalanced + in the event of a task migration. If a weight has not been + explicitly set for a node set in the new nodemask, the + value of that weight defaults to "1". For this reason, if + migrations are expected or possible, users should consider + utilizing global interleave weights. + NUMA memory policy supports the following optional mode flags: =20 MPOL_F_STATIC_NODES @@ -516,6 +525,7 @@ Extended Mempolicy Arguments:: __s32 home_node; /* mbind2: set home node */ __u64 pol_maxnodes; __aligned_u64 pol_nodes; /* nodemask pointer */ + __aligned_u64 il_weights; /* u8 buf of size pol_maxnodes */ }; =20 The extended mempolicy argument structure is defined to allow the mempolicy diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 0f1c85527626..06ec3a3b0f22 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -58,6 +58,7 @@ struct mempolicy { /* Weighted interleave settings */ struct { unsigned char cur_weight; + unsigned char weights[MAX_NUMNODES]; } wil; }; =20 @@ -70,6 +71,7 @@ struct mempolicy_args { unsigned short mode_flags; /* policy mode flags */ int home_node; /* mbind: use MPOL_MF_HOME_NODE */ nodemask_t *policy_nodes; /* get/set/mbind */ + unsigned char *il_weights; /* for mode MPOL_WEIGHTED_INTERLEAVE */ }; =20 /* diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 8880b753a446..16ee2359ef55 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -33,6 +33,7 @@ struct mpol_args { __s32 home_node; /* mbind2: policy home node */ __u64 pol_maxnodes; __aligned_u64 pol_nodes; + __aligned_u64 il_weights; /* size: pol_maxnodes * sizeof(char) */ }; =20 /* Flags for set_mempolicy */ @@ -73,6 +74,7 @@ struct mpol_args { #define MPOL_F_SHARED (1 << 0) /* identify shared policies */ #define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */ #define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */ +#define MPOL_F_GWEIGHT (1 << 5) /* Utilize global weights */ =20 /* * These bit locations are exposed in the vm.zone_reclaim_mode sysctl diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b5aca779249a..6bed4151e0c2 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -271,6 +271,7 @@ static struct mempolicy *mpol_new(struct mempolicy_args= *args) unsigned short mode =3D args->mode; unsigned short flags =3D args->mode_flags; nodemask_t *nodes =3D args->policy_nodes; + int node; =20 if (mode =3D=3D MPOL_DEFAULT) { if (nodes && !nodes_empty(*nodes)) @@ -297,6 +298,19 @@ static struct mempolicy *mpol_new(struct mempolicy_arg= s *args) (flags & MPOL_F_STATIC_NODES) || (flags & MPOL_F_RELATIVE_NODES)) return ERR_PTR(-EINVAL); + } else if (mode =3D=3D MPOL_WEIGHTED_INTERLEAVE) { + /* weighted interleave requires a nodemask and weights > 0 */ + if (nodes_empty(*nodes)) + return ERR_PTR(-EINVAL); + if (args->il_weights) { + node =3D first_node(*nodes); + while (node !=3D MAX_NUMNODES) { + if (!args->il_weights[node]) + return ERR_PTR(-EINVAL); + node =3D next_node(node, *nodes); + } + } else if (!(args->mode_flags & MPOL_F_GWEIGHT)) + return ERR_PTR(-EINVAL); } else if (nodes_empty(*nodes)) return ERR_PTR(-EINVAL); =20 @@ -309,6 +323,17 @@ static struct mempolicy *mpol_new(struct mempolicy_arg= s *args) policy->home_node =3D args->home_node; policy->wil.cur_weight =3D 0; =20 + if (policy->mode =3D=3D MPOL_WEIGHTED_INTERLEAVE && args->il_weights) { + policy->wil.cur_weight =3D 0; + /* Minimum weight value is always 1 */ + memset(policy->wil.weights, 1, MAX_NUMNODES); + node =3D first_node(*nodes); + while (node !=3D MAX_NUMNODES) { + policy->wil.weights[node] =3D args->il_weights[node]; + node =3D next_node(node, *nodes); + } + } + return policy; } =20 @@ -937,6 +962,17 @@ static void do_get_mempolicy_nodemask(struct mempolicy= *pol, nodemask_t *nmask) } } =20 +static void do_get_mempolicy_il_weights(struct mempolicy *pol, + unsigned char weights[MAX_NUMNODES]) +{ + if (pol->mode !=3D MPOL_WEIGHTED_INTERLEAVE) + memset(weights, 0, MAX_NUMNODES); + else if (pol->flags & MPOL_F_GWEIGHT) + memcpy(weights, iw_table, MAX_NUMNODES); + else + memcpy(weights, pol->wil.weights, MAX_NUMNODES); +} + /* Retrieve NUMA policy for a VMA assocated with a given address */ static long do_get_vma_mempolicy(unsigned long addr, int *addr_node, struct mempolicy_args *args) @@ -970,6 +1006,9 @@ static long do_get_vma_mempolicy(unsigned long addr, i= nt *addr_node, if (args->policy_nodes) do_get_mempolicy_nodemask(pol, args->policy_nodes); =20 + if (args->il_weights) + do_get_mempolicy_il_weights(pol, args->il_weights); + if (pol !=3D &default_policy) { mpol_put(pol); mpol_cond_put(pol); @@ -997,6 +1036,9 @@ static long do_get_task_mempolicy(struct mempolicy_arg= s *args, int *pol_node) if (args->policy_nodes) do_get_mempolicy_nodemask(pol, args->policy_nodes); =20 + if (args->il_weights) + do_get_mempolicy_il_weights(pol, args->il_weights); + return 0; } =20 @@ -1519,6 +1561,9 @@ static long kernel_mbind(unsigned long start, unsigne= d long len, if (err) return err; =20 + if (mode & MPOL_WEIGHTED_INTERLEAVE) + mode_flags |=3D MPOL_F_GWEIGHT; + memset(&margs, 0, sizeof(margs)); margs.mode =3D lmode; margs.mode_flags =3D mode_flags; @@ -1609,6 +1654,8 @@ SYSCALL_DEFINE5(mbind2, unsigned long, start, unsigne= d long, len, struct mempolicy_args margs; nodemask_t policy_nodes; unsigned long __user *nodes_ptr; + unsigned char weights[MAX_NUMNODES]; + unsigned char __user *weights_ptr; int err; =20 if (!start || !len) @@ -1641,6 +1688,23 @@ SYSCALL_DEFINE5(mbind2, unsigned long, start, unsign= ed long, len, return err; margs.policy_nodes =3D &policy_nodes; =20 + if (kargs.mode =3D=3D MPOL_WEIGHTED_INTERLEAVE) { + weights_ptr =3D u64_to_user_ptr(kargs.il_weights); + if (weights_ptr) { + err =3D copy_struct_from_user(weights, + sizeof(weights), + weights_ptr, + kargs.pol_maxnodes); + if (err) + return err; + margs.il_weights =3D weights; + } else { + margs.il_weights =3D NULL; + margs.mode_flags |=3D MPOL_F_GWEIGHT; + } + } else + margs.il_weights =3D NULL; + return do_mbind(untagged_addr(start), len, &margs, flags); } =20 @@ -1662,6 +1726,9 @@ static long kernel_set_mempolicy(int mode, const unsi= gned long __user *nmask, if (err) return err; =20 + if (mode & MPOL_WEIGHTED_INTERLEAVE) + mode_flags |=3D MPOL_F_GWEIGHT; + memset(&args, 0, sizeof(args)); args.mode =3D lmode; args.mode_flags =3D mode_flags; @@ -1685,6 +1752,8 @@ SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __us= er *, uargs, size_t, usize, int err; nodemask_t policy_nodemask; unsigned long __user *nodes_ptr; + unsigned char weights[MAX_NUMNODES]; + unsigned char __user *weights_ptr; =20 if (flags) return -EINVAL; @@ -1710,6 +1779,20 @@ SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __u= ser *, uargs, size_t, usize, } else margs.policy_nodes =3D NULL; =20 + if (kargs.mode =3D=3D MPOL_WEIGHTED_INTERLEAVE && kargs.il_weights) { + weights_ptr =3D u64_to_user_ptr(kargs.il_weights); + err =3D copy_struct_from_user(weights, + sizeof(weights), + weights_ptr, + kargs.pol_maxnodes); + if (err) + return err; + margs.il_weights =3D weights; + } else { + margs.il_weights =3D NULL; + margs.mode_flags |=3D MPOL_F_GWEIGHT; + } + return do_set_mempolicy(&margs); } =20 @@ -1913,17 +1996,25 @@ SYSCALL_DEFINE4(get_mempolicy2, struct mpol_args __= user *, uargs, size_t, usize, int err; nodemask_t policy_nodemask; unsigned long __user *nodes_ptr; + unsigned char __user *weights_ptr; + unsigned char weights[MAX_NUMNODES]; =20 if (flags & ~(MPOL_F_ADDR)) return -EINVAL; =20 /* initialize any memory liable to be copied to userland */ memset(&margs, 0, sizeof(margs)); + memset(weights, 0, sizeof(weights)); =20 err =3D copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize); if (err) return -EINVAL; =20 + if (kargs.il_weights) + margs.il_weights =3D weights; + else + margs.il_weights =3D NULL; + margs.policy_nodes =3D kargs.pol_nodes ? &policy_nodemask : NULL; if (flags & MPOL_F_ADDR) err =3D do_get_vma_mempolicy(untagged_addr(addr), NULL, &margs); @@ -1944,6 +2035,13 @@ SYSCALL_DEFINE4(get_mempolicy2, struct mpol_args __u= ser *, uargs, size_t, usize, return err; } =20 + if (kargs.mode =3D=3D MPOL_WEIGHTED_INTERLEAVE && kargs.il_weights) { + weights_ptr =3D u64_to_user_ptr(kargs.il_weights); + err =3D copy_to_user(weights_ptr, weights, kargs.pol_maxnodes); + if (err) + return err; + } + return copy_to_user(uargs, &kargs, usize) ? -EFAULT : 0; } =20 @@ -2060,13 +2158,18 @@ static unsigned int weighted_interleave_nodes(struc= t mempolicy *policy) { unsigned int next; struct task_struct *me =3D current; + unsigned char next_weight; =20 next =3D next_node_in(me->il_prev, policy->nodes); if (next =3D=3D MAX_NUMNODES) return next; =20 - if (!policy->wil.cur_weight) - policy->wil.cur_weight =3D iw_table[next]; + if (!policy->wil.cur_weight) { + next_weight =3D (policy->flags & MPOL_F_GWEIGHT) ? + iw_table[next] : + policy->wil.weights[next]; + policy->wil.cur_weight =3D next_weight ? next_weight : 1; + } =20 policy->wil.cur_weight--; if (!policy->wil.cur_weight) @@ -2140,6 +2243,7 @@ static unsigned int weighted_interleave_nid(struct me= mpolicy *pol, pgoff_t ilx) nodemask_t nodemask =3D pol->nodes; unsigned int target, weight_total =3D 0; int nid; + unsigned char *pol_weights; unsigned char weights[MAX_NUMNODES]; unsigned char weight; =20 @@ -2151,8 +2255,13 @@ static unsigned int weighted_interleave_nid(struct m= empolicy *pol, pgoff_t ilx) return nid; =20 /* Then collect weights on stack and calculate totals */ + if (pol->flags & MPOL_F_GWEIGHT) + pol_weights =3D iw_table; + else + pol_weights =3D pol->wil.weights; + for_each_node_mask(nid, nodemask) { - weight =3D iw_table[nid]; + weight =3D pol_weights[nid]; weight_total +=3D weight; weights[nid] =3D weight; } @@ -2550,6 +2659,7 @@ static unsigned long alloc_pages_bulk_array_weighted_= interleave(gfp_t gfp, unsigned long nr_allocated; unsigned long rounds; unsigned long node_pages, delta; + unsigned char *pol_weights; unsigned char weight; unsigned char weights[MAX_NUMNODES]; unsigned int weight_total =3D 0; @@ -2563,9 +2673,14 @@ static unsigned long alloc_pages_bulk_array_weighted= _interleave(gfp_t gfp, =20 nnodes =3D nodes_weight(nodes); =20 + if (pol->flags & MPOL_F_GWEIGHT) + pol_weights =3D iw_table; + else + pol_weights =3D pol->wil.weights; + /* Collect weights and save them on stack so they don't change */ for_each_node_mask(node, nodes) { - weight =3D iw_table[node]; + weight =3D pol_weights[node]; weight_total +=3D weight; weights[node] =3D weight; } @@ -3090,6 +3205,7 @@ void mpol_shared_policy_init(struct shared_policy *sp= , struct mempolicy *mpol) { int ret; struct mempolicy_args margs; + unsigned char weights[MAX_NUMNODES]; =20 sp->root =3D RB_ROOT; /* empty tree =3D=3D default mempolicy */ rwlock_init(&sp->lock); @@ -3107,6 +3223,11 @@ void mpol_shared_policy_init(struct shared_policy *s= p, struct mempolicy *mpol) margs.mode_flags =3D mpol->flags; margs.policy_nodes =3D &mpol->w.user_nodemask; margs.home_node =3D NUMA_NO_NODE; + if (margs.mode =3D=3D MPOL_WEIGHTED_INTERLEAVE && + !(margs.mode_flags & MPOL_F_GWEIGHT)) { + memcpy(weights, mpol->wil.weights, sizeof(weights)); + margs.il_weights =3D weights; + } =20 /* contextualize the tmpfs mount point mempolicy to this file */ npol =3D mpol_new(&margs); --=20 2.39.1