This patch adds the new user-api argument structure intended for
set_mempolicy2 and mbind2.
struct mpol_param {
__u16 mode;
__u16 mode_flags;
__s32 home_node; /* mbind2: policy home node */
__u16 pol_maxnodes;
__u8 resv[6];
__aligned_u64 *pol_nodes;
};
This structure is intended to be extensible as new mempolicy extensions
are added.
For example, set_mempolicy_home_node was added to allow vma mempolicies
to have a preferred/home node assigned. This structure allows the user
to set the home node at the time mempolicy is created, rather than
requiring an additional syscalls.
Full breakdown of arguments as of this patch:
mode: Mempolicy mode (MPOL_DEFAULT, MPOL_INTERLEAVE)
mode_flags: Flags previously or'd into mode in set_mempolicy
(e.g.: MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES)
home_node: for mbind2. Allows the setting of a policy's home
with the use of MPOL_MF_HOME_NODE
pol_maxnodes: Max number of nodes in the policy nodemask
pol_nodes: Policy nodemask
The reserved field accounts explicitly for a potential memory hole
in the structure.
Suggested-by: Frank van der Linden <fvdl@google.com>
Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Signed-off-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
---
.../admin-guide/mm/numa_memory_policy.rst | 17 +++++++++++++++++
include/linux/syscalls.h | 1 +
include/uapi/linux/mempolicy.h | 9 +++++++++
3 files changed, 27 insertions(+)
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index a70f20ce1ffb..cbfc5f65ed77 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -480,6 +480,23 @@ closest to which page allocation will come from. Specifying the home node overri
the default allocation policy to allocate memory close to the local node for an
executing CPU.
+Extended Mempolicy Arguments::
+
+ struct mpol_param {
+ __u16 mode;
+ __u16 mode_flags;
+ __s32 home_node; /* mbind2: set home node */
+ __u64 pol_maxnodes;
+ __aligned_u64 pol_nodes; /* nodemask pointer */
+ };
+
+The extended mempolicy argument structure is defined to allow the mempolicy
+interfaces future extensibility without the need for additional system calls.
+
+The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to
+all interfaces relative to their non-extended counterparts. Each additional
+field may only apply to specific extended interfaces. See the respective
+extended interface man page for more details.
Memory Policy Command Line Interface
====================================
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index fd9d12de7e92..fb0b4b2b9bea 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -74,6 +74,7 @@ struct landlock_ruleset_attr;
enum landlock_rule_type;
struct cachestat_range;
struct cachestat;
+struct mpol_param;
#include <linux/types.h>
#include <linux/aio_abi.h>
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 1f9bb10d1a47..109788c8be92 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -27,6 +27,15 @@ enum {
MPOL_MAX, /* always last member of enum */
};
+struct mpol_param {
+ __u16 mode;
+ __u16 mode_flags;
+ __s32 home_node; /* mbind2: policy home node */
+ __u16 pol_maxnodes;
+ __u8 resv[6];
+ __aligned_u64 pol_nodes;
+};
+
/* Flags for set_mempolicy */
#define MPOL_F_STATIC_NODES (1 << 15)
#define MPOL_F_RELATIVE_NODES (1 << 14)
--
2.39.1
Hi,
On 1/3/24 14:42, Gregory Price wrote:
> This patch adds the new user-api argument structure intended for
> set_mempolicy2 and mbind2.
>
> struct mpol_param {
> __u16 mode;
> __u16 mode_flags;
> __s32 home_node; /* mbind2: policy home node */
> __u16 pol_maxnodes;
> __u8 resv[6];
> __aligned_u64 *pol_nodes;
> };
>
> This structure is intended to be extensible as new mempolicy extensions
> are added.
>
> For example, set_mempolicy_home_node was added to allow vma mempolicies
> to have a preferred/home node assigned. This structure allows the user
> to set the home node at the time mempolicy is created, rather than
> requiring an additional syscalls.
>
> Full breakdown of arguments as of this patch:
> mode: Mempolicy mode (MPOL_DEFAULT, MPOL_INTERLEAVE)
>
> mode_flags: Flags previously or'd into mode in set_mempolicy
> (e.g.: MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES)
>
> home_node: for mbind2. Allows the setting of a policy's home
> with the use of MPOL_MF_HOME_NODE
>
> pol_maxnodes: Max number of nodes in the policy nodemask
>
> pol_nodes: Policy nodemask
>
> The reserved field accounts explicitly for a potential memory hole
> in the structure.
>
> Suggested-by: Frank van der Linden <fvdl@google.com>
> Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
> Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
> Signed-off-by: Gregory Price <gregory.price@memverge.com>
> Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
> Signed-off-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
> ---
> .../admin-guide/mm/numa_memory_policy.rst | 17 +++++++++++++++++
> include/linux/syscalls.h | 1 +
> include/uapi/linux/mempolicy.h | 9 +++++++++
> 3 files changed, 27 insertions(+)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index a70f20ce1ffb..cbfc5f65ed77 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -480,6 +480,23 @@ closest to which page allocation will come from. Specifying the home node overri
> the default allocation policy to allocate memory close to the local node for an
> executing CPU.
>
> +Extended Mempolicy Arguments::
> +
> + struct mpol_param {
> + __u16 mode;
> + __u16 mode_flags;
> + __s32 home_node; /* mbind2: set home node */
> + __u64 pol_maxnodes;
> + __aligned_u64 pol_nodes; /* nodemask pointer */
> + };
>
Can you make the above documentation struct agree with the
struct in the header below, please?
(just a difference in the size of pol_maxnodes and the
'resv' bytes)
> +The extended mempolicy argument structure is defined to allow the mempolicy
> +interfaces future extensibility without the need for additional system calls.
> +
> +The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to
> +all interfaces relative to their non-extended counterparts. Each additional
> +field may only apply to specific extended interfaces. See the respective
> +extended interface man page for more details.
>
> Memory Policy Command Line Interface
> ====================================
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index 1f9bb10d1a47..109788c8be92 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -27,6 +27,15 @@ enum {
> MPOL_MAX, /* always last member of enum */
> };
>
> +struct mpol_param {
> + __u16 mode;
> + __u16 mode_flags;
> + __s32 home_node; /* mbind2: policy home node */
> + __u16 pol_maxnodes;
> + __u8 resv[6];
> + __aligned_u64 pol_nodes;
> +};
> +
> /* Flags for set_mempolicy */
> #define MPOL_F_STATIC_NODES (1 << 15)
> #define MPOL_F_RELATIVE_NODES (1 << 14)
--
#Randy
On Wed, Jan 03, 2024 at 04:19:03PM -0800, Randy Dunlap wrote:
> Hi,
>
> On 1/3/24 14:42, Gregory Price wrote:
> > This patch adds the new user-api argument structure intended for
> > set_mempolicy2 and mbind2.
> >
> > struct mpol_param {
> > __u16 mode;
> > __u16 mode_flags;
> > __s32 home_node; /* mbind2: policy home node */
> > __u16 pol_maxnodes;
> > __u8 resv[6];
> > __aligned_u64 *pol_nodes;
> > };
> >
> >
> > +Extended Mempolicy Arguments::
> > +
> > + struct mpol_param {
> > + __u16 mode;
> > + __u16 mode_flags;
> > + __s32 home_node; /* mbind2: set home node */
> > + __u64 pol_maxnodes;
> > + __aligned_u64 pol_nodes; /* nodemask pointer */
> > + };
> >
>
> Can you make the above documentation struct agree with the
> struct in the header below, please?
> (just a difference in the size of pol_maxnodes and the
> 'resv' bytes)
>
>
*facepalm* made a note to double check this, and then still didn't.
Thank you for reviewing. Will fix in the next pass of feedback.
~Gregory
On Wed, Jan 10, 2024 at 02:11:39PM -0500, Andi Kleen wrote: > > Weighted interleave is a new interleave policy intended to make use > > of heterogeneous memory environments appearing with CXL. > > > > To implement weighted interleave with task-local weights, we need > > new syscalls capable of passing a weight array. This is the > > justification for mempolicy2/mbind2 - which are designed to be > > extensible to capture future policies as well. > > I might be late to the party here, but it's not clear to me you really > need the redesigned system calls. set_mempolicy has one argument left > so it can be enhanced with a new pointer dependending on a bit in mode. > For mbind() it already uses all arguments, but it has a flags argument. > > But it's unclear to me if a fully flexible weight array is really > needed here anyways. Can some common combinations be encoded in flags instead? > I assume it's mainly about some nodes getting preference depending on > some attribute > (apologize for the re-send, I had a delivery failure on the list, want to make sure it gets captured). This is actually something I haven't written out in the RFC that I probably should have: I'm also trying to make it so that a mempolicy can be described in its entirety with a single syscall. This cannot be done with the existing interfaces. (see: the existence of set_mempolicy_home_node). Likewise you cannot fetch the entire mempolicy configuration with a single get_mempolicy() call. Certainly if task-local weights exist, there's no room left to add that on either. You'd really like to know that the policy you set is not changed between calls to multiple interfaces. Right now, if you want to twiddle bits in the mempolicy (like home_node), the syscall does: (*new = *old)+change. That's... not great. So I did consider extending set_mempolicy() to allow you to twiddle the weight of a given node, but I was considering another proposal in the process: process_set_mempolicy and process_mbind. These interfaces were proposed to allow mempolicy to be changed by an external task. (This is presently not possible). That's basically how we got to this current iteration. Re: fully flexible weight array in the task At the end of the day, this is really about bandwidth distribution. For a reasonable set of workloads, the system-global settings should be sufficient. However, it was at least recommended that we also explore task-local weights along the way - so here we are. I'm certainly open to changing this, or even just dropping the task-local weight requirement entirely, but I did want to consider the other issues (above) in the process so we don't design ourselves into a corner if we have to go there anyway. > So if you add such a attribute, perhaps configurable in sysfs, and > then have flags like give weight + 1 on attribute, give weight + 2 on > attribute give weight + 4 on attribute. If more are needed there are more bits. > That would be a much more compact and simpler interface. > > For set_mempolicy either add a flags argument or encode in mode. > > It would also shrink the whole patchkit dramatically and be less risky. I'm certainly not against this idea. It is less flexible, but it does make it simpler. Another limitation, however, is that you have to make multiple syscalls if you want to change the weights of multiple nodes. I wanted to avoid that kind of ambiguity. If we don't think the external changing interfaces are realistic, then this is all moot, I'm down for whatever design is feasible. > > You perhaps underestimate the cost and risk of designing complex > kernel interfaces, it all has to be hardened audited fuzzed deployed etc. > Definitely not under any crazy impression that something like this is a quick process. Just iterating as I go and gathering feedback (I think we're on version 4 or 5 of this idea, and v7 of this patch line :]). I fully expect the initial chunk of (MPOL_WEIGHTED_INTERLEAVE + sysfs) will be a merge candidate long before the task-local weights will be, if only because, as you said, it's a much simpler extension and less risky. I appreciate the feedback, happy to keep hacking away, ~Gregory
© 2016 - 2025 Red Hat, Inc.