From nobody Fri Dec 19 11:34:02 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 36AA8C4167B for ; Sat, 9 Dec 2023 06:59:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234371AbjLIG7f (ORCPT ); Sat, 9 Dec 2023 01:59:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44982 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234359AbjLIG7c (ORCPT ); Sat, 9 Dec 2023 01:59:32 -0500 Received: from mail-qk1-x743.google.com (mail-qk1-x743.google.com [IPv6:2607:f8b0:4864:20::743]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E5C7710D2; Fri, 8 Dec 2023 22:59:37 -0800 (PST) Received: by mail-qk1-x743.google.com with SMTP id af79cd13be357-77f42ee9370so125918685a.2; Fri, 08 Dec 2023 22:59:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702105177; x=1702709977; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=P9scieKANHFB2osmlAGqD3i5GhIYjTl8+rxMCO4yMPo=; b=WqB/lZxgMBZe+65Tn/ibAnVmwGNn+eJLQ8WWk+8tH73ug0EYSHLsblm0dj9hzSc1u6 /XIRRWuHdvyKcj7d7spjG4z9a6LvlHhMS/HG1wKW7ysqtnWJetJMVnupi51bU3qRmq8R uawlwXP7qJ/We8Dx4sMkUuA+Xzio0FOVljILLKtQR4yR7AlZfmtlp2hQ9TWWhLZSXfDL xyO0Cg4JsTWjkDNEnnWY5v3ZDjwKe1azw75xX6yBjL7x/9XMbkUZdI5jfY3yEK/oMlnK gDoNRTwyE6BPqx5haUO4fEsY1+hyWOcDtR1d9Wpm9OOvB9muqHDRfEt/UDmZWOFgsqNt EfHg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702105177; x=1702709977; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=P9scieKANHFB2osmlAGqD3i5GhIYjTl8+rxMCO4yMPo=; b=YKAysHDCp9naYF9F2dF7xbALFMq07+qZK3jG0rdxh9ynSxM/69j7G/kYqHxaBVSXcP awuPJPS6NtfbpzbWOU7uwfMZB0SQd5GRtBzhN6JEXuttm5ojfCRIKydIcZWY9sVFFMZk ifm9/TFEurL38nQXddvXlHYDL2x3HGxFge6C0ETjrbUGzAH9VZMf8ZgGkH+K6mKJLOa2 gwhSRIC04wkCv8QQ7xn+rboCQuQ8xKZdjw915OETcAOq+nIkGblboXh6ltocCo6uCnt6 OUEhbueOyXpNXOjq282UG8hjd5x4lRVZKyEBz10bSr2MBbnOrCBOlSUqlI2Dw2YguKEt 2edw== X-Gm-Message-State: AOJu0YwfT2iS86Nu1EJufGdFaVpVqbH7QMy+slkjgVPFPzGybXCmsbTO udrXYHPP4o/uIcv6nRK/EA== X-Google-Smtp-Source: AGHT+IEMklhaVvDfru0A2OrhwoGRMDlqDnscWVBTkeMqKEPA0LfTH+HpFOYp2+eTKbuQszOpQv6Gtg== X-Received: by 2002:a05:620a:22f5:b0:77e:fba4:39fc with SMTP id p21-20020a05620a22f500b0077efba439fcmr1321966qki.82.1702105176994; Fri, 08 Dec 2023 22:59:36 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x8-20020a81b048000000b005df5d592244sm326530ywk.78.2023.12.08.22.59.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Dec 2023 22:59:36 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com Subject: [PATCH v2 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface Date: Sat, 9 Dec 2023 01:59:21 -0500 Message-Id: <20231209065931.3458-2-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231209065931.3458-1-gregory.price@memverge.com> References: <20231209065931.3458-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Rakie Kim This patch provides a way to set interleave weight information under sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN/weight The sysfs structure is designed as follows. $ tree /sys/kernel/mm/mempolicy/ /sys/kernel/mm/mempolicy/ [1] =E2=94=9C=E2=94=80=E2=94=80 possible_nodes [2] =E2=94=94=E2=94=80=E2=94=80 weighted_interleave [3] =E2=94=9C=E2=94=80=E2=94=80 node0 [4] =E2=94=82=C2=A0 =E2=94=94=E2=94=80=E2=94=80 weight [5] =E2=94=94=E2=94=80=E2=94=80 node1 =C2=A0 =E2=94=94=E2=94=80=E2=94=80 weight Each file above can be explained as follows. [1] mm/mempolicy: configuration interface for mempolicy subsystem [2] possible_nodes: list of possible nodes informational interface which may be used across multiple memory policy configurations. Lists the `possible` nodes for which configurations may be required. A `possible` node is one which has been reserved by the kernel at boot, but may or may not be online. For example, the weighted_interleave policy generates a nodeN/ folder for possible node N. [3] weighted_interleave/: config interface for weighted interleave policy [4] weighted_interleave/nodeN/: possible node configurations [5] weighted_interleave/nodeN/weight: weight for nodeN Signed-off-by: Rakie Kim Signed-off-by: Honggyu Kim Co-developed-by: Gregory Price Signed-off-by: Gregory Price Co-developed-by: Hyeongtak Ji Signed-off-by: Hyeongtak Ji Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Michal Hocko Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Vinicius Tavares Petrucci Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- .../ABI/testing/sysfs-kernel-mm-mempolicy | 18 ++ ...fs-kernel-mm-mempolicy-weighted-interleave | 21 +++ mm/mempolicy.c | 169 ++++++++++++++++++ 3 files changed, 208 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-wei= ghted-interleave diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Document= ation/ABI/testing/sysfs-kernel-mm-mempolicy new file mode 100644 index 000000000000..445377dfd232 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy @@ -0,0 +1,18 @@ +What: /sys/kernel/mm/mempolicy/ +Date: December 2023 +Contact: Linux memory management mailing list +Description: Interface for Mempolicy + +What: /sys/kernel/mm/mempolicy/possible_nodes +Date: December 2023 +Contact: Linux memory management mailing list +Description: The numa nodes which are possible to come online + + A possible numa node is one which has been reserved by the + system at boot, but may or may not be online at runtime. + + Example output: + + =3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D + "0,1,2,3" nodes 0-3 are possibly online or offline + =3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-i= nterleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-in= terleave new file mode 100644 index 000000000000..7c19a606725f --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interlea= ve @@ -0,0 +1,21 @@ +What: /sys/kernel/mm/mempolicy/weighted_interleave/ +Date: December 2023 +Contact: Linux memory management mailing list +Description: Configuration Interface for the Weighted Interleave policy + +What: /sys/kernel/mm/mempolicy/weighted_interleave/nodeN/ + /sys/kernel/mm/mempolicy/weighted_interleave/nodeN/weight +Date: December 2023 +Contact: Linux memory management mailing list +Description: Weight configuration interface for nodeN + + The interleave weight for a memory node (N). These weights are + utilized by processes which have set their mempolicy to + MPOL_WEIGHTED_INTERLEAVE and have opted into global weights by + omitting a task-local weight array. + + These weights only affect new allocations, and changes at runtime + will not cause migrations on already allocated pages. + + Minimum weight: 1 + Maximum weight: 255 diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 10a590ee1c89..28dfae195beb 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -131,6 +131,8 @@ static struct mempolicy default_policy =3D { =20 static struct mempolicy preferred_node_policy[MAX_NUMNODES]; =20 +static char iw_table[MAX_NUMNODES]; + /** * numa_nearest_node - Find nearest node by state * @node: Node id to start the search @@ -3067,3 +3069,170 @@ void mpol_to_str(char *buffer, int maxlen, struct m= empolicy *pol) p +=3D scnprintf(p, buffer + maxlen - p, ":%*pbl", nodemask_pr_args(&nodes)); } + +struct iw_node_info { + struct kobject kobj; + int nid; +}; + +static ssize_t node_weight_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct iw_node_info *node_info =3D container_of(kobj, struct iw_node_info, + kobj); + return sysfs_emit(buf, "%d\n", iw_table[node_info->nid]); +} + +static ssize_t node_weight_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned char weight =3D 0; + struct iw_node_info *node_info =3D NULL; + + node_info =3D container_of(kobj, struct iw_node_info, kobj); + + if (kstrtou8(buf, 0, &weight) || !weight) + return -EINVAL; + + iw_table[node_info->nid] =3D weight; + + return count; +} + +static struct kobj_attribute node_weight =3D + __ATTR(weight, 0664, node_weight_show, node_weight_store); + +static struct attribute *dst_node_attrs[] =3D { + &node_weight.attr, + NULL, +}; + +static struct attribute_group dst_node_attr_group =3D { + .attrs =3D dst_node_attrs, +}; + +static const struct attribute_group *dst_node_attr_groups[] =3D { + &dst_node_attr_group, + NULL, +}; + +static const struct kobj_type dst_node_kobj_ktype =3D { + .sysfs_ops =3D &kobj_sysfs_ops, + .default_groups =3D dst_node_attr_groups, +}; + +static int add_weight_node(int nid, struct kobject *src_kobj) +{ + struct iw_node_info *node_info =3D NULL; + int ret; + + node_info =3D kzalloc(sizeof(struct iw_node_info), GFP_KERNEL); + if (!node_info) + return -ENOMEM; + node_info->nid =3D nid; + + kobject_init(&node_info->kobj, &dst_node_kobj_ktype); + ret =3D kobject_add(&node_info->kobj, src_kobj, "node%d", nid); + if (ret) { + pr_err("kobject_add error [node%d]: %d", nid, ret); + kobject_put(&node_info->kobj); + } + return ret; +} + +static int add_weighted_interleave_group(struct kobject *root_kobj) +{ + struct kobject *wi_kobj; + int nid, err; + + wi_kobj =3D kobject_create_and_add("weighted_interleave", root_kobj); + if (!wi_kobj) { + pr_err("failed to create node kobject\n"); + return -ENOMEM; + } + + for_each_node_state(nid, N_POSSIBLE) { + err =3D add_weight_node(nid, wi_kobj); + if (err) { + pr_err("failed to add sysfs [node%d]\n", nid); + break; + } + } + if (err) + kobject_put(wi_kobj); + return 0; + +} + +static ssize_t possible_nodes_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + int nid, next_nid; + int len =3D 0; + + for_each_node_state(nid, N_POSSIBLE) { + len +=3D sysfs_emit_at(buf, len, "%d", nid); + next_nid =3D next_node(nid, node_states[N_POSSIBLE]); + if (next_nid < MAX_NUMNODES) + len +=3D sysfs_emit_at(buf, len, ","); + } + len +=3D sysfs_emit_at(buf, len, "\n"); + + return len; +} + +static struct kobj_attribute possible_nodes_attr =3D __ATTR_RO(possible_no= des); + +static struct attribute *mempolicy_attrs[] =3D { + &possible_nodes_attr.attr, + NULL, +}; + +static const struct attribute_group mempolicy_attr_group =3D { + .attrs =3D mempolicy_attrs, + NULL, +}; + +static void mempolicy_kobj_release(struct kobject *kobj) +{ + kfree(kobj); +} + +static const struct kobj_type mempolicy_kobj_ktype =3D { + .release =3D mempolicy_kobj_release, + .sysfs_ops =3D &kobj_sysfs_ops, +}; + +static int __init mempolicy_sysfs_init(void) +{ + int err; + struct kobject *root_kobj; + + memset(&iw_table, 1, sizeof(iw_table)); + + root_kobj =3D kzalloc(sizeof(struct kobject), GFP_KERNEL); + if (!root_kobj) + return -ENOMEM; + + kobject_init(root_kobj, &mempolicy_kobj_ktype); + err =3D kobject_add(root_kobj, mm_kobj, "mempolicy"); + if (err) { + pr_err("failed to add kobject to the system\n"); + goto fail_obj; + } + + err =3D sysfs_create_group(root_kobj, &mempolicy_attr_group); + if (err) { + pr_err("failed to register mempolicy group\n"); + goto fail_obj; + } + + err =3D add_weighted_interleave_group(root_kobj); +fail_obj: + if (err) + kobject_put(root_kobj); + return err; + +} +late_initcall(mempolicy_sysfs_init); --=20 2.39.1 From nobody Fri Dec 19 11:34:02 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B4EC0C4167B for ; Sat, 9 Dec 2023 06:59:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234433AbjLIG7q (ORCPT ); Sat, 9 Dec 2023 01:59:46 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44994 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229492AbjLIG7e (ORCPT ); Sat, 9 Dec 2023 01:59:34 -0500 Received: from mail-yw1-x1144.google.com (mail-yw1-x1144.google.com [IPv6:2607:f8b0:4864:20::1144]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DD20310D9; Fri, 8 Dec 2023 22:59:39 -0800 (PST) Received: by mail-yw1-x1144.google.com with SMTP id 00721157ae682-5d40c728fc4so20589967b3.1; Fri, 08 Dec 2023 22:59:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702105179; x=1702709979; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=kJfHcOqh+1VXL6BeGt8uno6KhS8AZJyMuELbDvAkwCQ=; b=Lynjovd5dZXFzZZykmkbNQWkmpdw0zKX3V7Zn3Eg55Qm3Xh4TrYE4a0VlRN3BVyAzL /N+Qek/tSE2N5oHeSoTWA34Ha14eH4HWqhhJhs1m6B3ohAGm06gKyrztcKDx/65gUTk7 ZDbWQ9K8p4XARMfolrdUlhlbyeZqbqCimh/902y8vA1n0plvMyAP3Y05lbRoXFOjRwLj tLTBie3DqX/V9m4CS9Fhz19DIdUMaoVZ78JyoHdbtn2Oy01MivmVQwpnHBD7RuTyoKqj 2KOuunfJ24BkBWapg1qlk65Y6hJejFaHnbJosSjrwT6i2LOlOshi1qH2R4Tlz2atFWOx v3Gw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702105179; x=1702709979; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=kJfHcOqh+1VXL6BeGt8uno6KhS8AZJyMuELbDvAkwCQ=; b=ud1art3fyKPd8feqUQhoA2FgytBNe9N+n55sFOlCJUtzRB+9RND0pIQg9X3bqiot2l eyZsZ0pCllqnkT/uGR/57+3dwYVnYZaWy8e9HuKpVAmHK5ScGNmRG1NxAH9S/t/LOvw3 axpTIPjpUOm0KDfw9ahm7OzAziAn9+rJOHL2Soi3IyXccdZWSob1EJPNrx75ucC2rYJs 9DPKrabOXaK3kglKbWTl1SQBaasCAzNbukSisPv2IVxP1cOxP3yvMJMDYBOiFTVuIid9 OGtSoom56u+fjb5xC+IOKEujwIZRCHGdNBo54UbZDZdgSe4eK18SRInZBbTZPmqnSzJi 9O0g== X-Gm-Message-State: AOJu0Yw5gV1aiZND5/OXro0qfoquq8QHb7/DtIUNBWPrhi7UOg4ZH18q lPcppzG77lPjtdB1kxiquA== X-Google-Smtp-Source: AGHT+IHem49DcoGcs2D+xaYZ/BXuVuRdWD2QqTeF3fRSLx7GEMQSsEB+52OPmkNaYCUrWg8SJLhQvQ== X-Received: by 2002:a0d:e802:0:b0:5d7:1940:3f03 with SMTP id r2-20020a0de802000000b005d719403f03mr949706ywe.52.1702105178928; Fri, 08 Dec 2023 22:59:38 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x8-20020a81b048000000b005df5d592244sm326530ywk.78.2023.12.08.22.59.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Dec 2023 22:59:38 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com, Srinivasulu Thanneeru Subject: [PATCH v2 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Date: Sat, 9 Dec 2023 01:59:22 -0500 Message-Id: <20231209065931.3458-3-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231209065931.3458-1-gregory.price@memverge.com> References: <20231209065931.3458-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Rakie Kim When a system has multiple NUMA nodes and it becomes bandwidth hungry, the current MPOL_INTERLEAVE could be an wise option. However, if those NUMA nodes consist of different types of memory such as having local DRAM and CXL memory together, the current round-robin based interleaving policy doesn't maximize the overall bandwidth because of their different bandwidth characteristics. Instead, the interleaving can be more efficient when the allocation policy follows each NUMA nodes' bandwidth weight rather than having 1:1 round-robin allocation. This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, which enables weighted interleaving between NUMA nodes. Weighted interleave allows for a proportional distribution of memory across multiple numa nodes, preferablly apportioned to match the bandwidth capacity of each node from the perspective of the accessing node. For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), with a relative bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight distribution is (2:1). Weights will be acquired from the global weight array exposed by the sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/ The policy will then allocate the number of pages according to the set weights. For example, if the weights are (2,1), then 2 pages will be allocated on node0 for every 1 page allocated on node1. The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) and mbind(2). There are 3 integration points: weighted_interleave_nodes: Counts the number of allocations as they occur, and applies the weight for the current node. When the weight reaches 0, switch to the next node. Applied by `mempolicy_slab_node()` and `policy_nodemask()` weighted_interleave_nid: Gets the total weight of the nodemask as well as each individual node weight, then calculates the node based on the given index. Applied by `policy_nodemask()` and `mpol_misplaced()` bulk_array_weighted_interleave: Gets the total weight of the nodemask as well as each individual node weight, then calculates the number of "interleave rounds" as well as any delta ("partial round"). Calculates the number of pages for each node and allocates them. If a node was scheduled for interleave via interleave_nodes, the current weight (pol->cur_weight) will be allocated first, before the remaining bulk calculation is done. This simplifies the calculation at the cost of an additional allocation call. One piece of complexity is the interaction between a recent refactor which split the logic to acquire the "ilx" (interleave index) of an allocation and the actually application of the interleave. The calculation of the `interleave index` is done by `get_vma_policy()`, while the actual selection of the node will be later applied by the relevant weighted_interleave function. Suggested-by: Hasan Al Maruf Signed-off-by: Rakie Kim Co-developed-by: Honggyu Kim Signed-off-by: Honggyu Kim Co-developed-by: Hyeongtak Ji Signed-off-by: Hyeongtak Ji Co-developed-by: Gregory Price Signed-off-by: Gregory Price Co-developed-by: Srinivasulu Thanneeru Signed-off-by: Srinivasulu Thanneeru Co-developed-by: Ravi Jonnalagadda Signed-off-by: Ravi Jonnalagadda Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Michal Hocko Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Vinicius Tavares Petrucci Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- .../admin-guide/mm/numa_memory_policy.rst | 11 + include/linux/mempolicy.h | 5 + include/uapi/linux/mempolicy.h | 1 + mm/mempolicy.c | 197 +++++++++++++++++- 4 files changed, 211 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Document= ation/admin-guide/mm/numa_memory_policy.rst index eca38fa81e0f..d2c8e712785b 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -250,6 +250,17 @@ MPOL_PREFERRED_MANY can fall back to all existing numa nodes. This is effectively MPOL_PREFERRED allowed for a mask rather than a single node. =20 +MPOL_WEIGHTED_INTERLEAVE + This mode operates the same as MPOL_INTERLEAVE, except that + interleaving behavior is executed based on weights set in + /sys/kernel/mm/mempolicy/weighted_interleave/ + + Weighted interleave allocations pages on nodes according to + their weight. For example if nodes [0,1] are weighted [5,2] + respectively, 5 pages will be allocated on node0 for every + 2 pages allocated on node1. This can better distribute data + according to bandwidth on heterogeneous memory systems. + NUMA memory policy supports the following optional mode flags: =20 MPOL_F_STATIC_NODES diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 931b118336f4..ba09167e80f7 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -54,6 +54,11 @@ struct mempolicy { nodemask_t cpuset_mems_allowed; /* relative to these nodes */ nodemask_t user_nodemask; /* nodemask passed by user */ } w; + + /* Weighted interleave settings */ + struct { + unsigned char cur_weight; + } wil; }; =20 /* diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index a8963f7ef4c2..1f9bb10d1a47 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -23,6 +23,7 @@ enum { MPOL_INTERLEAVE, MPOL_LOCAL, MPOL_PREFERRED_MANY, + MPOL_WEIGHTED_INTERLEAVE, MPOL_MAX, /* always last member of enum */ }; =20 diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 28dfae195beb..b4d94646e6a2 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -305,6 +305,7 @@ static struct mempolicy *mpol_new(unsigned short mode, = unsigned short flags, policy->mode =3D mode; policy->flags =3D flags; policy->home_node =3D NUMA_NO_NODE; + policy->wil.cur_weight =3D 0; =20 return policy; } @@ -417,6 +418,10 @@ static const struct mempolicy_operations mpol_ops[MPOL= _MAX] =3D { .create =3D mpol_new_nodemask, .rebind =3D mpol_rebind_preferred, }, + [MPOL_WEIGHTED_INTERLEAVE] =3D { + .create =3D mpol_new_nodemask, + .rebind =3D mpol_rebind_nodemask, + }, }; =20 static bool migrate_folio_add(struct folio *folio, struct list_head *folio= list, @@ -838,7 +843,8 @@ static long do_set_mempolicy(unsigned short mode, unsig= ned short flags, =20 old =3D current->mempolicy; current->mempolicy =3D new; - if (new && new->mode =3D=3D MPOL_INTERLEAVE) + if (new && (new->mode =3D=3D MPOL_INTERLEAVE || + new->mode =3D=3D MPOL_WEIGHTED_INTERLEAVE)) current->il_prev =3D MAX_NUMNODES-1; task_unlock(current); mpol_put(old); @@ -864,6 +870,7 @@ static void get_policy_nodemask(struct mempolicy *pol, = nodemask_t *nodes) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: *nodes =3D pol->nodes; break; case MPOL_LOCAL: @@ -948,6 +955,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *= nmask, } else if (pol =3D=3D current->mempolicy && pol->mode =3D=3D MPOL_INTERLEAVE) { *policy =3D next_node_in(current->il_prev, pol->nodes); + } else if (pol =3D=3D current->mempolicy && + (pol->mode =3D=3D MPOL_WEIGHTED_INTERLEAVE)) { + if (pol->wil.cur_weight) + *policy =3D current->il_prev; + else + *policy =3D next_node_in(current->il_prev, + pol->nodes); } else { err =3D -EINVAL; goto out; @@ -1777,7 +1791,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struc= t *vma, pol =3D __get_vma_policy(vma, addr, ilx); if (!pol) pol =3D get_task_policy(current); - if (pol->mode =3D=3D MPOL_INTERLEAVE) { + if (pol->mode =3D=3D MPOL_INTERLEAVE || + pol->mode =3D=3D MPOL_WEIGHTED_INTERLEAVE) { *ilx +=3D vma->vm_pgoff >> order; *ilx +=3D (addr - vma->vm_start) >> (PAGE_SHIFT + order); } @@ -1827,6 +1842,24 @@ bool apply_policy_zone(struct mempolicy *policy, enu= m zone_type zone) return zone >=3D dynamic_policy_zone; } =20 +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) +{ + unsigned int next; + struct task_struct *me =3D current; + + next =3D next_node_in(me->il_prev, policy->nodes); + if (next =3D=3D MAX_NUMNODES) + return next; + + if (!policy->wil.cur_weight) + policy->wil.cur_weight =3D iw_table[next]; + + policy->wil.cur_weight--; + if (!policy->wil.cur_weight) + me->il_prev =3D next; + return next; +} + /* Do dynamic interleaving for a process */ static unsigned int interleave_nodes(struct mempolicy *policy) { @@ -1861,6 +1894,9 @@ unsigned int mempolicy_slab_node(void) case MPOL_INTERLEAVE: return interleave_nodes(policy); =20 + case MPOL_WEIGHTED_INTERLEAVE: + return weighted_interleave_nodes(policy); + case MPOL_BIND: case MPOL_PREFERRED_MANY: { @@ -1885,6 +1921,41 @@ unsigned int mempolicy_slab_node(void) } } =20 +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t= ilx) +{ + nodemask_t nodemask =3D pol->nodes; + unsigned int target, weight_total =3D 0; + int nid; + unsigned char weights[MAX_NUMNODES]; + unsigned char weight; + + barrier(); + + /* first ensure we have a valid nodemask */ + nid =3D first_node(nodemask); + if (nid =3D=3D MAX_NUMNODES) + return nid; + + /* Then collect weights on stack and calculate totals */ + for_each_node_mask(nid, nodemask) { + weight =3D iw_table[nid]; + weight_total +=3D weight; + weights[nid] =3D weight; + } + + /* Finally, calculate the node offset based on totals */ + target =3D (unsigned int)ilx % weight_total; + nid =3D first_node(nodemask); + while (target) { + weight =3D weights[nid]; + if (target < weight) + break; + target -=3D weight; + nid =3D next_node_in(nid, nodemask); + } + return nid; +} + /* * Do static interleaving for interleave index @ilx. Returns the ilx'th * node in pol->nodes (starting from ilx=3D0), wrapping around if ilx @@ -1953,6 +2024,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct= mempolicy *pol, *nid =3D (ilx =3D=3D NO_INTERLEAVE_INDEX) ? interleave_nodes(pol) : interleave_nid(pol, ilx); break; + case MPOL_WEIGHTED_INTERLEAVE: + *nid =3D (ilx =3D=3D NO_INTERLEAVE_INDEX) ? + weighted_interleave_nodes(pol) : + weighted_interleave_nid(pol, ilx); + break; } =20 return nodemask; @@ -2014,6 +2090,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: *mask =3D mempolicy->nodes; break; =20 @@ -2113,7 +2190,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int= order, * If the policy is interleave or does not allow the current * node in its nodemask, we allocate the standard way. */ - if (pol->mode !=3D MPOL_INTERLEAVE && + if ((pol->mode !=3D MPOL_INTERLEAVE && + pol->mode !=3D MPOL_WEIGHTED_INTERLEAVE) && (!nodemask || node_isset(nid, *nodemask))) { /* * First, try to allocate THP only on local node, but @@ -2249,6 +2327,106 @@ static unsigned long alloc_pages_bulk_array_interle= ave(gfp_t gfp, return total_allocated; } =20 +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, + struct mempolicy *pol, unsigned long nr_pages, + struct page **page_array) +{ + struct task_struct *me =3D current; + unsigned long total_allocated =3D 0; + unsigned long nr_allocated; + unsigned long rounds; + unsigned long node_pages, delta; + unsigned char weight; + unsigned char weights[MAX_NUMNODES]; + unsigned int weight_total; + unsigned long rem_pages =3D nr_pages; + nodemask_t nodes =3D pol->nodes; + int nnodes, node, prev_node; + int i; + + /* Stabilize the nodemask on the stack */ + barrier(); + + nnodes =3D nodes_weight(nodes); + + /* Collect weights and save them on stack so they don't change */ + for_each_node_mask(node, nodes) { + weight =3D iw_table[node]; + weight_total +=3D weight; + weights[node] =3D weight; + } + + /* Continue allocating from most recent node and adjust the nr_pages */ + if (pol->wil.cur_weight) { + node =3D next_node_in(me->il_prev, nodes); + node_pages =3D pol->wil.cur_weight; + if (node_pages > rem_pages) + node_pages =3D rem_pages; + nr_allocated =3D __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array +=3D nr_allocated; + total_allocated +=3D nr_allocated; + /* if that's all the pages, no need to interleave */ + if (rem_pages <=3D pol->wil.cur_weight) { + pol->wil.cur_weight -=3D rem_pages; + return total_allocated; + } + /* Otherwise we adjust nr_pages down, and continue from there */ + rem_pages -=3D pol->wil.cur_weight; + pol->wil.cur_weight =3D 0; + prev_node =3D node; + } + + /* Now we can continue allocating as if from 0 instead of an offset */ + rounds =3D rem_pages / weight_total; + delta =3D rem_pages % weight_total; + for (i =3D 0; i < nnodes; i++) { + node =3D next_node_in(prev_node, nodes); + weight =3D weights[node]; + node_pages =3D weight * rounds; + if (delta) { + if (delta > weight) { + node_pages +=3D weight; + delta -=3D weight; + } else { + node_pages +=3D delta; + delta =3D 0; + } + } + /* We may not make it all the way around */ + if (!node_pages) + break; + /* If an over-allocation would occur, floor it */ + if (node_pages + total_allocated > nr_pages) { + node_pages =3D nr_pages - total_allocated; + delta =3D 0; + } + nr_allocated =3D __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array +=3D nr_allocated; + total_allocated +=3D nr_allocated; + prev_node =3D node; + } + + /* + * Finally, we need to update me->il_prev and pol->wil.cur_weight + * if there were overflow pages, but not equivalent to the node + * weight, set the cur_weight to node_weight - delta and the + * me->il_prev to the previous node. Otherwise if it was perfect + * we can simply set il_prev to node and cur_weight to 0 + */ + if (node_pages) { + me->il_prev =3D prev_node; + node_pages %=3D weight; + pol->wil.cur_weight =3D weight - node_pages; + } else { + me->il_prev =3D node; + pol->wil.cur_weight =3D 0; + } + + return total_allocated; +} + static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int = nid, struct mempolicy *pol, unsigned long nr_pages, struct page **page_array) @@ -2289,6 +2467,11 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t= gfp, return alloc_pages_bulk_array_interleave(gfp, pol, nr_pages, page_array); =20 + if (pol->mode =3D=3D MPOL_WEIGHTED_INTERLEAVE) + return alloc_pages_bulk_array_weighted_interleave(gfp, pol, + nr_pages, + page_array); + if (pol->mode =3D=3D MPOL_PREFERRED_MANY) return alloc_pages_bulk_array_preferred_many(gfp, numa_node_id(), pol, nr_pages, page_array); @@ -2364,6 +2547,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempoli= cy *b) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: return !!nodes_equal(a->nodes, b->nodes); case MPOL_LOCAL: return true; @@ -2500,6 +2684,10 @@ int mpol_misplaced(struct folio *folio, struct vm_ar= ea_struct *vma, polnid =3D interleave_nid(pol, ilx); break; =20 + case MPOL_WEIGHTED_INTERLEAVE: + polnid =3D weighted_interleave_nid(pol, ilx); + break; + case MPOL_PREFERRED: if (node_isset(curnid, pol->nodes)) goto out; @@ -2874,6 +3062,7 @@ static const char * const policy_modes[] =3D [MPOL_PREFERRED] =3D "prefer", [MPOL_BIND] =3D "bind", [MPOL_INTERLEAVE] =3D "interleave", + [MPOL_WEIGHTED_INTERLEAVE] =3D "weighted interleave", [MPOL_LOCAL] =3D "local", [MPOL_PREFERRED_MANY] =3D "prefer (many)", }; @@ -2933,6 +3122,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) } break; case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: /* * Default to online nodes with memory if no nodelist */ @@ -3043,6 +3233,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mem= policy *pol) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: nodes =3D pol->nodes; break; default: --=20 2.39.1 From nobody Fri Dec 19 11:34:02 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CA4A1C10F05 for ; Sat, 9 Dec 2023 06:59:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234354AbjLIG7t (ORCPT ); Sat, 9 Dec 2023 01:59:49 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45010 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234366AbjLIG7f (ORCPT ); Sat, 9 Dec 2023 01:59:35 -0500 Received: from mail-yw1-x1142.google.com (mail-yw1-x1142.google.com [IPv6:2607:f8b0:4864:20::1142]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A478510D0; Fri, 8 Dec 2023 22:59:41 -0800 (PST) Received: by mail-yw1-x1142.google.com with SMTP id 00721157ae682-5d77a1163faso21704207b3.0; Fri, 08 Dec 2023 22:59:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702105181; x=1702709981; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=i7HyyIuO4SKyKyBceL5BuQSQ5m+etdZ0u+23B3P6Igo=; b=lor5k3MSieECNXvQg4ACYQU5Mxl9mfFjuu2rX5IrPzF4kj4xPUksbnqNnMM8SKN/2Q 3WkZEdRydxoZA0AK4I398Xu0bZHlj6cbEHygEObUsXEN8S8SAJ3ZVB7Q4hzqYXSPQTa1 H8c+UJJUwjulaRLwS0N19XBO15FbG1lUvLiolUG1xkQhzQIzjIGatVOYnlfEg1b1g4rn wXwZczKC1i8+SKyIzH8QcPJo8OfqHb2tfsm3WodbmOIFP7iJKqLVdzaIjwtXH50pWCCu OYrzFr2DhsB7qwtLYLEVh7Fn+4/UpsR61LLwZ8qO+kokz2efZHaBhf10lvCg7VIM43OA /d5Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702105181; x=1702709981; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=i7HyyIuO4SKyKyBceL5BuQSQ5m+etdZ0u+23B3P6Igo=; b=dqjXV+OXfSkpBTeiP7ihzt0z40OVSzWt6LheqeILxumMiur1jdMRq9IEIuAFprgN9h JhCR2adh0IYAXKwnrzgp2nVZEMvTosLSPnII1ORGQczzWcr2Ocuj/Gv7BIb+H2P3ih0P VFX8KLY4xarHMteTVJi/PaRIopF0F0we3zsWz2qOYoKEUQ37x4u93JqaA3ZwHJtkK/eI dLdrd2Myoofna1spcedsC94G31nJYVpsNAfistSz9qK8QAast3CCaYtp7URzd6OFOJ6U httrupFrENpv4J7CUZRw5jfsHn4p4mCquh/LNstdMdnc55RitPVXJ5Zi6Ii3/GGqA20V 8xQQ== X-Gm-Message-State: AOJu0YzmdEXqIiERpn3CNdlYigXYU9/6rbMdeKjorgMkCZCTO+9ztueg R10eCzm4onTHwdXgbeO7cQ== X-Google-Smtp-Source: AGHT+IGteQ1CEjfgmNHOonmlymj7glCZN8Z626LEQR0OLvqtfJs+kc+S7pyWNN9I9wFIiUm8ZkYGCg== X-Received: by 2002:a0d:c405:0:b0:5d7:1940:3efe with SMTP id g5-20020a0dc405000000b005d719403efemr1025990ywd.47.1702105180858; Fri, 08 Dec 2023 22:59:40 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x8-20020a81b048000000b005df5d592244sm326530ywk.78.2023.12.08.22.59.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Dec 2023 22:59:40 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com Subject: [PATCH v2 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse Date: Sat, 9 Dec 2023 01:59:23 -0500 Message-Id: <20231209065931.3458-4-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231209065931.3458-1-gregory.price@memverge.com> References: <20231209065931.3458-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" split sanitize_mpol_flags into sanitize and validate. Sanitize is used by set_mempolicy to split (int mode) into mode and mode_flags, and then validates them. Validate validates already split flags. Validate will be reused for new syscalls that accept already split mode and mode_flags. Signed-off-by: Gregory Price Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Michal Hocko Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Vinicius Tavares Petrucci Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- mm/mempolicy.c | 29 ++++++++++++++++++++++------- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b4d94646e6a2..65d023720e83 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1463,24 +1463,39 @@ static int copy_nodes_to_user(unsigned long __user = *mask, unsigned long maxnode, return copy_to_user(mask, nodes_addr(*nodes), copy) ? -EFAULT : 0; } =20 -/* Basic parameter sanity check used by both mbind() and set_mempolicy() */ -static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) +/* + * Basic parameter sanity check used by mbind/set_mempolicy + * May modify flags to include internal flags (e.g. MPOL_F_MOF/F_MORON) + */ +static inline int validate_mpol_flags(unsigned short mode, unsigned short = *flags) { - *flags =3D *mode & MPOL_MODE_FLAGS; - *mode &=3D ~MPOL_MODE_FLAGS; - - if ((unsigned int)(*mode) >=3D MPOL_MAX) + if ((unsigned int)(mode) >=3D MPOL_MAX) return -EINVAL; if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) return -EINVAL; if (*flags & MPOL_F_NUMA_BALANCING) { - if (*mode !=3D MPOL_BIND) + if (mode !=3D MPOL_BIND) return -EINVAL; *flags |=3D (MPOL_F_MOF | MPOL_F_MORON); } return 0; } =20 +/* + * Used by mbind/set_memplicy to split and validate mode/flags + * set_mempolicy combines (mode | flags), split them out into separate + * fields and return just the mode in mode_arg and flags in flags. + */ +static inline int sanitize_mpol_flags(int *mode_arg, unsigned short *flags) +{ + unsigned short mode =3D (*mode_arg & ~MPOL_MODE_FLAGS); + + *flags =3D *mode_arg & MPOL_MODE_FLAGS; + *mode_arg =3D mode; + + return validate_mpol_flags(mode, flags); +} + static long kernel_mbind(unsigned long start, unsigned long len, unsigned long mode, const unsigned long __user *nmask, unsigned long maxnode, unsigned int flags) --=20 2.39.1 From nobody Fri Dec 19 11:34:02 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 71EEBC4167B for ; Sat, 9 Dec 2023 07:00:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234375AbjLIG7z (ORCPT ); Sat, 9 Dec 2023 01:59:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45028 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234377AbjLIG7i (ORCPT ); Sat, 9 Dec 2023 01:59:38 -0500 Received: from mail-oi1-x241.google.com (mail-oi1-x241.google.com [IPv6:2607:f8b0:4864:20::241]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8DAB310D2; Fri, 8 Dec 2023 22:59:43 -0800 (PST) Received: by mail-oi1-x241.google.com with SMTP id 5614622812f47-3b9df0a6560so1587377b6e.2; Fri, 08 Dec 2023 22:59:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702105183; x=1702709983; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=HJUnwmn3OH2abllzIBUF/JTKCz0pi5N+I3Pfku5l+U8=; b=cxgvJSOT3ocEqeZFu+VNRatAUQye05EYvFahFyxHH0AQ+yWqC5eMAimUE68Vhz2/pM NOQ5WtpCKdP0lu4lz4CeCPOrFYbf8b606XIuDNzj8YBV+JtSZUmpn8q9T2fi7ppnpxDS 2WgLwMNh+KzAPWewZM/wLVbcYoShG7HBbyiapTGw33S0RVtJ95t7cQcuQuWC0f589XCS zSut+/0pUFlf8VXyKElp+xRGNE+OQuK5gErX8YuIDbQ+/dI+rV34Io12y8+Agnf2GDWJ uzwhYEAt8kILGCdHPFcN8rSS2AgjmQvZ0TGsTBIEDzE9XKzVecSrGTAilEsULn5KjOJ9 Vc+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702105183; x=1702709983; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HJUnwmn3OH2abllzIBUF/JTKCz0pi5N+I3Pfku5l+U8=; b=c5FojedpS0NavG8J2vNzcB6oO5nTmroRY65NyTBKCRzeabbLUYc11EWswJqh9VgtZA SNUkzGmHr1KZCm0JQPkudLiXkPD0CpK1huHrNnCFfiv3GN0fQCL1pSAhVHjRv1ij33tk WFSxQN8n/b/o+POlqTzuS3eTz0pg1Fdcxx7RaLm4I1jMOTnyG+zKAKhWrhoO2nDHvFMu iYJDcn6vwIlRzYi4HnKrwOwFL/4dCoEDkHDg1aST9xjfbI15sr6zDgwRYcy/yXa8n6Wy BElCVE9rllue8vN1X7xst0tPCBeHVKPKRJSWrawDsBzrHBMdRmdXLT7Ppo4SG86rAE02 CYCg== X-Gm-Message-State: AOJu0Yy+1q4NjVAQUzwCL+m0js/XSpVl0WkcW4euN0G8Dros6pHSe4aK 3AAfHNFj+O3wLDgl7UuyfA== X-Google-Smtp-Source: AGHT+IGfx2quVhqqGvRMciNaLc8WqixwWAG40cne6XiyyiijmcmRnVWQV9piIhIK075kAkzoXX+W5w== X-Received: by 2002:a05:6808:2dcf:b0:3b9:e145:7128 with SMTP id gn15-20020a0568082dcf00b003b9e1457128mr1389314oib.64.1702105182805; Fri, 08 Dec 2023 22:59:42 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x8-20020a81b048000000b005df5d592244sm326530ywk.78.2023.12.08.22.59.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Dec 2023 22:59:42 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com Subject: [PATCH v2 04/11] mm/mempolicy: create struct mempolicy_args for creating new mempolicies Date: Sat, 9 Dec 2023 01:59:24 -0500 Message-Id: <20231209065931.3458-5-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231209065931.3458-1-gregory.price@memverge.com> References: <20231209065931.3458-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" This patch adds a new kernel structure `struct mempolicy_args`, intended to be used for an extensible get/set_mempolicy interface. This implements the fields required to support the existing syscall interfaces interfaces, but does not expose any user-facing arg structure. mpol_new is refactored to take the argument structure so that future mempolicy extensions can all be managed in the mempolicy constructor. The get_mempolicy and mbind syscalls are refactored to utilize the new argument structure, as are all the callers of mpol_new() and do_set_mempolicy. Signed-off-by: Gregory Price Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Michal Hocko Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Vinicius Tavares Petrucci Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- include/linux/mempolicy.h | 14 ++++++++ mm/mempolicy.c | 69 +++++++++++++++++++++++++++++---------- 2 files changed, 65 insertions(+), 18 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index ba09167e80f7..117c5395c6eb 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -61,6 +61,20 @@ struct mempolicy { } wil; }; =20 +/* + * Describes settings of a mempolicy during set/get syscalls and + * kernel internal calls to do_set_mempolicy() + */ +struct mempolicy_args { + unsigned short mode; /* policy mode */ + unsigned short mode_flags; /* policy mode flags */ + nodemask_t *policy_nodes; /* get/set/mbind */ + int policy_node; /* get: policy node information */ + unsigned long addr; /* get: vma address */ + int addr_node; /* get: node the address belongs to */ + int home_node; /* mbind: use MPOL_MF_HOME_NODE */ +}; + /* * Support for managing mempolicy data objects (clone, copy, destroy) * The default fast path of a NULL MPOL_DEFAULT policy is always inlined. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 65d023720e83..324dbf1782df 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -265,10 +265,12 @@ static int mpol_set_nodemask(struct mempolicy *pol, * This function just creates a new policy, does some check and simple * initialization. You must invoke mpol_set_nodemask() to set nodes. */ -static struct mempolicy *mpol_new(unsigned short mode, unsigned short flag= s, - nodemask_t *nodes) +static struct mempolicy *mpol_new(struct mempolicy_args *args) { struct mempolicy *policy; + unsigned short mode =3D args->mode; + unsigned short flags =3D args->mode_flags; + nodemask_t *nodes =3D args->policy_nodes; =20 if (mode =3D=3D MPOL_DEFAULT) { if (nodes && !nodes_empty(*nodes)) @@ -817,8 +819,7 @@ static int mbind_range(struct vma_iterator *vmi, struct= vm_area_struct *vma, } =20 /* Set the process memory policy */ -static long do_set_mempolicy(unsigned short mode, unsigned short flags, - nodemask_t *nodes) +static long do_set_mempolicy(struct mempolicy_args *args) { struct mempolicy *new, *old; NODEMASK_SCRATCH(scratch); @@ -827,14 +828,14 @@ static long do_set_mempolicy(unsigned short mode, uns= igned short flags, if (!scratch) return -ENOMEM; =20 - new =3D mpol_new(mode, flags, nodes); + new =3D mpol_new(args); if (IS_ERR(new)) { ret =3D PTR_ERR(new); goto out; } =20 task_lock(current); - ret =3D mpol_set_nodemask(new, nodes, scratch); + ret =3D mpol_set_nodemask(new, args->policy_nodes, scratch); if (ret) { task_unlock(current); mpol_put(new); @@ -1232,8 +1233,7 @@ static struct folio *alloc_migration_target_by_mpol(s= truct folio *src, #endif =20 static long do_mbind(unsigned long start, unsigned long len, - unsigned short mode, unsigned short mode_flags, - nodemask_t *nmask, unsigned long flags) + struct mempolicy_args *margs, unsigned long flags) { struct mm_struct *mm =3D current->mm; struct vm_area_struct *vma, *prev; @@ -1253,7 +1253,7 @@ static long do_mbind(unsigned long start, unsigned lo= ng len, if (start & ~PAGE_MASK) return -EINVAL; =20 - if (mode =3D=3D MPOL_DEFAULT) + if (margs->mode =3D=3D MPOL_DEFAULT) flags &=3D ~MPOL_MF_STRICT; =20 len =3D PAGE_ALIGN(len); @@ -1264,7 +1264,7 @@ static long do_mbind(unsigned long start, unsigned lo= ng len, if (end =3D=3D start) return 0; =20 - new =3D mpol_new(mode, mode_flags, nmask); + new =3D mpol_new(margs); if (IS_ERR(new)) return PTR_ERR(new); =20 @@ -1281,7 +1281,8 @@ static long do_mbind(unsigned long start, unsigned lo= ng len, NODEMASK_SCRATCH(scratch); if (scratch) { mmap_write_lock(mm); - err =3D mpol_set_nodemask(new, nmask, scratch); + err =3D mpol_set_nodemask(new, margs->policy_nodes, + scratch); if (err) mmap_write_unlock(mm); } else @@ -1295,7 +1296,7 @@ static long do_mbind(unsigned long start, unsigned lo= ng len, * Lock the VMAs before scanning for pages to migrate, * to ensure we don't miss a concurrently inserted page. */ - nr_failed =3D queue_pages_range(mm, start, end, nmask, + nr_failed =3D queue_pages_range(mm, start, end, margs->policy_nodes, flags | MPOL_MF_INVERT | MPOL_MF_WRLOCK, &pagelist); =20 if (nr_failed < 0) { @@ -1500,6 +1501,7 @@ static long kernel_mbind(unsigned long start, unsigne= d long len, unsigned long mode, const unsigned long __user *nmask, unsigned long maxnode, unsigned int flags) { + struct mempolicy_args margs; unsigned short mode_flags; nodemask_t nodes; int lmode =3D mode; @@ -1514,7 +1516,12 @@ static long kernel_mbind(unsigned long start, unsign= ed long len, if (err) return err; =20 - return do_mbind(start, len, lmode, mode_flags, &nodes, flags); + memset(&margs, 0, sizeof(margs)); + margs.mode =3D lmode; + margs.mode_flags =3D mode_flags; + margs.policy_nodes =3D &nodes; + + return do_mbind(start, len, &margs, flags); } =20 SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned lo= ng, len, @@ -1595,6 +1602,7 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned= long, len, static long kernel_set_mempolicy(int mode, const unsigned long __user *nma= sk, unsigned long maxnode) { + struct mempolicy_args args; unsigned short mode_flags; nodemask_t nodes; int lmode =3D mode; @@ -1608,7 +1616,12 @@ static long kernel_set_mempolicy(int mode, const uns= igned long __user *nmask, if (err) return err; =20 - return do_set_mempolicy(lmode, mode_flags, &nodes); + memset(&args, 0, sizeof(args)); + args.mode =3D lmode; + args.mode_flags =3D mode_flags; + args.policy_nodes =3D &nodes; + + return do_set_mempolicy(&args); } =20 SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nm= ask, @@ -2890,6 +2903,7 @@ static int shared_policy_replace(struct shared_policy= *sp, pgoff_t start, void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *m= pol) { int ret; + struct mempolicy_args margs; =20 sp->root =3D RB_ROOT; /* empty tree =3D=3D default mempolicy */ rwlock_init(&sp->lock); @@ -2902,8 +2916,12 @@ void mpol_shared_policy_init(struct shared_policy *s= p, struct mempolicy *mpol) if (!scratch) goto put_mpol; =20 + memset(&margs, 0, sizeof(margs)); + margs.mode =3D mpol->mode; + margs.mode_flags =3D mpol->flags; + margs.policy_nodes =3D &mpol->w.user_nodemask; /* contextualize the tmpfs mount point mempolicy to this file */ - npol =3D mpol_new(mpol->mode, mpol->flags, &mpol->w.user_nodemask); + npol =3D mpol_new(&margs); if (IS_ERR(npol)) goto free_scratch; /* no valid nodemask intersection */ =20 @@ -3011,6 +3029,7 @@ static inline void __init check_numabalancing_enable(= void) =20 void __init numa_policy_init(void) { + struct mempolicy_args args; nodemask_t interleave_nodes; unsigned long largest =3D 0; int nid, prefer =3D 0; @@ -3056,7 +3075,11 @@ void __init numa_policy_init(void) if (unlikely(nodes_empty(interleave_nodes))) node_set(prefer, interleave_nodes); =20 - if (do_set_mempolicy(MPOL_INTERLEAVE, 0, &interleave_nodes)) + memset(&args, 0, sizeof(args)); + args.mode =3D MPOL_INTERLEAVE; + args.policy_nodes =3D &interleave_nodes; + + if (do_set_mempolicy(&args)) pr_err("%s: interleaving failed\n", __func__); =20 check_numabalancing_enable(); @@ -3065,7 +3088,12 @@ void __init numa_policy_init(void) /* Reset policy of current process to default */ void numa_default_policy(void) { - do_set_mempolicy(MPOL_DEFAULT, 0, NULL); + struct mempolicy_args args; + + memset(&args, 0, sizeof(args)); + args.mode =3D MPOL_DEFAULT; + + do_set_mempolicy(&args); } =20 /* @@ -3095,6 +3123,7 @@ static const char * const policy_modes[] =3D */ int mpol_parse_str(char *str, struct mempolicy **mpol) { + struct mempolicy_args margs; struct mempolicy *new =3D NULL; unsigned short mode_flags; nodemask_t nodes; @@ -3181,7 +3210,11 @@ int mpol_parse_str(char *str, struct mempolicy **mpo= l) goto out; } =20 - new =3D mpol_new(mode, mode_flags, &nodes); + memset(&margs, 0, sizeof(margs)); + margs.mode =3D mode; + margs.mode_flags =3D mode_flags; + margs.policy_nodes =3D &nodes; + new =3D mpol_new(&margs); if (IS_ERR(new)) goto out; =20 --=20 2.39.1 From nobody Fri Dec 19 11:34:02 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B6E89C10F05 for ; Sat, 9 Dec 2023 07:00:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234404AbjLIHAG (ORCPT ); Sat, 9 Dec 2023 02:00:06 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43818 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234422AbjLIG7q (ORCPT ); Sat, 9 Dec 2023 01:59:46 -0500 Received: from mail-qk1-x744.google.com (mail-qk1-x744.google.com [IPv6:2607:f8b0:4864:20::744]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 129A9172A; Fri, 8 Dec 2023 22:59:46 -0800 (PST) Received: by mail-qk1-x744.google.com with SMTP id af79cd13be357-77f552d4179so62881385a.1; Fri, 08 Dec 2023 22:59:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702105185; x=1702709985; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=HyO37uEt2s7CfDc9j6MDVN1pIs7702j+Cd0isrN3eeg=; b=NNeaphrkJL/NzbDz5k2bPGGJP/pV3lk3HsAxNGJ9m5Ax14lKzonV2uY95IHmigQXOV f+VEql9zfjrkE8ADdgCiGOnAfyrp1HnXh65bEZK8Z6FxI+WBlzVJPUHMN5706k6YA9TC gC3YG7sBnDChRp9eBNJXRRpz1UC0XX7+gG01MnBMQ+x1WKwZDuPVrqriJUZJ+td+icld 3dt7t22Ev/j8BFwOGpwpaSi9xG+25FP98kcgiKP7GHnfzsP0Cy1sAk0+X2nNqd8VhXGR 0hu1Du1ahcuwJcS/goTzgj1oPnCL691IeQt16ps4beeYBYrxCmoBDccHdeT/dP+Rjxcr XSyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702105185; x=1702709985; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HyO37uEt2s7CfDc9j6MDVN1pIs7702j+Cd0isrN3eeg=; b=QuTBth7R9lZo6DUb/0VhIT+M9LzvWGoAwpagTytX7G+bWuQn6JI7GglNY2E6kJq/iq zxVaOa+AQcRCfFiVJxiNdZVYvW/JgB6/UUIkAlVs4PTzWVy4zvEd8QhcCv42E6orsa4t Lj4fNKAVQAeBzwpKXhdjY/DWuTW7tthWkqKHfntSJQwRLX/8u+4d3p99NT91a2E7CCE6 IzjMfj9RwKZ0EWW/oXliwu1oPJclh22xtI8nWSxRyvdAhCvxWM5zZDInfAaMRP+UkjMY 5W6YG1WTsssM/5pnrkMUW1vUsMRzzNjEEc18QR18CGSkchQbG8w3EkgZp8Gaccxb7jql cOJw== X-Gm-Message-State: AOJu0YwBdzPrMDvMgxQCS+hJMpn8ZPuRxjr+gIghDJkULE02PS+0InrZ meSLMilL5cduoOWR6O9Lyw== X-Google-Smtp-Source: AGHT+IFUsOfBH9W6yp5ojwYyZpUt8WzKjfVdhDUOYCzr9GQ8Voczka7QjJSa/kK3LqyrPGhITVuFVw== X-Received: by 2002:a37:ad0a:0:b0:77e:fba3:937b with SMTP id f10-20020a37ad0a000000b0077efba3937bmr1328328qkm.93.1702105184733; Fri, 08 Dec 2023 22:59:44 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x8-20020a81b048000000b005df5d592244sm326530ywk.78.2023.12.08.22.59.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Dec 2023 22:59:44 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com Subject: [PATCH v2 05/11] mm/mempolicy: refactor kernel_get_mempolicy for code re-use Date: Sat, 9 Dec 2023 01:59:25 -0500 Message-Id: <20231209065931.3458-6-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231209065931.3458-1-gregory.price@memverge.com> References: <20231209065931.3458-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Pull operation flag checking from inside do_get_mempolicy out to kernel_get_mempolicy. This allows us to flatten the internal code, and break it into separate functions for future syscalls (get_mempolicy2, process_get_mempolicy) to re-use the code, even after additional extensions are made. The primary change is that the flag is treated as the multiplexer that it actually is. For get_mempolicy, the flags represents 3 different primary operations: if (flags & MPOL_F_MEMS_ALLOWED) return task->mems_allowed else if (flags & MPOL_F_ADDR) return vma mempolicy information else return task mempolicy information Plus the behavior modifying flag: if (flags & MPOL_F_NODE) change the return value of (int __user *policy) based on whether MPOL_F_ADDR was set. The original behavior of get_mempolicy is retained, but we utilize the new mempolicy_args structure to pass the operations down the stack. This will allow us to extend the internal functions without affecting the legacy behavior of get_mempolicy. Signed-off-by: Gregory Price Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Michal Hocko Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Vinicius Tavares Petrucci Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- mm/mempolicy.c | 240 ++++++++++++++++++++++++++++++------------------- 1 file changed, 150 insertions(+), 90 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 324dbf1782df..ce5b7963e9b5 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -895,106 +895,107 @@ static int lookup_node(struct mm_struct *mm, unsign= ed long addr) return ret; } =20 -/* Retrieve NUMA policy */ -static long do_get_mempolicy(int *policy, nodemask_t *nmask, - unsigned long addr, unsigned long flags) +/* Retrieve the mems_allowed for current task */ +static inline long do_get_mems_allowed(nodemask_t *nmask) { - int err; - struct mm_struct *mm =3D current->mm; - struct vm_area_struct *vma =3D NULL; - struct mempolicy *pol =3D current->mempolicy, *pol_refcount =3D NULL; + task_lock(current); + *nmask =3D cpuset_current_mems_allowed; + task_unlock(current); + return 0; +} =20 - if (flags & - ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED)) - return -EINVAL; +/* If the policy has additional node information to retrieve, return it */ +static long do_get_policy_node(struct mempolicy *pol) +{ + /* + * For MPOL_INTERLEAVE, the extended node information is the next + * node that will be selected for interleave. For weighted interleave + * we return the next node based on the current weight. + */ + if (pol =3D=3D current->mempolicy && pol->mode =3D=3D MPOL_INTERLEAVE) + return next_node_in(current->il_prev, pol->nodes); =20 - if (flags & MPOL_F_MEMS_ALLOWED) { - if (flags & (MPOL_F_NODE|MPOL_F_ADDR)) - return -EINVAL; - *policy =3D 0; /* just so it's initialized */ + if (pol =3D=3D current->mempolicy && + pol->mode =3D=3D MPOL_WEIGHTED_INTERLEAVE) { + if (pol->wil.cur_weight) + return current->il_prev; + else + return next_node_in(current->il_prev, pol->nodes); + } + return -EINVAL; +} + +/* Handle user_nodemask condition when fetching nodemask for userspace */ +static void do_get_mempolicy_nodemask(struct mempolicy *pol, nodemask_t *n= mask) +{ + if (mpol_store_user_nodemask(pol)) { + *nmask =3D pol->w.user_nodemask; + } else { task_lock(current); - *nmask =3D cpuset_current_mems_allowed; + get_policy_nodemask(pol, nmask); task_unlock(current); - return 0; } +} =20 - if (flags & MPOL_F_ADDR) { - pgoff_t ilx; /* ignored here */ - /* - * Do NOT fall back to task policy if the - * vma/shared policy at addr is NULL. We - * want to return MPOL_DEFAULT in this case. - */ - mmap_read_lock(mm); - vma =3D vma_lookup(mm, addr); - if (!vma) { - mmap_read_unlock(mm); - return -EFAULT; - } - pol =3D __get_vma_policy(vma, addr, &ilx); - } else if (addr) - return -EINVAL; +/* Retrieve NUMA policy for a VMA assocated with a given address */ +static long do_get_vma_mempolicy(struct mempolicy_args *args) +{ + pgoff_t ilx; + struct mm_struct *mm =3D current->mm; + struct vm_area_struct *vma =3D NULL; + struct mempolicy *pol =3D NULL; =20 + mmap_read_lock(mm); + vma =3D vma_lookup(mm, args->addr); + if (!vma) { + mmap_read_unlock(mm); + return -EFAULT; + } + pol =3D __get_vma_policy(vma, args->addr, &ilx); if (!pol) - pol =3D &default_policy; /* indicates default behavior */ + pol =3D &default_policy; + /* this may cause a double-reference, resolved by a put+cond_put */ + mpol_get(pol); + mmap_read_unlock(mm); =20 - if (flags & MPOL_F_NODE) { - if (flags & MPOL_F_ADDR) { - /* - * Take a refcount on the mpol, because we are about to - * drop the mmap_lock, after which only "pol" remains - * valid, "vma" is stale. - */ - pol_refcount =3D pol; - vma =3D NULL; - mpol_get(pol); - mmap_read_unlock(mm); - err =3D lookup_node(mm, addr); - if (err < 0) - goto out; - *policy =3D err; - } else if (pol =3D=3D current->mempolicy && - pol->mode =3D=3D MPOL_INTERLEAVE) { - *policy =3D next_node_in(current->il_prev, pol->nodes); - } else if (pol =3D=3D current->mempolicy && - (pol->mode =3D=3D MPOL_WEIGHTED_INTERLEAVE)) { - if (pol->wil.cur_weight) - *policy =3D current->il_prev; - else - *policy =3D next_node_in(current->il_prev, - pol->nodes); - } else { - err =3D -EINVAL; - goto out; - } - } else { - *policy =3D pol =3D=3D &default_policy ? MPOL_DEFAULT : - pol->mode; - /* - * Internal mempolicy flags must be masked off before exposing - * the policy to userspace. - */ - *policy |=3D (pol->flags & MPOL_MODE_FLAGS); - } + /* Fetch the node for the given address */ + args->addr_node =3D lookup_node(mm, args->addr); =20 - err =3D 0; - if (nmask) { - if (mpol_store_user_nodemask(pol)) { - *nmask =3D pol->w.user_nodemask; - } else { - task_lock(current); - get_policy_nodemask(pol, nmask); - task_unlock(current); - } + args->mode =3D pol =3D=3D &default_policy ? MPOL_DEFAULT : pol->mode; + args->mode_flags =3D (pol->flags & MPOL_MODE_FLAGS); + + /* If this policy has extra node info, fetch that */ + args->policy_node =3D do_get_policy_node(pol); + + if (args->policy_nodes) + do_get_mempolicy_nodemask(pol, args->policy_nodes); + + if (pol !=3D &default_policy) { + mpol_put(pol); + mpol_cond_put(pol); } =20 - out: - mpol_cond_put(pol); - if (vma) - mmap_read_unlock(mm); - if (pol_refcount) - mpol_put(pol_refcount); - return err; + return 0; +} + +/* Retrieve NUMA policy for the current task */ +static long do_get_task_mempolicy(struct mempolicy_args *args) +{ + struct mempolicy *pol =3D current->mempolicy; + + if (!pol) + pol =3D &default_policy; /* indicates default behavior */ + + args->mode =3D pol =3D=3D &default_policy ? MPOL_DEFAULT : pol->mode; + /* Internal flags must be masked off before exposing to userspace */ + args->mode_flags =3D (pol->flags & MPOL_MODE_FLAGS); + + args->policy_node =3D do_get_policy_node(pol); + + if (args->policy_nodes) + do_get_mempolicy_nodemask(pol, args->policy_nodes); + + return 0; } =20 #ifdef CONFIG_MIGRATION @@ -1731,16 +1732,75 @@ static int kernel_get_mempolicy(int __user *policy, unsigned long addr, unsigned long flags) { + struct mempolicy_args args; int err; - int pval; + int pval =3D 0; nodemask_t nodes; =20 if (nmask !=3D NULL && maxnode < nr_node_ids) return -EINVAL; =20 - addr =3D untagged_addr(addr); + if (flags & + ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED)) + return -EINVAL; =20 - err =3D do_get_mempolicy(&pval, &nodes, addr, flags); + /* Ensure any data that may be copied to userland is initialized */ + memset(&args, 0, sizeof(args)); + args.policy_nodes =3D &nodes; + args.addr =3D untagged_addr(addr); + + /* + * set_mempolicy was originally multiplexed based on 3 flags: + * MPOL_F_MEMS_ALLOWED: fetch task->mems_allowed + * MPOL_F_ADDR : operate on vma->mempolicy + * MPOL_F_NODE : change return value of *policy + * + * Split this behavior out here, rather than internal functions, + * so that the internal functions can be re-used by future + * get_mempolicy2 interfaces and the arg structure made extensible + */ + if (flags & MPOL_F_MEMS_ALLOWED) { + if (flags & (MPOL_F_NODE|MPOL_F_ADDR)) + return -EINVAL; + pval =3D 0; /* just so it's initialized */ + err =3D do_get_mems_allowed(&nodes); + } else if (flags & MPOL_F_ADDR) { + /* If F_ADDR, we operation on a vma policy (or default) */ + err =3D do_get_vma_mempolicy(&args); + if (err) + return err; + /* if (F_ADDR | F_NODE), *pval is the address' node */ + if (flags & MPOL_F_NODE) { + /* if we failed to fetch, that's likely an EFAULT */ + if (args.addr_node < 0) + return args.addr_node; + pval =3D args.addr_node; + } else + pval =3D args.mode | args.mode_flags; + } else { + /* if not F_ADDR and addr !=3D null, EINVAL */ + if (addr) + return -EINVAL; + + err =3D do_get_task_mempolicy(&args); + if (err) + return err; + /* + * if F_NODE was set and mode was MPOL_INTERLEAVE + * *pval is equal to next interleave node. + * + * if args.policy_node < 0, this means the mode did + * not have a policy. This presently emulates the + * original behavior of (F_NODE) & (!MPOL_INTERLEAVE) + * producing -EINVAL + */ + if (flags & MPOL_F_NODE) { + if (args.policy_node < 0) + return args.policy_node; + pval =3D args.policy_node; + } else + pval =3D args.mode | args.mode_flags; + } =20 if (err) return err; --=20 2.39.1 From nobody Fri Dec 19 11:34:02 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 99552C4167B for ; Sat, 9 Dec 2023 07:00:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229492AbjLIHAL (ORCPT ); Sat, 9 Dec 2023 02:00:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43688 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234441AbjLIG7q (ORCPT ); Sat, 9 Dec 2023 01:59:46 -0500 Received: from mail-yb1-xb43.google.com (mail-yb1-xb43.google.com [IPv6:2607:f8b0:4864:20::b43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C37FB1994; Fri, 8 Dec 2023 22:59:47 -0800 (PST) Received: by mail-yb1-xb43.google.com with SMTP id 3f1490d57ef6-d9caf5cc948so2900852276.0; Fri, 08 Dec 2023 22:59:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702105187; x=1702709987; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=5fJWjKBdWhZssADuPucbwh57xIICZA8yt0Ti8FZ2CXY=; b=JkGhhDNEe7KJuLH8UcKhvnje9xzLzaxjvm4D8RBuCCbsKQU+BFvnoWOG7DJoKCc1oG iU2u/NjEiRP7gU5SLHjx8PBLtwE6BC1S6Vwt8Aqas1Iqhz6EG4sN3V18HZ7F7mc5RmF7 oKeEHBJEPuh2x+3Q9qGMgP55UZHpGHMtPWidqQxIlWyOE84hxQXOMsIOYKmXO/cj1Exn ZCgAmlXMne/M4czLx5UtWOXTfZbPifLxaZGUXiSZtTCrtJwb7ZXCk4KjhAp3gnqCr2Dp bTIGx0Y2NRl25jm6+XD6xSURKKlo+gBsbVVXIqAM99azbOuvzc3uiN+B/VY9cM8ysO3v juoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702105187; x=1702709987; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=5fJWjKBdWhZssADuPucbwh57xIICZA8yt0Ti8FZ2CXY=; b=xNW3moEE3NmusDVyZPSb4ympSqzl15Il3p5wbaa76HgskamWKndxnuoomGqcZz0xIA omsdTab1SL1RWhs8sXT4oT9P9prRrfjYbkOqhVRb8kbOvMRmOJYXDu6/G/mGKD3ml0NC DDY5ThptZalj7/Y4qFvHNWZhpeeMAc2GDYD0463LSuyGcKYHw8liQy/axYFjrtpJ07cX M0ZI0SILYk9SprQwe/bItikgTlCwSMYSvoTgySouLkeZdPw+4iSV6bm3CUq6zHy0m32/ U1loRiicYGwzEPdFujjvoGKHT1z7aTxENbsvu+Rp83Eunf97gaqFqN7Z4pzQoSiSdfmb F8SA== X-Gm-Message-State: AOJu0Yy5/TZEDS3eZO/90akBJ/3CSIQE9QDYAUwlOSMmjioJOu5tXCFI WAOF2rSgXI4v540eC5xCCQ== X-Google-Smtp-Source: AGHT+IHF8XUqVMPhCWme+bTJUc1wnYluDhMFCId+PO64vQhrQCMsy2CrIazKg3YTeo7a1cynsK6z/A== X-Received: by 2002:a0d:d48c:0:b0:5d3:e835:bd67 with SMTP id w134-20020a0dd48c000000b005d3e835bd67mr948296ywd.41.1702105186764; Fri, 08 Dec 2023 22:59:46 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x8-20020a81b048000000b005df5d592244sm326530ywk.78.2023.12.08.22.59.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Dec 2023 22:59:46 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com Subject: [PATCH v2 06/11] mm/mempolicy: allow home_node to be set by mpol_new Date: Sat, 9 Dec 2023 01:59:26 -0500 Message-Id: <20231209065931.3458-7-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231209065931.3458-1-gregory.price@memverge.com> References: <20231209065931.3458-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" This patch adds the plumbing into mpol_new() to allow the argument structure's home_node field to set mempolicy home node. The syscall sys_set_mempolicy_home_node was added to allow a home node to be registered for a vma. For set_mempolicy2 and mbind2 syscalls, it would be useful to add this as an extension to allow the user to submit a fully formed mempolicy configuration in a single call, rather than require multiple calls to configure a mempolicy. This will become particularly useful if/when pidfd interfaces to change process mempolicies from outside the task appear, as each call to change the mempolicy does an atomic swap of that policy in the task, rather than mutate the policy. Signed-off-by: Gregory Price Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Michal Hocko Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Vinicius Tavares Petrucci Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- mm/mempolicy.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index ce5b7963e9b5..446167dcebdc 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -308,6 +308,7 @@ static struct mempolicy *mpol_new(struct mempolicy_args= *args) policy->flags =3D flags; policy->home_node =3D NUMA_NO_NODE; policy->wil.cur_weight =3D 0; + policy->home_node =3D args->home_node; =20 return policy; } @@ -1621,6 +1622,7 @@ static long kernel_set_mempolicy(int mode, const unsi= gned long __user *nmask, args.mode =3D lmode; args.mode_flags =3D mode_flags; args.policy_nodes =3D &nodes; + args.home_node =3D NUMA_NO_NODE; =20 return do_set_mempolicy(&args); } @@ -2980,6 +2982,8 @@ void mpol_shared_policy_init(struct shared_policy *sp= , struct mempolicy *mpol) margs.mode =3D mpol->mode; margs.mode_flags =3D mpol->flags; margs.policy_nodes =3D &mpol->w.user_nodemask; + margs.home_node =3D NUMA_NO_NODE; + /* contextualize the tmpfs mount point mempolicy to this file */ npol =3D mpol_new(&margs); if (IS_ERR(npol)) @@ -3138,6 +3142,7 @@ void __init numa_policy_init(void) memset(&args, 0, sizeof(args)); args.mode =3D MPOL_INTERLEAVE; args.policy_nodes =3D &interleave_nodes; + args.home_node =3D NUMA_NO_NODE; =20 if (do_set_mempolicy(&args)) pr_err("%s: interleaving failed\n", __func__); @@ -3152,6 +3157,7 @@ void numa_default_policy(void) =20 memset(&args, 0, sizeof(args)); args.mode =3D MPOL_DEFAULT; + args.home_node =3D NUMA_NO_NODE; =20 do_set_mempolicy(&args); } @@ -3274,6 +3280,8 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) margs.mode =3D mode; margs.mode_flags =3D mode_flags; margs.policy_nodes =3D &nodes; + margs.home_node =3D NUMA_NO_NODE; + new =3D mpol_new(&margs); if (IS_ERR(new)) goto out; --=20 2.39.1 From nobody Fri Dec 19 11:34:02 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3FE1FC4167B for ; Sat, 9 Dec 2023 07:00:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234522AbjLIHAb (ORCPT ); Sat, 9 Dec 2023 02:00:31 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43828 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234493AbjLIHAD (ORCPT ); Sat, 9 Dec 2023 02:00:03 -0500 Received: from mail-qv1-xf41.google.com (mail-qv1-xf41.google.com [IPv6:2607:f8b0:4864:20::f41]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 08C9719AC; Fri, 8 Dec 2023 22:59:56 -0800 (PST) Received: by mail-qv1-xf41.google.com with SMTP id 6a1803df08f44-67adac40221so18310316d6.2; Fri, 08 Dec 2023 22:59:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702105195; x=1702709995; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=/jpJNV10Y4Aq3La+SspO/hMcopvAOMbrsn8P/kGKMEU=; b=nEJuVWm9ToQyex+z6jFKMTLatqoMtbPQIl5prMit2E/NVm8EWHIUytJNBKYY+McwLX XVd3B4QVCDPkNXyjVkcS2ihPPV5Lg4tsAj4608p7blqgIQ1V0WgfqCYqRsjVbVvzSAnx BMJqglHYiFkZPXsIrYUqTrVdjspIl6aHoo7xfY77QMRGekkFsXvdh5EPtFZ50LavXaFt tvUqvZrqEg1trihThRi1e+SGCFPAXWw+1LeujEn+nmR0vXYxzIjxf1Cj6r2G513ioJ50 xUdNFnOzauGfbS8U2vQH5LJkp8hEmuLpBPfEsNvXwSdCINig3+Pkt+r2uc+HvIw4EBWv +bDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702105195; x=1702709995; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/jpJNV10Y4Aq3La+SspO/hMcopvAOMbrsn8P/kGKMEU=; b=RN49pTpSrhrFqB8DMYT3PJnWUIEYOTIEHvsOiRp2VnQhcK96hFnB1Wi45QpEwvHODD 0U0OwGXWgPd/ILAmb7QnLFYwVS683k5nMHytsfBvsNEfuZN5NUsoC+bx6/IVSH++7cIW 4E4pt38nMO4tm0obVfn4S/tJ3n5K87OPjq//4ihgA+s4LqulZNu/O8/AQFytZeVx2O/I 37yvBE+vL+kjIGmz1t5sO91DMwjlhM32TB1/MMY1TlmOBLKU0PXCP5xgSSpfTxnHMR2i wd7GEmctYZsOFAkSLYv/9ne0WynF3PJErZoUzn76wYk/qI2/iDS1YfPDSSXoWqypvGU2 AvQQ== X-Gm-Message-State: AOJu0YzwC/Ekalt84X4tRs4Z/hZtWeXSUzM3L+NoSxrO9+bP95Lmj3h3 2qOjFEowPPw9eU0tF0mtxw== X-Google-Smtp-Source: AGHT+IGq7B5WE+Xyw4YblcWAaIb3zQDtffG9M6pUeoRlky0GQTbpT4Cg4FD2j38+EVMPor4kTHolGw== X-Received: by 2002:a05:620a:2a10:b0:77f:3161:9147 with SMTP id o16-20020a05620a2a1000b0077f31619147mr1748373qkp.19.1702105195576; Fri, 08 Dec 2023 22:59:55 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x8-20020a81b048000000b005df5d592244sm326530ywk.78.2023.12.08.22.59.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Dec 2023 22:59:55 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com, Frank van der Linden Subject: [PATCH v2 07/11] mm/mempolicy: add userland mempolicy arg structure Date: Sat, 9 Dec 2023 01:59:27 -0500 Message-Id: <20231209065931.3458-8-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231209065931.3458-1-gregory.price@memverge.com> References: <20231209065931.3458-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" This patch adds the new user-api argument structure intended for set_mempolicy2 and mbind2. struct mpol_args { __u16 mode; __u16 mode_flags; __s32 home_node; /* mbind2: policy home node */ __aligned_u64 *pol_nodes; __u64 pol_maxnodes; __u64 addr; /* get_mempolicy: policy address */ __s32 policy_node; /* get_mempolicy: policy node info */ __s32 addr_node; /* get_mempolicy: memory range policy */ }; This structure is intended to be extensible as new mempolicy extensions are added. For example, set_mempolicy_home_node was added to allow vma mempolicies to have a preferred/home node assigned. This structure allows the addition of that setting at the time the mempolicy is set, rather than requiring additional calls to modify the policy. Full breakdown of arguments as of this patch: mode: Mempolicy mode (MPOL_DEFAULT, MPOL_INTERLEAVE) mode_flags: Flags previously or'd into mode in set_mempolicy (e.g.: MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES) home_node: for mbind2. Allows the setting of a policy's home with the use of MPOL_MF_HOME_NODE pol_nodes: Policy nodemask pol_maxnodes: Max number of nodes in the policy nodemask policy_node: for get_mempolicy2. Returns extended information about a policy that was previously reported by passing MPOL_F_NODE to get_mempolicy. Instead of overriding the mode value, simply add a field. addr: for get_mempolicy2. Used with MPOL_F_ADDR to run get_mempolicy against the vma the address belongs to instead of the task. addr_node: for get_mempolicy2. Returns the node the address belongs to. Previously get_mempolicy() would override the output value of (mode) if MPOL_F_ADDR and MPOL_F_NODE were set. Instead, we extend mpol_args to do this by default if MPOL_F_ADDR is set and do away with MPOL_F_NODE. Suggested-by: Frank van der Linden Suggested-by: Vinicius Tavares Petrucci Suggested-by: Hasan Al Maruf Signed-off-by: Gregory Price Co-developed-by: Vinicius Tavares Petrucci Signed-off-by: Vinicius Tavares Petrucci Suggested-by: Dan Williams Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Michal Hocko Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- .../admin-guide/mm/numa_memory_policy.rst | 20 +++++++++++++++++++ include/uapi/linux/mempolicy.h | 12 +++++++++++ 2 files changed, 32 insertions(+) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Document= ation/admin-guide/mm/numa_memory_policy.rst index d2c8e712785b..64c5804dc40f 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -482,6 +482,26 @@ closest to which page allocation will come from. Speci= fying the home node overri the default allocation policy to allocate memory close to the local node f= or an executing CPU. =20 +Extended Mempolicy Arguments:: + + struct mpol_args { + __u16 mode; + __u16 mode_flags; + __s32 home_node; /* mbind2: policy home node */ + __aligned_u64 pol_nodes; /* nodemask pointer */ + __u64 pol_maxnodes; + __u64 addr; /* get_mempolicy2: policy address */ + __s32 policy_node; /* get_mempolicy2: policy node information */ + __s32 addr_node; /* get_mempolicy2: memory range policy */ + }; + +The extended mempolicy argument structure is defined to allow the mempolicy +interfaces future extensibility without the need for additional system cal= ls. + +The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to +all interfaces relative to their non-extended counterparts. Each additional +field may only apply to specific extended interfaces. See the respective +extended interface man page for more details. =20 Memory Policy Command Line Interface =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 1f9bb10d1a47..00a673e30047 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -27,6 +27,18 @@ enum { MPOL_MAX, /* always last member of enum */ }; =20 +struct mpol_args { + /* Basic mempolicy settings */ + __u16 mode; + __u16 mode_flags; + __s32 home_node; /* mbind2: policy home node */ + __aligned_u64 pol_nodes; + __u64 pol_maxnodes; + __u64 addr; + __s32 policy_node; /* get_mempolicy: policy node info */ + __s32 addr_node; /* get_mempolicy: memory range policy */ +}; + /* Flags for set_mempolicy */ #define MPOL_F_STATIC_NODES (1 << 15) #define MPOL_F_RELATIVE_NODES (1 << 14) --=20 2.39.1 From nobody Fri Dec 19 11:34:02 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63AEFC4167B for ; Sat, 9 Dec 2023 07:00:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236033AbjLIHAk (ORCPT ); Sat, 9 Dec 2023 02:00:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38826 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234377AbjLIHAU (ORCPT ); Sat, 9 Dec 2023 02:00:20 -0500 Received: from mail-oi1-x241.google.com (mail-oi1-x241.google.com [IPv6:2607:f8b0:4864:20::241]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D6C4E1BCF; Fri, 8 Dec 2023 23:00:00 -0800 (PST) Received: by mail-oi1-x241.google.com with SMTP id 5614622812f47-3b9e1a3e3f0so1535775b6e.1; Fri, 08 Dec 2023 23:00:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702105200; x=1702710000; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Qh3VQET8gdSfaH63bKHpkp7oSbUAim1IJTPPsejDsgw=; b=gp8UQYH7Vn0VPLF+VbLzfNdz5SrBAI7IRLk6PIfbC8tB5kWsJbmJlbZA4zNQHRwpV/ /RiN+9OLBYJERatXDLF3g3kucWKtqxSIkqdvYEeaBWF7kGjyGEK8IXzRJpnzvVfFju8a WMCBwP3n2TavAj+QoueAR6TkC/12oo2BdLc8wMT7ToJnwqj/zDwR+KShP6vx6a02EbZJ DwTWVoMwGZKMZHLbGpFn1u61mGgvvVVnOe4BOJ9rJUZZ21+e+8Sv0r7xu4uFkkTyV0jC Wmy9gE4bbbx/pMA7ruv4/COzS1nGevGEhDljOK6+8tprvOvygsys4OrHQDHJdrjGhf62 5NFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702105200; x=1702710000; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Qh3VQET8gdSfaH63bKHpkp7oSbUAim1IJTPPsejDsgw=; b=OkS9cDbT96ul0COD7WUXotvQSQneO/nYNlx9Jy/4kJ+3M7+cphN72H0DURXPJPN5tF q62Ecf+jWW/kNKzhTYync26uzazEI+ISehaLRtW9qX+bPZjapCu5v8/6Q90DorQShRhX HRPAQbD71bTtoWAMj4aShZ/eXzD/CClhw4PBsPASKaMsOA4KuNk2q5MHwjgNejPaAmW0 K67gOoL3ma81RMZoALZVx30Hw6lXUTqDx2ELg5kjoh3fQN6+6AvrL6p4OhYW5zBvNEnD MRhiNzkg6KFepOXYiWIBfTP467s4KgPkwEfZMWtgw5zAH3srsQ01GeMXo3PGyf12Ah0F TeqA== X-Gm-Message-State: AOJu0Yz+SCHJIEwcgUMv7sBQj750xU0hFDSYgUyWZrhhLuVuRMkLJ+bP jf+7dd7hYy87B9KCFqroBw== X-Google-Smtp-Source: AGHT+IG2mj1pOh1HHLifMkPv5gob2juFXWv2nAjPYtFYlMm5LsHYKCXmC/3wmfCJb7v1aXs04oFSqQ== X-Received: by 2002:a05:6808:114a:b0:3b8:39d9:90fd with SMTP id u10-20020a056808114a00b003b839d990fdmr1809704oiu.39.1702105199992; Fri, 08 Dec 2023 22:59:59 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x8-20020a81b048000000b005df5d592244sm326530ywk.78.2023.12.08.22.59.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Dec 2023 22:59:59 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com, Michal Hocko Subject: [PATCH v2 08/11] mm/mempolicy: add set_mempolicy2 syscall Date: Sat, 9 Dec 2023 01:59:28 -0500 Message-Id: <20231209065931.3458-9-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231209065931.3458-1-gregory.price@memverge.com> References: <20231209065931.3458-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" set_mempolicy2 is an extensible set_mempolicy interface which allows a user to set the per-task memory policy. Defined as: set_mempolicy2(struct mpol_args *args, size_t size, unsigned long flags); relevant mpol_args fields include the following: mode: The MPOL_* policy (DEFAULT, INTERLEAVE, etc.) mode_flags: The MPOL_F_* flags that were previously passed in or'd into the mode. This was split to hopefully allow future extensions additional mode/flag space. pol_nodes: the nodemask to apply for the memory policy pol_maxnodes: The max number of nodes described by pol_nodes The usize arg is intended for the user to pass in sizeof(mpol_args) to allow forward/backward compatibility whenever possible. The flags argument is intended to future proof the syscall against future extensions which may require interpreting the arguments in the structure differently. Semantics of `set_mempolicy` are otherwise the same as `set_mempolicy` as of this patch. Suggested-by: Michal Hocko Signed-off-by: Gregory Price Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Vinicius Tavares Petrucci Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- .../admin-guide/mm/numa_memory_policy.rst | 10 ++++++ arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/syscalls.h | 2 ++ include/uapi/asm-generic/unistd.h | 4 ++- mm/mempolicy.c | 36 +++++++++++++++++++ 18 files changed, 65 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Document= ation/admin-guide/mm/numa_memory_policy.rst index 64c5804dc40f..aabc24db92d3 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -432,6 +432,8 @@ Set [Task] Memory Policy:: =20 long set_mempolicy(int mode, const unsigned long *nmask, unsigned long maxnode); + long set_mempolicy2(struct mpol_args args, size_t size, + unsigned long flags); =20 Set's the calling task's "task/process memory policy" to mode specified by the 'mode' argument and the set of nodes defined by @@ -440,6 +442,12 @@ specified by the 'mode' argument and the set of nodes = defined by 'mode' argument with the flag (for example: MPOL_INTERLEAVE | MPOL_F_STATIC_NODES). =20 +set_mempolicy2() is an extended version of set_mempolicy() capable +of setting a mempolicy which requires more information than can be +passed via get_mempolicy(). For example, weighted interleave with +task-local weights requires a weight array to be passed via the +'mpol_args->il_weights' argument in the 'struct mpol_args' arg. + See the set_mempolicy(2) man page for more details =20 =20 @@ -498,6 +506,8 @@ Extended Mempolicy Arguments:: The extended mempolicy argument structure is defined to allow the mempolicy interfaces future extensibility without the need for additional system cal= ls. =20 +Extended interfaces (set_mempolicy2) use this argument structure. + The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to all interfaces relative to their non-extended counterparts. Each additional field may only apply to specific extended interfaces. See the respective diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/sys= calls/syscall.tbl index 18c842ca6c32..0dc288a1118a 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -496,3 +496,4 @@ 564 common futex_wake sys_futex_wake 565 common futex_wait sys_futex_wait 566 common futex_requeue sys_futex_requeue +567 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 584f9528c996..50172ec0e1f5 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -470,3 +470,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/sysca= lls/syscall.tbl index 7a4b780e82cb..839d90c535f2 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -456,3 +456,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/= kernel/syscalls/syscall.tbl index 5b6a0b02b7de..567c8b883735 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -462,3 +462,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/s= yscalls/syscall_n32.tbl index a842b41c8e06..cc0640e16f2f 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -395,3 +395,4 @@ 454 n32 futex_wake sys_futex_wake 455 n32 futex_wait sys_futex_wait 456 n32 futex_requeue sys_futex_requeue +457 n32 set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/s= yscalls/syscall_o32.tbl index 525cc54bc63b..f7262fde98d9 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -444,3 +444,4 @@ 454 o32 futex_wake sys_futex_wake 455 o32 futex_wait sys_futex_wait 456 o32 futex_requeue sys_futex_requeue +457 o32 set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/s= yscalls/syscall.tbl index a47798fed54e..e10f0e8bd064 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -455,3 +455,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel= /syscalls/syscall.tbl index 7fab411378f2..4f03f5f42b78 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -543,3 +543,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/sysca= lls/syscall.tbl index 86fec9b080f6..f98dadc2e9df 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -459,3 +459,4 @@ 454 common futex_wake sys_futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/= syscall.tbl index 363fae0fe9bf..f47ba9f2d05d 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -459,3 +459,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/sys= calls/syscall.tbl index 7bcaa3d5ea44..53fb16616728 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -502,3 +502,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscal= ls/syscall_32.tbl index c8fac5205803..4b4dc41b24ee 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -461,3 +461,4 @@ 454 i386 futex_wake sys_futex_wake 455 i386 futex_wait sys_futex_wait 456 i386 futex_requeue sys_futex_requeue +457 i386 set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscal= ls/syscall_64.tbl index 8cb8bf68721c..1bc2190bec27 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -378,6 +378,7 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 =20 # # Due to a historical design error, certain syscalls are numbered differen= tly diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/s= yscalls/syscall.tbl index 06eefa9c1458..e26dc89399eb 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -427,3 +427,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index fd9d12de7e92..3244cd990858 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -822,6 +822,8 @@ asmlinkage long sys_get_mempolicy(int __user *policy, unsigned long addr, unsigned long flags); asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nm= ask, unsigned long maxnode); +asmlinkage long sys_set_mempolicy2(struct mpol_args *args, size_t size, + unsigned long flags); asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode, const unsigned long __user *from, const unsigned long __user *to); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/u= nistd.h index 756b013fb832..55486aba099f 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -828,9 +828,11 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake) __SYSCALL(__NR_futex_wait, sys_futex_wait) #define __NR_futex_requeue 456 __SYSCALL(__NR_futex_requeue, sys_futex_requeue) +#define __NR_set_mempolicy2 457 +__SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2) =20 #undef __NR_syscalls -#define __NR_syscalls 457 +#define __NR_syscalls 458 =20 /* * 32 bit systems traditionally used different diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 446167dcebdc..a56ff02f780e 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1633,6 +1633,42 @@ SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsi= gned long __user *, nmask, return kernel_set_mempolicy(mode, nmask, maxnode); } =20 +SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __user *, uargs, size_t, = usize, + unsigned long, flags) +{ + struct mpol_args kargs; + struct mempolicy_args margs; + int err; + nodemask_t policy_nodemask; + unsigned long __user *nodes_ptr; + + if (flags) + return -EINVAL; + + err =3D copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize); + if (err) + return err; + + err =3D validate_mpol_flags(kargs.mode, &kargs.mode_flags); + if (err) + return err; + + memset(&margs, '\0', sizeof(margs)); + margs.mode =3D kargs.mode; + margs.mode_flags =3D kargs.mode_flags; + if (kargs.pol_nodes) { + nodes_ptr =3D u64_to_user_ptr(kargs.pol_nodes); + err =3D get_nodes(&policy_nodemask, nodes_ptr, + kargs.pol_maxnodes); + if (err) + return err; + margs.policy_nodes =3D &policy_nodemask; + } else + margs.policy_nodes =3D NULL; + + return do_set_mempolicy(&margs); +} + static int kernel_migrate_pages(pid_t pid, unsigned long maxnode, const unsigned long __user *old_nodes, const unsigned long __user *new_nodes) --=20 2.39.1 From nobody Fri Dec 19 11:34:02 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 644FAC4167B for ; Sat, 9 Dec 2023 07:01:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236043AbjLIHBG (ORCPT ); Sat, 9 Dec 2023 02:01:06 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42336 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234338AbjLIHAc (ORCPT ); Sat, 9 Dec 2023 02:00:32 -0500 Received: from mail-yw1-x1143.google.com (mail-yw1-x1143.google.com [IPv6:2607:f8b0:4864:20::1143]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0BCE71FE6; Fri, 8 Dec 2023 23:00:11 -0800 (PST) Received: by mail-yw1-x1143.google.com with SMTP id 00721157ae682-5d34f8f211fso27274947b3.0; Fri, 08 Dec 2023 23:00:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702105211; x=1702710011; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=5jYajdpOXFi/B/dNFku3CaKz4MM7mFJ22/+jEV7wLMU=; b=W6j0+z2NcPqjon8XouGzXKB4+xH4cj7iaBeZvEKfXRPdLbDG/E+Fsln76VSowEZ5W9 ushC9QBXsJhzZmFfyEjES5jmYWGTeRlwXpUM/gtbKpqNrSqPidq7zmvxuoxVqwktSlK2 nwGUgGmTqpkUMKqww01mDlkAjIwBcFP/E1V3JfQWhTR9RzEfnQZEJhlRHVUICj42bxry Vyk2lLRx/aG2xww4mTC6D5CuzDPGcxsDSeIt6X0xWQf/i1f4weB4x3sODW/Osg3PXla2 XYvPAVtBFBc+Be8n995bkhr2Pl1UeEPEiBWoRhebK39TVEqk7HZcmdKYHD3SQ2Xm1lgE 62eQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702105211; x=1702710011; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=5jYajdpOXFi/B/dNFku3CaKz4MM7mFJ22/+jEV7wLMU=; b=pQYq4SqfooPGUwg/aQ54Tcpb2PkDdv2X6vfyEQgeEN7IER6V1lO6CQP/zZgHyYTUmU TEwo2ePurIE5Tagp9v3Rc/6ts9M0I9SeVgLCG1yMek/Mgs5T7wVuDbkYV9i7ZXh1BLpz sV+tS5ygkFr5ZPmDL/lv7f9lsIzHaQZ1iyF9mPT5V+3kGfdGeopBOSJAxbXG/aDCQB5Q 2cR6EetNzN/qwFD2c90o35OQ3aLSTCSTPH0RJhkKn76ZJaNLntmwkYrqSk5O1IGsbBMd s1s6p7F+S4REvHbi+GAKeMQkHmukwIbokCDF0Z6V09A13fZZfsuvq5NZgrKdTD48kb32 nZxw== X-Gm-Message-State: AOJu0YwHXZnyDvlOQiCaJieXcPZG1Zkm7K9F6mD8CGteIpFd84kfuC1P Br/tzUrBun/UGPNk9QepWw== X-Google-Smtp-Source: AGHT+IF4HLQWZThd+U+NGJAyGUqIh0Nsy1W2x6Jmvx3yrk87lHkzFdoTufr1N2HCgYU1abpTa7t0eg== X-Received: by 2002:a0d:cc05:0:b0:5d6:aea0:2232 with SMTP id o5-20020a0dcc05000000b005d6aea02232mr1034156ywd.19.1702105210686; Fri, 08 Dec 2023 23:00:10 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x8-20020a81b048000000b005df5d592244sm326530ywk.78.2023.12.08.23.00.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Dec 2023 23:00:10 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com, Michal Hocko Subject: [PATCH v2 09/11] mm/mempolicy: add get_mempolicy2 syscall Date: Sat, 9 Dec 2023 01:59:30 -0500 Message-Id: <20231209065931.3458-10-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231209065931.3458-1-gregory.price@memverge.com> References: <20231209065931.3458-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" get_mempolicy2 is an extensible get_mempolicy interface which allows a user to retrieve the memory policy for a task or address. Defined as: get_mempolicy2(struct mpol_args *args, size_t size, unsigned long flags) Input values include the following fields of mpol_args: pol_nodes: if set, the nodemask of the policy returned here pol_maxnodes: if pol_nodes is set, must describe max number of nodes to be copied to pol_nodes addr: if MPOL_F_ADDR is passed in `flags`, this address will be used to return the mempolicy details of the vma the address belongs to flags: if MPOL_F_MEMS_ALLOWED, returns mems_allowed in pol_nodes if MPOL_F_ADDR, return mempolicy info vma containing addr else, returns per-task mempolicy information Output values include the following fields of mpol_args: mode: mempolicy mode mode_flags: mempolicy mode flags pol_nodes: if set, the nodemask for the mempolicy policy_node: if the policy has extended node information, it will be placed here. For example MPOL_INTERLEAVE will return the next node which will be used for allocation addr_node: If MPOL_F_ADDR is set, the numa node that the address is located on will be returned. home_node: policy home node will be returned here, or -1 if not. MPOL_F_NODE has been dropped from get_mempolicy2 (it is ignored) in favor or returning explicit values in `policy_node` and `addr_node`. Suggested-by: Michal Hocko Signed-off-by: Gregory Price Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Vinicius Tavares Petrucci Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- .../admin-guide/mm/numa_memory_policy.rst | 8 +++- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 4 +- mm/mempolicy.c | 47 +++++++++++++++++++ 18 files changed, 73 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Document= ation/admin-guide/mm/numa_memory_policy.rst index aabc24db92d3..a52624ab659a 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -456,11 +456,17 @@ Get [Task] Memory Policy or Related Information:: long get_mempolicy(int *mode, const unsigned long *nmask, unsigned long maxnode, void *addr, int flags); + long get_mempolicy2(struct mpol_args args, size_t size, + unsigned long flags); =20 Queries the "task/process memory policy" of the calling task, or the policy or location of a specified virtual address, depending on the 'flags' argument. =20 +get_mempolicy2() is an extended version of get_mempolicy() capable of +acquiring extended information about a mempolicy, including those +that can only be set via set_mempolicy2() or mbind2().. + See the get_mempolicy(2) man page for more details =20 =20 @@ -506,7 +512,7 @@ Extended Mempolicy Arguments:: The extended mempolicy argument structure is defined to allow the mempolicy interfaces future extensibility without the need for additional system cal= ls. =20 -Extended interfaces (set_mempolicy2) use this argument structure. +Extended interfaces (set_mempolicy2 and get_mempolicy2) use this structure. =20 The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to all interfaces relative to their non-extended counterparts. Each additional diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/sys= calls/syscall.tbl index 0dc288a1118a..0301a8b0a262 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -497,3 +497,4 @@ 565 common futex_wait sys_futex_wait 566 common futex_requeue sys_futex_requeue 567 common set_mempolicy2 sys_set_mempolicy2 +568 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 50172ec0e1f5..771a33446e8e 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -471,3 +471,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/sysca= lls/syscall.tbl index 839d90c535f2..048a409e684c 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -457,3 +457,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/= kernel/syscalls/syscall.tbl index 567c8b883735..327b01bd6793 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -463,3 +463,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/s= yscalls/syscall_n32.tbl index cc0640e16f2f..921d58e1da23 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -396,3 +396,4 @@ 455 n32 futex_wait sys_futex_wait 456 n32 futex_requeue sys_futex_requeue 457 n32 set_mempolicy2 sys_set_mempolicy2 +458 n32 get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/s= yscalls/syscall_o32.tbl index f7262fde98d9..9271c83c9993 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -445,3 +445,4 @@ 455 o32 futex_wait sys_futex_wait 456 o32 futex_requeue sys_futex_requeue 457 o32 set_mempolicy2 sys_set_mempolicy2 +458 o32 get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/s= yscalls/syscall.tbl index e10f0e8bd064..0654f3f89fc7 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -456,3 +456,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel= /syscalls/syscall.tbl index 4f03f5f42b78..ac11d2064e7a 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -544,3 +544,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/sysca= lls/syscall.tbl index f98dadc2e9df..1cdcafe1ccca 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -460,3 +460,4 @@ 455 common futex_wait sys_futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/= syscall.tbl index f47ba9f2d05d..f71742024c29 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -460,3 +460,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/sys= calls/syscall.tbl index 53fb16616728..2fbf5dbe0620 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -503,3 +503,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscal= ls/syscall_32.tbl index 4b4dc41b24ee..0af813b9a118 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -462,3 +462,4 @@ 455 i386 futex_wait sys_futex_wait 456 i386 futex_requeue sys_futex_requeue 457 i386 set_mempolicy2 sys_set_mempolicy2 +458 i386 get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscal= ls/syscall_64.tbl index 1bc2190bec27..0b777876fc15 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -379,6 +379,7 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 =20 # # Due to a historical design error, certain syscalls are numbered differen= tly diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/s= yscalls/syscall.tbl index e26dc89399eb..4536c9a4227d 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -428,3 +428,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 3244cd990858..774512b7934e 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -820,6 +820,8 @@ asmlinkage long sys_get_mempolicy(int __user *policy, unsigned long __user *nmask, unsigned long maxnode, unsigned long addr, unsigned long flags); +asmlinkage long sys_get_mempolicy2(struct mpol_args *args, size_t size, + unsigned long flags); asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nm= ask, unsigned long maxnode); asmlinkage long sys_set_mempolicy2(struct mpol_args *args, size_t size, diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/u= nistd.h index 55486aba099f..719accc731db 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -830,9 +830,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait) __SYSCALL(__NR_futex_requeue, sys_futex_requeue) #define __NR_set_mempolicy2 457 __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2) +#define __NR_get_mempolicy2 458 +__SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2) =20 #undef __NR_syscalls -#define __NR_syscalls 458 +#define __NR_syscalls 459 =20 /* * 32 bit systems traditionally used different diff --git a/mm/mempolicy.c b/mm/mempolicy.c index a56ff02f780e..cfe22156ef13 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1859,6 +1859,53 @@ SYSCALL_DEFINE5(get_mempolicy, int __user *, policy, return kernel_get_mempolicy(policy, nmask, maxnode, addr, flags); } =20 +SYSCALL_DEFINE3(get_mempolicy2, struct mpol_args __user *, uargs, size_t, = usize, + unsigned long, flags) +{ + struct mpol_args kargs; + struct mempolicy_args margs; + int err; + nodemask_t policy_nodemask; + unsigned long __user *nodes_ptr; + + err =3D copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize); + if (err) + return -EINVAL; + + if (flags & MPOL_F_MEMS_ALLOWED) { + if (!margs.policy_nodes) + return -EINVAL; + err =3D do_get_mems_allowed(&policy_nodemask); + if (err) + return err; + nodes_ptr =3D u64_to_user_ptr(kargs.pol_nodes); + return copy_nodes_to_user(nodes_ptr, kargs.pol_maxnodes, + &policy_nodemask); + } + + margs.policy_nodes =3D kargs.pol_nodes ? &policy_nodemask : NULL; + if (flags & MPOL_F_ADDR) { + margs.addr =3D kargs.addr; + err =3D do_get_vma_mempolicy(&margs); + } else + err =3D do_get_task_mempolicy(&margs); + + if (err) + return err; + + kargs.mode =3D margs.mode; + kargs.mode_flags =3D margs.mode_flags; + kargs.policy_node =3D margs.policy_node; + kargs.addr_node =3D (flags & MPOL_F_ADDR) ? margs.addr_node : -1; + if (kargs.pol_nodes) { + nodes_ptr =3D u64_to_user_ptr(kargs.pol_nodes); + err =3D copy_nodes_to_user(nodes_ptr, kargs.pol_maxnodes, + margs.policy_nodes); + } + + return copy_to_user(uargs, &kargs, usize) ? -EFAULT : 0; +} + bool vma_migratable(struct vm_area_struct *vma) { if (vma->vm_flags & (VM_IO | VM_PFNMAP)) --=20 2.39.1 From nobody Fri Dec 19 11:34:02 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id F2CBDC4167B for ; Sat, 9 Dec 2023 07:01:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234397AbjLIHB1 (ORCPT ); Sat, 9 Dec 2023 02:01:27 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43922 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234330AbjLIHA6 (ORCPT ); Sat, 9 Dec 2023 02:00:58 -0500 Received: from mail-qk1-x744.google.com (mail-qk1-x744.google.com [IPv6:2607:f8b0:4864:20::744]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C38FE2136; Fri, 8 Dec 2023 23:00:21 -0800 (PST) Received: by mail-qk1-x744.google.com with SMTP id af79cd13be357-77f3183f012so139964285a.0; Fri, 08 Dec 2023 23:00:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702105219; x=1702710019; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=5/XyqqBkg5fuo33Ip2SFhoThrNsBfzrYojUCpknb6qg=; b=lDOTJwVrqAP3j8VcOewwWbbPMbLSpkZNm5S059hEw7CTu0f+bC1HNRfF575B5zSwOF wFreCEoAp8e2De4S3NxFWiFrkjVRpf5AcWddoauLZYxSAD7eZbI5We1y1HvOn350W7wa NBJm0RZ6mNpMl2w9FGkDfFhE/1NJI9O/akPYgup5r3CSNSFsxnru79W5tZdKVELGTUcz cuUdFyg77j2ZBqty88M9WwsuAKrF4+gwA+8G6bfOjEqwfjWBnXpjIzYL5nrfIHiwEK2v bE4TlcVJtO6mQ+QHJUxYMmtREmkaYFJDStrqf3kMpHi0Y3qmrrQn87twtU6CLf5QD9R0 FmfA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702105219; x=1702710019; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=5/XyqqBkg5fuo33Ip2SFhoThrNsBfzrYojUCpknb6qg=; b=I8mWKy1Q8aH4REf7z+oWBjTFjkpau16YKkhYr9HvjwWY+pGZQLm8EI1HISYT3YI6r+ PzGonL9WyZOjBqabL/bPT5I0ZGaJyVQpCaxgvorhkK+FWfZDugF9X1uyRCZ7pE6zT7qQ Fw8saKxNZTH7+tEzh5MQvmE454w1hmexVJCyeNSmwDXqBiZJ2zQHgnzo/eYZOD0Z5Fv7 yPPToq1UXrI4Fo5p/K+GYvf3BBhVqLYiQ8ybrI5ZTCIYrWykeVblYPacN8O/iRAr1nHH jbea5kW4ukLUPcmvwYq1vDy9DyGgjiGYbS74li/39F1Rklr1JsZzVRXqyDG5bjr8Dc1U hrbA== X-Gm-Message-State: AOJu0Yy5r+neIIPxWtZV6hmGdAklyDD/4Cwq7/3fvRSimdT8ADf7baDA JL6H8EobA6FAei8rJ/DfNg== X-Google-Smtp-Source: AGHT+IG+mvJKnPRrBL3r0GmZB6X9O+Z0T8mESY8jq/MrpvJQq54w3SN+fxjeQ0TUX61npkiogxX9JQ== X-Received: by 2002:a05:620a:2287:b0:77e:fba3:9383 with SMTP id o7-20020a05620a228700b0077efba39383mr1322911qkh.101.1702105219224; Fri, 08 Dec 2023 23:00:19 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x8-20020a81b048000000b005df5d592244sm326530ywk.78.2023.12.08.23.00.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Dec 2023 23:00:19 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com, Michal Hocko , Frank van der Linden Subject: [PATCH v2 10/11] mm/mempolicy: add the mbind2 syscall Date: Sat, 9 Dec 2023 01:59:31 -0500 Message-Id: <20231209065931.3458-11-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231209065931.3458-1-gregory.price@memverge.com> References: <20231209065931.3458-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" mbind2 is an extensible mbind interface which allows a user to set the mempolicy for one or more address ranges. Defined as: mbind2(struct mpol_args *args, size_t size, unsigned long flags) Input values include the following fields of mpl_args: mode: The MPOL_* policy (DEFAULT, INTERLEAVE, etc.) mode_flags: The MPOL_F_* flags that were previously passed in or'd into the mode. This was split to hopefully allow future extensions additional mode/flag space. pol_nodes: the nodemask to apply for the memory policy pol_maxnodes: The max number of nodes described by pol_nodes home_node: if MPOL_MF_HOME_NODE, set home node of policy to this vec: the vector of (address, len) memory ranges to operate on vlen: the number of entries in vec The semantics are otherwise the same as mbind(), except that the home_node can be set, and all address ranges defined by vec/vlen will be operated on. Valid flags for mbind2 include the same flags as mbind, plus MPOL_MF_HOME_NODE, which informs the syscall to utilize the value of mpol_args->home_node to set the mempolicy home node. Suggested-by: Michal Hocko Suggested-by: Frank van der Linden Suggested-by: Vinicius Tavares Petrucci Suggested-by: Rakie Kim Suggested-by: Hyeongtak Ji Suggested-by: Honggyu Kim Signed-off-by: Gregory Price Co-developed-by: Vinicius Tavares Petrucci Suggested-by: Dan Williams Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- .../admin-guide/mm/numa_memory_policy.rst | 12 +++- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/mempolicy.h | 5 +- mm/mempolicy.c | 68 +++++++++++++++++++ 19 files changed, 102 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Document= ation/admin-guide/mm/numa_memory_policy.rst index a52624ab659a..f1ba33de3a6e 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -475,12 +475,18 @@ Install VMA/Shared Policy for a Range of Task's Addre= ss Space:: long mbind(void *start, unsigned long len, int mode, const unsigned long *nmask, unsigned long maxnode, unsigned flags); + long mbind2(struct iovec *vec, size_t vlen, struct mpol_args args, + size_t size, unsigned long flags); =20 mbind() installs the policy specified by (mode, nmask, maxnodes) as a VMA policy for the range of the calling task's address space specified by the 'start' and 'len' arguments. Additional actions may be requested via the 'flags' argument. =20 +mbind2() is an extended version of mbind() capable of operating on multiple +memory ranges in one syscall, and which is capable of setting the home node +for the memory policy without an additional call to set_mempolicy_home_nod= e() + See the mbind(2) man page for more details. =20 Set home node for a Range of Task's Address Spacec:: @@ -496,6 +502,9 @@ closest to which page allocation will come from. Specif= ying the home node overri the default allocation policy to allocate memory close to the local node f= or an executing CPU. =20 +mbind2() also provides a way for the home node to be set at the time the +mempolicy is set. See the mbind(2) man page for more details. + Extended Mempolicy Arguments:: =20 struct mpol_args { @@ -512,7 +521,8 @@ Extended Mempolicy Arguments:: The extended mempolicy argument structure is defined to allow the mempolicy interfaces future extensibility without the need for additional system cal= ls. =20 -Extended interfaces (set_mempolicy2 and get_mempolicy2) use this structure. +Extended interfaces (set_mempolicy2, get_mempolicy2, and mbind2) use this +this argument structure. =20 The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to all interfaces relative to their non-extended counterparts. Each additional diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/sys= calls/syscall.tbl index 0301a8b0a262..e8239293c35a 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -498,3 +498,4 @@ 566 common futex_requeue sys_futex_requeue 567 common set_mempolicy2 sys_set_mempolicy2 568 common get_mempolicy2 sys_get_mempolicy2 +569 common mbind2 sys_mbind2 diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 771a33446e8e..a3f39750257a 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -472,3 +472,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/sysca= lls/syscall.tbl index 048a409e684c..9a12dface18e 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -458,3 +458,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/= kernel/syscalls/syscall.tbl index 327b01bd6793..6cb740123137 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -464,3 +464,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/s= yscalls/syscall_n32.tbl index 921d58e1da23..52cf720f8ae2 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -397,3 +397,4 @@ 456 n32 futex_requeue sys_futex_requeue 457 n32 set_mempolicy2 sys_set_mempolicy2 458 n32 get_mempolicy2 sys_get_mempolicy2 +459 n32 mbind2 sys_mbind2 diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/s= yscalls/syscall_o32.tbl index 9271c83c9993..fd37c5301a48 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -446,3 +446,4 @@ 456 o32 futex_requeue sys_futex_requeue 457 o32 set_mempolicy2 sys_set_mempolicy2 458 o32 get_mempolicy2 sys_get_mempolicy2 +459 o32 mbind2 sys_mbind2 diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/s= yscalls/syscall.tbl index 0654f3f89fc7..fcd67bc405b1 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -457,3 +457,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel= /syscalls/syscall.tbl index ac11d2064e7a..89715417014c 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -545,3 +545,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/sysca= lls/syscall.tbl index 1cdcafe1ccca..c8304e0d0aa7 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -461,3 +461,4 @@ 456 common futex_requeue sys_futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 sys_mbind2 diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/= syscall.tbl index f71742024c29..e5c51b6c367f 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -461,3 +461,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/sys= calls/syscall.tbl index 2fbf5dbe0620..74527f585500 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -504,3 +504,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscal= ls/syscall_32.tbl index 0af813b9a118..be2e2aa17dd8 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -463,3 +463,4 @@ 456 i386 futex_requeue sys_futex_requeue 457 i386 set_mempolicy2 sys_set_mempolicy2 458 i386 get_mempolicy2 sys_get_mempolicy2 +459 i386 mbind2 sys_mbind2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscal= ls/syscall_64.tbl index 0b777876fc15..6e2347eb8773 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -380,6 +380,7 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 =20 # # Due to a historical design error, certain syscalls are numbered differen= tly diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/s= yscalls/syscall.tbl index 4536c9a4227d..f00a21317dc0 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -429,3 +429,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 774512b7934e..487dd9155b25 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -816,6 +816,9 @@ asmlinkage long sys_mbind(unsigned long start, unsigned= long len, const unsigned long __user *nmask, unsigned long maxnode, unsigned flags); +asmlinkage long sys_mbind2(const struct iovec __user *vec, size_t vlen, + const struct mpol_args __user *uargs, size_t usize, + unsigned long flags); asmlinkage long sys_get_mempolicy(int __user *policy, unsigned long __user *nmask, unsigned long maxnode, diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/u= nistd.h index 719accc731db..cd31599bb9cc 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -832,9 +832,11 @@ __SYSCALL(__NR_futex_requeue, sys_futex_requeue) __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2) #define __NR_get_mempolicy2 458 __SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2) +#define __NR_mbind2 459 +__SYSCALL(__NR_mbind2, sys_mbind2) =20 #undef __NR_syscalls -#define __NR_syscalls 459 +#define __NR_syscalls 460 =20 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 00a673e30047..506ea0f8f34e 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -56,13 +56,14 @@ struct mpol_args { #define MPOL_F_ADDR (1<<1) /* look up vma using address */ #define MPOL_F_MEMS_ALLOWED (1<<2) /* return allowed memories */ =20 -/* Flags for mbind */ +/* Flags for mbind/mbind2 */ #define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */ #define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to policy */ #define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to policy */ #define MPOL_MF_LAZY (1<<3) /* UNSUPPORTED FLAG: Lazy migrate on fault */ -#define MPOL_MF_INTERNAL (1<<4) /* Internal flags start here */ +#define MPOL_MF_HOME_NODE (1<<4) /* mbind2: set home node */ +#define MPOL_MF_INTERNAL (1<<5) /* Internal flags start here */ =20 #define MPOL_MF_VALID (MPOL_MF_STRICT | \ MPOL_MF_MOVE | \ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index cfe22156ef13..8f609204fbe7 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1600,6 +1600,74 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigne= d long, len, return kernel_mbind(start, len, mode, nmask, maxnode, flags); } =20 +SYSCALL_DEFINE5(mbind2, const struct iovec __user *, vec, size_t, vlen, + const struct mpol_args __user *, uargs, size_t, usize, + unsigned long, flags) +{ + struct mpol_args kargs; + struct mempolicy_args margs; + nodemask_t policy_nodes; + unsigned long __user *nodes_ptr; + struct iovec iovstack[UIO_FASTIOV]; + struct iovec *iov =3D iovstack; + struct iov_iter iter; + int err; + + if (!vec || !vlen) + return -EINVAL; + + err =3D copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize); + if (err) + return -EINVAL; + + err =3D validate_mpol_flags(kargs.mode, &kargs.mode_flags); + if (err) + return err; + + margs.mode =3D kargs.mode; + margs.mode_flags =3D kargs.mode_flags; + margs.addr =3D kargs.addr; + + /* if home node given, validate it is online */ + if (flags & MPOL_MF_HOME_NODE) { + if ((kargs.home_node >=3D MAX_NUMNODES) || + !node_online(kargs.home_node)) + return -EINVAL; + margs.home_node =3D kargs.home_node; + } else + margs.home_node =3D NUMA_NO_NODE; + flags &=3D ~MPOL_MF_HOME_NODE; + + if (kargs.pol_nodes) { + nodes_ptr =3D u64_to_user_ptr(kargs.pol_nodes); + err =3D get_nodes(&policy_nodes, nodes_ptr, + kargs.pol_maxnodes); + if (err) + return err; + margs.policy_nodes =3D &policy_nodes; + } else + margs.policy_nodes =3D NULL; + + /* For each address range in vector, do_mbind */ + err =3D import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack), &iov, + &iter); + if (err) + return err; + while (iov_iter_count(&iter)) { + unsigned long start, len; + + start =3D untagged_addr((unsigned long)iter_iov_addr(&iter)); + len =3D iter_iov_len(&iter); + err =3D do_mbind(start, len, &margs, flags); + if (err) + break; + iov_iter_advance(&iter, iter_iov_len(&iter)); + } + + kfree(iov); + return err; +} + /* Set the process memory policy */ static long kernel_set_mempolicy(int mode, const unsigned long __user *nma= sk, unsigned long maxnode) --=20 2.39.1 From nobody Fri Dec 19 11:34:02 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C745CC4167B for ; Sat, 9 Dec 2023 07:02:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234389AbjLIHBx (ORCPT ); Sat, 9 Dec 2023 02:01:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43808 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236000AbjLIHBY (ORCPT ); Sat, 9 Dec 2023 02:01:24 -0500 Received: from mail-qk1-x744.google.com (mail-qk1-x744.google.com [IPv6:2607:f8b0:4864:20::744]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D91C91BE7; Fri, 8 Dec 2023 23:00:57 -0800 (PST) Received: by mail-qk1-x744.google.com with SMTP id af79cd13be357-77f335002cfso150125585a.3; Fri, 08 Dec 2023 23:00:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702105257; x=1702710057; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=f0/QsDUqVOH9BytpPr3Z53rXKvjz2DVO81ae22Cz4+M=; b=PSFHaeIsQxT3fPp6v+QLIMdBumke8orGESvleO4YlR2NesdtdLebrfIWTh3GeLJEoX GHnBrM44EsYOsOxWS2DPAQUD3S4sS98BtE1i6x/K7PoJ3OsePTMm5wObnmk5PqhdN7Ry VO0pteh6G1vz22Lr6bzCSXxyCy1fvwU7C2XM0GzcaAiQV0st8SAMFCRxEDvCx+MVsXNH cI5JSLqV1/V3u/nP180g4D21p0KRm05YqDidTURza/YQ5SRliY5ntxZRelzATan6aicq /VrjhodbVxHopin9rrR6Kjz7K9YteFqKZ1WDyjMrDLl0iR7PR63454iQptro2KgsaARP Gbvg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702105257; x=1702710057; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=f0/QsDUqVOH9BytpPr3Z53rXKvjz2DVO81ae22Cz4+M=; b=Bmzo/qnJSUGfxLkNJUz3akZt2g4PbZTTh/XfhqxJGeyKyJE4SyVsy2MB1xgZnpgkWn eMPvbaswRpcBfTRfb7JTLj6ayOQzFdcCxs6nh8BlGCxkU/AuYAMX27dwKMwpKjk6JuLi OvS9i3NS9upLLnQbi7wIawoA9xKwIb1ZHjzwAAa1LBRSs6u3NDXTBuYf9LWl4Czm9Vf2 B3FW6WSAYGQH7kIW8jFF7+e9nnw+yKcjKjWj6FfsfUl3u0oiYwmLirWhVf0BnqbbNrdd Mq34FVHK4HLJxdkC5B5irUgw8GYWND5kwWMg7MKnndjDw6kuoYTG84fYdGuZWzMAuw0o c9aw== X-Gm-Message-State: AOJu0YzldZiQ2qug1vLpg+iY/qz1fcPluH5iZVefyj+QEeq+DQsNKFfz FEkwoteq2psWl6zFfSpcSw== X-Google-Smtp-Source: AGHT+IHzzAJgvXXv0VkKyzYw1q1C6HSp7DMbEqYwJaHNuPCI+sjP7dceW1TF3ra6bNoategdmsIdnQ== X-Received: by 2002:a05:620a:370e:b0:77e:fba3:a23a with SMTP id de14-20020a05620a370e00b0077efba3a23amr1474903qkb.148.1702105256885; Fri, 08 Dec 2023 23:00:56 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x8-20020a81b048000000b005df5d592244sm326530ywk.78.2023.12.08.23.00.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Dec 2023 23:00:56 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com, Gregory Price Subject: [PATCH v2 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave Date: Sat, 9 Dec 2023 01:59:33 -0500 Message-Id: <20231209065931.3458-12-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231209065931.3458-1-gregory.price@memverge.com> References: <20231209065931.3458-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Gregory Price Extend set_mempolicy2 and mbind2 to support weighted interleave, and demonstrate the extensibility of the mpol_args structure. To support weighted interleave we add interleave weight fields to the following structures: Kernel Internal: (include/linux/mempolicy.h) struct mempolicy { /* task-local weights to apply to weighted interleave */ unsigned char weights[MAX_NUMNODES]; } struct mempolicy_args { /* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */ unsigned char *il_weights; /* of size MAX_NUMNODES */ } UAPI: (/include/uapi/linux/mempolicy.h) struct mpol_args { /* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */ unsigned char *il_weights; /* of size pol_max_nodes */ } The task-local weights are a single, one-dimensional array of weights that apply to all possible nodes on the system. If a node is set in the mempolicy nodemask, the weight in `il_weights` must be >=3D 1, otherwise set_mempolicy2() will return -EINVAL. If a node is not set in pol_nodemask, the weight will default to `1` in the task policy. The default value of `1` is required to handle the situation where a task migrates to a set of nodes for which weights were not set (up to and including the local numa node). For example, a migrated task whose nodemask changes entirely will have all its weights defaulted back to `1`, or if the nodemask changes to include a mix of nodes that were not previously accounted for - the weighted interleave may be suboptimal. If migrations are expected, a task should prefer not to use task-local interleave weights, and instead utilize the global settings for natural re-weighting on migration. To support global vs local weighting, we add the kernel-internal flag: MPOL_F_GWEIGHT (1 << 5) /* Utilize global weights */ This flag is set when il_weights is omitted by set_mempolicy2(), or when MPOL_WEIGHTED_INTERLEAVE is set by set_mempolicy(). This internal mode_flag dictates whether global weights or task-local weights are utilized by the the various weighted interleave functions: * weighted_interleave_nodes * weighted_interleave_nid * alloc_pages_bulk_array_weighted_interleave if (pol->flags & MPOL_F_GWEIGHT) pol_weights =3D iw_table[numa_node_id()].weights; else pol_weights =3D pol->wil.weights; To simplify creations and duplication of mempolicies, the weights are added as a structure directly within mempolicy. This allows the existing logic in __mpol_dup to copy the weights without additional allocations: if (old =3D=3D current->mempolicy) { task_lock(current); *new =3D *old; task_unlock(current); } else *new =3D *old Suggested-by: Rakie Kim Suggested-by: Hyeongtak Ji Suggested-by: Honggyu Kim Suggested-by: Vinicius Tavares Petrucci Signed-off-by: Gregory Price Co-developed-by: Rakie Kim Signed-off-by: Rakie Kim Co-developed-by: Hyeongtak Ji Signed-off-by: Hyeongtak Ji Co-developed-by: Honggyu Kim Signed-off-by: Honggyu Kim Co-developed-by: Vinicius Tavares Petrucci Signed-off-by: Vinicius Tavares Petrucci Suggested-by: Dan Williams Suggested-by: Frank van der Linden Suggested-by: Gregory Price Suggested-by: Hao Wang Suggested-by: Hasan Al Maruf Suggested-by: Johannes Weiner Suggested-by: John Groves Suggested-by: Jonathan Cameron Suggested-by: Michal Hocko Suggested-by: Ravi Jonnalagadda Suggested-by: Srinivasulu Thanneeru Suggested-by: Ying Huang Suggested-by: Zhongkun He Suggested-by: tj --- .../admin-guide/mm/numa_memory_policy.rst | 10 ++ include/linux/mempolicy.h | 2 + include/uapi/linux/mempolicy.h | 2 + mm/mempolicy.c | 105 +++++++++++++++++- 4 files changed, 115 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Document= ation/admin-guide/mm/numa_memory_policy.rst index f1ba33de3a6e..84c076af74c3 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -254,6 +254,8 @@ MPOL_WEIGHTED_INTERLEAVE This mode operates the same as MPOL_INTERLEAVE, except that interleaving behavior is executed based on weights set in /sys/kernel/mm/mempolicy/weighted_interleave/ + when configured to utilize global weights, or based on task-local + weights configured with set_mempolicy2(2) or mbind2(2). =20 Weighted interleave allocations pages on nodes according to their weight. For example if nodes [0,1] are weighted [5,2] @@ -261,6 +263,13 @@ MPOL_WEIGHTED_INTERLEAVE 2 pages allocated on node1. This can better distribute data according to bandwidth on heterogeneous memory systems. =20 + When utilizing task-local weights, weights are not rebalanced + in the event of a task migration. If a weight has not been + explicitly set for a node set in the new nodemask, the + value of that weight defaults to "1". For this reason, if + migrations are expected or possible, users should consider + utilizing global interleave weights. + NUMA memory policy supports the following optional mode flags: =20 MPOL_F_STATIC_NODES @@ -516,6 +525,7 @@ Extended Mempolicy Arguments:: __u64 addr; /* get_mempolicy2: policy address */ __s32 policy_node; /* get_mempolicy2: policy node information */ __s32 addr_node; /* get_mempolicy2: memory range policy */ + __aligned_u64 il_weights; /* u8 buf of size pol_maxnodes */ }; =20 The extended mempolicy argument structure is defined to allow the mempolicy diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 117c5395c6eb..c78874bd84dd 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -58,6 +58,7 @@ struct mempolicy { /* Weighted interleave settings */ struct { unsigned char cur_weight; + unsigned char weights[MAX_NUMNODES]; } wil; }; =20 @@ -73,6 +74,7 @@ struct mempolicy_args { unsigned long addr; /* get: vma address */ int addr_node; /* get: node the address belongs to */ int home_node; /* mbind: use MPOL_MF_HOME_NODE */ + unsigned char *il_weights; /* for mode MPOL_WEIGHTED_INTERLEAVE */ }; =20 /* diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 506ea0f8f34e..687c72fbe6a1 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -37,6 +37,7 @@ struct mpol_args { __u64 addr; __s32 policy_node; /* get_mempolicy: policy node info */ __s32 addr_node; /* get_mempolicy: memory range policy */ + __aligned_u64 il_weights; /* size: pol_maxnodes * sizeof(char) */ }; =20 /* Flags for set_mempolicy */ @@ -77,6 +78,7 @@ struct mpol_args { #define MPOL_F_SHARED (1 << 0) /* identify shared policies */ #define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */ #define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */ +#define MPOL_F_GWEIGHT (1 << 5) /* Utilize global weights */ =20 /* * These bit locations are exposed in the vm.zone_reclaim_mode sysctl diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 8f609204fbe7..e5f86e430207 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -271,6 +271,7 @@ static struct mempolicy *mpol_new(struct mempolicy_args= *args) unsigned short mode =3D args->mode; unsigned short flags =3D args->mode_flags; nodemask_t *nodes =3D args->policy_nodes; + int node; =20 if (mode =3D=3D MPOL_DEFAULT) { if (nodes && !nodes_empty(*nodes)) @@ -297,6 +298,19 @@ static struct mempolicy *mpol_new(struct mempolicy_arg= s *args) (flags & MPOL_F_STATIC_NODES) || (flags & MPOL_F_RELATIVE_NODES)) return ERR_PTR(-EINVAL); + } else if (mode =3D=3D MPOL_WEIGHTED_INTERLEAVE) { + /* weighted interleave requires a nodemask and weights > 0 */ + if (nodes_empty(*nodes)) + return ERR_PTR(-EINVAL); + if (args->il_weights) { + node =3D first_node(*nodes); + while (node !=3D MAX_NUMNODES) { + if (!args->il_weights[node]) + return ERR_PTR(-EINVAL); + node =3D next_node(node, *nodes); + } + } else if (!(args->mode_flags & MPOL_F_GWEIGHT)) + return ERR_PTR(-EINVAL); } else if (nodes_empty(*nodes)) return ERR_PTR(-EINVAL); =20 @@ -309,6 +323,16 @@ static struct mempolicy *mpol_new(struct mempolicy_arg= s *args) policy->home_node =3D NUMA_NO_NODE; policy->wil.cur_weight =3D 0; policy->home_node =3D args->home_node; + if (policy->mode =3D=3D MPOL_WEIGHTED_INTERLEAVE && args->il_weights) { + policy->wil.cur_weight =3D 0; + /* Minimum weight value is always 1 */ + memset(policy->wil.weights, 1, MAX_NUMNODES); + node =3D first_node(*nodes); + while (node !=3D MAX_NUMNODES) { + policy->wil.weights[node] =3D args->il_weights[node]; + node =3D next_node(node, *nodes); + } + } =20 return policy; } @@ -1518,6 +1542,9 @@ static long kernel_mbind(unsigned long start, unsigne= d long len, if (err) return err; =20 + if (mode & MPOL_WEIGHTED_INTERLEAVE) + mode_flags |=3D MPOL_F_GWEIGHT; + memset(&margs, 0, sizeof(margs)); margs.mode =3D lmode; margs.mode_flags =3D mode_flags; @@ -1611,6 +1638,8 @@ SYSCALL_DEFINE5(mbind2, const struct iovec __user *, = vec, size_t, vlen, struct iovec iovstack[UIO_FASTIOV]; struct iovec *iov =3D iovstack; struct iov_iter iter; + unsigned char weights[MAX_NUMNODES]; + unsigned char *weights_ptr; int err; =20 if (!vec || !vlen) @@ -1648,6 +1677,20 @@ SYSCALL_DEFINE5(mbind2, const struct iovec __user *,= vec, size_t, vlen, } else margs.policy_nodes =3D NULL; =20 + if (kargs.mode =3D=3D MPOL_WEIGHTED_INTERLEAVE) { + weights_ptr =3D u64_to_user_ptr(kargs.il_weights); + err =3D copy_struct_from_user(&weights, + sizeof(weights), + weights_ptr, + kargs.pol_maxnodes); + if (err) + return err; + margs.il_weights =3D weights; + } else { + margs.il_weights =3D NULL; + flags |=3D MPOL_F_GWEIGHT; + } + /* For each address range in vector, do_mbind */ err =3D import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter); @@ -1686,6 +1729,9 @@ static long kernel_set_mempolicy(int mode, const unsi= gned long __user *nmask, if (err) return err; =20 + if (mode & MPOL_WEIGHTED_INTERLEAVE) + mode_flags |=3D MPOL_F_GWEIGHT; + memset(&args, 0, sizeof(args)); args.mode =3D lmode; args.mode_flags =3D mode_flags; @@ -1709,6 +1755,8 @@ SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __us= er *, uargs, size_t, usize, int err; nodemask_t policy_nodemask; unsigned long __user *nodes_ptr; + unsigned char weights[MAX_NUMNODES]; + unsigned char __user *weights_ptr; =20 if (flags) return -EINVAL; @@ -1734,6 +1782,20 @@ SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __u= ser *, uargs, size_t, usize, } else margs.policy_nodes =3D NULL; =20 + if (kargs.mode =3D=3D MPOL_WEIGHTED_INTERLEAVE && kargs.il_weights) { + weights_ptr =3D u64_to_user_ptr(kargs.il_weights); + err =3D copy_struct_from_user(weights, + sizeof(weights), + weights_ptr, + kargs.pol_maxnodes); + if (err) + return err; + margs.il_weights =3D weights; + } else { + margs.il_weights =3D NULL; + flags |=3D MPOL_F_GWEIGHT; + } + return do_set_mempolicy(&margs); } =20 @@ -1935,6 +1997,8 @@ SYSCALL_DEFINE3(get_mempolicy2, struct mpol_args __us= er *, uargs, size_t, usize, int err; nodemask_t policy_nodemask; unsigned long __user *nodes_ptr; + unsigned char *weights_ptr; + unsigned char weights[MAX_NUMNODES]; =20 err =3D copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize); if (err) @@ -1951,6 +2015,9 @@ SYSCALL_DEFINE3(get_mempolicy2, struct mpol_args __us= er *, uargs, size_t, usize, &policy_nodemask); } =20 + if (kargs.il_weights) + margs.il_weights =3D weights; + margs.policy_nodes =3D kargs.pol_nodes ? &policy_nodemask : NULL; if (flags & MPOL_F_ADDR) { margs.addr =3D kargs.addr; @@ -1971,6 +2038,13 @@ SYSCALL_DEFINE3(get_mempolicy2, struct mpol_args __u= ser *, uargs, size_t, usize, margs.policy_nodes); } =20 + if (kargs.il_weights) { + weights_ptr =3D u64_to_user_ptr(kargs.il_weights); + err =3D copy_to_user(weights_ptr, weights, kargs.pol_maxnodes); + if (err) + return err; + } + return copy_to_user(uargs, &kargs, usize) ? -EFAULT : 0; } =20 @@ -2087,13 +2161,18 @@ static unsigned int weighted_interleave_nodes(struc= t mempolicy *policy) { unsigned int next; struct task_struct *me =3D current; + unsigned char next_weight; =20 next =3D next_node_in(me->il_prev, policy->nodes); if (next =3D=3D MAX_NUMNODES) return next; =20 - if (!policy->wil.cur_weight) - policy->wil.cur_weight =3D iw_table[next]; + if (!policy->wil.cur_weight) { + next_weight =3D (policy->flags & MPOL_F_GWEIGHT) ? + iw_table[next] : + policy->wil.weights[next]; + policy->wil.cur_weight =3D next_weight ? next_weight : 1; + } =20 policy->wil.cur_weight--; if (!policy->wil.cur_weight) @@ -2167,6 +2246,7 @@ static unsigned int weighted_interleave_nid(struct me= mpolicy *pol, pgoff_t ilx) nodemask_t nodemask =3D pol->nodes; unsigned int target, weight_total =3D 0; int nid; + unsigned char *pol_weights; unsigned char weights[MAX_NUMNODES]; unsigned char weight; =20 @@ -2178,8 +2258,13 @@ static unsigned int weighted_interleave_nid(struct m= empolicy *pol, pgoff_t ilx) return nid; =20 /* Then collect weights on stack and calculate totals */ + if (pol->flags & MPOL_F_GWEIGHT) + pol_weights =3D iw_table; + else + pol_weights =3D pol->wil.weights; + for_each_node_mask(nid, nodemask) { - weight =3D iw_table[nid]; + weight =3D pol_weights[nid]; weight_total +=3D weight; weights[nid] =3D weight; } @@ -2577,6 +2662,7 @@ static unsigned long alloc_pages_bulk_array_weighted_= interleave(gfp_t gfp, unsigned long nr_allocated; unsigned long rounds; unsigned long node_pages, delta; + unsigned char *pol_weights; unsigned char weight; unsigned char weights[MAX_NUMNODES]; unsigned int weight_total; @@ -2590,9 +2676,14 @@ static unsigned long alloc_pages_bulk_array_weighted= _interleave(gfp_t gfp, =20 nnodes =3D nodes_weight(nodes); =20 + if (pol->flags & MPOL_F_GWEIGHT) + pol_weights =3D iw_table; + else + pol_weights =3D pol->wil.weights; + /* Collect weights and save them on stack so they don't change */ for_each_node_mask(node, nodes) { - weight =3D iw_table[node]; + weight =3D pol_weights[node]; weight_total +=3D weight; weights[node] =3D weight; } @@ -3117,6 +3208,7 @@ void mpol_shared_policy_init(struct shared_policy *sp= , struct mempolicy *mpol) { int ret; struct mempolicy_args margs; + unsigned char weights[MAX_NUMNODES]; =20 sp->root =3D RB_ROOT; /* empty tree =3D=3D default mempolicy */ rwlock_init(&sp->lock); @@ -3134,6 +3226,11 @@ void mpol_shared_policy_init(struct shared_policy *s= p, struct mempolicy *mpol) margs.mode_flags =3D mpol->flags; margs.policy_nodes =3D &mpol->w.user_nodemask; margs.home_node =3D NUMA_NO_NODE; + if (margs.mode =3D=3D MPOL_WEIGHTED_INTERLEAVE && + !(margs.mode_flags & MPOL_F_GWEIGHT)) { + memcpy(weights, mpol->wil.weights, sizeof(weights)); + margs.il_weights =3D weights; + } =20 /* contextualize the tmpfs mount point mempolicy to this file */ npol =3D mpol_new(&margs); --=20 2.39.1