From nobody Sat Apr 11 19:32:44 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3DD26C00140 for ; Mon, 8 Aug 2022 06:26:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234810AbiHHG05 (ORCPT ); Mon, 8 Aug 2022 02:26:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58738 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235075AbiHHG0t (ORCPT ); Mon, 8 Aug 2022 02:26:49 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C48C363B1 for ; Sun, 7 Aug 2022 23:26:44 -0700 (PDT) Received: from pps.filterd (m0098419.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2785mA3g019530; Mon, 8 Aug 2022 06:26:23 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=k42aPm0jJa6aK7tVGRyiVD/jZ+RnS4rG4X5oES7T23Y=; b=N9wD/wRVMBkzOyUlZRZyqpV2tXqYzO0uOx96xwL+duR6aFL91iPSgM9QszBeanynYXeA 5YWJk6Wd5yreSWolr9VUYEFejK/UobwH3ajwK2oT4HjHGwKa0HPJVE+TVRt6YncoS5CP N6iupC6su/kALrMJwofpyeKVBGmX7lnRUMEW7V3TqMhPVywHzDIkUrXJMxBtOrPZWL8m CQuDoZNhNlY8f/ezAatPi+6EPfHwglVci3ENO++4dkoBWqmebY36NxsT2b7txXZ9gDXh eLxe7wLre/f6XaQhad+gFV8bz2Wl8F5Kviiv69WxRLHn+CLfuHFjeDZKYOudgw7HKQLB HQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3htvnj8v15-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:22 +0000 Received: from m0098419.ppops.net (m0098419.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2786QMWH027527; Mon, 8 Aug 2022 06:26:22 GMT Received: from ppma01wdc.us.ibm.com (fd.55.37a9.ip4.static.sl-reverse.com [169.55.85.253]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3htvnj8v0w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:22 +0000 Received: from pps.filterd (ppma01wdc.us.ibm.com [127.0.0.1]) by ppma01wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 2786LJCq005549; Mon, 8 Aug 2022 06:26:21 GMT Received: from b03cxnp07027.gho.boulder.ibm.com (b03cxnp07027.gho.boulder.ibm.com [9.17.130.14]) by ppma01wdc.us.ibm.com with ESMTP id 3hsfx92ud4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:21 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp07027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 2786QKeM31064514 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 8 Aug 2022 06:26:20 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A21C478064; Mon, 8 Aug 2022 06:26:20 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B79337805E; Mon, 8 Aug 2022 06:26:14 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.19.76]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Mon, 8 Aug 2022 06:26:14 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v13 1/9] mm/demotion: Add support for explicit memory tiers Date: Mon, 8 Aug 2022 11:55:53 +0530 Message-Id: <20220808062601.836025-2-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> References: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-GUID: WqncfFwssIDCdVqaZcdNSLzaCzhajT1n X-Proofpoint-ORIG-GUID: vKSGt2pVpfnMD2333Es8mImpDQ8lNvD5 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-08_03,2022-08-05_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 impostorscore=0 bulkscore=0 adultscore=0 lowpriorityscore=0 spamscore=0 clxscore=1015 suspectscore=0 mlxlogscore=999 mlxscore=0 priorityscore=1501 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2208080031 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" In the current kernel, memory tiers are defined implicitly via a demotion p= ath relationship between NUMA nodes, which is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the highest tier, and b= uilds the tier hierarchy by establishing the per-node demotion targets based on t= he distances between nodes. This current memory tier kernel implementation needs to be improved for sev= eral important use cases, The current tier initialization code always initializes each memory-only NU= MA node into a lower tier. But a memory-only NUMA node may have a high perform= ance memory device (e.g. a DRAM-backed memory-only node on a virtual machine) th= at should be put into a higher tier. The current tier hierarchy always puts CPU nodes into the top tier. But on a system with HBM or GPU devices, the memory-only NUMA nodes mapping these de= vices should be in the top tier, and DRAM nodes with CPUs are better to be placed= into the next lower tier. With current kernel higher tier node can only be demoted to nodes with shor= test distance on the next lower tier as defined by the demotion path, not any ot= her node from any lower tier. This strict, demotion order does not work in all = use cases (e.g. some use cases may want to allow cross-socket demotion to anoth= er node in the same demotion tier as a fallback when the preferred demotion no= de is out of space), This demotion order is also inconsistent with the page alloc= ation fallback order when all the nodes in a higher tier are out of space: The pa= ge allocation can fall back to any node from any lower tier, whereas the demot= ion order doesn't allow that. This patch series address the above by defining memory tiers explicitly. Linux kernel presents memory devices as NUMA nodes and each memory device i= s of a specific type. The memory type of a device is represented by its abstract distance. A memory tier corresponds to a range of abstract distance. This a= llows for classifying memory devices with a specific performance range into a mem= ory tier. This patch configures the range/chunk size to be 128. The default DRAM abst= ract distance is 512. We can have 4 memory tiers below the default DRAM with abs= tract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511. Faster memory devic= es can be placed in these faster(higher) memory tiers. Slower memory devices l= ike persistent memory will have abstract distance higher than the default DRAM level. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 15 +++++ mm/Makefile | 1 + mm/memory-tiers.c | 107 +++++++++++++++++++++++++++++++++++ 3 files changed, 123 insertions(+) create mode 100644 include/linux/memory-tiers.h create mode 100644 mm/memory-tiers.c diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h new file mode 100644 index 000000000000..bc7c1b799bef --- /dev/null +++ b/include/linux/memory-tiers.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMORY_TIERS_H +#define _LINUX_MEMORY_TIERS_H + +/* + * Each tier cover a abstrace distance chunk size of 128 + */ +#define MEMTIER_CHUNK_BITS 7 +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) +/* + * Smaller abstract distance value imply faster(higher) memory tiers. + */ +#define MEMTIER_ADISTANCE_DRAM (4 * MEMTIER_CHUNK_SIZE) + +#endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/Makefile b/mm/Makefile index 6f9ffa968a1a..d30acebc2164 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) +=3D kfence/ obj-$(CONFIG_FAILSLAB) +=3D failslab.o obj-$(CONFIG_MEMTEST) +=3D memtest.o obj-$(CONFIG_MIGRATION) +=3D migrate.o +obj-$(CONFIG_NUMA) +=3D memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) +=3D migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) +=3D huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) +=3D page_counter.o diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c new file mode 100644 index 000000000000..78b311d9bde9 --- /dev/null +++ b/mm/memory-tiers.c @@ -0,0 +1,107 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include + +struct memory_tier { + /* hierarchy of memory tiers */ + struct list_head list; + /* list of all memory types part of this tier */ + struct list_head memory_types; + /* + * start value of abstract distance. memory tier maps + * an abstract distance range, + * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE + */ + int adistance_start; +}; + +struct memory_dev_type { + /* list of memory types that are part of same tier as this type */ + struct list_head tier_sibiling; + /* abstract distance for this specific memory type */ + int adistance; + /* Nodes of same abstract distance */ + nodemask_t nodes; + struct memory_tier *memtier; +}; + +static DEFINE_MUTEX(memory_tier_lock); +static LIST_HEAD(memory_tiers); +static struct memory_dev_type *node_memory_types[MAX_NUMNODES]; +/* + * For now let's have 4 memory tier below default DRAM tier. + */ +static struct memory_dev_type default_dram_type =3D { + .adistance =3D MEMTIER_ADISTANCE_DRAM, + .tier_sibiling =3D LIST_HEAD_INIT(default_dram_type.tier_sibiling), +}; + +static struct memory_tier *find_create_memory_tier(struct memory_dev_type = *memtype) +{ + bool found_slot =3D false; + struct memory_tier *memtier, *new_memtier; + int adistance =3D memtype->adistance; + unsigned int memtier_adistance_chunk_size =3D MEMTIER_CHUNK_SIZE; + + lockdep_assert_held_once(&memory_tier_lock); + + /* + * If the memtype is already part of a memory tier, + * just return that. + */ + if (memtype->memtier) + return memtype->memtier; + + adistance =3D round_down(adistance, memtier_adistance_chunk_size); + list_for_each_entry(memtier, &memory_tiers, list) { + if (adistance =3D=3D memtier->adistance_start) { + memtype->memtier =3D memtier; + list_add(&memtype->tier_sibiling, &memtier->memory_types); + return memtier; + } else if (adistance < memtier->adistance_start) { + found_slot =3D true; + break; + } + } + + new_memtier =3D kmalloc(sizeof(struct memory_tier), GFP_KERNEL); + if (!new_memtier) + return ERR_PTR(-ENOMEM); + + new_memtier->adistance_start =3D adistance; + INIT_LIST_HEAD(&new_memtier->list); + INIT_LIST_HEAD(&new_memtier->memory_types); + if (found_slot) + list_add_tail(&new_memtier->list, &memtier->list); + else + list_add_tail(&new_memtier->list, &memory_tiers); + memtype->memtier =3D new_memtier; + list_add(&memtype->tier_sibiling, &new_memtier->memory_types); + return new_memtier; +} + +static int __init memory_tier_init(void) +{ + int node; + struct memory_tier *memtier; + + mutex_lock(&memory_tier_lock); + /* CPU only nodes are not part of memory tiers. */ + default_dram_type.nodes =3D node_states[N_MEMORY]; + + memtier =3D find_create_memory_tier(&default_dram_type); + if (IS_ERR(memtier)) + panic("%s() failed to register memory tier: %ld\n", + __func__, PTR_ERR(memtier)); + + for_each_node_state(node, N_MEMORY) + node_memory_types[node] =3D &default_dram_type; + + mutex_unlock(&memory_tier_lock); + + return 0; +} +subsys_initcall(memory_tier_init); --=20 2.37.1 From nobody Sat Apr 11 19:32:44 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9238C00140 for ; Mon, 8 Aug 2022 06:27:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236142AbiHHG1B (ORCPT ); Mon, 8 Aug 2022 02:27:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58752 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235526AbiHHG0w (ORCPT ); Mon, 8 Aug 2022 02:26:52 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D679E634C for ; Sun, 7 Aug 2022 23:26:50 -0700 (PDT) Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2785BpjZ022737; Mon, 8 Aug 2022 06:26:30 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=g6Y0vyvWQrekpKLp+OYnHRje5GiC/d6GC+GD9/zYEnk=; b=guB6193ouzJ84ZtDo1iFm31KlskvjNe2dimK3ur58GD8osqcKYQwr4oWk4eK4miIxTWB dJGjteJnF5xiGJ8FNNRsbJ5uebftafGZQs8maGhpSpowEtvp1tjX3GYCoFf4UdimInA3 nSegv87Gv7uFy/IEKE0gD5xIaJ63KQA2nVg4XWozd8t6zVBBbTRoEulZ8drS1rTaR7Y+ nLwHAbUDKY+Ssj6qOQu6pPRxNqHzydz1QZVvj+ZQ8DyMAG/vXaQpm8tn3LidwgST7C/q CyrvvlOUO99B8OSdBaTGXlo6drMv/DsxJXmci+BE/XeJ5zrwAweQDmtepcyBpds1oLDY 4g== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3htv4g9tkd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:30 +0000 Received: from m0187473.ppops.net (m0187473.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2786AOMv017762; Mon, 8 Aug 2022 06:26:29 GMT Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com [169.47.144.26]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3htv4g9tju-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:29 +0000 Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1]) by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 2786L0xL002664; Mon, 8 Aug 2022 06:26:28 GMT Received: from b03cxnp07027.gho.boulder.ibm.com (b03cxnp07027.gho.boulder.ibm.com [9.17.130.14]) by ppma04wdc.us.ibm.com with ESMTP id 3hsfx92ts9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:27 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp07027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 2786QRVF25362732 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 8 Aug 2022 06:26:27 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2C20C7805E; Mon, 8 Aug 2022 06:26:27 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 52BD47805F; Mon, 8 Aug 2022 06:26:21 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.19.76]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Mon, 8 Aug 2022 06:26:21 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v13 2/9] mm/demotion: Move memory demotion related code Date: Mon, 8 Aug 2022 11:55:54 +0530 Message-Id: <20220808062601.836025-3-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> References: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 5OzJnOzmKiDTPiP8CyRgWDe38GuyHWJW X-Proofpoint-ORIG-GUID: OJBA2phkeL5B5S-wl-VBzIDfrqqY-wjd X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-08_03,2022-08-05_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 adultscore=0 mlxlogscore=999 bulkscore=0 malwarescore=0 spamscore=0 impostorscore=0 suspectscore=0 lowpriorityscore=0 priorityscore=1501 phishscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2208080031 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" This move memory demotion related code to mm/memory-tiers.c. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 8 +++++ include/linux/migrate.h | 2 -- mm/memory-tiers.c | 64 ++++++++++++++++++++++++++++++++++++ mm/migrate.c | 60 +-------------------------------- mm/vmscan.c | 1 + 5 files changed, 74 insertions(+), 61 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index bc7c1b799bef..9fdd9572fdf9 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -12,4 +12,12 @@ */ #define MEMTIER_ADISTANCE_DRAM (4 * MEMTIER_CHUNK_SIZE) =20 +#ifdef CONFIG_NUMA +#include +extern bool numa_demotion_enabled; + +#else + +#define numa_demotion_enabled false +#endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/migrate.h b/include/linux/migrate.h index ae5bb67a9ba1..4b07c5ef3903 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -103,7 +103,6 @@ static inline int migrate_huge_page_move_mapping(struct= address_space *mapping, #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) extern void set_migration_target_nodes(void); extern void migrate_on_reclaim_init(void); -extern bool numa_demotion_enabled; extern int next_demotion_node(int node); #else static inline void set_migration_target_nodes(void) {} @@ -112,7 +111,6 @@ static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } -#define numa_demotion_enabled false #endif =20 #ifdef CONFIG_COMPACTION diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 78b311d9bde9..391b36ee7afe 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -3,6 +3,8 @@ #include #include #include +#include +#include #include =20 struct memory_tier { @@ -105,3 +107,65 @@ static int __init memory_tier_init(void) return 0; } subsys_initcall(memory_tier_init); + +bool numa_demotion_enabled =3D false; + +#ifdef CONFIG_MIGRATION +#ifdef CONFIG_SYSFS +static ssize_t numa_demotion_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%s\n", + numa_demotion_enabled ? "true" : "false"); +} + +static ssize_t numa_demotion_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + ssize_t ret; + + ret =3D kstrtobool(buf, &numa_demotion_enabled); + if (ret) + return ret; + + return count; +} + +static struct kobj_attribute numa_demotion_enabled_attr =3D + __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, + numa_demotion_enabled_store); + +static struct attribute *numa_attrs[] =3D { + &numa_demotion_enabled_attr.attr, + NULL, +}; + +static const struct attribute_group numa_attr_group =3D { + .attrs =3D numa_attrs, +}; + +static int __init numa_init_sysfs(void) +{ + int err; + struct kobject *numa_kobj; + + numa_kobj =3D kobject_create_and_add("numa", mm_kobj); + if (!numa_kobj) { + pr_err("failed to create numa kobject\n"); + return -ENOMEM; + } + err =3D sysfs_create_group(numa_kobj, &numa_attr_group); + if (err) { + pr_err("failed to register numa group\n"); + goto delete_obj; + } + return 0; + +delete_obj: + kobject_put(numa_kobj); + return err; +} +subsys_initcall(numa_init_sysfs); +#endif /* CONFIG_SYSFS */ +#endif diff --git a/mm/migrate.c b/mm/migrate.c index 1b4b977809a1..774225c45354 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2551,64 +2551,6 @@ void __init migrate_on_reclaim_init(void) set_migration_target_nodes(); cpus_read_unlock(); } +#endif /* CONFIG_NUMA */ =20 -bool numa_demotion_enabled =3D false; - -#ifdef CONFIG_SYSFS -static ssize_t numa_demotion_enabled_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%s\n", - numa_demotion_enabled ? "true" : "false"); -} - -static ssize_t numa_demotion_enabled_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) -{ - ssize_t ret; - - ret =3D kstrtobool(buf, &numa_demotion_enabled); - if (ret) - return ret; - - return count; -} - -static struct kobj_attribute numa_demotion_enabled_attr =3D - __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, - numa_demotion_enabled_store); - -static struct attribute *numa_attrs[] =3D { - &numa_demotion_enabled_attr.attr, - NULL, -}; - -static const struct attribute_group numa_attr_group =3D { - .attrs =3D numa_attrs, -}; - -static int __init numa_init_sysfs(void) -{ - int err; - struct kobject *numa_kobj; =20 - numa_kobj =3D kobject_create_and_add("numa", mm_kobj); - if (!numa_kobj) { - pr_err("failed to create numa kobject\n"); - return -ENOMEM; - } - err =3D sysfs_create_group(numa_kobj, &numa_attr_group); - if (err) { - pr_err("failed to register numa group\n"); - goto delete_obj; - } - return 0; - -delete_obj: - kobject_put(numa_kobj); - return err; -} -subsys_initcall(numa_init_sysfs); -#endif /* CONFIG_SYSFS */ -#endif /* CONFIG_NUMA */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 04f8671caad9..5043b10ff71e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -50,6 +50,7 @@ #include #include #include +#include =20 #include #include --=20 2.37.1 From nobody Sat Apr 11 19:32:44 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1D21AC00140 for ; Mon, 8 Aug 2022 06:27:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236220AbiHHG1L (ORCPT ); Mon, 8 Aug 2022 02:27:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58892 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236081AbiHHG07 (ORCPT ); Mon, 8 Aug 2022 02:26:59 -0400 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DB122DF81 for ; Sun, 7 Aug 2022 23:26:54 -0700 (PDT) Received: from pps.filterd (m0127361.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2785Gb7x030079; Mon, 8 Aug 2022 06:26:36 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=wnDz/K3rZP5TH42YWo1wQx/NBysVAZqvcr6ouezjaMg=; b=XReMCxOaOa4/214odQoFanNkJgB9+OxgxMmr1Rn1a6L6Izg8kNu10tEen1tbq5ubHHAz vxyNDup8pDQFS64i+/idioUhft1UmGAyc3SmaMYg1Thh5cXWmurnWCYyY/2ZIMcD9vtS SYeg10zCDE+ssNuO577XaKs40TSfo8GhfcN+ViHCKr5IXBBpKFQsbV+NWNTtI21AkHDA ybXRmPVTmFiBWEnJTgpnVvfDPwjieX7d9tbu94ngjN49BIZqPbrjTbcq90GMTsTBRkU/ EO/X9IgEOR4u2PyLXnCEqnMFkFxEndQXv1T5RkNpMONa6aJCLKyLOQ5L5MKaztF0V6vr sg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3htv6j1mwt-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:36 +0000 Received: from m0127361.ppops.net (m0127361.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2786DciT021071; Mon, 8 Aug 2022 06:26:35 GMT Received: from ppma03wdc.us.ibm.com (ba.79.3fa9.ip4.static.sl-reverse.com [169.63.121.186]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3htv6j1mwe-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:35 +0000 Received: from pps.filterd (ppma03wdc.us.ibm.com [127.0.0.1]) by ppma03wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 2786KQi8021601; Mon, 8 Aug 2022 06:26:34 GMT Received: from b03cxnp08025.gho.boulder.ibm.com (b03cxnp08025.gho.boulder.ibm.com [9.17.130.17]) by ppma03wdc.us.ibm.com with ESMTP id 3hsfx9avj3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:34 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp08025.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 2786QX9P34472354 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 8 Aug 2022 06:26:33 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 76BA978064; Mon, 8 Aug 2022 06:26:33 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B84BF7805E; Mon, 8 Aug 2022 06:26:27 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.19.76]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Mon, 8 Aug 2022 06:26:27 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v13 3/9] mm/demotion: Add hotplug callbacks to handle new numa node onlined Date: Mon, 8 Aug 2022 11:55:55 +0530 Message-Id: <20220808062601.836025-4-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> References: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: 5h1UQYS_cN7i9sJ7P0jUNYHCefm8RQFn X-Proofpoint-GUID: oLOnM5Q-ApxmmDy-sqjwoX4n3KyWposI X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-08_03,2022-08-05_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 mlxlogscore=999 suspectscore=0 lowpriorityscore=0 phishscore=0 malwarescore=0 bulkscore=0 impostorscore=0 priorityscore=1501 mlxscore=0 clxscore=1015 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2208080031 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" If the new NUMA node onlined doesn't have a abstract distance assigned, the kernel adds the NUMA node to default memory tier. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 1 + mm/memory-tiers.c | 87 ++++++++++++++++++++++++++++++++++++ 2 files changed, 88 insertions(+) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 9fdd9572fdf9..cc89876899a6 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -11,6 +11,7 @@ * Smaller abstract distance value imply faster(higher) memory tiers. */ #define MEMTIER_ADISTANCE_DRAM (4 * MEMTIER_CHUNK_SIZE) +#define MEMTIER_HOTPLUG_PRIO 100 =20 #ifdef CONFIG_NUMA #include diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 391b36ee7afe..2caa5ab446b8 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -5,6 +5,7 @@ #include #include #include +#include #include =20 struct memory_tier { @@ -85,6 +86,91 @@ static struct memory_tier *find_create_memory_tier(struc= t memory_dev_type *memty return new_memtier; } =20 +static struct memory_tier *__node_get_memory_tier(int node) +{ + struct memory_dev_type *memtype; + + memtype =3D node_memory_types[node]; + if (memtype && node_isset(node, memtype->nodes)) + return memtype->memtier; + return NULL; +} + +static struct memory_tier *set_node_memory_tier(int node) +{ + struct memory_tier *memtier; + struct memory_dev_type *memtype; + + lockdep_assert_held_once(&memory_tier_lock); + + if (!node_state(node, N_MEMORY)) + return ERR_PTR(-EINVAL); + + if (!node_memory_types[node]) + node_memory_types[node] =3D &default_dram_type; + + memtype =3D node_memory_types[node]; + node_set(node, memtype->nodes); + memtier =3D find_create_memory_tier(memtype); + return memtier; +} + +static void destroy_memory_tier(struct memory_tier *memtier) +{ + list_del(&memtier->list); + kfree(memtier); +} + +static bool clear_node_memory_tier(int node) +{ + bool cleared =3D false; + struct memory_tier *memtier; + + memtier =3D __node_get_memory_tier(node); + if (memtier) { + struct memory_dev_type *memtype; + + memtype =3D node_memory_types[node]; + node_clear(node, memtype->nodes); + if (nodes_empty(memtype->nodes)) { + list_del(&memtype->tier_sibiling); + memtype->memtier =3D NULL; + if (list_empty(&memtier->memory_types)) + destroy_memory_tier(memtier); + } + cleared =3D true; + } + return cleared; +} + +static int __meminit memtier_hotplug_callback(struct notifier_block *self, + unsigned long action, void *_arg) +{ + struct memory_notify *arg =3D _arg; + + /* + * Only update the node migration order when a node is + * changing status, like online->offline. + */ + if (arg->status_change_nid < 0) + return notifier_from_errno(0); + + switch (action) { + case MEM_OFFLINE: + mutex_lock(&memory_tier_lock); + clear_node_memory_tier(arg->status_change_nid); + mutex_unlock(&memory_tier_lock); + break; + case MEM_ONLINE: + mutex_lock(&memory_tier_lock); + set_node_memory_tier(arg->status_change_nid); + mutex_unlock(&memory_tier_lock); + break; + } + + return notifier_from_errno(0); +} + static int __init memory_tier_init(void) { int node; @@ -104,6 +190,7 @@ static int __init memory_tier_init(void) =20 mutex_unlock(&memory_tier_lock); =20 + hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRIO); return 0; } subsys_initcall(memory_tier_init); --=20 2.37.1 From nobody Sat Apr 11 19:32:44 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 322CDC00140 for ; Mon, 8 Aug 2022 06:27:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230239AbiHHG1R (ORCPT ); Mon, 8 Aug 2022 02:27:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58910 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236124AbiHHG07 (ORCPT ); Mon, 8 Aug 2022 02:26:59 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4DB83E031 for ; Sun, 7 Aug 2022 23:26:55 -0700 (PDT) Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2784pfBS020969; Mon, 8 Aug 2022 06:26:42 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=4OqbboYNuzFkDfdQOpnppRNomrd/ePIQBxSb2+g6JBw=; b=dxMjj0bPCwqjLRHCLLMOSU/VoPNFO1lY8hRBhNPeXsAdnK83Vs5pGmerLP1p+ydx28+H 9a3cBP6WFmLeYo+t7mdJfFrW72SKLc4lGd4d9TmMFoHIqY1ivW8aoe0YeByEE4+JpR1l eOL9KMRBJJpzPRmcoyP9avnia4H3O18BZ/FEp8v94p/Yn6ggbTBXthLwYDY5RuTALPQy XEXvDzYqNE/p4dNsnK/O3GL4hfCJCv4zv3OowyXKiF/HXeiMzIhTmj6TgNg0B+yTGaxs Vv8haPxhQ1Se9Hs+wqElmPRIFIRbRjI472cS6M/m2LWIAI4bl+MHht0cqyRj6FDAVCZ5 EQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3htutut4bw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:42 +0000 Received: from m0098420.ppops.net (m0098420.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2785GUsG032417; Mon, 8 Aug 2022 06:26:41 GMT Received: from ppma02wdc.us.ibm.com (aa.5b.37a9.ip4.static.sl-reverse.com [169.55.91.170]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3htutut4bh-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:41 +0000 Received: from pps.filterd (ppma02wdc.us.ibm.com [127.0.0.1]) by ppma02wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 2786KWRf028615; Mon, 8 Aug 2022 06:26:40 GMT Received: from b03cxnp08025.gho.boulder.ibm.com (b03cxnp08025.gho.boulder.ibm.com [9.17.130.17]) by ppma02wdc.us.ibm.com with ESMTP id 3hsfx92v76-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:40 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp08025.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 2786QeeT40108350 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 8 Aug 2022 06:26:40 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0AAA77805F; Mon, 8 Aug 2022 06:26:40 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 21B717805E; Mon, 8 Aug 2022 06:26:34 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.19.76]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Mon, 8 Aug 2022 06:26:33 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v13 4/9] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE Date: Mon, 8 Aug 2022 11:55:56 +0530 Message-Id: <20220808062601.836025-5-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> References: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: nNXIQDMKDPxeqLO_l5s4uOBmiFiL9fRr X-Proofpoint-GUID: 2FNQdhzcTDT1VLCGPLSJDZHo--Z_b43p X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-08_03,2022-08-05_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 adultscore=0 priorityscore=1501 phishscore=0 spamscore=0 suspectscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 bulkscore=0 lowpriorityscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2208080031 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" By default, all nodes are assigned to the default memory tier which is the memory tier designated for nodes with DRAM Set dax kmem device node's tier to slower memory tier by assigning abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE. Low-level drivers like papr_scm or ACPI NFIT can initialize memory device type to a more accurate value based on device tree details or HMAT. If the kernel doesn't find the memory type initialized, a default slower memory type is assigned by the kmem driver. Signed-off-by: Aneesh Kumar K.V --- drivers/dax/kmem.c | 40 ++++++++++++++++++- include/linux/memory-tiers.h | 26 ++++++++++++- mm/memory-tiers.c | 74 +++++++++++++++++++++++++----------- 3 files changed, 115 insertions(+), 25 deletions(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index a37622060fff..b5cb03307af8 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -11,9 +11,17 @@ #include #include #include +#include #include "dax-private.h" #include "bus.h" =20 +/* + * Default abstract distance assigned to the NUMA node onlined + * by DAX/kmem if the low level platform driver didn't initialize + * one for this NUMA node. + */ +#define MEMTIER_DEFAULT_DAX_ADISTANCE (MEMTIER_ADISTANCE_DRAM * 2) + /* Memory resource name used for add_memory_driver_managed(). */ static const char *kmem_name; /* Set if any memory will remain added when the driver will be unloaded. */ @@ -41,6 +49,7 @@ struct dax_kmem_data { struct resource *res[]; }; =20 +static struct memory_dev_type *dax_slowmem_type; static int dev_dax_kmem_probe(struct dev_dax *dev_dax) { struct device *dev =3D &dev_dax->dev; @@ -62,6 +71,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) return -EINVAL; } =20 + init_node_memory_type(numa_node, dax_slowmem_type); + for (i =3D 0; i < dev_dax->nr_range; i++) { struct range range; =20 @@ -162,6 +173,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) static void dev_dax_kmem_remove(struct dev_dax *dev_dax) { int i, success =3D 0; + int node =3D dev_dax->target_node; struct device *dev =3D &dev_dax->dev; struct dax_kmem_data *data =3D dev_get_drvdata(dev); =20 @@ -198,6 +210,14 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_da= x) kfree(data->res_name); kfree(data); dev_set_drvdata(dev, NULL); + /* + * Clear the memtype association on successful unplug. + * If not, we have memory blocks left which can be + * offlined/onlined later. We need to keep memory_dev_type + * for that. This implies this reference will be around + * till next reboot. + */ + clear_node_memory_type(node, dax_slowmem_type); } } #else @@ -228,9 +248,27 @@ static int __init dax_kmem_init(void) if (!kmem_name) return -ENOMEM; =20 + dax_slowmem_type =3D kmalloc(sizeof(*dax_slowmem_type), GFP_KERNEL); + if (!dax_slowmem_type) { + rc =3D -ENOMEM; + goto kmem_name_free; + } + dax_slowmem_type->adistance =3D MEMTIER_DEFAULT_DAX_ADISTANCE; + INIT_LIST_HEAD(&dax_slowmem_type->tier_sibiling); + dax_slowmem_type->nodes =3D NODE_MASK_NONE; + dax_slowmem_type->memtier =3D NULL; + kref_init(&dax_slowmem_type->kref); + rc =3D dax_driver_register(&device_dax_kmem_driver); if (rc) - kfree_const(kmem_name); + goto error_out; + + return rc; + +error_out: + kfree(dax_slowmem_type); +kmem_name_free: + kfree_const(kmem_name); return rc; } =20 diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index cc89876899a6..7bf6f47d581a 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -2,6 +2,8 @@ #ifndef _LINUX_MEMORY_TIERS_H #define _LINUX_MEMORY_TIERS_H =20 +#include +#include /* * Each tier cover a abstrace distance chunk size of 128 */ @@ -13,12 +15,34 @@ #define MEMTIER_ADISTANCE_DRAM (4 * MEMTIER_CHUNK_SIZE) #define MEMTIER_HOTPLUG_PRIO 100 =20 +struct memory_tier; +struct memory_dev_type { + /* list of memory types that are part of same tier as this type */ + struct list_head tier_sibiling; + /* abstract distance for this specific memory type */ + int adistance; + /* Nodes of same abstract distance */ + nodemask_t nodes; + struct kref kref; + struct memory_tier *memtier; +}; + #ifdef CONFIG_NUMA -#include extern bool numa_demotion_enabled; +void init_node_memory_type(int node, struct memory_dev_type *default_type); +void clear_node_memory_type(int node, struct memory_dev_type *memtype); =20 #else =20 #define numa_demotion_enabled false +static inline void init_node_memory_type(int node, struct memory_dev_type = *default_type) +{ + +} + +static inline void clear_node_memory_type(int node, struct memory_dev_type= *memtype) +{ + +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 2caa5ab446b8..e07dffb67567 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -1,6 +1,4 @@ // SPDX-License-Identifier: GPL-2.0 -#include -#include #include #include #include @@ -21,26 +19,10 @@ struct memory_tier { int adistance_start; }; =20 -struct memory_dev_type { - /* list of memory types that are part of same tier as this type */ - struct list_head tier_sibiling; - /* abstract distance for this specific memory type */ - int adistance; - /* Nodes of same abstract distance */ - nodemask_t nodes; - struct memory_tier *memtier; -}; - static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); static struct memory_dev_type *node_memory_types[MAX_NUMNODES]; -/* - * For now let's have 4 memory tier below default DRAM tier. - */ -static struct memory_dev_type default_dram_type =3D { - .adistance =3D MEMTIER_ADISTANCE_DRAM, - .tier_sibiling =3D LIST_HEAD_INIT(default_dram_type.tier_sibiling), -}; +static struct memory_dev_type *default_dram_type; =20 static struct memory_tier *find_create_memory_tier(struct memory_dev_type = *memtype) { @@ -96,6 +78,14 @@ static struct memory_tier *__node_get_memory_tier(int no= de) return NULL; } =20 +static inline void __init_node_memory_type(int node, struct memory_dev_typ= e *default_type) +{ + if (!node_memory_types[node]) { + node_memory_types[node] =3D default_type; + kref_get(&default_type->kref); + } +} + static struct memory_tier *set_node_memory_tier(int node) { struct memory_tier *memtier; @@ -107,7 +97,7 @@ static struct memory_tier *set_node_memory_tier(int node) return ERR_PTR(-EINVAL); =20 if (!node_memory_types[node]) - node_memory_types[node] =3D &default_dram_type; + __init_node_memory_type(node, default_dram_type); =20 memtype =3D node_memory_types[node]; node_set(node, memtype->nodes); @@ -143,6 +133,34 @@ static bool clear_node_memory_tier(int node) return cleared; } =20 +void init_node_memory_type(int node, struct memory_dev_type *default_type) +{ + + mutex_lock(&memory_tier_lock); + __init_node_memory_type(node, default_type); + mutex_unlock(&memory_tier_lock); +} +EXPORT_SYMBOL_GPL(init_node_memory_type); + +static void release_memtype(struct kref *kref) +{ + struct memory_dev_type *memtype; + + memtype =3D container_of(kref, struct memory_dev_type, kref); + kfree(memtype); +} + +void clear_node_memory_type(int node, struct memory_dev_type *memtype) +{ + mutex_lock(&memory_tier_lock); + if (node_memory_types[node] =3D=3D memtype) { + node_memory_types[node] =3D NULL; + kref_put(&memtype->kref, release_memtype); + } + mutex_unlock(&memory_tier_lock); +} +EXPORT_SYMBOL_GPL(clear_node_memory_type); + static int __meminit memtier_hotplug_callback(struct notifier_block *self, unsigned long action, void *_arg) { @@ -176,17 +194,27 @@ static int __init memory_tier_init(void) int node; struct memory_tier *memtier; =20 + default_dram_type =3D kmalloc(sizeof(*default_dram_type), GFP_KERNEL); + if (!default_dram_type) + panic("%s() failed to allocate default DRAM tier\n", __func__); + mutex_lock(&memory_tier_lock); + + /* For now let's have 4 memory tier below default DRAM tier. */ + default_dram_type->adistance =3D MEMTIER_ADISTANCE_DRAM; + INIT_LIST_HEAD(&default_dram_type->tier_sibiling); + default_dram_type->memtier =3D NULL; + kref_init(&default_dram_type->kref); /* CPU only nodes are not part of memory tiers. */ - default_dram_type.nodes =3D node_states[N_MEMORY]; + default_dram_type->nodes =3D node_states[N_MEMORY]; =20 - memtier =3D find_create_memory_tier(&default_dram_type); + memtier =3D find_create_memory_tier(default_dram_type); if (IS_ERR(memtier)) panic("%s() failed to register memory tier: %ld\n", __func__, PTR_ERR(memtier)); =20 for_each_node_state(node, N_MEMORY) - node_memory_types[node] =3D &default_dram_type; + __init_node_memory_type(node, default_dram_type); =20 mutex_unlock(&memory_tier_lock); =20 --=20 2.37.1 From nobody Sat Apr 11 19:32:44 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6FD7AC00140 for ; Mon, 8 Aug 2022 06:27:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237010AbiHHG1k (ORCPT ); Mon, 8 Aug 2022 02:27:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59136 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236243AbiHHG1M (ORCPT ); Mon, 8 Aug 2022 02:27:12 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 21AE11115B for ; Sun, 7 Aug 2022 23:27:09 -0700 (PDT) Received: from pps.filterd (m0098399.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2784feZa003581; Mon, 8 Aug 2022 06:26:50 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=QUv/VloxAQWU5dCwm7SJvMvPC6UlGem3pe5pmQSWv+s=; b=aOcOt+Yep3Y90kB7EuTOGKvsZYbXCrncSeGsBrGgzyTHfq5kqsWKQaTDCzX/MINdkh7g 9Tc/3WeJQTMnlWhJaEOy73uamlmoC4h5dprrqJAH40mJkdd7CWacyQG58JdTwbYJSwTW 5r1E0Sn0aYdESVXmRfGIsaaFICa2EdMVd2jGW+rMcAZ1f1Gei7TNSVGfc0vlaAaiZvgb r0iGxN0LefzF8mkyYkHIiIHqUKGiiDF5xbgo+QDv3PdSzbxb4WrJxQ5eJh4FiMg7qZSy mlPd4BXc0PWrgDuzZmL1zKFHIuPiuLhoFUqj+fhJw6pzyxCfprhFSqA0mqxqeVAm28yK JA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3htupc2kkk-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:50 +0000 Received: from m0098399.ppops.net (m0098399.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2786HufI009996; Mon, 8 Aug 2022 06:26:49 GMT Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com [169.47.144.26]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3htupc2kk5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:49 +0000 Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1]) by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 2786KucM002626; Mon, 8 Aug 2022 06:26:47 GMT Received: from b03cxnp07029.gho.boulder.ibm.com (b03cxnp07029.gho.boulder.ibm.com [9.17.130.16]) by ppma04wdc.us.ibm.com with ESMTP id 3hsfx92tt8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:47 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp07029.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 2786QkHg41877894 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 8 Aug 2022 06:26:46 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id CC47778066; Mon, 8 Aug 2022 06:26:46 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A8F3A7805E; Mon, 8 Aug 2022 06:26:40 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.19.76]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Mon, 8 Aug 2022 06:26:40 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v13 5/9] mm/demotion: Build demotion targets based on explicit memory tiers Date: Mon, 8 Aug 2022 11:55:57 +0530 Message-Id: <20220808062601.836025-6-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> References: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: JizEg6uVGdAOLnnSdrsjh0ERX3xjMybh X-Proofpoint-GUID: 601MKBtI32WDxEwEC-YZ1KUuALsNOGK_ X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-08_03,2022-08-05_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 malwarescore=0 bulkscore=0 mlxlogscore=999 impostorscore=0 priorityscore=1501 lowpriorityscore=0 adultscore=0 clxscore=1015 suspectscore=0 spamscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2208080031 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" This patch switch the demotion target building logic to use memory tiers instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the default memory tier and additional memory tiers will be added by drivers li= ke dax kmem. This patch builds the demotion target for a NUMA node by looking at all memory tiers below the tier to which the NUMA node belongs. The closest node in the immediately following memory tier is used as a demotion target. Since we are now only building demotion target for N_MEMORY NUMA nodes the CPU hotplug calls are removed in this patch. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 13 ++ include/linux/migrate.h | 13 -- mm/memory-tiers.c | 219 ++++++++++++++++++- mm/migrate.c | 394 ----------------------------------- mm/vmstat.c | 4 - 5 files changed, 229 insertions(+), 414 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 7bf6f47d581a..c8cd593fa2df 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -31,6 +31,14 @@ struct memory_dev_type { extern bool numa_demotion_enabled; void init_node_memory_type(int node, struct memory_dev_type *default_type); void clear_node_memory_type(int node, struct memory_dev_type *memtype); +#ifdef CONFIG_MIGRATION +int next_demotion_node(int node); +#else +static inline int next_demotion_node(int node) +{ + return NUMA_NO_NODE; +} +#endif =20 #else =20 @@ -44,5 +52,10 @@ static inline void clear_node_memory_type(int node, stru= ct memory_dev_type *memt { =20 } + +static inline int next_demotion_node(int node) +{ + return NUMA_NO_NODE; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 4b07c5ef3903..c09880a6e1d2 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -100,19 +100,6 @@ static inline int migrate_huge_page_move_mapping(struc= t address_space *mapping, =20 #endif /* CONFIG_MIGRATION */ =20 -#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) -extern void set_migration_target_nodes(void); -extern void migrate_on_reclaim_init(void); -extern int next_demotion_node(int node); -#else -static inline void set_migration_target_nodes(void) {} -static inline void migrate_on_reclaim_init(void) {} -static inline int next_demotion_node(int node) -{ - return NUMA_NO_NODE; -} -#endif - #ifdef CONFIG_COMPACTION bool PageMovable(struct page *page); void __SetPageMovable(struct page *page, const struct movable_operations *= ops); diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index e07dffb67567..02e514e87d5c 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -4,8 +4,11 @@ #include #include #include +#include #include =20 +#include "internal.h" + struct memory_tier { /* hierarchy of memory tiers */ struct list_head list; @@ -19,10 +22,74 @@ struct memory_tier { int adistance_start; }; =20 +struct demotion_nodes { + nodemask_t preferred; +}; + static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); static struct memory_dev_type *node_memory_types[MAX_NUMNODES]; static struct memory_dev_type *default_dram_type; +#ifdef CONFIG_MIGRATION +/* + * node_demotion[] examples: + * + * Example 1: + * + * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes. + * + * node distances: + * node 0 1 2 3 + * 0 10 20 30 40 + * 1 20 10 40 30 + * 2 30 40 10 40 + * 3 40 30 40 10 + * + * memory_tiers0 =3D 0-1 + * memory_tiers1 =3D 2-3 + * + * node_demotion[0].preferred =3D 2 + * node_demotion[1].preferred =3D 3 + * node_demotion[2].preferred =3D + * node_demotion[3].preferred =3D + * + * Example 2: + * + * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node. + * + * node distances: + * node 0 1 2 + * 0 10 20 30 + * 1 20 10 30 + * 2 30 30 10 + * + * memory_tiers0 =3D 0-2 + * + * node_demotion[0].preferred =3D + * node_demotion[1].preferred =3D + * node_demotion[2].preferred =3D + * + * Example 3: + * + * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node. + * + * node distances: + * node 0 1 2 + * 0 10 20 30 + * 1 20 10 40 + * 2 30 40 10 + * + * memory_tiers0 =3D 1 + * memory_tiers1 =3D 0 + * memory_tiers2 =3D 2 + * + * node_demotion[0].preferred =3D 2 + * node_demotion[1].preferred =3D 0 + * node_demotion[2].preferred =3D + * + */ +static struct demotion_nodes *node_demotion __read_mostly; +#endif /* CONFIG_MIGRATION */ =20 static struct memory_tier *find_create_memory_tier(struct memory_dev_type = *memtype) { @@ -78,6 +145,144 @@ static struct memory_tier *__node_get_memory_tier(int = node) return NULL; } =20 +#ifdef CONFIG_MIGRATION +/** + * next_demotion_node() - Get the next node in the demotion path + * @node: The starting node to lookup the next node + * + * Return: node id for next memory node in the demotion path hierarchy + * from @node; NUMA_NO_NODE if @node is terminal. This does not keep + * @node online or guarantee that it *continues* to be the next demotion + * target. + */ +int next_demotion_node(int node) +{ + struct demotion_nodes *nd; + int target; + + if (!node_demotion) + return NUMA_NO_NODE; + + nd =3D &node_demotion[node]; + + /* + * node_demotion[] is updated without excluding this + * function from running. + * + * Make sure to use RCU over entire code blocks if + * node_demotion[] reads need to be consistent. + */ + rcu_read_lock(); + /* + * If there are multiple target nodes, just select one + * target node randomly. + * + * In addition, we can also use round-robin to select + * target node, but we should introduce another variable + * for node_demotion[] to record last selected target node, + * that may cause cache ping-pong due to the changing of + * last target node. Or introducing per-cpu data to avoid + * caching issue, which seems more complicated. So selecting + * target node randomly seems better until now. + */ + target =3D node_random(&nd->preferred); + rcu_read_unlock(); + + return target; +} + +static void disable_all_demotion_targets(void) +{ + int node; + + for_each_node_state(node, N_MEMORY) + node_demotion[node].preferred =3D NODE_MASK_NONE; + /* + * Ensure that the "disable" is visible across the system. + * Readers will see either a combination of before+disable + * state or disable+after. They will never see before and + * after state together. + */ + synchronize_rcu(); +} + +static __always_inline nodemask_t get_memtier_nodemask(struct memory_tier = *memtier) +{ + nodemask_t nodes =3D NODE_MASK_NONE; + struct memory_dev_type *memtype; + + list_for_each_entry(memtype, &memtier->memory_types, tier_sibiling) + nodes_or(nodes, nodes, memtype->nodes); + + return nodes; +} + +/* + * Find an automatic demotion target for all memory + * nodes. Failing here is OK. It might just indicate + * being at the end of a chain. + */ +static void establish_demotion_targets(void) +{ + struct memory_tier *memtier; + struct demotion_nodes *nd; + int target =3D NUMA_NO_NODE, node; + int distance, best_distance; + nodemask_t tier_nodes; + + lockdep_assert_held_once(&memory_tier_lock); + + if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION)) + return; + + disable_all_demotion_targets(); + + for_each_node_state(node, N_MEMORY) { + best_distance =3D -1; + nd =3D &node_demotion[node]; + + memtier =3D __node_get_memory_tier(node); + if (!memtier || list_is_last(&memtier->list, &memory_tiers)) + continue; + /* + * Get the lower memtier to find the demotion node list. + */ + memtier =3D list_next_entry(memtier, list); + tier_nodes =3D get_memtier_nodemask(memtier); + /* + * find_next_best_node, use 'used' nodemask as a skip list. + * Add all memory nodes except the selected memory tier + * nodelist to skip list so that we find the best node from the + * memtier nodelist. + */ + nodes_andnot(tier_nodes, node_states[N_MEMORY], tier_nodes); + + /* + * Find all the nodes in the memory tier node list of same best distance. + * add them to the preferred mask. We randomly select between nodes + * in the preferred mask when allocating pages during demotion. + */ + do { + target =3D find_next_best_node(node, &tier_nodes); + if (target =3D=3D NUMA_NO_NODE) + break; + + distance =3D node_distance(node, target); + if (distance =3D=3D best_distance || best_distance =3D=3D -1) { + best_distance =3D distance; + node_set(target, nd->preferred); + } else { + break; + } + } while (1); + } +} + +#else +static inline void disable_all_demotion_targets(void) {} +static inline void establish_demotion_targets(void) {} +#endif /* CONFIG_MIGRATION */ + static inline void __init_node_memory_type(int node, struct memory_dev_typ= e *default_type) { if (!node_memory_types[node]) { @@ -164,6 +369,7 @@ EXPORT_SYMBOL_GPL(clear_node_memory_type); static int __meminit memtier_hotplug_callback(struct notifier_block *self, unsigned long action, void *_arg) { + struct memory_tier *memtier; struct memory_notify *arg =3D _arg; =20 /* @@ -176,12 +382,15 @@ static int __meminit memtier_hotplug_callback(struct = notifier_block *self, switch (action) { case MEM_OFFLINE: mutex_lock(&memory_tier_lock); - clear_node_memory_tier(arg->status_change_nid); + if (clear_node_memory_tier(arg->status_change_nid)) + establish_demotion_targets(); mutex_unlock(&memory_tier_lock); break; case MEM_ONLINE: mutex_lock(&memory_tier_lock); - set_node_memory_tier(arg->status_change_nid); + memtier =3D set_node_memory_tier(arg->status_change_nid); + if (!IS_ERR(memtier)) + establish_demotion_targets(); mutex_unlock(&memory_tier_lock); break; } @@ -217,7 +426,11 @@ static int __init memory_tier_init(void) __init_node_memory_type(node, default_dram_type); =20 mutex_unlock(&memory_tier_lock); - +#ifdef CONFIG_MIGRATION + node_demotion =3D kcalloc(nr_node_ids, sizeof(struct demotion_nodes), + GFP_KERNEL); + WARN_ON(!node_demotion); +#endif hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRIO); return 0; } diff --git a/mm/migrate.c b/mm/migrate.c index 774225c45354..45290ddd3806 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2159,398 +2159,4 @@ int migrate_misplaced_page(struct page *page, struc= t vm_area_struct *vma, return 0; } #endif /* CONFIG_NUMA_BALANCING */ - -/* - * node_demotion[] example: - * - * Consider a system with two sockets. Each socket has - * three classes of memory attached: fast, medium and slow. - * Each memory class is placed in its own NUMA node. The - * CPUs are placed in the node with the "fast" memory. The - * 6 NUMA nodes (0-5) might be split among the sockets like - * this: - * - * Socket A: 0, 1, 2 - * Socket B: 3, 4, 5 - * - * When Node 0 fills up, its memory should be migrated to - * Node 1. When Node 1 fills up, it should be migrated to - * Node 2. The migration path start on the nodes with the - * processors (since allocations default to this node) and - * fast memory, progress through medium and end with the - * slow memory: - * - * 0 -> 1 -> 2 -> stop - * 3 -> 4 -> 5 -> stop - * - * This is represented in the node_demotion[] like this: - * - * { nr=3D1, nodes[0]=3D1 }, // Node 0 migrates to 1 - * { nr=3D1, nodes[0]=3D2 }, // Node 1 migrates to 2 - * { nr=3D0, nodes[0]=3D-1 }, // Node 2 does not migrate - * { nr=3D1, nodes[0]=3D4 }, // Node 3 migrates to 4 - * { nr=3D1, nodes[0]=3D5 }, // Node 4 migrates to 5 - * { nr=3D0, nodes[0]=3D-1 }, // Node 5 does not migrate - * - * Moreover some systems may have multiple slow memory nodes. - * Suppose a system has one socket with 3 memory nodes, node 0 - * is fast memory type, and node 1/2 both are slow memory - * type, and the distance between fast memory node and slow - * memory node is same. So the migration path should be: - * - * 0 -> 1/2 -> stop - * - * This is represented in the node_demotion[] like this: - * { nr=3D2, {nodes[0]=3D1, nodes[1]=3D2} }, // Node 0 migrates to node 1 = and node 2 - * { nr=3D0, nodes[0]=3D-1, }, // Node 1 dose not migrate - * { nr=3D0, nodes[0]=3D-1, }, // Node 2 does not migrate - */ - -/* - * Writes to this array occur without locking. Cycles are - * not allowed: Node X demotes to Y which demotes to X... - * - * If multiple reads are performed, a single rcu_read_lock() - * must be held over all reads to ensure that no cycles are - * observed. - */ -#define DEFAULT_DEMOTION_TARGET_NODES 15 - -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES -#define DEMOTION_TARGET_NODES (MAX_NUMNODES - 1) -#else -#define DEMOTION_TARGET_NODES DEFAULT_DEMOTION_TARGET_NODES -#endif - -struct demotion_nodes { - unsigned short nr; - short nodes[DEMOTION_TARGET_NODES]; -}; - -static struct demotion_nodes *node_demotion __read_mostly; - -/** - * next_demotion_node() - Get the next node in the demotion path - * @node: The starting node to lookup the next node - * - * Return: node id for next memory node in the demotion path hierarchy - * from @node; NUMA_NO_NODE if @node is terminal. This does not keep - * @node online or guarantee that it *continues* to be the next demotion - * target. - */ -int next_demotion_node(int node) -{ - struct demotion_nodes *nd; - unsigned short target_nr, index; - int target; - - if (!node_demotion) - return NUMA_NO_NODE; - - nd =3D &node_demotion[node]; - - /* - * node_demotion[] is updated without excluding this - * function from running. RCU doesn't provide any - * compiler barriers, so the READ_ONCE() is required - * to avoid compiler reordering or read merging. - * - * Make sure to use RCU over entire code blocks if - * node_demotion[] reads need to be consistent. - */ - rcu_read_lock(); - target_nr =3D READ_ONCE(nd->nr); - - switch (target_nr) { - case 0: - target =3D NUMA_NO_NODE; - goto out; - case 1: - index =3D 0; - break; - default: - /* - * If there are multiple target nodes, just select one - * target node randomly. - * - * In addition, we can also use round-robin to select - * target node, but we should introduce another variable - * for node_demotion[] to record last selected target node, - * that may cause cache ping-pong due to the changing of - * last target node. Or introducing per-cpu data to avoid - * caching issue, which seems more complicated. So selecting - * target node randomly seems better until now. - */ - index =3D get_random_int() % target_nr; - break; - } - - target =3D READ_ONCE(nd->nodes[index]); - -out: - rcu_read_unlock(); - return target; -} - -/* Disable reclaim-based migration. */ -static void __disable_all_migrate_targets(void) -{ - int node, i; - - if (!node_demotion) - return; - - for_each_online_node(node) { - node_demotion[node].nr =3D 0; - for (i =3D 0; i < DEMOTION_TARGET_NODES; i++) - node_demotion[node].nodes[i] =3D NUMA_NO_NODE; - } -} - -static void disable_all_migrate_targets(void) -{ - __disable_all_migrate_targets(); - - /* - * Ensure that the "disable" is visible across the system. - * Readers will see either a combination of before+disable - * state or disable+after. They will never see before and - * after state together. - * - * The before+after state together might have cycles and - * could cause readers to do things like loop until this - * function finishes. This ensures they can only see a - * single "bad" read and would, for instance, only loop - * once. - */ - synchronize_rcu(); -} - -/* - * Find an automatic demotion target for 'node'. - * Failing here is OK. It might just indicate - * being at the end of a chain. - */ -static int establish_migrate_target(int node, nodemask_t *used, - int best_distance) -{ - int migration_target, index, val; - struct demotion_nodes *nd; - - if (!node_demotion) - return NUMA_NO_NODE; - - nd =3D &node_demotion[node]; - - migration_target =3D find_next_best_node(node, used); - if (migration_target =3D=3D NUMA_NO_NODE) - return NUMA_NO_NODE; - - /* - * If the node has been set a migration target node before, - * which means it's the best distance between them. Still - * check if this node can be demoted to other target nodes - * if they have a same best distance. - */ - if (best_distance !=3D -1) { - val =3D node_distance(node, migration_target); - if (val > best_distance) - goto out_clear; - } - - index =3D nd->nr; - if (WARN_ONCE(index >=3D DEMOTION_TARGET_NODES, - "Exceeds maximum demotion target nodes\n")) - goto out_clear; - - nd->nodes[index] =3D migration_target; - nd->nr++; - - return migration_target; -out_clear: - node_clear(migration_target, *used); - return NUMA_NO_NODE; -} - -/* - * When memory fills up on a node, memory contents can be - * automatically migrated to another node instead of - * discarded at reclaim. - * - * Establish a "migration path" which will start at nodes - * with CPUs and will follow the priorities used to build the - * page allocator zonelists. - * - * The difference here is that cycles must be avoided. If - * node0 migrates to node1, then neither node1, nor anything - * node1 migrates to can migrate to node0. Also one node can - * be migrated to multiple nodes if the target nodes all have - * a same best-distance against the source node. - * - * This function can run simultaneously with readers of - * node_demotion[]. However, it can not run simultaneously - * with itself. Exclusion is provided by memory hotplug events - * being single-threaded. - */ -static void __set_migration_target_nodes(void) -{ - nodemask_t next_pass; - nodemask_t this_pass; - nodemask_t used_targets =3D NODE_MASK_NONE; - int node, best_distance; - - /* - * Avoid any oddities like cycles that could occur - * from changes in the topology. This will leave - * a momentary gap when migration is disabled. - */ - disable_all_migrate_targets(); - - /* - * Allocations go close to CPUs, first. Assume that - * the migration path starts at the nodes with CPUs. - */ - next_pass =3D node_states[N_CPU]; -again: - this_pass =3D next_pass; - next_pass =3D NODE_MASK_NONE; - /* - * To avoid cycles in the migration "graph", ensure - * that migration sources are not future targets by - * setting them in 'used_targets'. Do this only - * once per pass so that multiple source nodes can - * share a target node. - * - * 'used_targets' will become unavailable in future - * passes. This limits some opportunities for - * multiple source nodes to share a destination. - */ - nodes_or(used_targets, used_targets, this_pass); - - for_each_node_mask(node, this_pass) { - best_distance =3D -1; - - /* - * Try to set up the migration path for the node, and the target - * migration nodes can be multiple, so doing a loop to find all - * the target nodes if they all have a best node distance. - */ - do { - int target_node =3D - establish_migrate_target(node, &used_targets, - best_distance); - - if (target_node =3D=3D NUMA_NO_NODE) - break; - - if (best_distance =3D=3D -1) - best_distance =3D node_distance(node, target_node); - - /* - * Visit targets from this pass in the next pass. - * Eventually, every node will have been part of - * a pass, and will become set in 'used_targets'. - */ - node_set(target_node, next_pass); - } while (1); - } - /* - * 'next_pass' contains nodes which became migration - * targets in this pass. Make additional passes until - * no more migrations targets are available. - */ - if (!nodes_empty(next_pass)) - goto again; -} - -/* - * For callers that do not hold get_online_mems() already. - */ -void set_migration_target_nodes(void) -{ - get_online_mems(); - __set_migration_target_nodes(); - put_online_mems(); -} - -/* - * This leaves migrate-on-reclaim transiently disabled between - * the MEM_GOING_OFFLINE and MEM_OFFLINE events. This runs - * whether reclaim-based migration is enabled or not, which - * ensures that the user can turn reclaim-based migration at - * any time without needing to recalculate migration targets. - * - * These callbacks already hold get_online_mems(). That is why - * __set_migration_target_nodes() can be used as opposed to - * set_migration_target_nodes(). - */ -#ifdef CONFIG_MEMORY_HOTPLUG -static int __meminit migrate_on_reclaim_callback(struct notifier_block *se= lf, - unsigned long action, void *_arg) -{ - struct memory_notify *arg =3D _arg; - - /* - * Only update the node migration order when a node is - * changing status, like online->offline. This avoids - * the overhead of synchronize_rcu() in most cases. - */ - if (arg->status_change_nid < 0) - return notifier_from_errno(0); - - switch (action) { - case MEM_GOING_OFFLINE: - /* - * Make sure there are not transient states where - * an offline node is a migration target. This - * will leave migration disabled until the offline - * completes and the MEM_OFFLINE case below runs. - */ - disable_all_migrate_targets(); - break; - case MEM_OFFLINE: - case MEM_ONLINE: - /* - * Recalculate the target nodes once the node - * reaches its final state (online or offline). - */ - __set_migration_target_nodes(); - break; - case MEM_CANCEL_OFFLINE: - /* - * MEM_GOING_OFFLINE disabled all the migration - * targets. Reenable them. - */ - __set_migration_target_nodes(); - break; - case MEM_GOING_ONLINE: - case MEM_CANCEL_ONLINE: - break; - } - - return notifier_from_errno(0); -} -#endif - -void __init migrate_on_reclaim_init(void) -{ - node_demotion =3D kcalloc(nr_node_ids, - sizeof(struct demotion_nodes), - GFP_KERNEL); - WARN_ON(!node_demotion); -#ifdef CONFIG_MEMORY_HOTPLUG - hotplug_memory_notifier(migrate_on_reclaim_callback, 100); -#endif - /* - * At this point, all numa nodes with memory/CPus have their state - * properly set, so we can build the demotion order now. - * Let us hold the cpu_hotplug lock just, as we could possibily have - * CPU hotplug events during boot. - */ - cpus_read_lock(); - set_migration_target_nodes(); - cpus_read_unlock(); -} #endif /* CONFIG_NUMA */ - - diff --git a/mm/vmstat.c b/mm/vmstat.c index 373d2730fcf2..35c6ff97cf29 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -28,7 +28,6 @@ #include #include #include -#include =20 #include "internal.h" =20 @@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu) =20 if (!node_state(cpu_to_node(cpu), N_CPU)) { node_set_state(cpu_to_node(cpu), N_CPU); - set_migration_target_nodes(); } =20 return 0; @@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu) return 0; =20 node_clear_state(node, N_CPU); - set_migration_target_nodes(); =20 return 0; } @@ -2118,7 +2115,6 @@ void __init init_mm_internals(void) =20 start_shepherd_timer(); #endif - migrate_on_reclaim_init(); #ifdef CONFIG_PROC_FS proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op); proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op); --=20 2.37.1 From nobody Sat Apr 11 19:32:44 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 79513C00140 for ; Mon, 8 Aug 2022 06:27:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236837AbiHHG1g (ORCPT ); Mon, 8 Aug 2022 02:27:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59134 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236193AbiHHG1L (ORCPT ); Mon, 8 Aug 2022 02:27:11 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3FDAF1181A for ; Sun, 7 Aug 2022 23:27:09 -0700 (PDT) Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2785C1sd023380; Mon, 8 Aug 2022 06:26:56 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=Vm99fpXrEL5sGzVMPQGBhdc/VnWO9MKydLltPdUtcYg=; b=XDILkOvxonbyXAiLlwU4vqfSbdahZfaXgHm4HqnGcYLJAnpE9eSXy9SgnS+TdeiXcgOU wVTb8twNJ329TWqOwXUhNgRnYlYspoxrFG6tHe4ghfT593b/RsXnoJY791FpW/5kE7wM LZLOoWEUyI2SYC17D6aRuOIpTnsIn+3Ovywq/KVBr5JblbAn51IZWQgD8M8XRUMFiHGA 2pZNmJoFLUQIIptpmtE/LwZ1MM/nisyI+Y/N6JW0lKPhsIVn5qxa36VZ3//UtCuCU7Dc +81ULr13vygNDjF5jgmqH8qYP9zty39abdlETg927FldFrxlOOU6g6qfK4vu6uBEg7T/ 1w== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3htv4g9tu0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:56 +0000 Received: from m0187473.ppops.net (m0187473.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2786MmEW007292; Mon, 8 Aug 2022 06:26:55 GMT Received: from ppma02wdc.us.ibm.com (aa.5b.37a9.ip4.static.sl-reverse.com [169.55.91.170]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3htv4g9tth-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:55 +0000 Received: from pps.filterd (ppma02wdc.us.ibm.com [127.0.0.1]) by ppma02wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 2786KbG6028642; Mon, 8 Aug 2022 06:26:54 GMT Received: from b03cxnp07029.gho.boulder.ibm.com (b03cxnp07029.gho.boulder.ibm.com [9.17.130.16]) by ppma02wdc.us.ibm.com with ESMTP id 3hsfx92v7r-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:54 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp07029.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 2786Qrj030933258 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 8 Aug 2022 06:26:53 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2DC387805E; Mon, 8 Aug 2022 06:26:53 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6DDD478060; Mon, 8 Aug 2022 06:26:47 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.19.76]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Mon, 8 Aug 2022 06:26:47 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v13 6/9] mm/demotion: Add pg_data_t member to track node memory tier details Date: Mon, 8 Aug 2022 11:55:58 +0530 Message-Id: <20220808062601.836025-7-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> References: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-GUID: LcTcIobk-iDHfxlzSLb2Nlsrb872aeul X-Proofpoint-ORIG-GUID: 4j-rHofYnSnj17CbZr69n1LfiUvTCka3 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-08_03,2022-08-05_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 adultscore=0 mlxlogscore=999 bulkscore=0 malwarescore=0 spamscore=0 impostorscore=0 suspectscore=0 lowpriorityscore=0 priorityscore=1501 phishscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2208080031 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Also update different helpes to use NODE_DATA()->memtier. Since node specific memtier can change based on the reassignment of NUMA node to a different memory tiers, accessing NODE_DATA()->memtier needs to happen under an rcu read lock or memory_tier_lock. Signed-off-by: Aneesh Kumar K.V --- include/linux/mmzone.h | 3 +++ mm/memory-tiers.c | 38 ++++++++++++++++++++++++++++++++------ 2 files changed, 35 insertions(+), 6 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index aab70355d64f..353812495a70 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -928,6 +928,9 @@ typedef struct pglist_data { /* Per-node vmstats */ struct per_cpu_nodestat __percpu *per_cpu_nodestats; atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS]; +#ifdef CONFIG_NUMA + struct memory_tier __rcu *memtier; +#endif } pg_data_t; =20 #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 02e514e87d5c..3778ac6a44a1 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -5,6 +5,7 @@ #include #include #include +#include #include =20 #include "internal.h" @@ -137,12 +138,18 @@ static struct memory_tier *find_create_memory_tier(st= ruct memory_dev_type *memty =20 static struct memory_tier *__node_get_memory_tier(int node) { - struct memory_dev_type *memtype; + pg_data_t *pgdat; =20 - memtype =3D node_memory_types[node]; - if (memtype && node_isset(node, memtype->nodes)) - return memtype->memtier; - return NULL; + pgdat =3D NODE_DATA(node); + if (!pgdat) + return NULL; + /* + * Since we hold memory_tier_lock, we can avoid + * RCU read locks when accessing the details. No + * parallel updates are possible here. + */ + return rcu_dereference_check(pgdat->memtier, + lockdep_is_held(&memory_tier_lock)); } =20 #ifdef CONFIG_MIGRATION @@ -295,6 +302,8 @@ static struct memory_tier *set_node_memory_tier(int nod= e) { struct memory_tier *memtier; struct memory_dev_type *memtype; + pg_data_t *pgdat =3D NODE_DATA(node); + =20 lockdep_assert_held_once(&memory_tier_lock); =20 @@ -307,6 +316,8 @@ static struct memory_tier *set_node_memory_tier(int nod= e) memtype =3D node_memory_types[node]; node_set(node, memtype->nodes); memtier =3D find_create_memory_tier(memtype); + if (!IS_ERR(memtier)) + rcu_assign_pointer(pgdat->memtier, memtier); return memtier; } =20 @@ -319,12 +330,25 @@ static void destroy_memory_tier(struct memory_tier *m= emtier) static bool clear_node_memory_tier(int node) { bool cleared =3D false; + pg_data_t *pgdat; struct memory_tier *memtier; =20 + pgdat =3D NODE_DATA(node); + if (!pgdat) + return false; + + /* + * Make sure that anybody looking at NODE_DATA who finds + * a valid memtier finds memory_dev_types with nodes still + * linked to the memtier. We achieve this by waiting for + * rcu read section to finish using synchronize_rcu. + */ memtier =3D __node_get_memory_tier(node); if (memtier) { struct memory_dev_type *memtype; =20 + rcu_assign_pointer(pgdat->memtier, NULL); + synchronize_rcu(); memtype =3D node_memory_types[node]; node_clear(node, memtype->nodes); if (nodes_empty(memtype->nodes)) { @@ -422,8 +446,10 @@ static int __init memory_tier_init(void) panic("%s() failed to register memory tier: %ld\n", __func__, PTR_ERR(memtier)); =20 - for_each_node_state(node, N_MEMORY) + for_each_node_state(node, N_MEMORY) { __init_node_memory_type(node, default_dram_type); + rcu_assign_pointer(NODE_DATA(node)->memtier, memtier); + } =20 mutex_unlock(&memory_tier_lock); #ifdef CONFIG_MIGRATION --=20 2.37.1 From nobody Sat Apr 11 19:32:44 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63D71C00140 for ; Mon, 8 Aug 2022 06:28:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237454AbiHHG2h (ORCPT ); Mon, 8 Aug 2022 02:28:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60316 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236243AbiHHG1l (ORCPT ); Mon, 8 Aug 2022 02:27:41 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1EF0113CC7 for ; Sun, 7 Aug 2022 23:27:26 -0700 (PDT) Received: from pps.filterd (m0098419.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2785m2e0014907; Mon, 8 Aug 2022 06:27:02 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=5cbhY/9EVQFT+FZ5lduHh+TIxdE4tvXhkR9/Vzfcm+I=; b=GosEhVDdne/4fSuJeGOH+HX635P5DP5eZW3boe9QbcdSfjVE3O6kLDioRYh6Jq8qGAx+ CZw/HZ/FhCuN/2EgAbspPcQ9lOOq3sQyLqXoln+J0sslXx5NNM/5ZbQXD3HY/T3lHvpL QJlov3Hf4k0MXtTKTcEfRB2jkcy6LSkj0ybwSthKVrw6ui2+TpM7buW438loyxezNWIc HFr41YMg+oaJSI3+d9beqkIzMXHY6QIAf8hxWzuSaZwfmTPkhtFG9tlfYyaE3pF0iVUu 5pObsT8kNrp4D0j4GLhxHg4XktM2Gedjn4a2sU2MYxtdS4NxKXpKT6SvbgQQzPLv2ak4 mg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3htvnj8vc6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:27:02 +0000 Received: from m0098419.ppops.net (m0098419.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2786R1pB029194; Mon, 8 Aug 2022 06:27:01 GMT Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3htvnj8vbr-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:27:01 +0000 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 2786Lb4W017082; Mon, 8 Aug 2022 06:27:00 GMT Received: from b03cxnp07028.gho.boulder.ibm.com (b03cxnp07028.gho.boulder.ibm.com [9.17.130.15]) by ppma01dal.us.ibm.com with ESMTP id 3hsfx9dht9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:27:00 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp07028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 2786QxdQ34079178 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 8 Aug 2022 06:26:59 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6A3697805E; Mon, 8 Aug 2022 06:26:59 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C4EC578060; Mon, 8 Aug 2022 06:26:53 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.19.76]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Mon, 8 Aug 2022 06:26:53 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K . V" Subject: [PATCH v13 7/9] mm/demotion: Demote pages according to allocation fallback order Date: Mon, 8 Aug 2022 11:55:59 +0530 Message-Id: <20220808062601.836025-8-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> References: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-GUID: vor45uBoi5FBGc5QtrQkJaiOIpgrFeAO X-Proofpoint-ORIG-GUID: x6z_kVw9zfDBGetinuziU1S0SRELzVeP X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-08_03,2022-08-05_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 impostorscore=0 bulkscore=0 adultscore=0 lowpriorityscore=0 spamscore=0 clxscore=1015 suspectscore=0 mlxlogscore=999 mlxscore=0 priorityscore=1501 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2208080031 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Jagdish Gediya Currently, a higher tier node can only be demoted to selected nodes on the = next lower tier as defined by the demotion path. This strict demotion order does= not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space). This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node= from any lower tier, whereas the demotion order doesn't allow that currently. This patch adds support to get all the allowed demotion targets for a memory tier. demote_page_list() function is now modified to utilize this allowed n= ode mask as the fallback allocation mask. Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 12 ++++++++ mm/memory-tiers.c | 51 +++++++++++++++++++++++++++++-- mm/vmscan.c | 58 ++++++++++++++++++++++++++---------- 3 files changed, 103 insertions(+), 18 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index c8cd593fa2df..341ba8082e05 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -4,6 +4,7 @@ =20 #include #include +#include /* * Each tier cover a abstrace distance chunk size of 128 */ @@ -33,11 +34,17 @@ void init_node_memory_type(int node, struct memory_dev_= type *default_type); void clear_node_memory_type(int node, struct memory_dev_type *memtype); #ifdef CONFIG_MIGRATION int next_demotion_node(int node); +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); #else static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } + +static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *= targets) +{ + *targets =3D NODE_MASK_NONE; +} #endif =20 #else @@ -57,5 +64,10 @@ static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } + +static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *= targets) +{ + *targets =3D NODE_MASK_NONE; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 3778ac6a44a1..925d7168e825 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -5,7 +5,6 @@ #include #include #include -#include #include =20 #include "internal.h" @@ -21,6 +20,8 @@ struct memory_tier { * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE */ int adistance_start; + /* All the nodes that are part of all the lower memory tiers. */ + nodemask_t lower_tier_mask; }; =20 struct demotion_nodes { @@ -153,6 +154,24 @@ static struct memory_tier *__node_get_memory_tier(int = node) } =20 #ifdef CONFIG_MIGRATION +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + struct memory_tier *memtier; + + /* + * pg_data_t.memtier updates includes a synchronize_rcu() + * which ensures that we either find NULL or a valid memtier + * in NODE_DATA. protect the access via rcu_read_lock(); + */ + rcu_read_lock(); + memtier =3D rcu_dereference(pgdat->memtier); + if (memtier) + *targets =3D memtier->lower_tier_mask; + else + *targets =3D NODE_MASK_NONE; + rcu_read_unlock(); +} + /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node @@ -200,10 +219,19 @@ int next_demotion_node(int node) =20 static void disable_all_demotion_targets(void) { + struct memory_tier *memtier; int node; =20 - for_each_node_state(node, N_MEMORY) + for_each_node_state(node, N_MEMORY) { node_demotion[node].preferred =3D NODE_MASK_NONE; + /* + * We are holding memory_tier_lock, it is safe + * to access pgda->memtier. + */ + memtier =3D __node_get_memory_tier(node); + if (memtier) + memtier->lower_tier_mask =3D NODE_MASK_NONE; + } /* * Ensure that the "disable" is visible across the system. * Readers will see either a combination of before+disable @@ -235,7 +263,7 @@ static void establish_demotion_targets(void) struct demotion_nodes *nd; int target =3D NUMA_NO_NODE, node; int distance, best_distance; - nodemask_t tier_nodes; + nodemask_t tier_nodes, lower_tier; =20 lockdep_assert_held_once(&memory_tier_lock); =20 @@ -283,6 +311,23 @@ static void establish_demotion_targets(void) } } while (1); } + /* + * Now build the lower_tier mask for each node collecting node mask from + * all memory tier below it. This allows us to fallback demotion page + * allocation to a set of nodes that is closer the above selected + * perferred node. + */ + lower_tier =3D node_states[N_MEMORY]; + list_for_each_entry(memtier, &memory_tiers, list) { + /* + * Keep removing current tier from lower_tier nodes, + * This will remove all nodes in current and above + * memory tier from the lower_tier mask. + */ + tier_nodes =3D get_memtier_nodemask(memtier); + nodes_andnot(lower_tier, lower_tier, tier_nodes); + memtier->lower_tier_mask =3D lower_tier; + } } =20 #else diff --git a/mm/vmscan.c b/mm/vmscan.c index 5043b10ff71e..74b4ee8eca2b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1460,21 +1460,34 @@ static void folio_check_dirty_writeback(struct foli= o *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } =20 -static struct page *alloc_demote_page(struct page *page, unsigned long nod= e) +static struct page *alloc_demote_page(struct page *page, unsigned long pri= vate) { - struct migration_target_control mtc =3D { - /* - * Allocate from 'node', or fail quickly and quietly. - * When this happens, 'page' will likely just be discarded - * instead of migrated. - */ - .gfp_mask =3D (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | - __GFP_THISNODE | __GFP_NOWARN | - __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid =3D node - }; + struct page *target_page; + nodemask_t *allowed_mask; + struct migration_target_control *mtc; + + mtc =3D (struct migration_target_control *)private; + + allowed_mask =3D mtc->nmask; + /* + * make sure we allocate from the target node first also trying to + * demote or reclaim pages from the target node via kswapd if we are + * low on free memory on target node. If we don't do this and if + * we have free memory on the slower(lower) memtier, we would start + * allocating pages from slower(lower) memory tiers without even forcing + * a demotion of cold pages from the target memtier. This can result + * in the kernel placing hot pages in slower(lower) memory tiers. + */ + mtc->nmask =3D NULL; + mtc->gfp_mask |=3D __GFP_THISNODE; + target_page =3D alloc_migration_target(page, (unsigned long)mtc); + if (target_page) + return target_page; =20 - return alloc_migration_target(page, (unsigned long)&mtc); + mtc->gfp_mask &=3D ~__GFP_THISNODE; + mtc->nmask =3D allowed_mask; + + return alloc_migration_target(page, (unsigned long)mtc); } =20 /* @@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_hea= d *demote_pages, { int target_nid =3D next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; + nodemask_t allowed_mask; + + struct migration_target_control mtc =3D { + /* + * Allocate from 'node', or fail quickly and quietly. + * When this happens, 'page' will likely just be discarded + * instead of migrated. + */ + .gfp_mask =3D (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | + __GFP_NOMEMALLOC | GFP_NOWAIT, + .nid =3D target_nid, + .nmask =3D &allowed_mask + }; =20 if (list_empty(demote_pages)) return 0; @@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_he= ad *demote_pages, if (target_nid =3D=3D NUMA_NO_NODE) return 0; =20 + node_get_allowed_targets(pgdat, &allowed_mask); + /* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_pages, alloc_demote_page, NULL, - target_nid, MIGRATE_ASYNC, MR_DEMOTION, - &nr_succeeded); + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, + &nr_succeeded); =20 if (current_is_kswapd()) __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded); --=20 2.37.1 From nobody Sat Apr 11 19:32:44 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 67C5CC00140 for ; Mon, 8 Aug 2022 06:28:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235460AbiHHG21 (ORCPT ); Mon, 8 Aug 2022 02:28:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60258 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237003AbiHHG1k (ORCPT ); Mon, 8 Aug 2022 02:27:40 -0400 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3C2CE12AFE for ; Sun, 7 Aug 2022 23:27:23 -0700 (PDT) Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2786Ibtv024873; Mon, 8 Aug 2022 06:27:08 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=5Ivlwtm84y0fjlliikwroD0Xe5LIvxUwtUwd3yCzT3A=; b=X2NDG5J5pKqmOxfNnmajFO6M3KKuleapa+dwdd0C/Bmbg3TyRMeShvaGdgdZYHli+qjr Zh6PFvrm1YYfPHi8XZpfTHTQGQUg29vdxoi4d1Rx+dh8kKhSYFB7ZVOK+LWNtv4i+9RG 6ofS8NkuHAjT0QGYmqOjG1d6qD34wtC2VBpg/m49/uL3YdQ3h60AthCgx0eAGTWNhrF0 BcffDyY/kx7zfmMDQ9HocyQO8fX49OyGK4KyhZUQtSY+z6dYpiRAVBky22/VH9loSDIt +TrA0+MiPTjbQjD5p0XxXmQGfqwxsgynmGV6uI1C6ASb+StpgAME3NJTn1Xb680Orvac CA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3htw3x84rt-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:27:08 +0000 Received: from m0098421.ppops.net (m0098421.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2786Iikl025166; Mon, 8 Aug 2022 06:27:07 GMT Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com [169.47.144.26]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3htw3x84r7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:27:07 +0000 Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1]) by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 2786KuWa002623; Mon, 8 Aug 2022 06:27:06 GMT Received: from b03cxnp08028.gho.boulder.ibm.com (b03cxnp08028.gho.boulder.ibm.com [9.17.130.20]) by ppma04wdc.us.ibm.com with ESMTP id 3hsfx92tu1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:27:06 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp08028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 2786R5AE21037364 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 8 Aug 2022 06:27:05 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id BF06478060; Mon, 8 Aug 2022 06:27:05 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 03B727805F; Mon, 8 Aug 2022 06:27:00 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.19.76]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Mon, 8 Aug 2022 06:26:59 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v13 8/9] mm/demotion: Update node_is_toptier to work with memory tiers Date: Mon, 8 Aug 2022 11:56:00 +0530 Message-Id: <20220808062601.836025-9-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> References: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 2Lucb8XSH4RWQRL_Py-rf04tptpZ3AZq X-Proofpoint-ORIG-GUID: JkGhR_AmTs5nXfXOMFgbsQz37Z91AcUh X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-08_03,2022-08-05_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 suspectscore=0 adultscore=0 spamscore=0 phishscore=0 impostorscore=0 mlxlogscore=999 priorityscore=1501 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2208080031 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" With memory tiers support we can have memory only NUMA nodes in the top tier from which we want to avoid promotion tracking NUMA faults. Update node_is_toptier to work with memory tiers. All NUMA nodes are by default top tier nodes. With lower memory tiers added we consider all memory tiers above a memory tier having CPU NUMA nodes as a top memory tier Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 11 +++++++++ include/linux/node.h | 5 ---- mm/huge_memory.c | 1 + mm/memory-tiers.c | 46 ++++++++++++++++++++++++++++++++++++ mm/migrate.c | 1 + mm/mprotect.c | 1 + 6 files changed, 60 insertions(+), 5 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 341ba8082e05..0bdd5955a5e2 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -35,6 +35,7 @@ void clear_node_memory_type(int node, struct memory_dev_t= ype *memtype); #ifdef CONFIG_MIGRATION int next_demotion_node(int node); void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); +bool node_is_toptier(int node); #else static inline int next_demotion_node(int node) { @@ -45,6 +46,11 @@ static inline void node_get_allowed_targets(pg_data_t *p= gdat, nodemask_t *target { *targets =3D NODE_MASK_NONE; } + +static inline bool node_is_toptier(int node) +{ + return true; +} #endif =20 #else @@ -69,5 +75,10 @@ static inline void node_get_allowed_targets(pg_data_t *p= gdat, nodemask_t *target { *targets =3D NODE_MASK_NONE; } + +static inline bool node_is_toptier(int node) +{ + return true; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/node.h b/include/linux/node.h index 40d641a8bfb0..9ec680dd607f 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -185,9 +185,4 @@ static inline void register_hugetlbfs_with_node(node_re= gistration_func_t reg, =20 #define to_node(device) container_of(device, struct node, dev) =20 -static inline bool node_is_toptier(int node) -{ - return node_state(node, N_CPU); -} - #endif /* _LINUX_NODE_H_ */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 15965084816d..524498061e7c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -36,6 +36,7 @@ #include #include #include +#include =20 #include #include diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 925d7168e825..ea5c04f62170 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -33,6 +33,7 @@ static LIST_HEAD(memory_tiers); static struct memory_dev_type *node_memory_types[MAX_NUMNODES]; static struct memory_dev_type *default_dram_type; #ifdef CONFIG_MIGRATION +static int top_tier_adistance; /* * node_demotion[] examples: * @@ -154,6 +155,31 @@ static struct memory_tier *__node_get_memory_tier(int = node) } =20 #ifdef CONFIG_MIGRATION +bool node_is_toptier(int node) +{ + bool toptier; + pg_data_t *pgdat; + struct memory_tier *memtier; + + pgdat =3D NODE_DATA(node); + if (!pgdat) + return false; + + rcu_read_lock(); + memtier =3D rcu_dereference(pgdat->memtier); + if (!memtier) { + toptier =3D true; + goto out; + } + if (memtier->adistance_start < top_tier_adistance) + toptier =3D true; + else + toptier =3D false; +out: + rcu_read_unlock(); + return toptier; +} + void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) { struct memory_tier *memtier; @@ -311,6 +337,26 @@ static void establish_demotion_targets(void) } } while (1); } + /* + * Promotion is allowed from a memory tier to higher + * memory tier only if the memory tier doesn't include + * compute. We want to skip promotion from a memory tier, + * if any node that is part of the memory tier have CPUs. + * Once we detect such a memory tier, we consider that tier + * as top tiper from which promotion is not allowed. + */ + list_for_each_entry_reverse(memtier, &memory_tiers, list) { + tier_nodes =3D get_memtier_nodemask(memtier); + nodes_and(tier_nodes, node_states[N_CPU], tier_nodes); + if (!nodes_empty(tier_nodes)) { + /* + * abstract distance below the max value of this memtier + * is considered toptier. + */ + top_tier_adistance =3D memtier->adistance_start + MEMTIER_CHUNK_SIZE; + break; + } + } /* * Now build the lower_tier mask for each node collecting node mask from * all memory tier below it. This allows us to fallback demotion page diff --git a/mm/migrate.c b/mm/migrate.c index 45290ddd3806..e7f3f52596c1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -50,6 +50,7 @@ #include #include #include +#include =20 #include =20 diff --git a/mm/mprotect.c b/mm/mprotect.c index ba5592655ee3..92a2fc0fa88b 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include --=20 2.37.1 From nobody Sat Apr 11 19:32:44 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 809CFC00140 for ; Mon, 8 Aug 2022 06:28:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237304AbiHHG2c (ORCPT ); Mon, 8 Aug 2022 02:28:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60304 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237212AbiHHG1l (ORCPT ); Mon, 8 Aug 2022 02:27:41 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 05E2D13CC6 for ; Sun, 7 Aug 2022 23:27:25 -0700 (PDT) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2786C77u002741; Mon, 8 Aug 2022 06:27:15 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=Vq7NtUTGJv/Btzc/7iihSXrYrrCuEc0Ypk+wO/4fJgU=; b=soa81K6XdewrUquilSEzClnmi/ZW8Kdv6mDGIpoNcg+UtQsxSSaOP7omH61HiP4L1bGw +yKLWlrRBdK9GaGWhv3AXCivi1Y1Jz7u5QTO5DKVEXtrJNUnAO99Qxwfv3D6xWTnNfcG Wrpl6pT9HjOf5i24mYT8fqstIgbN4o6Jk+cCsBO35r6RisBnJTN+9P8j7QpZTzMs83VC k0VkYdyM8KSgAwMiZ71LA+VUn0+ChjYcaKtp8uYh/OYBonvC83Y0tQM9itBqAtufi37Z zctPMurlfQCT+cen8QCVgHgjs0SwsU48VVys0mjIU5J9gw398NGz5ApNFPimIlYt1HgK 9g== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3htw0mrbeh-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:27:15 +0000 Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2786Dqh5012289; Mon, 8 Aug 2022 06:27:14 GMT Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3htw0mrbdv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:27:14 +0000 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 2786KJNs015604; Mon, 8 Aug 2022 06:27:13 GMT Received: from b03cxnp08028.gho.boulder.ibm.com (b03cxnp08028.gho.boulder.ibm.com [9.17.130.20]) by ppma03dal.us.ibm.com with ESMTP id 3hsfx9wh2x-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:27:13 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp08028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 2786RCoB36372986 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 8 Aug 2022 06:27:12 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 3802C7805E; Mon, 8 Aug 2022 06:27:12 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6050C7805F; Mon, 8 Aug 2022 06:27:06 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.19.76]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Mon, 8 Aug 2022 06:27:06 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v13 9/9] lib/nodemask: Optimize node_random for nodemask with single NUMA node Date: Mon, 8 Aug 2022 11:56:01 +0530 Message-Id: <20220808062601.836025-10-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> References: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-GUID: t1KZaO_Yul7vT8aqZ_tnU9bCL67z8lYx X-Proofpoint-ORIG-GUID: qk0acrlX4inmpnQSAoP_SET7g4Q8bi1Y X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-08_03,2022-08-05_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 malwarescore=0 adultscore=0 priorityscore=1501 phishscore=0 suspectscore=0 bulkscore=0 mlxlogscore=999 lowpriorityscore=0 impostorscore=0 clxscore=1015 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2208080031 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The most common case for certain node_random usage (demotion nodemask) is w= ith nodemask weight 1. We can avoid calling get_random_init() in that case and always return the only node set in the nodemask. Signed-off-by: Aneesh Kumar K.V --- lib/nodemask.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/lib/nodemask.c b/lib/nodemask.c index e22647f5181b..c91a6b0404a5 100644 --- a/lib/nodemask.c +++ b/lib/nodemask.c @@ -20,12 +20,21 @@ EXPORT_SYMBOL(__next_node_in); */ int node_random(const nodemask_t *maskp) { - int w, bit =3D NUMA_NO_NODE; + int w, bit; =20 w =3D nodes_weight(*maskp); - if (w) + switch (w) { + case 0: + bit =3D NUMA_NO_NODE; + break; + case 1: + bit =3D __first_node(maskp); + break; + default: bit =3D bitmap_ord_to_pos(maskp->bits, - get_random_int() % w, MAX_NUMNODES); + get_random_int() % w, MAX_NUMNODES); + break; + } return bit; } #endif --=20 2.37.1