From nobody Sat Apr 11 20:56:14 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3DD26C00140 for ; Mon, 8 Aug 2022 06:26:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234810AbiHHG05 (ORCPT ); Mon, 8 Aug 2022 02:26:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58738 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235075AbiHHG0t (ORCPT ); Mon, 8 Aug 2022 02:26:49 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C48C363B1 for ; Sun, 7 Aug 2022 23:26:44 -0700 (PDT) Received: from pps.filterd (m0098419.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2785mA3g019530; Mon, 8 Aug 2022 06:26:23 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=k42aPm0jJa6aK7tVGRyiVD/jZ+RnS4rG4X5oES7T23Y=; b=N9wD/wRVMBkzOyUlZRZyqpV2tXqYzO0uOx96xwL+duR6aFL91iPSgM9QszBeanynYXeA 5YWJk6Wd5yreSWolr9VUYEFejK/UobwH3ajwK2oT4HjHGwKa0HPJVE+TVRt6YncoS5CP N6iupC6su/kALrMJwofpyeKVBGmX7lnRUMEW7V3TqMhPVywHzDIkUrXJMxBtOrPZWL8m CQuDoZNhNlY8f/ezAatPi+6EPfHwglVci3ENO++4dkoBWqmebY36NxsT2b7txXZ9gDXh eLxe7wLre/f6XaQhad+gFV8bz2Wl8F5Kviiv69WxRLHn+CLfuHFjeDZKYOudgw7HKQLB HQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3htvnj8v15-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:22 +0000 Received: from m0098419.ppops.net (m0098419.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2786QMWH027527; Mon, 8 Aug 2022 06:26:22 GMT Received: from ppma01wdc.us.ibm.com (fd.55.37a9.ip4.static.sl-reverse.com [169.55.85.253]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3htvnj8v0w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:22 +0000 Received: from pps.filterd (ppma01wdc.us.ibm.com [127.0.0.1]) by ppma01wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 2786LJCq005549; Mon, 8 Aug 2022 06:26:21 GMT Received: from b03cxnp07027.gho.boulder.ibm.com (b03cxnp07027.gho.boulder.ibm.com [9.17.130.14]) by ppma01wdc.us.ibm.com with ESMTP id 3hsfx92ud4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:26:21 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp07027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 2786QKeM31064514 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 8 Aug 2022 06:26:20 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A21C478064; Mon, 8 Aug 2022 06:26:20 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B79337805E; Mon, 8 Aug 2022 06:26:14 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.19.76]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Mon, 8 Aug 2022 06:26:14 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v13 1/9] mm/demotion: Add support for explicit memory tiers Date: Mon, 8 Aug 2022 11:55:53 +0530 Message-Id: <20220808062601.836025-2-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> References: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-GUID: WqncfFwssIDCdVqaZcdNSLzaCzhajT1n X-Proofpoint-ORIG-GUID: vKSGt2pVpfnMD2333Es8mImpDQ8lNvD5 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-08_03,2022-08-05_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 impostorscore=0 bulkscore=0 adultscore=0 lowpriorityscore=0 spamscore=0 clxscore=1015 suspectscore=0 mlxlogscore=999 mlxscore=0 priorityscore=1501 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2208080031 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" In the current kernel, memory tiers are defined implicitly via a demotion p= ath relationship between NUMA nodes, which is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the highest tier, and b= uilds the tier hierarchy by establishing the per-node demotion targets based on t= he distances between nodes. This current memory tier kernel implementation needs to be improved for sev= eral important use cases, The current tier initialization code always initializes each memory-only NU= MA node into a lower tier. But a memory-only NUMA node may have a high perform= ance memory device (e.g. a DRAM-backed memory-only node on a virtual machine) th= at should be put into a higher tier. The current tier hierarchy always puts CPU nodes into the top tier. But on a system with HBM or GPU devices, the memory-only NUMA nodes mapping these de= vices should be in the top tier, and DRAM nodes with CPUs are better to be placed= into the next lower tier. With current kernel higher tier node can only be demoted to nodes with shor= test distance on the next lower tier as defined by the demotion path, not any ot= her node from any lower tier. This strict, demotion order does not work in all = use cases (e.g. some use cases may want to allow cross-socket demotion to anoth= er node in the same demotion tier as a fallback when the preferred demotion no= de is out of space), This demotion order is also inconsistent with the page alloc= ation fallback order when all the nodes in a higher tier are out of space: The pa= ge allocation can fall back to any node from any lower tier, whereas the demot= ion order doesn't allow that. This patch series address the above by defining memory tiers explicitly. Linux kernel presents memory devices as NUMA nodes and each memory device i= s of a specific type. The memory type of a device is represented by its abstract distance. A memory tier corresponds to a range of abstract distance. This a= llows for classifying memory devices with a specific performance range into a mem= ory tier. This patch configures the range/chunk size to be 128. The default DRAM abst= ract distance is 512. We can have 4 memory tiers below the default DRAM with abs= tract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511. Faster memory devic= es can be placed in these faster(higher) memory tiers. Slower memory devices l= ike persistent memory will have abstract distance higher than the default DRAM level. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 15 +++++ mm/Makefile | 1 + mm/memory-tiers.c | 107 +++++++++++++++++++++++++++++++++++ 3 files changed, 123 insertions(+) create mode 100644 include/linux/memory-tiers.h create mode 100644 mm/memory-tiers.c diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h new file mode 100644 index 000000000000..bc7c1b799bef --- /dev/null +++ b/include/linux/memory-tiers.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMORY_TIERS_H +#define _LINUX_MEMORY_TIERS_H + +/* + * Each tier cover a abstrace distance chunk size of 128 + */ +#define MEMTIER_CHUNK_BITS 7 +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) +/* + * Smaller abstract distance value imply faster(higher) memory tiers. + */ +#define MEMTIER_ADISTANCE_DRAM (4 * MEMTIER_CHUNK_SIZE) + +#endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/Makefile b/mm/Makefile index 6f9ffa968a1a..d30acebc2164 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) +=3D kfence/ obj-$(CONFIG_FAILSLAB) +=3D failslab.o obj-$(CONFIG_MEMTEST) +=3D memtest.o obj-$(CONFIG_MIGRATION) +=3D migrate.o +obj-$(CONFIG_NUMA) +=3D memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) +=3D migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) +=3D huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) +=3D page_counter.o diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c new file mode 100644 index 000000000000..78b311d9bde9 --- /dev/null +++ b/mm/memory-tiers.c @@ -0,0 +1,107 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include + +struct memory_tier { + /* hierarchy of memory tiers */ + struct list_head list; + /* list of all memory types part of this tier */ + struct list_head memory_types; + /* + * start value of abstract distance. memory tier maps + * an abstract distance range, + * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE + */ + int adistance_start; +}; + +struct memory_dev_type { + /* list of memory types that are part of same tier as this type */ + struct list_head tier_sibiling; + /* abstract distance for this specific memory type */ + int adistance; + /* Nodes of same abstract distance */ + nodemask_t nodes; + struct memory_tier *memtier; +}; + +static DEFINE_MUTEX(memory_tier_lock); +static LIST_HEAD(memory_tiers); +static struct memory_dev_type *node_memory_types[MAX_NUMNODES]; +/* + * For now let's have 4 memory tier below default DRAM tier. + */ +static struct memory_dev_type default_dram_type =3D { + .adistance =3D MEMTIER_ADISTANCE_DRAM, + .tier_sibiling =3D LIST_HEAD_INIT(default_dram_type.tier_sibiling), +}; + +static struct memory_tier *find_create_memory_tier(struct memory_dev_type = *memtype) +{ + bool found_slot =3D false; + struct memory_tier *memtier, *new_memtier; + int adistance =3D memtype->adistance; + unsigned int memtier_adistance_chunk_size =3D MEMTIER_CHUNK_SIZE; + + lockdep_assert_held_once(&memory_tier_lock); + + /* + * If the memtype is already part of a memory tier, + * just return that. + */ + if (memtype->memtier) + return memtype->memtier; + + adistance =3D round_down(adistance, memtier_adistance_chunk_size); + list_for_each_entry(memtier, &memory_tiers, list) { + if (adistance =3D=3D memtier->adistance_start) { + memtype->memtier =3D memtier; + list_add(&memtype->tier_sibiling, &memtier->memory_types); + return memtier; + } else if (adistance < memtier->adistance_start) { + found_slot =3D true; + break; + } + } + + new_memtier =3D kmalloc(sizeof(struct memory_tier), GFP_KERNEL); + if (!new_memtier) + return ERR_PTR(-ENOMEM); + + new_memtier->adistance_start =3D adistance; + INIT_LIST_HEAD(&new_memtier->list); + INIT_LIST_HEAD(&new_memtier->memory_types); + if (found_slot) + list_add_tail(&new_memtier->list, &memtier->list); + else + list_add_tail(&new_memtier->list, &memory_tiers); + memtype->memtier =3D new_memtier; + list_add(&memtype->tier_sibiling, &new_memtier->memory_types); + return new_memtier; +} + +static int __init memory_tier_init(void) +{ + int node; + struct memory_tier *memtier; + + mutex_lock(&memory_tier_lock); + /* CPU only nodes are not part of memory tiers. */ + default_dram_type.nodes =3D node_states[N_MEMORY]; + + memtier =3D find_create_memory_tier(&default_dram_type); + if (IS_ERR(memtier)) + panic("%s() failed to register memory tier: %ld\n", + __func__, PTR_ERR(memtier)); + + for_each_node_state(node, N_MEMORY) + node_memory_types[node] =3D &default_dram_type; + + mutex_unlock(&memory_tier_lock); + + return 0; +} +subsys_initcall(memory_tier_init); --=20 2.37.1