From nobody Sun Feb 8 01:50:38 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD725C00140 for ; Thu, 18 Aug 2022 13:13:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244946AbiHRNNH (ORCPT ); Thu, 18 Aug 2022 09:13:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37474 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244772AbiHRNMt (ORCPT ); Thu, 18 Aug 2022 09:12:49 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6FB70B2743 for ; Thu, 18 Aug 2022 06:12:48 -0700 (PDT) Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27ICitT4025834; Thu, 18 Aug 2022 13:11:03 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=4MzbymPQ/qh0NBVwON+9WHYwENubdq5Fjkr2wzaXZNg=; b=Hw71tHpOa+jvR+mcNMlDqoR9TAs2+rdU1e8KEN2oNmhabnt2QiV73gFs3z03Kbu3ewxI 99SuQ/cDtorPYlOWv2pJ6iwnY+voICsqXAldEYucb/QriaJYkJ/pCphhxKYlb9qe71TB tUAGRUOXwDSIl+rWemSN6cKcgd2xsgLjo2/5CpoPzZ4oXsc28aCHpY9qju1wrxrTltPb XkcnvtRM75sG+72MEZh0naC3rjUARVLeZkfWIxIHc+DG/XUUOjI5XQg3cBhbZW82U95c wSZ7SqNq0Lh7cY8iA0IlOlyff37h/6pZOn0OwJcRoOxl4z96z+KJnWvKo7NVJnWxueZz 8g== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1npxgx21-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:03 +0000 Received: from m0187473.ppops.net (m0187473.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27ICnXv9040230; Thu, 18 Aug 2022 13:11:02 GMT Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1npxgx10-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:02 +0000 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27ID68Mv007590; Thu, 18 Aug 2022 13:11:01 GMT Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com [9.57.198.24]) by ppma01dal.us.ibm.com with ESMTP id 3hx3kaxgar-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:01 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27IDB0Zk3080772 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 18 Aug 2022 13:11:00 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 45673124055; Thu, 18 Aug 2022 13:11:00 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 556A9124054; Thu, 18 Aug 2022 13:10:55 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.111.107]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Thu, 18 Aug 2022 13:10:55 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v15 01/10] mm/demotion: Add support for explicit memory tiers Date: Thu, 18 Aug 2022 18:40:33 +0530 Message-Id: <20220818131042.113280-2-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.2 In-Reply-To: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> References: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: qnO-NOLRzjSnMq5r0oR9KL8StF-DTgeu X-Proofpoint-GUID: sAUN-Rd9BJ2AqOrsHqw9IPLUJWpi7vbY X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-18_12,2022-08-18_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 lowpriorityscore=0 clxscore=1015 bulkscore=0 mlxlogscore=999 adultscore=0 suspectscore=0 mlxscore=0 impostorscore=0 priorityscore=1501 phishscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208180045 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" In the current kernel, memory tiers are defined implicitly via a demotion p= ath relationship between NUMA nodes, which is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the highest tier, and b= uilds the tier hierarchy by establishing the per-node demotion targets based on t= he distances between nodes. This current memory tier kernel implementation needs to be improved for sev= eral important use cases, The current tier initialization code always initializes each memory-only NU= MA node into a lower tier. But a memory-only NUMA node may have a high perform= ance memory device (e.g. a DRAM-backed memory-only node on a virtual machine) th= at should be put into a higher tier. The current tier hierarchy always puts CPU nodes into the top tier. But on a system with HBM or GPU devices, the memory-only NUMA nodes mapping these de= vices should be in the top tier, and DRAM nodes with CPUs are better to be placed= into the next lower tier. With current kernel higher tier node can only be demoted to nodes with shor= test distance on the next lower tier as defined by the demotion path, not any ot= her node from any lower tier. This strict, demotion order does not work in all = use cases (e.g. some use cases may want to allow cross-socket demotion to anoth= er node in the same demotion tier as a fallback when the preferred demotion no= de is out of space), This demotion order is also inconsistent with the page alloc= ation fallback order when all the nodes in a higher tier are out of space: The pa= ge allocation can fall back to any node from any lower tier, whereas the demot= ion order doesn't allow that. This patch series address the above by defining memory tiers explicitly. Linux kernel presents memory devices as NUMA nodes and each memory device i= s of a specific type. The memory type of a device is represented by its abstract distance. A memory tier corresponds to a range of abstract distance. This a= llows for classifying memory devices with a specific performance range into a mem= ory tier. This patch configures the range/chunk size to be 128. The default DRAM abst= ract distance is 512. We can have 4 memory tiers below the default DRAM with abs= tract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511. Faster memory devic= es can be placed in these faster(higher) memory tiers. Slower memory devices l= ike persistent memory will have abstract distance higher than the default DRAM level. Reviewed-by: "Huang, Ying" Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 18 +++++ mm/Makefile | 1 + mm/memory-tiers.c | 129 +++++++++++++++++++++++++++++++++++ 3 files changed, 148 insertions(+) create mode 100644 include/linux/memory-tiers.h create mode 100644 mm/memory-tiers.c diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h new file mode 100644 index 000000000000..5206568f34ed --- /dev/null +++ b/include/linux/memory-tiers.h @@ -0,0 +1,18 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMORY_TIERS_H +#define _LINUX_MEMORY_TIERS_H + +/* + * Each tier cover a abstrace distance chunk size of 128 + */ +#define MEMTIER_CHUNK_BITS 7 +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) +/* + * Smaller abstract distance values imply faster(higher) memory tiers. Off= set + * the DRAM adistance so that we can accommodate devices with a slightly l= ower + * adistance value (slightly slower) than default DRAM adistance to be par= t of + * the same memory tier. + */ +#define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_= SIZE >> 1)) + +#endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/Makefile b/mm/Makefile index 9a564f836403..488f604e77e0 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) +=3D kfence/ obj-$(CONFIG_FAILSLAB) +=3D failslab.o obj-$(CONFIG_MEMTEST) +=3D memtest.o obj-$(CONFIG_MIGRATION) +=3D migrate.o +obj-$(CONFIG_NUMA) +=3D memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) +=3D migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) +=3D huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) +=3D page_counter.o diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c new file mode 100644 index 000000000000..1f494e69776a --- /dev/null +++ b/mm/memory-tiers.c @@ -0,0 +1,129 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include + +struct memory_tier { + /* hierarchy of memory tiers */ + struct list_head list; + /* list of all memory types part of this tier */ + struct list_head memory_types; + /* + * start value of abstract distance. memory tier maps + * an abstract distance range, + * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE + */ + int adistance_start; +}; + +struct memory_dev_type { + /* list of memory types that are part of same tier as this type */ + struct list_head tier_sibiling; + /* abstract distance for this specific memory type */ + int adistance; + /* Nodes of same abstract distance */ + nodemask_t nodes; + struct memory_tier *memtier; +}; + +static DEFINE_MUTEX(memory_tier_lock); +static LIST_HEAD(memory_tiers); +static struct memory_dev_type *node_memory_types[MAX_NUMNODES]; +/* + * For now we can have 4 faster memory tiers with smaller adistance + * than default DRAM tier. + */ +static struct memory_dev_type default_dram_type =3D { + .adistance =3D MEMTIER_ADISTANCE_DRAM, + .tier_sibiling =3D LIST_HEAD_INIT(default_dram_type.tier_sibiling), +}; + +static struct memory_tier *find_create_memory_tier(struct memory_dev_type = *memtype) +{ + bool found_slot =3D false; + struct memory_tier *memtier, *new_memtier; + int adistance =3D memtype->adistance; + unsigned int memtier_adistance_chunk_size =3D MEMTIER_CHUNK_SIZE; + + lockdep_assert_held_once(&memory_tier_lock); + + /* + * If the memtype is already part of a memory tier, + * just return that. + */ + if (memtype->memtier) + return memtype->memtier; + + adistance =3D round_down(adistance, memtier_adistance_chunk_size); + list_for_each_entry(memtier, &memory_tiers, list) { + if (adistance =3D=3D memtier->adistance_start) { + memtype->memtier =3D memtier; + list_add(&memtype->tier_sibiling, &memtier->memory_types); + return memtier; + } else if (adistance < memtier->adistance_start) { + found_slot =3D true; + break; + } + } + + new_memtier =3D kmalloc(sizeof(struct memory_tier), GFP_KERNEL); + if (!new_memtier) + return ERR_PTR(-ENOMEM); + + new_memtier->adistance_start =3D adistance; + INIT_LIST_HEAD(&new_memtier->list); + INIT_LIST_HEAD(&new_memtier->memory_types); + if (found_slot) + list_add_tail(&new_memtier->list, &memtier->list); + else + list_add_tail(&new_memtier->list, &memory_tiers); + memtype->memtier =3D new_memtier; + list_add(&memtype->tier_sibiling, &new_memtier->memory_types); + return new_memtier; +} + +static struct memory_tier *set_node_memory_tier(int node) +{ + struct memory_tier *memtier; + struct memory_dev_type *memtype; + + lockdep_assert_held_once(&memory_tier_lock); + + if (!node_state(node, N_MEMORY)) + return ERR_PTR(-EINVAL); + + if (!node_memory_types[node]) + node_memory_types[node] =3D &default_dram_type; + + memtype =3D node_memory_types[node]; + node_set(node, memtype->nodes); + memtier =3D find_create_memory_tier(memtype); + return memtier; +} + +static int __init memory_tier_init(void) +{ + int node; + struct memory_tier *memtier; + + mutex_lock(&memory_tier_lock); + /* + * Look at all the existing N_MEMORY nodes and add them to + * default memory tier or to a tier if we already have memory + * types assigned. + */ + for_each_node_state(node, N_MEMORY) { + memtier =3D set_node_memory_tier(node); + if (IS_ERR(memtier)) + /* + * Continue with memtiers we are able to setup + */ + break; + } + mutex_unlock(&memory_tier_lock); + + return 0; +} +subsys_initcall(memory_tier_init); --=20 2.37.2