From nobody Sat Apr 11 21:01:40 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63D71C00140 for ; Mon, 8 Aug 2022 06:28:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237454AbiHHG2h (ORCPT ); Mon, 8 Aug 2022 02:28:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60316 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236243AbiHHG1l (ORCPT ); Mon, 8 Aug 2022 02:27:41 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1EF0113CC7 for ; Sun, 7 Aug 2022 23:27:26 -0700 (PDT) Received: from pps.filterd (m0098419.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2785m2e0014907; Mon, 8 Aug 2022 06:27:02 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=5cbhY/9EVQFT+FZ5lduHh+TIxdE4tvXhkR9/Vzfcm+I=; b=GosEhVDdne/4fSuJeGOH+HX635P5DP5eZW3boe9QbcdSfjVE3O6kLDioRYh6Jq8qGAx+ CZw/HZ/FhCuN/2EgAbspPcQ9lOOq3sQyLqXoln+J0sslXx5NNM/5ZbQXD3HY/T3lHvpL QJlov3Hf4k0MXtTKTcEfRB2jkcy6LSkj0ybwSthKVrw6ui2+TpM7buW438loyxezNWIc HFr41YMg+oaJSI3+d9beqkIzMXHY6QIAf8hxWzuSaZwfmTPkhtFG9tlfYyaE3pF0iVUu 5pObsT8kNrp4D0j4GLhxHg4XktM2Gedjn4a2sU2MYxtdS4NxKXpKT6SvbgQQzPLv2ak4 mg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3htvnj8vc6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:27:02 +0000 Received: from m0098419.ppops.net (m0098419.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2786R1pB029194; Mon, 8 Aug 2022 06:27:01 GMT Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3htvnj8vbr-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:27:01 +0000 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 2786Lb4W017082; Mon, 8 Aug 2022 06:27:00 GMT Received: from b03cxnp07028.gho.boulder.ibm.com (b03cxnp07028.gho.boulder.ibm.com [9.17.130.15]) by ppma01dal.us.ibm.com with ESMTP id 3hsfx9dht9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Aug 2022 06:27:00 +0000 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp07028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 2786QxdQ34079178 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 8 Aug 2022 06:26:59 GMT Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6A3697805E; Mon, 8 Aug 2022 06:26:59 +0000 (GMT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C4EC578060; Mon, 8 Aug 2022 06:26:53 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.19.76]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Mon, 8 Aug 2022 06:26:53 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K . V" Subject: [PATCH v13 7/9] mm/demotion: Demote pages according to allocation fallback order Date: Mon, 8 Aug 2022 11:55:59 +0530 Message-Id: <20220808062601.836025-8-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> References: <20220808062601.836025-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-GUID: vor45uBoi5FBGc5QtrQkJaiOIpgrFeAO X-Proofpoint-ORIG-GUID: x6z_kVw9zfDBGetinuziU1S0SRELzVeP X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-08_03,2022-08-05_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 impostorscore=0 bulkscore=0 adultscore=0 lowpriorityscore=0 spamscore=0 clxscore=1015 suspectscore=0 mlxlogscore=999 mlxscore=0 priorityscore=1501 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2208080031 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Jagdish Gediya Currently, a higher tier node can only be demoted to selected nodes on the = next lower tier as defined by the demotion path. This strict demotion order does= not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space). This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node= from any lower tier, whereas the demotion order doesn't allow that currently. This patch adds support to get all the allowed demotion targets for a memory tier. demote_page_list() function is now modified to utilize this allowed n= ode mask as the fallback allocation mask. Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 12 ++++++++ mm/memory-tiers.c | 51 +++++++++++++++++++++++++++++-- mm/vmscan.c | 58 ++++++++++++++++++++++++++---------- 3 files changed, 103 insertions(+), 18 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index c8cd593fa2df..341ba8082e05 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -4,6 +4,7 @@ =20 #include #include +#include /* * Each tier cover a abstrace distance chunk size of 128 */ @@ -33,11 +34,17 @@ void init_node_memory_type(int node, struct memory_dev_= type *default_type); void clear_node_memory_type(int node, struct memory_dev_type *memtype); #ifdef CONFIG_MIGRATION int next_demotion_node(int node); +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); #else static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } + +static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *= targets) +{ + *targets =3D NODE_MASK_NONE; +} #endif =20 #else @@ -57,5 +64,10 @@ static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } + +static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *= targets) +{ + *targets =3D NODE_MASK_NONE; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 3778ac6a44a1..925d7168e825 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -5,7 +5,6 @@ #include #include #include -#include #include =20 #include "internal.h" @@ -21,6 +20,8 @@ struct memory_tier { * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE */ int adistance_start; + /* All the nodes that are part of all the lower memory tiers. */ + nodemask_t lower_tier_mask; }; =20 struct demotion_nodes { @@ -153,6 +154,24 @@ static struct memory_tier *__node_get_memory_tier(int = node) } =20 #ifdef CONFIG_MIGRATION +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + struct memory_tier *memtier; + + /* + * pg_data_t.memtier updates includes a synchronize_rcu() + * which ensures that we either find NULL or a valid memtier + * in NODE_DATA. protect the access via rcu_read_lock(); + */ + rcu_read_lock(); + memtier =3D rcu_dereference(pgdat->memtier); + if (memtier) + *targets =3D memtier->lower_tier_mask; + else + *targets =3D NODE_MASK_NONE; + rcu_read_unlock(); +} + /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node @@ -200,10 +219,19 @@ int next_demotion_node(int node) =20 static void disable_all_demotion_targets(void) { + struct memory_tier *memtier; int node; =20 - for_each_node_state(node, N_MEMORY) + for_each_node_state(node, N_MEMORY) { node_demotion[node].preferred =3D NODE_MASK_NONE; + /* + * We are holding memory_tier_lock, it is safe + * to access pgda->memtier. + */ + memtier =3D __node_get_memory_tier(node); + if (memtier) + memtier->lower_tier_mask =3D NODE_MASK_NONE; + } /* * Ensure that the "disable" is visible across the system. * Readers will see either a combination of before+disable @@ -235,7 +263,7 @@ static void establish_demotion_targets(void) struct demotion_nodes *nd; int target =3D NUMA_NO_NODE, node; int distance, best_distance; - nodemask_t tier_nodes; + nodemask_t tier_nodes, lower_tier; =20 lockdep_assert_held_once(&memory_tier_lock); =20 @@ -283,6 +311,23 @@ static void establish_demotion_targets(void) } } while (1); } + /* + * Now build the lower_tier mask for each node collecting node mask from + * all memory tier below it. This allows us to fallback demotion page + * allocation to a set of nodes that is closer the above selected + * perferred node. + */ + lower_tier =3D node_states[N_MEMORY]; + list_for_each_entry(memtier, &memory_tiers, list) { + /* + * Keep removing current tier from lower_tier nodes, + * This will remove all nodes in current and above + * memory tier from the lower_tier mask. + */ + tier_nodes =3D get_memtier_nodemask(memtier); + nodes_andnot(lower_tier, lower_tier, tier_nodes); + memtier->lower_tier_mask =3D lower_tier; + } } =20 #else diff --git a/mm/vmscan.c b/mm/vmscan.c index 5043b10ff71e..74b4ee8eca2b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1460,21 +1460,34 @@ static void folio_check_dirty_writeback(struct foli= o *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } =20 -static struct page *alloc_demote_page(struct page *page, unsigned long nod= e) +static struct page *alloc_demote_page(struct page *page, unsigned long pri= vate) { - struct migration_target_control mtc =3D { - /* - * Allocate from 'node', or fail quickly and quietly. - * When this happens, 'page' will likely just be discarded - * instead of migrated. - */ - .gfp_mask =3D (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | - __GFP_THISNODE | __GFP_NOWARN | - __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid =3D node - }; + struct page *target_page; + nodemask_t *allowed_mask; + struct migration_target_control *mtc; + + mtc =3D (struct migration_target_control *)private; + + allowed_mask =3D mtc->nmask; + /* + * make sure we allocate from the target node first also trying to + * demote or reclaim pages from the target node via kswapd if we are + * low on free memory on target node. If we don't do this and if + * we have free memory on the slower(lower) memtier, we would start + * allocating pages from slower(lower) memory tiers without even forcing + * a demotion of cold pages from the target memtier. This can result + * in the kernel placing hot pages in slower(lower) memory tiers. + */ + mtc->nmask =3D NULL; + mtc->gfp_mask |=3D __GFP_THISNODE; + target_page =3D alloc_migration_target(page, (unsigned long)mtc); + if (target_page) + return target_page; =20 - return alloc_migration_target(page, (unsigned long)&mtc); + mtc->gfp_mask &=3D ~__GFP_THISNODE; + mtc->nmask =3D allowed_mask; + + return alloc_migration_target(page, (unsigned long)mtc); } =20 /* @@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_hea= d *demote_pages, { int target_nid =3D next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; + nodemask_t allowed_mask; + + struct migration_target_control mtc =3D { + /* + * Allocate from 'node', or fail quickly and quietly. + * When this happens, 'page' will likely just be discarded + * instead of migrated. + */ + .gfp_mask =3D (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | + __GFP_NOMEMALLOC | GFP_NOWAIT, + .nid =3D target_nid, + .nmask =3D &allowed_mask + }; =20 if (list_empty(demote_pages)) return 0; @@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_he= ad *demote_pages, if (target_nid =3D=3D NUMA_NO_NODE) return 0; =20 + node_get_allowed_targets(pgdat, &allowed_mask); + /* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_pages, alloc_demote_page, NULL, - target_nid, MIGRATE_ASYNC, MR_DEMOTION, - &nr_succeeded); + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, + &nr_succeeded); =20 if (current_is_kswapd()) __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded); --=20 2.37.1