From nobody Sun Feb 8 12:14:37 2026 Received: from lgeamrelo03.lge.com (lgeamrelo03.lge.com [156.147.51.102]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A2EC824A06B for ; Sat, 31 Jan 2026 12:55:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=156.147.51.102 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769864161; cv=none; b=peRNNEQK2hPz7f85rvmEHp+QB005zkPIws54a+49z4R5BRx7CsTexZCIrX1INQvG3e7lLQ30ULqOn5ZZqJR7tQzkkjmATicx1quuYK11fP2SROQdvYuFdiGF+nV/1QnK7fp3o/Q8QG8gmLIWHUIzemPTyARbbnMhs4lsra/xgwA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769864161; c=relaxed/simple; bh=NB6Z6V/aDVN7if2XriRUUion0H+4rTJc7i1V5nG5NrM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=S8Czw3OY5qRLb+jYkX35PzeHPXSdyINN8/ajGXbmjVbKPvOAb0CbglJY8r/6jJeP8tyn4mDH2AydYmtrp4RlB5TfuC913VNDXWOLzLMwD5UCG8XeXEVaXArtoRxoVexTQ0Cj7E4ae2A/6vbyEvfnCKsnihddykuwVhGSPwlkBQU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=lge.com; spf=pass smtp.mailfrom=lge.com; arc=none smtp.client-ip=156.147.51.102 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=lge.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=lge.com Received: from unknown (HELO yjaykim-PowerEdge-T330.lge.net) (10.177.112.156) by 156.147.51.102 with ESMTP; 31 Jan 2026 21:55:53 +0900 X-Original-SENDERIP: 10.177.112.156 X-Original-MAILFROM: youngjun.park@lge.com From: Youngjun Park To: akpm@linux-foundation.org Cc: chrisl@kernel.org, kasong@tencent.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, gunho.lee@lge.com, youngjun.park@lge.com, taejoon.song@lge.com Subject: [RFC PATCH v3 1/5] mm: swap: introduce swap tier infrastructure Date: Sat, 31 Jan 2026 21:54:50 +0900 Message-Id: <20260131125454.3187546-2-youngjun.park@lge.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260131125454.3187546-1-youngjun.park@lge.com> References: <20260131125454.3187546-1-youngjun.park@lge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This patch introduces the "Swap tier" concept, which serves as an abstraction layer for managing swap devices based on their performance characteristics (e.g., NVMe, HDD, Network swap). Swap tiers are user-named groups representing priority ranges. These tiers collectively cover the entire priority space from -1 (`DEF_SWAP_PRIO`) to `SHRT_MAX`. To configure tiers, a new sysfs interface is exposed at `/sys/kernel/mm/swap/tiers`. The input parser evaluates commands from left to right and supports batch input, allowing users to add, remove or modify multiple tiers in a single write operation. Tier management enforces continuous priority ranges anchored by start priorities. Operations trigger range splitting or merging, but overwriting start priorities is forbidden. Merging expands lower tiers upwards to preserve configured start priorities, except when removing `DEF_SWAP_PRIO`, which merges downwards. Suggested-by: Chris Li Signed-off-by: Youngjun Park --- MAINTAINERS | 2 + mm/Makefile | 2 +- mm/swap.h | 4 + mm/swap_state.c | 70 +++++++++++ mm/swap_tier.c | 304 ++++++++++++++++++++++++++++++++++++++++++++++++ mm/swap_tier.h | 38 ++++++ mm/swapfile.c | 7 +- 7 files changed, 423 insertions(+), 4 deletions(-) create mode 100644 mm/swap_tier.c create mode 100644 mm/swap_tier.h diff --git a/MAINTAINERS b/MAINTAINERS index 18d1ebf053db..501bf46adfb4 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16743,6 +16743,8 @@ F: mm/swap.c F: mm/swap.h F: mm/swap_table.h F: mm/swap_state.c +F: mm/swap_tier.c +F: mm/swap_tier.h F: mm/swapfile.c =20 MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE) diff --git a/mm/Makefile b/mm/Makefile index 53ca5d4b1929..3b3de2de7285 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -75,7 +75,7 @@ ifdef CONFIG_MMU obj-$(CONFIG_ADVISE_SYSCALLS) +=3D madvise.o endif =20 -obj-$(CONFIG_SWAP) +=3D page_io.o swap_state.o swapfile.o +obj-$(CONFIG_SWAP) +=3D page_io.o swap_state.o swapfile.o swap_tier.o obj-$(CONFIG_ZSWAP) +=3D zswap.o obj-$(CONFIG_HAS_DMA) +=3D dmapool.o obj-$(CONFIG_HUGETLBFS) +=3D hugetlb.o hugetlb_sysfs.o hugetlb_sysctl.o diff --git a/mm/swap.h b/mm/swap.h index bfafa637c458..55f230cbe4e7 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -16,6 +16,10 @@ extern int page_cluster; #define swap_entry_order(order) 0 #endif =20 +#define DEF_SWAP_PRIO -1 + +extern spinlock_t swap_lock; +extern struct plist_head swap_active_head; extern struct swap_info_struct *swap_info[]; =20 /* diff --git a/mm/swap_state.c b/mm/swap_state.c index 6d0eef7470be..f1a7d9cdc648 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -25,6 +25,7 @@ #include "internal.h" #include "swap_table.h" #include "swap.h" +#include "swap_tier.h" =20 /* * swapper_space is a fiction, retained to simplify the path through @@ -947,8 +948,77 @@ static ssize_t vma_ra_enabled_store(struct kobject *ko= bj, } static struct kobj_attribute vma_ra_enabled_attr =3D __ATTR_RW(vma_ra_enab= led); =20 +static ssize_t tiers_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return swap_tiers_sysfs_show(buf); +} + +static ssize_t tiers_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + char *p, *token, *name, *tmp; + int ret =3D 0; + short prio; + DEFINE_SWAP_TIER_SAVE_CTX(ctx); + + tmp =3D kstrdup(buf, GFP_KERNEL); + if (!tmp) + return -ENOMEM; + + spin_lock(&swap_lock); + spin_lock(&swap_tier_lock); + + p =3D tmp; + swap_tiers_save(ctx); + + while (!ret && (token =3D strsep(&p, ", \t\n")) !=3D NULL) { + if (!*token) + continue; + + if (token[0] =3D=3D '-') { + ret =3D swap_tiers_remove(token + 1); + } else { + + name =3D strsep(&token, ":"); + if (!token || kstrtos16(token, 10, &prio)) { + ret =3D -EINVAL; + goto out; + } + + if (name[0] =3D=3D '+') + ret =3D swap_tiers_add(name + 1, prio); + else + ret =3D swap_tiers_modify(name, prio); + } + + if (ret) + goto restore; + } + + if (!swap_tiers_validate()) { + ret =3D -EINVAL; + goto restore; + } + +out: + spin_unlock(&swap_tier_lock); + spin_unlock(&swap_lock); + + kfree(tmp); + return ret ? ret : count; + +restore: + swap_tiers_restore(ctx); + goto out; +} + +static struct kobj_attribute tier_attr =3D __ATTR_RW(tiers); + static struct attribute *swap_attrs[] =3D { &vma_ra_enabled_attr.attr, + &tier_attr.attr, NULL, }; =20 diff --git a/mm/swap_tier.c b/mm/swap_tier.c new file mode 100644 index 000000000000..3bd011abee7c --- /dev/null +++ b/mm/swap_tier.c @@ -0,0 +1,304 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include "memcontrol-v1.h" +#include +#include + +#include "swap.h" +#include "swap_tier.h" + +/* + * struct swap_tier - structure representing a swap tier. + * + * @name: name of the swap_tier. + * @prio: starting value of priority. + * @list: linked list of tiers. +*/ +static struct swap_tier { + char name[MAX_TIERNAME]; + short prio; + struct list_head list; +} swap_tiers[MAX_SWAPTIER]; + +DEFINE_SPINLOCK(swap_tier_lock); +/* active swap priority list, sorted in descending order */ +static LIST_HEAD(swap_tier_active_list); +/* unused swap_tier object */ +static LIST_HEAD(swap_tier_inactive_list); + +#define TIER_IDX(tier) ((tier) - swap_tiers) +#define TIER_MASK(tier) (1 << TIER_IDX(tier)) +#define TIER_INVALID_PRIO (DEF_SWAP_PRIO - 1) +#define TIER_END_PRIO(tier) \ + (!list_is_first(&(tier)->list, &swap_tier_active_list) ? \ + list_prev_entry((tier), list)->prio - 1 : SHRT_MAX) + +#define for_each_tier(tier, idx) \ + for (idx =3D 0, tier =3D &swap_tiers[0]; idx < MAX_SWAPTIER; \ + idx++, tier =3D &swap_tiers[idx]) + +#define for_each_active_tier(tier) \ + list_for_each_entry(tier, &swap_tier_active_list, list) + +#define for_each_inactive_tier(tier) \ + list_for_each_entry(tier, &swap_tier_inactive_list, list) + +/* + * Naming Convention: + * swap_tiers_*() - Public/exported functions + * swap_tier_*() - Private/internal functions + */ + +static bool swap_tier_is_active(void) +{ + return !list_empty(&swap_tier_active_list) ? true : false; +} + +static struct swap_tier *swap_tier_lookup(const char *name) +{ + struct swap_tier *tier; + + for_each_active_tier(tier) { + if (!strcmp(tier->name, name)) + return tier; + } + + return NULL; +} + +void swap_tiers_init(void) +{ + struct swap_tier *tier; + int idx; + + BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER); + + for_each_tier(tier, idx) { + INIT_LIST_HEAD(&tier->list); + list_add_tail(&tier->list, &swap_tier_inactive_list); + } +} + +ssize_t swap_tiers_sysfs_show(char *buf) +{ + struct swap_tier *tier; + ssize_t len =3D 0; + + len +=3D sysfs_emit_at(buf, len, "%-16s %-5s %-11s %-11s\n", + "Name", "Idx", "PrioStart", "PrioEnd"); + + spin_lock(&swap_tier_lock); + for_each_active_tier(tier) { + len +=3D sysfs_emit_at(buf, len, "%-16s %-5ld %-11d %-11d\n", + tier->name, + TIER_IDX(tier), + tier->prio, + TIER_END_PRIO(tier)); + if (len >=3D PAGE_SIZE) + break; + } + spin_unlock(&swap_tier_lock); + + return len; +} + +static void swap_tier_insert_by_prio(struct swap_tier *new) +{ + struct swap_tier *tier; + + for_each_active_tier(tier) { + if (tier->prio > new->prio) + continue; + + list_add_tail(&new->list, &tier->list); + return; + } + /* First addition, or becomes the first tier */ + list_add_tail(&new->list, &swap_tier_active_list); +} + +static void __swap_tier_prepare(struct swap_tier *tier, const char *name, + short prio) +{ + list_del_init(&tier->list); + strscpy(tier->name, name, MAX_TIERNAME); + tier->prio =3D prio; +} + +static struct swap_tier *swap_tier_prepare(const char *name, short prio) +{ + struct swap_tier *tier; + + lockdep_assert_held(&swap_tier_lock); + + if (prio < DEF_SWAP_PRIO) + return ERR_PTR(-EINVAL); + + if (list_empty(&swap_tier_inactive_list)) + return ERR_PTR(-EPERM); + + tier =3D list_first_entry(&swap_tier_inactive_list, + struct swap_tier, list); + + __swap_tier_prepare(tier, name, prio); + return tier; +} + +static int swap_tier_check_range(short prio) +{ + struct swap_tier *tier; + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + + for_each_active_tier(tier) { + /* No overwrite */ + if (tier->prio =3D=3D prio) + return -EINVAL; + } + + return 0; +} + +int swap_tiers_add(const char *name, int prio) +{ + int ret; + struct swap_tier *tier; + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + + /* Duplicate check */ + if (swap_tier_lookup(name)) + return -EPERM; + + ret =3D swap_tier_check_range(prio); + if (ret) + return ret; + + tier =3D swap_tier_prepare(name, prio); + if (IS_ERR(tier)) { + ret =3D PTR_ERR(tier); + return ret; + } + + + swap_tier_insert_by_prio(tier); + return ret; +} + +int swap_tiers_remove(const char *name) +{ + int ret =3D 0; + struct swap_tier *tier; + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + + tier =3D swap_tier_lookup(name); + if (!tier) + return -EINVAL; + + /* Removing DEF_SWAP_PRIO merges into the higher tier. */ + if (!list_is_singular(&swap_tier_active_list) + && tier->prio =3D=3D DEF_SWAP_PRIO) + list_prev_entry(tier, list)->prio =3D DEF_SWAP_PRIO; + + list_move(&tier->list, &swap_tier_inactive_list); + return ret; +} + +int swap_tiers_modify(const char *name, int prio) +{ + int ret; + struct swap_tier *tier; + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + + tier =3D swap_tier_lookup(name); + if (!tier) + return -EINVAL; + + /* No need to modify */ + if (tier->prio =3D=3D prio) + return 0; + + ret =3D swap_tier_check_range(prio); + if (ret) + return ret; + + list_del_init(&tier->list); + tier->prio =3D prio; + swap_tier_insert_by_prio(tier); + + return ret; +} + +/* + * XXX: Reverting individual operations becomes complex as the number of + * operations grows. Instead, we save the original state beforehand and + * fully restore it if any operation fails. + */ +void swap_tiers_save(struct swap_tier_save_ctx ctx[]) +{ + struct swap_tier *tier; + int idx; + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + + for_each_active_tier(tier) { + idx =3D TIER_IDX(tier); + strcpy(ctx[idx].name, tier->name); + ctx[idx].prio =3D tier->prio; + } + + for_each_inactive_tier(tier) { + idx =3D TIER_IDX(tier); + /* Indicator of inactive */ + ctx[idx].prio =3D TIER_INVALID_PRIO; + } +} + +void swap_tiers_restore(struct swap_tier_save_ctx ctx[]) +{ + struct swap_tier *tier; + int idx; + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + + /* Invalidate active list */ + list_splice_tail_init(&swap_tier_active_list, + &swap_tier_inactive_list); + + for_each_tier(tier, idx) { + if (ctx[idx].prio !=3D TIER_INVALID_PRIO) { + /* Preserve idx(mask) */ + __swap_tier_prepare(tier, ctx[idx].name, ctx[idx].prio); + swap_tier_insert_by_prio(tier); + } + } +} + +bool swap_tiers_validate(void) +{ + struct swap_tier *tier; + + /* + * Initial setting might not cover DEF_SWAP_PRIO. + * Swap tier must cover the full range (DEF_SWAP_PRIO to SHRT_MAX). + * Also, modify operation can change only one remaining priority. + */ + if (swap_tier_is_active()) { + tier =3D list_last_entry(&swap_tier_active_list, + struct swap_tier, list); + + if (tier->prio !=3D DEF_SWAP_PRIO) + return false; + } + + return true; +} diff --git a/mm/swap_tier.h b/mm/swap_tier.h new file mode 100644 index 000000000000..4b1b0602d691 --- /dev/null +++ b/mm/swap_tier.h @@ -0,0 +1,38 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _SWAP_TIER_H +#define _SWAP_TIER_H + +#include +#include + +#define MAX_TIERNAME 16 + +/* Ensure MAX_SWAPTIER does not exceed MAX_SWAPFILES */ +#if 8 > MAX_SWAPFILES +#define MAX_SWAPTIER MAX_SWAPFILES +#else +#define MAX_SWAPTIER 8 +#endif + +extern spinlock_t swap_tier_lock; + +struct swap_tier_save_ctx { + char name[MAX_TIERNAME]; + short prio; +}; + +#define DEFINE_SWAP_TIER_SAVE_CTX(_name) \ + struct swap_tier_save_ctx _name[MAX_SWAPTIER] =3D {0} + +/* Initialization and application */ +void swap_tiers_init(void); +ssize_t swap_tiers_sysfs_show(char *buf); + +int swap_tiers_add(const char *name, int prio); +int swap_tiers_remove(const char *name); +int swap_tiers_modify(const char *name, int prio); + +void swap_tiers_save(struct swap_tier_save_ctx ctx[]); +void swap_tiers_restore(struct swap_tier_save_ctx ctx[]); +bool swap_tiers_validate(void); +#endif /* _SWAP_TIER_H */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 7b055f15d705..c27952b41d4f 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -50,6 +50,7 @@ #include "internal.h" #include "swap_table.h" #include "swap.h" +#include "swap_tier.h" =20 static bool swap_count_continued(struct swap_info_struct *, pgoff_t, unsigned char); @@ -65,7 +66,7 @@ static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, enum swap_cluster_flags new_flags); =20 -static DEFINE_SPINLOCK(swap_lock); +DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; atomic_long_t nr_swap_pages; /* @@ -76,7 +77,6 @@ atomic_long_t nr_swap_pages; EXPORT_SYMBOL_GPL(nr_swap_pages); /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */ long total_swap_pages; -#define DEF_SWAP_PRIO -1 unsigned long swapfile_maximum_size; #ifdef CONFIG_MIGRATION bool swap_migration_ad_supported; @@ -89,7 +89,7 @@ static const char Bad_offset[] =3D "Bad swap offset entry= "; * all active swap_info_structs * protected with swap_lock, and ordered by priority. */ -static PLIST_HEAD(swap_active_head); +PLIST_HEAD(swap_active_head); =20 /* * all available (active, not full) swap_info_structs @@ -3977,6 +3977,7 @@ static int __init swapfile_init(void) swap_migration_ad_supported =3D true; #endif /* CONFIG_MIGRATION */ =20 + swap_tiers_init(); return 0; } subsys_initcall(swapfile_init); --=20 2.34.1 From nobody Sun Feb 8 12:14:37 2026 Received: from lgeamrelo03.lge.com (lgeamrelo03.lge.com [156.147.51.102]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6BE02261B8F for ; Sat, 31 Jan 2026 12:56:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=156.147.51.102 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769864170; cv=none; b=cx12LcGUZGmghdjZyhwolfNrT3IpeAC45dFRO6BFPASqJs7lIsFIlxXI+iyVA9AWtGgn8s/ysOhli2wUmSE1QrWi4pHmPxrCr6bFp557GNk0NJ24phGDenPzrAO2XHwylWx5JsvteMsq3/93xWHwmkfTP/ERDLp17frepP+F/iM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769864170; c=relaxed/simple; bh=oURYD9tlVefoRIfQOj01sG8pXeg74/HP6wBM89lR3h0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=qxGu9jaMoXtBtXDfhqczN+Wjmp6/TsO0J+aS90FXai9rkk5zwV7iJubE06iMAtjU9T3ZVtz7v2SdD7h8bhNwAOWT817TM+Uc4btLjKxqIu77ic19RkgX/VbRWqE+/HIK+ZxDxxdAATFXoLMeQnhWWLprs1UaWm0F6u7Ion+uW2Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=lge.com; spf=pass smtp.mailfrom=lge.com; arc=none smtp.client-ip=156.147.51.102 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=lge.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=lge.com Received: from unknown (HELO yjaykim-PowerEdge-T330.lge.net) (10.177.112.156) by 156.147.51.102 with ESMTP; 31 Jan 2026 21:55:59 +0900 X-Original-SENDERIP: 10.177.112.156 X-Original-MAILFROM: youngjun.park@lge.com From: Youngjun Park To: akpm@linux-foundation.org Cc: chrisl@kernel.org, kasong@tencent.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, gunho.lee@lge.com, youngjun.park@lge.com, taejoon.song@lge.com Subject: [RFC PATCH v3 2/5] mm: swap: associate swap devices with tiers Date: Sat, 31 Jan 2026 21:54:51 +0900 Message-Id: <20260131125454.3187546-3-youngjun.park@lge.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260131125454.3187546-1-youngjun.park@lge.com> References: <20260131125454.3187546-1-youngjun.park@lge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This patch connects swap devices to the swap tier infrastructure, ensuring that devices are correctly assigned to tiers based on their priority. A `tier_mask` is added to identify the tier membership of swap devices. Although tier-based allocation logic is not yet implemented, this mapping is necessary to track which tier a device belongs to. Upon activation, the device is assigned to a tier by matching its priority against the configured tier ranges. The infrastructure allows dynamic modification of tiers, such as splitting or merging ranges. These operations are permitted provided that the tier assignment of already configured swap devices remains unchanged. This patch also adds the documentation for the swap tier feature, covering the core concepts, sysfs interface usage, and configuration details. Signed-off-by: Youngjun Park --- Documentation/mm/swap-tier.rst | 109 +++++++++++++++++++++++++++++++++ include/linux/swap.h | 1 + mm/swap_state.c | 2 +- mm/swap_tier.c | 100 +++++++++++++++++++++++++++--- mm/swap_tier.h | 13 +++- mm/swapfile.c | 2 + 6 files changed, 215 insertions(+), 12 deletions(-) create mode 100644 Documentation/mm/swap-tier.rst diff --git a/Documentation/mm/swap-tier.rst b/Documentation/mm/swap-tier.rst new file mode 100644 index 000000000000..3386161b9b18 --- /dev/null +++ b/Documentation/mm/swap-tier.rst @@ -0,0 +1,109 @@ +.. SPDX-License-Identifier: GPL-2.0 + +:Author: Chris Li Youngjun Park + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Swap Tier +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Swap tier is a collection of user-named groups classified by priority rang= es. +It acts as a facilitation layer, allowing users to manage swap devices bas= ed +on their speeds. + +Users are encouraged to assign swap device priorities according to device +speed to fully utilize this feature. While the current implementation is +integrated with cgroups, the concept is designed to be extensible for other +subsystems in the future. + +Use case +------- + +Users can perform selective swapping by choosing a swap tier assigned acco= rding +to speed within a cgroup. + +For more information on cgroup v2, please refer to +``Documentation/admin-guide/cgroup-v2.rst``. + +Priority Range +-------------- + +The specified tiers must cover the entire priority range from -1 +(DEF_SWAP_PRIO) to SHRT_MAX. + +Consistency +----------- + +Tier consistency is guaranteed with a focus on maximizing flexibility. Whe= n a +swap device is activated within a tier range, a reference is held from the +start of the tier to the priority of that swap device. This ensures that t= he +tier of region containing the active swap device does not disappear. + +If a request to add a new tier with a priority higher than the current swap +device is received, the existing tier can be split. + +However, specifying a tier in a cgroup does not hold a reference to the ti= er. +Consequently, the corresponding tier can disappear at any time. + +Configuration Interface +----------------------- + +The swap tiers can be configured via the following interface: + +/sys/kernel/mm/swap/tiers + +Operations can be performed using the following syntax: + +* Add: ``+"":""`` +* Remove: ``-""`` +* Modify: ``"":""`` + +Multiple operations can be provided in a single write, separated by spaces= (" ") +or commas (","). + +When configuring tiers, the specified value represents the **start priorit= y** +of that tier. The end priority is automatically determined by the start +priority of the next higher tier. Consequently, adding or modifying a tier +automatically adjusts (splits or merges) the ranges of adjacent tiers to +ensure continuity. + +Examples +-------- + +**1. Initialization** + +A tier starting at -1 is mandatory to cover the entire priority range up to +SHRT_MAX. In this example, 'HDD' starts at 50, and 'NET' covers the remain= ing +lower range starting from -1. + +:: + + # echo "+HDD:50, +NET:-1" > /sys/kernel/mm/swap/tiers + # cat /sys/kernel/mm/swap/tiers + Name Idx PrioStart PrioEnd + HDD 0 50 32767 + NET 1 -1 49 + +**2. Modification and Splitting** + +Here, 'HDD' is moved to start at 80, and a new tier 'SSD' is added at 100. +Notice how the ranges are automatically recalculated: +* 'SSD' takes the top range. Split HDD Tier's range. (100 to SHRT_MAX). +* 'HDD' is adjusted to the range between 'NET' and 'SSD' (80 to 99). +* 'NET' automatically extends to fill the gap below 'HDD' (-1 to 79). + +:: + + # echo "HDD:80, +SSD:100" > /sys/kernel/mm/swap/tiers + # cat /sys/kernel/mm/swap/tiers + Name Idx PrioStart PrioEnd + SSD 2 100 32767 + HDD 0 80 99 + NET 1 -1 79 + +**3. Removal** + +Tiers can be removed using the '-' prefix. + +:: + + # echo "-SSD,-HDD,-NET" > /sys/kernel/mm/swap/tiers diff --git a/include/linux/swap.h b/include/linux/swap.h index 62fc7499b408..1e68c220a0e7 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -262,6 +262,7 @@ struct swap_info_struct { struct percpu_ref users; /* indicate and keep swap device valid. */ unsigned long flags; /* SWP_USED etc: see above */ signed short prio; /* swap priority of this type */ + int tier_mask; /* swap tier mask */ struct plist_node list; /* entry in swap_active_head */ signed char type; /* strange name for an index */ unsigned int max; /* extent of the swap_map */ diff --git a/mm/swap_state.c b/mm/swap_state.c index f1a7d9cdc648..d46ca61d2e42 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -997,7 +997,7 @@ static ssize_t tiers_store(struct kobject *kobj, goto restore; } =20 - if (!swap_tiers_validate()) { + if (!swap_tiers_update()) { ret =3D -EINVAL; goto restore; } diff --git a/mm/swap_tier.c b/mm/swap_tier.c index 3bd011abee7c..7741214312c7 100644 --- a/mm/swap_tier.c +++ b/mm/swap_tier.c @@ -14,7 +14,7 @@ * @name: name of the swap_tier. * @prio: starting value of priority. * @list: linked list of tiers. -*/ + */ static struct swap_tier { char name[MAX_TIERNAME]; short prio; @@ -34,6 +34,8 @@ static LIST_HEAD(swap_tier_inactive_list); (!list_is_first(&(tier)->list, &swap_tier_active_list) ? \ list_prev_entry((tier), list)->prio - 1 : SHRT_MAX) =20 +#define MASK_TO_TIER(mask) (&swap_tiers[__ffs((mask))]) + #define for_each_tier(tier, idx) \ for (idx =3D 0, tier =3D &swap_tiers[0]; idx < MAX_SWAPTIER; \ idx++, tier =3D &swap_tiers[idx]) @@ -55,6 +57,26 @@ static bool swap_tier_is_active(void) return !list_empty(&swap_tier_active_list) ? true : false; } =20 +static bool swap_tier_prio_in_range(struct swap_tier *tier, short prio) +{ + if (tier->prio <=3D prio && TIER_END_PRIO(tier) >=3D prio) + return true; + + return false; +} + +static bool swap_tier_prio_is_used(struct swap_tier *self, short prio) +{ + struct swap_tier *tier; + + for_each_active_tier(tier) { + if (tier !=3D self && tier->prio =3D=3D prio) + return true; + } + + return false; +} + static struct swap_tier *swap_tier_lookup(const char *name) { struct swap_tier *tier; @@ -67,12 +89,14 @@ static struct swap_tier *swap_tier_lookup(const char *n= ame) return NULL; } =20 + void swap_tiers_init(void) { struct swap_tier *tier; int idx; =20 BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER); + BUILD_BUG_ON(MAX_SWAPTIER > TIER_DEFAULT_IDX); =20 for_each_tier(tier, idx) { INIT_LIST_HEAD(&tier->list); @@ -145,17 +169,35 @@ static struct swap_tier *swap_tier_prepare(const char= *name, short prio) return tier; } =20 -static int swap_tier_check_range(short prio) +static int swap_tier_can_split_range(struct swap_tier *orig_tier, + short new_prio) { + struct swap_info_struct *p; struct swap_tier *tier; =20 lockdep_assert_held(&swap_lock); lockdep_assert_held(&swap_tier_lock); =20 - for_each_active_tier(tier) { - /* No overwrite */ - if (tier->prio =3D=3D prio) - return -EINVAL; + plist_for_each_entry(p, &swap_active_head, list) { + if (p->tier_mask =3D=3D TIER_DEFAULT_MASK) + continue; + + tier =3D MASK_TO_TIER(p->tier_mask); + if (tier->prio > new_prio) + continue; + /* + * Prohibit implicit tier reassignment. + * Case 1: Prevent orig_tier devices from dropping out + * of the new range. + */ + if (orig_tier =3D=3D tier && (p->prio < new_prio)) + return -EBUSY; + /* + * Case 2: Prevent other tier devices from entering + * the new range. + */ + else if (orig_tier !=3D tier && (p->prio >=3D new_prio)) + return -EBUSY; } =20 return 0; @@ -173,7 +215,10 @@ int swap_tiers_add(const char *name, int prio) if (swap_tier_lookup(name)) return -EPERM; =20 - ret =3D swap_tier_check_range(prio); + if (swap_tier_prio_is_used(NULL, prio)) + return -EBUSY; + + ret =3D swap_tier_can_split_range(NULL, prio); if (ret) return ret; =20 @@ -183,7 +228,6 @@ int swap_tiers_add(const char *name, int prio) return ret; } =20 - swap_tier_insert_by_prio(tier); return ret; } @@ -200,6 +244,11 @@ int swap_tiers_remove(const char *name) if (!tier) return -EINVAL; =20 + /* Simulate adding a tier to check for conflicts */ + ret =3D swap_tier_can_split_range(NULL, tier->prio); + if (ret) + return ret; + /* Removing DEF_SWAP_PRIO merges into the higher tier. */ if (!list_is_singular(&swap_tier_active_list) && tier->prio =3D=3D DEF_SWAP_PRIO) @@ -225,7 +274,10 @@ int swap_tiers_modify(const char *name, int prio) if (tier->prio =3D=3D prio) return 0; =20 - ret =3D swap_tier_check_range(prio); + if (swap_tier_prio_is_used(tier, prio)) + return -EBUSY; + + ret =3D swap_tier_can_split_range(tier, prio); if (ret) return ret; =20 @@ -283,9 +335,26 @@ void swap_tiers_restore(struct swap_tier_save_ctx ctx[= ]) } } =20 -bool swap_tiers_validate(void) +void swap_tiers_assign_dev(struct swap_info_struct *swp) +{ + struct swap_tier *tier; + + lockdep_assert_held(&swap_lock); + + for_each_active_tier(tier) { + if (swap_tier_prio_in_range(tier, swp->prio)) { + swp->tier_mask =3D TIER_MASK(tier); + return; + } + } + + swp->tier_mask =3D TIER_DEFAULT_MASK; +} + +bool swap_tiers_update(void) { struct swap_tier *tier; + struct swap_info_struct *swp; =20 /* * Initial setting might not cover DEF_SWAP_PRIO. @@ -300,5 +369,16 @@ bool swap_tiers_validate(void) return false; } =20 + /* + * If applied initially, the swap tier_mask may change + * from the default value. + */ + plist_for_each_entry(swp, &swap_active_head, list) { + /* Tier is already configured */ + if (swp->tier_mask !=3D TIER_DEFAULT_MASK) + break; + swap_tiers_assign_dev(swp); + } + return true; } diff --git a/mm/swap_tier.h b/mm/swap_tier.h index 4b1b0602d691..de81d540e3b5 100644 --- a/mm/swap_tier.h +++ b/mm/swap_tier.h @@ -14,6 +14,9 @@ #define MAX_SWAPTIER 8 #endif =20 +/* Forward declarations */ +struct swap_info_struct; + extern spinlock_t swap_tier_lock; =20 struct swap_tier_save_ctx { @@ -24,6 +27,10 @@ struct swap_tier_save_ctx { #define DEFINE_SWAP_TIER_SAVE_CTX(_name) \ struct swap_tier_save_ctx _name[MAX_SWAPTIER] =3D {0} =20 +#define TIER_ALL_MASK (~0) +#define TIER_DEFAULT_IDX (31) +#define TIER_DEFAULT_MASK (1 << TIER_DEFAULT_IDX) + /* Initialization and application */ void swap_tiers_init(void); ssize_t swap_tiers_sysfs_show(char *buf); @@ -34,5 +41,9 @@ int swap_tiers_modify(const char *name, int prio); =20 void swap_tiers_save(struct swap_tier_save_ctx ctx[]); void swap_tiers_restore(struct swap_tier_save_ctx ctx[]); -bool swap_tiers_validate(void); +bool swap_tiers_update(void); + +/* Tier assignment */ +void swap_tiers_assign_dev(struct swap_info_struct *swp); + #endif /* _SWAP_TIER_H */ diff --git a/mm/swapfile.c b/mm/swapfile.c index c27952b41d4f..4f8ce021c5bd 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2672,6 +2672,8 @@ static void _enable_swap_info(struct swap_info_struct= *si) =20 /* Add back to available list */ add_to_avail_list(si, true); + + swap_tiers_assign_dev(si); } =20 static void enable_swap_info(struct swap_info_struct *si, int prio, --=20 2.34.1 From nobody Sun Feb 8 12:14:37 2026 Received: from lgeamrelo03.lge.com (lgeamrelo03.lge.com [156.147.51.102]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 34CF0276051 for ; Sat, 31 Jan 2026 12:56:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=156.147.51.102 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769864180; cv=none; b=B/yEfTBECtmQk4BuTYsEJkBCTSL7qusXqfojAovb6mV0aZjGgTPAqI31PGN82bkMVFDGcgE4mzAGqi0wImG8BO6T45B9E5sfvrJrIwiS1M0jwbliQ+/0E+N+HfcjPl/pDFZkJ9mCSnsu/kJ4Xf9GbcNP1SknB4Cf4wvTNOVBP9I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769864180; c=relaxed/simple; bh=aYFBQp/xBdjvMhL6Oe09UjWrgOAst8Z/rG7MbY5eIyo=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=pVOtZmKryrcrxcKnRTnHH7xNMyNzEomyVtv62fVIumwMBCN5t+EzCMcCx/yu/N6xOFE0UPhiDC6FL8LN/Xu4qSeLE3WHFODJLyOBDVtrbjcbZnK+cQ2bUTFtU7DTJtN78Y1uF6xkUPtik4rwS1gJ/zSAGyQxYNtPFGABsu/ZKXM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=lge.com; spf=pass smtp.mailfrom=lge.com; arc=none smtp.client-ip=156.147.51.102 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=lge.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=lge.com Received: from unknown (HELO yjaykim-PowerEdge-T330.lge.net) (10.177.112.156) by 156.147.51.102 with ESMTP; 31 Jan 2026 21:56:08 +0900 X-Original-SENDERIP: 10.177.112.156 X-Original-MAILFROM: youngjun.park@lge.com From: Youngjun Park To: akpm@linux-foundation.org Cc: chrisl@kernel.org, kasong@tencent.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, gunho.lee@lge.com, youngjun.park@lge.com, taejoon.song@lge.com Subject: [RFC PATCH v3 3/5] mm: memcontrol: add interface for swap tier selection Date: Sat, 31 Jan 2026 21:54:52 +0900 Message-Id: <20260131125454.3187546-4-youngjun.park@lge.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260131125454.3187546-1-youngjun.park@lge.com> References: <20260131125454.3187546-1-youngjun.park@lge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This patch integrates the swap tier infrastructure with cgroup, enabling the selection of specific swap devices per cgroup by configuring allowed swap tiers. The new `memory.swap.tiers` interface controls allowed swap tiers via a mas= k. By default, the mask is set to include all tiers, allowing specific tiers to be excluded or restored. Note that effective tiers are calculated separately using a dedicated mask to respect the cgroup hierarchy. Consequently, configured tiers may differ from effective ones, as they must be a subset of the parent's. Note that cgroups do not pin swap tiers. This is similar to the `cpuset` controller, which does not prevent CPU hotplug. This approach ensures flexibility by allowing tier configuration changes regardless of cgroup usage. Signed-off-by: Youngjun Park --- Documentation/admin-guide/cgroup-v2.rst | 27 ++++++++ include/linux/memcontrol.h | 3 +- mm/memcontrol.c | 85 +++++++++++++++++++++++ mm/swap_state.c | 6 +- mm/swap_tier.c | 89 ++++++++++++++++++++++++- mm/swap_tier.h | 39 ++++++++++- mm/swapfile.c | 4 ++ 7 files changed, 246 insertions(+), 7 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-= guide/cgroup-v2.rst index 7f5b59d95fce..776a908ce1b9 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1848,6 +1848,33 @@ The following nested keys are defined. Swap usage hard limit. If a cgroup's swap usage reaches this limit, anonymous memory of the cgroup will not be swapped out. =20 + memory.swap.tiers + A read-write nested-keyed file which exists on non-root + cgroups. The default is to enable all tiers. + + This interface allows selecting which swap tiers a cgroup can + use for swapping out memory. + + The effective tiers are inherited from the parent. Only tiers + effective in the parent can be effective in the child. However, + the child can explicitly disable tiers allowed by the parent. + + When read, the file shows two lines: + - The first line shows the operation string that was + written to this file. + - The second line shows the effective operation after + merging with parent settings. + + When writing, the format is: + (+/-)(TIER_NAME) (+/-)(TIER_NAME) ... + + Valid tier names are those configured in + /sys/kernel/mm/swap/tiers. + + Each tier can be prefixed with: + + Enable this tier + - Disable this tier + memory.swap.events A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b6c82c8f73e1..542bee1b5f60 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -283,7 +283,8 @@ struct mem_cgroup { /* per-memcg mm_struct list */ struct lru_gen_mm_list mm_list; #endif - + int tier_mask; + int tier_effective_mask; #ifdef CONFIG_MEMCG_V1 /* Legacy consumer-oriented counters */ struct page_counter kmem; /* v1 only */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 007413a53b45..5fcf8ebe0ca8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -68,6 +68,7 @@ #include #include "slab.h" #include "memcontrol-v1.h" +#include "swap_tier.h" =20 #include =20 @@ -3691,6 +3692,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) { lru_gen_exit_memcg(memcg); memcg_wb_domain_exit(memcg); + swap_tiers_memcg_sync_mask(memcg); __mem_cgroup_free(memcg); } =20 @@ -3792,6 +3794,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *pare= nt_css) WRITE_ONCE(memcg->zswap_writeback, true); #endif page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); + memcg->tier_mask =3D TIER_ALL_MASK; + swap_tiers_memcg_inherit_mask(memcg, parent); + if (parent) { WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent)); =20 @@ -5352,6 +5357,80 @@ static int swap_events_show(struct seq_file *m, void= *v) return 0; } =20 +static int swap_tier_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m); + + swap_tiers_mask_show(m, memcg->tier_mask); + swap_tiers_mask_show(m, memcg->tier_effective_mask); + + return 0; +} + +static ssize_t swap_tier_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg =3D mem_cgroup_from_css(of_css(of)); + char *pos, *token; + int ret =3D 0; + int original_mask; + + pos =3D strstrip(buf); + + spin_lock(&swap_tier_lock); + if (!*pos) { + memcg->tier_mask =3D TIER_ALL_MASK; + goto sync; + } + + original_mask =3D memcg->tier_mask; + + while ((token =3D strsep(&pos, " \t\n")) !=3D NULL) { + int mask; + + if (!*token) + continue; + + if (token[0] !=3D '-' && token[0] !=3D '+') { + ret =3D -EINVAL; + goto err; + } + + mask =3D swap_tiers_mask_lookup(token+1); + if (!mask) { + ret =3D -EINVAL; + goto err; + } + + /* + * if child already set, cannot add that tiers for hierarch mismatching. + * parent compatible, child must respect parent selected swap device. + */ + switch (token[0]) { + case '-': + memcg->tier_mask &=3D ~mask; + break; + case '+': + memcg->tier_mask |=3D mask; + break; + default: + ret =3D -EINVAL; + break; + } + + if (ret) + goto err; + } + +sync: + __swap_tiers_memcg_sync_mask(memcg); +err: + if (ret) + memcg->tier_mask =3D original_mask; + spin_unlock(&swap_tier_lock); + return ret ? ret : nbytes; +} + static struct cftype swap_files[] =3D { { .name =3D "swap.current", @@ -5384,6 +5463,12 @@ static struct cftype swap_files[] =3D { .file_offset =3D offsetof(struct mem_cgroup, swap_events_file), .seq_show =3D swap_events_show, }, + { + .name =3D "swap.tiers", + .flags =3D CFTYPE_NOT_ON_ROOT, + .seq_show =3D swap_tier_show, + .write =3D swap_tier_write, + }, { } /* terminate */ }; =20 diff --git a/mm/swap_state.c b/mm/swap_state.c index d46ca61d2e42..c0dcab74779d 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -961,6 +961,8 @@ static ssize_t tiers_store(struct kobject *kobj, char *p, *token, *name, *tmp; int ret =3D 0; short prio; + int mask =3D 0; + DEFINE_SWAP_TIER_SAVE_CTX(ctx); =20 tmp =3D kstrdup(buf, GFP_KERNEL); @@ -978,7 +980,7 @@ static ssize_t tiers_store(struct kobject *kobj, continue; =20 if (token[0] =3D=3D '-') { - ret =3D swap_tiers_remove(token + 1); + ret =3D swap_tiers_remove(token + 1, &mask); } else { =20 name =3D strsep(&token, ":"); @@ -997,7 +999,7 @@ static ssize_t tiers_store(struct kobject *kobj, goto restore; } =20 - if (!swap_tiers_update()) { + if (!swap_tiers_update(mask)) { ret =3D -EINVAL; goto restore; } diff --git a/mm/swap_tier.c b/mm/swap_tier.c index 7741214312c7..0e067ba545cb 100644 --- a/mm/swap_tier.c +++ b/mm/swap_tier.c @@ -232,7 +232,7 @@ int swap_tiers_add(const char *name, int prio) return ret; } =20 -int swap_tiers_remove(const char *name) +int swap_tiers_remove(const char *name, int *mask) { int ret =3D 0; struct swap_tier *tier; @@ -255,6 +255,8 @@ int swap_tiers_remove(const char *name) list_prev_entry(tier, list)->prio =3D DEF_SWAP_PRIO; =20 list_move(&tier->list, &swap_tier_inactive_list); + *mask |=3D TIER_MASK(tier); + return ret; } =20 @@ -351,7 +353,17 @@ void swap_tiers_assign_dev(struct swap_info_struct *sw= p) swp->tier_mask =3D TIER_DEFAULT_MASK; } =20 -bool swap_tiers_update(void) +static void swap_tier_memcg_propagate(int mask) +{ + struct mem_cgroup *child; + + for_each_mem_cgroup_tree(child, root_mem_cgroup) { + child->tier_mask |=3D mask; + child->tier_effective_mask |=3D mask; + } +} + +bool swap_tiers_update(int mask) { struct swap_tier *tier; struct swap_info_struct *swp; @@ -379,6 +391,79 @@ bool swap_tiers_update(void) break; swap_tiers_assign_dev(swp); } + /* + * XXX: Unused tiers default to ON, disabled after next tier added. + * Use removed tier mask to clear settings for removed/re-added tiers. + * (Could hold tier refs, but better to keep cgroup config independent) + */ + if (mask) + swap_tier_memcg_propagate(mask); =20 return true; } + +void swap_tiers_mask_show(struct seq_file *m, int mask) +{ + struct swap_tier *tier; + + spin_lock(&swap_tier_lock); + for_each_active_tier(tier) { + if (mask & TIER_MASK(tier)) + seq_printf(m, "%s ", tier->name); + } + spin_unlock(&swap_tier_lock); + seq_puts(m, "\n"); +} + +int swap_tiers_mask_lookup(const char *name) +{ + struct swap_tier *tier; + + lockdep_assert_held(&swap_tier_lock); + + for_each_active_tier(tier) { + if (!strcmp(name, tier->name)) + return TIER_MASK(tier); + } + + return 0; +} + +static void __swap_tier_memcg_inherit_mask(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ + int effective_mask + =3D parent ? parent->tier_effective_mask : TIER_ALL_MASK; + + memcg->tier_effective_mask + =3D effective_mask & memcg->tier_mask; +} + +void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ + spin_lock(&swap_tier_lock); + __swap_tier_memcg_inherit_mask(memcg, parent); + spin_unlock(&swap_tier_lock); +} + +void __swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg) +{ + struct mem_cgroup *child; + + lockdep_assert_held(&swap_tier_lock); + + if (memcg =3D=3D root_mem_cgroup) + return; + + for_each_mem_cgroup_tree(child, memcg) + __swap_tier_memcg_inherit_mask(child, parent_mem_cgroup(child)); +} + +void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg) +{ + spin_lock(&swap_tier_lock); + memcg->tier_mask =3D TIER_ALL_MASK; + __swap_tiers_memcg_sync_mask(memcg); + spin_unlock(&swap_tier_lock); +} diff --git a/mm/swap_tier.h b/mm/swap_tier.h index de81d540e3b5..9024c82c807a 100644 --- a/mm/swap_tier.h +++ b/mm/swap_tier.h @@ -31,19 +31,54 @@ struct swap_tier_save_ctx { #define TIER_DEFAULT_IDX (31) #define TIER_DEFAULT_MASK (1 << TIER_DEFAULT_IDX) =20 +#ifdef CONFIG_MEMCG +static inline int folio_tier_effective_mask(struct folio *folio) +{ + struct mem_cgroup *memcg =3D folio_memcg(folio); + + return memcg ? memcg->tier_effective_mask : TIER_ALL_MASK; +} +#else +static inline int folio_tier_effective_mask(struct folio *folio) +{ + return TIER_ALL_MASK; +} +#endif + /* Initialization and application */ void swap_tiers_init(void); ssize_t swap_tiers_sysfs_show(char *buf); =20 int swap_tiers_add(const char *name, int prio); -int swap_tiers_remove(const char *name); +int swap_tiers_remove(const char *name, int *mask); int swap_tiers_modify(const char *name, int prio); =20 void swap_tiers_save(struct swap_tier_save_ctx ctx[]); void swap_tiers_restore(struct swap_tier_save_ctx ctx[]); -bool swap_tiers_update(void); +bool swap_tiers_update(int mask); =20 /* Tier assignment */ void swap_tiers_assign_dev(struct swap_info_struct *swp); =20 +/* Memcg related functions */ +void swap_tiers_mask_show(struct seq_file *m, int mask); +void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg, + struct mem_cgroup *parent); +void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg); +void __swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg); + +/* Mask and tier lookup */ +int swap_tiers_mask_lookup(const char *name); + +/** + * swap_tiers_mask_test - Check if the tier mask is valid + * @tier_mask: The tier mask to check + * @mask: The mask to compare against + * + * Return: true if condition matches, false otherwise + */ +static inline bool swap_tiers_mask_test(int tier_mask, int mask) +{ + return tier_mask & mask; +} #endif /* _SWAP_TIER_H */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 4f8ce021c5bd..e04811e10431 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1348,10 +1348,14 @@ static bool swap_alloc_fast(struct folio *folio) static void swap_alloc_slow(struct folio *folio) { struct swap_info_struct *si, *next; + int mask =3D folio_tier_effective_mask(folio); =20 spin_lock(&swap_avail_lock); start_over: plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) { + if (!swap_tiers_mask_test(si->tier_mask, mask)) + continue; + /* Rotate the device and switch to a new cluster */ plist_requeue(&si->avail_list, &swap_avail_head); spin_unlock(&swap_avail_lock); --=20 2.34.1 From nobody Sun Feb 8 12:14:37 2026 Received: from lgeamrelo03.lge.com (lgeamrelo03.lge.com [156.147.51.102]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F286C26529A for ; Sat, 31 Jan 2026 12:56:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=156.147.51.102 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769864190; cv=none; b=qbHvHoC1tqBhbCZwlOzI3O53cJGznzw3gTNkjSW+OX7/EQi4l/sWvSR7zfzmxAeCfUxirDiMXnEwQFiechJV1k0RGktWpu1Br/886yfs8SVBhEffNcJn/oc29/Ypi+YPJwScPIbcq/+EdcxDGE4pv9FkLRhJ3JYD50aGT6RM2aA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769864190; c=relaxed/simple; bh=6gZY8gbKKPjjKx2S+wB6FuCLbbWcMbcNPstuxpPQZmA=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=O2zalf3dibvbxisatcErpm5iX0RG3nB9R7f458jh4xpyOBSrJMGcd/vbeGAMERapiqNxopxAekSLnjSmUF+eQvy6DEMa9xlWZowczAjX/n+6dVXDhL4pyDUL9Y83QriCUlveb9+olddO51z2K+A3dH5gVTwkyvgPsfX/xsSCdbg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=lge.com; spf=pass smtp.mailfrom=lge.com; arc=none smtp.client-ip=156.147.51.102 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=lge.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=lge.com Received: from unknown (HELO yjaykim-PowerEdge-T330.lge.net) (10.177.112.156) by 156.147.51.102 with ESMTP; 31 Jan 2026 21:56:20 +0900 X-Original-SENDERIP: 10.177.112.156 X-Original-MAILFROM: youngjun.park@lge.com From: Youngjun Park To: akpm@linux-foundation.org Cc: chrisl@kernel.org, kasong@tencent.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, gunho.lee@lge.com, youngjun.park@lge.com, taejoon.song@lge.com Subject: [RFC PATCH v3 4/5] mm, swap: change back to use each swap device's percpu cluster Date: Sat, 31 Jan 2026 21:54:53 +0900 Message-Id: <20260131125454.3187546-5-youngjun.park@lge.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260131125454.3187546-1-youngjun.park@lge.com> References: <20260131125454.3187546-1-youngjun.park@lge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This reverts commit 1b7e90020eb7 ("mm, swap: use percpu cluster as allocation fast path"). Because in the newly introduced swap tiers, the global percpu cluster will cause two issues: 1) it will cause caching oscillation in the same order of different si if two different memcg can only be allowed to access different si and both of them are swapping out. 2) It can cause priority inversion on swap devices. Imagine a case where there are two memcg, say memcg1 and memcg2. Memcg1 can access si A, B and A is higher priority device. While memcg2 can only access si B. Then memcg 2 could write the global percpu cluster with si B, then memcg1 take si B in fast path even though si A is not exhausted. Hence in order to support swap tier, revert commit to use each swap device's percpu cluster. Suggested-by: Kairui Song Co-developed-by: Baoquan He Signed-off-by: Baoquan He Signed-off-by: Youngjun Park --- include/linux/swap.h | 17 +++-- mm/swapfile.c | 149 +++++++++++++++---------------------------- 2 files changed, 62 insertions(+), 104 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 1e68c220a0e7..6921e22b14d3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -247,11 +247,18 @@ enum { #define SWAP_NR_ORDERS 1 #endif =20 -/* - * We keep using same cluster for rotational device so IO will be sequenti= al. - * The purpose is to optimize SWAP throughput on these device. - */ + /* + * We assign a cluster to each CPU, so each CPU can allocate swap entry f= rom + * its own cluster and swapout sequentially. The purpose is to optimize s= wapout + * throughput. + */ +struct percpu_cluster { + local_lock_t lock; /* Protect the percpu_cluster above */ + unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ +}; + struct swap_sequential_cluster { + spinlock_t lock; /* Serialize usage of global cluster */ unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ }; =20 @@ -277,8 +284,8 @@ struct swap_info_struct { /* list of cluster that are fragmented or contented */ unsigned int pages; /* total of usable pages of swap */ atomic_long_t inuse_pages; /* number of those currently in use */ + struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio= n */ struct swap_sequential_cluster *global_cluster; /* Use one global cluster= for rotating device */ - spinlock_t global_cluster_lock; /* Serialize usage of global cluster */ struct rb_root swap_extent_root;/* root of the swap extent rbtree */ struct block_device *bdev; /* swap device or bdev of swap file */ struct file *swap_file; /* seldom referenced */ diff --git a/mm/swapfile.c b/mm/swapfile.c index e04811e10431..4708014c96c4 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -118,18 +118,6 @@ static atomic_t proc_poll_event =3D ATOMIC_INIT(0); =20 atomic_t nr_rotate_swap =3D ATOMIC_INIT(0); =20 -struct percpu_swap_cluster { - struct swap_info_struct *si[SWAP_NR_ORDERS]; - unsigned long offset[SWAP_NR_ORDERS]; - local_lock_t lock; -}; - -static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) =3D= { - .si =3D { NULL }, - .offset =3D { SWAP_ENTRY_INVALID }, - .lock =3D INIT_LOCAL_LOCK(), -}; - /* May return NULL on invalid type, caller must check for NULL return */ static struct swap_info_struct *swap_type_to_info(int type) { @@ -477,8 +465,10 @@ swap_cluster_alloc_table(struct swap_info_struct *si, * Swap allocator uses percpu clusters and holds the local lock. */ lockdep_assert_held(&ci->lock); - lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock); - + if (si->flags & SWP_SOLIDSTATE) + lockdep_assert_held(this_cpu_ptr(&si->percpu_cluster->lock)); + else + lockdep_assert_held(&si->global_cluster->lock); /* The cluster must be free and was just isolated from the free list. */ VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci)); =20 @@ -494,9 +484,10 @@ swap_cluster_alloc_table(struct swap_info_struct *si, * the potential recursive allocation is limited. */ spin_unlock(&ci->lock); - if (!(si->flags & SWP_SOLIDSTATE)) - spin_unlock(&si->global_cluster_lock); - local_unlock(&percpu_swap_cluster.lock); + if (si->flags & SWP_SOLIDSTATE) + local_unlock(&si->percpu_cluster->lock); + else + spin_unlock(&si->global_cluster->lock); =20 table =3D swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | GFP_KERNEL); =20 @@ -508,9 +499,9 @@ swap_cluster_alloc_table(struct swap_info_struct *si, * could happen with ignoring the percpu cluster is fragmentation, * which is acceptable since this fallback and race is rare. */ - local_lock(&percpu_swap_cluster.lock); + local_lock(&si->percpu_cluster->lock); if (!(si->flags & SWP_SOLIDSTATE)) - spin_lock(&si->global_cluster_lock); + spin_lock(&si->global_cluster->lock); spin_lock(&ci->lock); =20 /* Nothing except this helper should touch a dangling empty cluster. */ @@ -622,7 +613,7 @@ static bool swap_do_scheduled_discard(struct swap_info_= struct *si) ci =3D list_first_entry(&si->discard_clusters, struct swap_cluster_info,= list); /* * Delete the cluster from list to prepare for discard, but keep - * the CLUSTER_FLAG_DISCARD flag, percpu_swap_cluster could be + * the CLUSTER_FLAG_DISCARD flag, there could be percpu_cluster * pointing to it, or ran into by relocate_cluster. */ list_del(&ci->list); @@ -953,12 +944,11 @@ static unsigned int alloc_swap_scan_cluster(struct sw= ap_info_struct *si, out: relocate_cluster(si, ci); swap_cluster_unlock(ci); - if (si->flags & SWP_SOLIDSTATE) { - this_cpu_write(percpu_swap_cluster.offset[order], next); - this_cpu_write(percpu_swap_cluster.si[order], si); - } else { + if (si->flags & SWP_SOLIDSTATE) + this_cpu_write(si->percpu_cluster->next[order], next); + else si->global_cluster->next[order] =3D next; - } + return found; } =20 @@ -1052,13 +1042,17 @@ static unsigned long cluster_alloc_swap_entry(struc= t swap_info_struct *si, if (order && !(si->flags & SWP_BLKDEV)) return 0; =20 - if (!(si->flags & SWP_SOLIDSTATE)) { + if (si->flags & SWP_SOLIDSTATE) { + /* Fast path using per CPU cluster */ + local_lock(&si->percpu_cluster->lock); + offset =3D __this_cpu_read(si->percpu_cluster->next[order]); + } else { /* Serialize HDD SWAP allocation for each device. */ - spin_lock(&si->global_cluster_lock); + spin_lock(&si->global_cluster->lock); offset =3D si->global_cluster->next[order]; - if (offset =3D=3D SWAP_ENTRY_INVALID) - goto new_cluster; + } =20 + if (offset !=3D SWAP_ENTRY_INVALID) { ci =3D swap_cluster_lock(si, offset); /* Cluster could have been used by another order */ if (cluster_is_usable(ci, order)) { @@ -1072,7 +1066,6 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, goto done; } =20 -new_cluster: /* * If the device need discard, prefer new cluster over nonfull * to spread out the writes. @@ -1129,8 +1122,10 @@ static unsigned long cluster_alloc_swap_entry(struct= swap_info_struct *si, goto done; } done: - if (!(si->flags & SWP_SOLIDSTATE)) - spin_unlock(&si->global_cluster_lock); + if (si->flags & SWP_SOLIDSTATE) + local_unlock(&si->percpu_cluster->lock); + else + spin_unlock(&si->global_cluster->lock); =20 return found; } @@ -1311,41 +1306,8 @@ static bool get_swap_device_info(struct swap_info_st= ruct *si) return true; } =20 -/* - * Fast path try to get swap entries with specified order from current - * CPU's swap entry pool (a cluster). - */ -static bool swap_alloc_fast(struct folio *folio) -{ - unsigned int order =3D folio_order(folio); - struct swap_cluster_info *ci; - struct swap_info_struct *si; - unsigned int offset; - - /* - * Once allocated, swap_info_struct will never be completely freed, - * so checking it's liveness by get_swap_device_info is enough. - */ - si =3D this_cpu_read(percpu_swap_cluster.si[order]); - offset =3D this_cpu_read(percpu_swap_cluster.offset[order]); - if (!si || !offset || !get_swap_device_info(si)) - return false; - - ci =3D swap_cluster_lock(si, offset); - if (cluster_is_usable(ci, order)) { - if (cluster_is_empty(ci)) - offset =3D cluster_offset(si, ci); - alloc_swap_scan_cluster(si, ci, folio, offset); - } else { - swap_cluster_unlock(ci); - } - - put_swap_device(si); - return folio_test_swapcache(folio); -} - /* Rotate the device and switch to a new cluster */ -static void swap_alloc_slow(struct folio *folio) +static void swap_alloc_entry(struct folio *folio) { struct swap_info_struct *si, *next; int mask =3D folio_tier_effective_mask(folio); @@ -1362,6 +1324,7 @@ static void swap_alloc_slow(struct folio *folio) if (get_swap_device_info(si)) { cluster_alloc_swap_entry(si, folio); put_swap_device(si); + if (folio_test_swapcache(folio)) return; if (folio_test_large(folio)) @@ -1521,11 +1484,7 @@ int folio_alloc_swap(struct folio *folio) } =20 again: - local_lock(&percpu_swap_cluster.lock); - if (!swap_alloc_fast(folio)) - swap_alloc_slow(folio); - local_unlock(&percpu_swap_cluster.lock); - + swap_alloc_entry(folio); if (!order && unlikely(!folio_test_swapcache(folio))) { if (swap_sync_discard()) goto again; @@ -1944,9 +1903,7 @@ swp_entry_t swap_alloc_hibernation_slot(int type) * Grab the local lock to be compliant * with swap table allocation. */ - local_lock(&percpu_swap_cluster.lock); offset =3D cluster_alloc_swap_entry(si, NULL); - local_unlock(&percpu_swap_cluster.lock); if (offset) entry =3D swp_entry(si->type, offset); } @@ -2750,28 +2707,6 @@ static void free_cluster_info(struct swap_cluster_in= fo *cluster_info, kvfree(cluster_info); } =20 -/* - * Called after swap device's reference count is dead, so - * neither scan nor allocation will use it. - */ -static void flush_percpu_swap_cluster(struct swap_info_struct *si) -{ - int cpu, i; - struct swap_info_struct **pcp_si; - - for_each_possible_cpu(cpu) { - pcp_si =3D per_cpu_ptr(percpu_swap_cluster.si, cpu); - /* - * Invalidate the percpu swap cluster cache, si->users - * is dead, so no new user will point to it, just flush - * any existing user. - */ - for (i =3D 0; i < SWAP_NR_ORDERS; i++) - cmpxchg(&pcp_si[i], si, NULL); - } -} - - SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) { struct swap_info_struct *p =3D NULL; @@ -2855,7 +2790,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) =20 flush_work(&p->discard_work); flush_work(&p->reclaim_work); - flush_percpu_swap_cluster(p); =20 destroy_swap_extents(p); if (p->flags & SWP_CONTINUED) @@ -2884,6 +2818,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) arch_swap_invalidate_area(p->type); zswap_swapoff(p->type); mutex_unlock(&swapon_mutex); + free_percpu(p->percpu_cluster); + p->percpu_cluster =3D NULL; kfree(p->global_cluster); p->global_cluster =3D NULL; vfree(swap_map); @@ -3267,7 +3203,7 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, { unsigned long nr_clusters =3D DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); struct swap_cluster_info *cluster_info; - int err =3D -ENOMEM; + int cpu, err =3D -ENOMEM; unsigned long i; =20 cluster_info =3D kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL); @@ -3277,14 +3213,27 @@ static struct swap_cluster_info *setup_clusters(str= uct swap_info_struct *si, for (i =3D 0; i < nr_clusters; i++) spin_lock_init(&cluster_info[i].lock); =20 - if (!(si->flags & SWP_SOLIDSTATE)) { + if (si->flags & SWP_SOLIDSTATE) { + si->percpu_cluster =3D alloc_percpu(struct percpu_cluster); + if (!si->percpu_cluster) + goto err; + + for_each_possible_cpu(cpu) { + struct percpu_cluster *cluster; + + cluster =3D per_cpu_ptr(si->percpu_cluster, cpu); + for (i =3D 0; i < SWAP_NR_ORDERS; i++) + cluster->next[i] =3D SWAP_ENTRY_INVALID; + local_lock_init(&cluster->lock); + } + } else { si->global_cluster =3D kmalloc(sizeof(*si->global_cluster), GFP_KERNEL); if (!si->global_cluster) goto err; for (i =3D 0; i < SWAP_NR_ORDERS; i++) si->global_cluster->next[i] =3D SWAP_ENTRY_INVALID; - spin_lock_init(&si->global_cluster_lock); + spin_lock_init(&si->global_cluster->lock); } =20 /* @@ -3565,6 +3514,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf= ile, int, swap_flags) bad_swap_unlock_inode: inode_unlock(inode); bad_swap: + free_percpu(si->percpu_cluster); + si->percpu_cluster =3D NULL; kfree(si->global_cluster); si->global_cluster =3D NULL; inode =3D NULL; --=20 2.34.1 From nobody Sun Feb 8 12:14:37 2026 Received: from lgeamrelo03.lge.com (lgeamrelo03.lge.com [156.147.51.102]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8EFB41EDA0F for ; Sat, 31 Jan 2026 12:56:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=156.147.51.102 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769864192; cv=none; b=nKHnuyv1GuyPGgLtkbn3B/ZD85/GOubx8SsTWqTMs4Fo5Peow1XDcZ7IdusR+RRKjo5+D2KWPc5sfvn6+WuxRE9LbWkdE93KLfGS/X0Ugnh2iUqmdoNXC2c6WvJqi4b1EIPyrXhNmV6taPT7s171PkJb5VlWjHIuzLX+nMmjb8I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769864192; c=relaxed/simple; bh=4EriPx2e1KD0LB3eU2OevGD+QtdGkBWQCXNzcpUyFac=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Dm7plxD5styuvWjfBMWyEsEIryqwNUx7Vch3j53jV18yCEiYkN4OCSNhB5V6d7iPI0MMqsWsL4leLGlea+aHDV96NO+NtsQutqLcWR91vBTlcmS9a2WKKREjUsYRcwTgHdN6gmVNiXaz7tAvEup0yWtJVF6Y3tTOZpbiQzDR134= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=lge.com; spf=pass smtp.mailfrom=lge.com; arc=none smtp.client-ip=156.147.51.102 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=lge.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=lge.com Received: from unknown (HELO yjaykim-PowerEdge-T330.lge.net) (10.177.112.156) by 156.147.51.102 with ESMTP; 31 Jan 2026 21:56:27 +0900 X-Original-SENDERIP: 10.177.112.156 X-Original-MAILFROM: youngjun.park@lge.com From: Youngjun Park To: akpm@linux-foundation.org Cc: chrisl@kernel.org, kasong@tencent.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, gunho.lee@lge.com, youngjun.park@lge.com, taejoon.song@lge.com Subject: [RFC PATCH v3 5/5] mm, swap: introduce percpu swap device cache to avoid fragmentation Date: Sat, 31 Jan 2026 21:54:54 +0900 Message-Id: <20260131125454.3187546-6-youngjun.park@lge.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260131125454.3187546-1-youngjun.park@lge.com> References: <20260131125454.3187546-1-youngjun.park@lge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In the previous commit that introduced per-device percpu clusters, the allocation logic caused swap device rotation on every allocation when multiple swap devices share the same priority. This led to cluster fragmentation on every allocation attemp. To address this issue, this patch introduces a per-cpu swap device cache, restoring the allocation behavior to closely match the traditional fastpath and slowpath flow. With swap tiers, cluster fragmentation can still occur when a CPU's cached swap device doesn't belong to the required tier for the current allocation - this is the intended behavior for tier-based allocation. With swap tiers and same-priority swap devices, the slow path triggers device rotation and causes initial cluster fragmentation. However, once a cluster is allocated, subsequent allocations will continue using that cluster until it's exhausted, preventing repeated fragmentation. While this may not be severe, there is room for future optimization. Signed-off-by: Youngjun Park --- include/linux/swap.h | 1 - mm/swapfile.c | 87 +++++++++++++++++++++++++++++++++++--------- 2 files changed, 69 insertions(+), 19 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 6921e22b14d3..ac634a21683a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -253,7 +253,6 @@ enum { * throughput. */ struct percpu_cluster { - local_lock_t lock; /* Protect the percpu_cluster above */ unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ }; =20 diff --git a/mm/swapfile.c b/mm/swapfile.c index 4708014c96c4..fc1f64eaa8fe 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -106,6 +106,16 @@ PLIST_HEAD(swap_active_head); static PLIST_HEAD(swap_avail_head); static DEFINE_SPINLOCK(swap_avail_lock); =20 +struct percpu_swap_device { + struct swap_info_struct *si[SWAP_NR_ORDERS]; + local_lock_t lock; +}; + +static DEFINE_PER_CPU(struct percpu_swap_device, percpu_swap_device) =3D { + .si =3D { NULL }, + .lock =3D INIT_LOCAL_LOCK(), +}; + struct swap_info_struct *swap_info[MAX_SWAPFILES]; =20 static struct kmem_cache *swap_table_cachep; @@ -465,10 +475,8 @@ swap_cluster_alloc_table(struct swap_info_struct *si, * Swap allocator uses percpu clusters and holds the local lock. */ lockdep_assert_held(&ci->lock); - if (si->flags & SWP_SOLIDSTATE) - lockdep_assert_held(this_cpu_ptr(&si->percpu_cluster->lock)); - else - lockdep_assert_held(&si->global_cluster->lock); + lockdep_assert_held(this_cpu_ptr(&percpu_swap_device.lock)); + /* The cluster must be free and was just isolated from the free list. */ VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci)); =20 @@ -484,10 +492,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si, * the potential recursive allocation is limited. */ spin_unlock(&ci->lock); - if (si->flags & SWP_SOLIDSTATE) - local_unlock(&si->percpu_cluster->lock); - else - spin_unlock(&si->global_cluster->lock); + local_unlock(&percpu_swap_device.lock); =20 table =3D swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | GFP_KERNEL); =20 @@ -499,7 +504,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si, * could happen with ignoring the percpu cluster is fragmentation, * which is acceptable since this fallback and race is rare. */ - local_lock(&si->percpu_cluster->lock); + local_lock(&percpu_swap_device.lock); if (!(si->flags & SWP_SOLIDSTATE)) spin_lock(&si->global_cluster->lock); spin_lock(&ci->lock); @@ -944,9 +949,10 @@ static unsigned int alloc_swap_scan_cluster(struct swa= p_info_struct *si, out: relocate_cluster(si, ci); swap_cluster_unlock(ci); - if (si->flags & SWP_SOLIDSTATE) + if (si->flags & SWP_SOLIDSTATE) { this_cpu_write(si->percpu_cluster->next[order], next); - else + this_cpu_write(percpu_swap_device.si[order], si); + } else si->global_cluster->next[order] =3D next; =20 return found; @@ -1044,7 +1050,6 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, =20 if (si->flags & SWP_SOLIDSTATE) { /* Fast path using per CPU cluster */ - local_lock(&si->percpu_cluster->lock); offset =3D __this_cpu_read(si->percpu_cluster->next[order]); } else { /* Serialize HDD SWAP allocation for each device. */ @@ -1122,9 +1127,7 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, goto done; } done: - if (si->flags & SWP_SOLIDSTATE) - local_unlock(&si->percpu_cluster->lock); - else + if (!(si->flags & SWP_SOLIDSTATE)) spin_unlock(&si->global_cluster->lock); =20 return found; @@ -1306,8 +1309,29 @@ static bool get_swap_device_info(struct swap_info_st= ruct *si) return true; } =20 +static bool swap_alloc_fast(struct folio *folio) +{ + unsigned int order =3D folio_order(folio); + struct swap_info_struct *si; + int mask =3D folio_tier_effective_mask(folio); + + /* + * Once allocated, swap_info_struct will never be completely freed, + * so checking it's liveness by get_swap_device_info is enough. + */ + si =3D this_cpu_read(percpu_swap_device.si[order]); + if (!si || !swap_tiers_mask_test(si->tier_mask, mask) || + !get_swap_device_info(si)) + return false; + + cluster_alloc_swap_entry(si, folio); + put_swap_device(si); + + return folio_test_swapcache(folio); +} + /* Rotate the device and switch to a new cluster */ -static void swap_alloc_entry(struct folio *folio) +static void swap_alloc_slow(struct folio *folio) { struct swap_info_struct *si, *next; int mask =3D folio_tier_effective_mask(folio); @@ -1484,7 +1508,11 @@ int folio_alloc_swap(struct folio *folio) } =20 again: - swap_alloc_entry(folio); + local_lock(&percpu_swap_device.lock); + if (!swap_alloc_fast(folio)) + swap_alloc_slow(folio); + local_unlock(&percpu_swap_device.lock); + if (!order && unlikely(!folio_test_swapcache(folio))) { if (swap_sync_discard()) goto again; @@ -1903,7 +1931,9 @@ swp_entry_t swap_alloc_hibernation_slot(int type) * Grab the local lock to be compliant * with swap table allocation. */ + local_lock(&percpu_swap_device.lock); offset =3D cluster_alloc_swap_entry(si, NULL); + local_unlock(&percpu_swap_device.lock); if (offset) entry =3D swp_entry(si->type, offset); } @@ -2707,6 +2737,27 @@ static void free_cluster_info(struct swap_cluster_in= fo *cluster_info, kvfree(cluster_info); } =20 +/* + * Called after swap device's reference count is dead, so + * neither scan nor allocation will use it. + */ +static void flush_percpu_swap_device(struct swap_info_struct *si) +{ + int cpu, i; + struct swap_info_struct **pcp_si; + + for_each_possible_cpu(cpu) { + pcp_si =3D per_cpu_ptr(percpu_swap_device.si, cpu); + /* + * Invalidate the percpu swap device cache, si->users + * is dead, so no new user will point to it, just flush + * any existing user. + */ + for (i =3D 0; i < SWAP_NR_ORDERS; i++) + cmpxchg(&pcp_si[i], si, NULL); + } +} + SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) { struct swap_info_struct *p =3D NULL; @@ -2790,6 +2841,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) =20 flush_work(&p->discard_work); flush_work(&p->reclaim_work); + flush_percpu_swap_device(p); =20 destroy_swap_extents(p); if (p->flags & SWP_CONTINUED) @@ -3224,7 +3276,6 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, cluster =3D per_cpu_ptr(si->percpu_cluster, cpu); for (i =3D 0; i < SWAP_NR_ORDERS; i++) cluster->next[i] =3D SWAP_ENTRY_INVALID; - local_lock_init(&cluster->lock); } } else { si->global_cluster =3D kmalloc(sizeof(*si->global_cluster), --=20 2.34.1