From nobody Thu Apr 2 17:16:03 2026 Received: from lgeamrelo03.lge.com (lgeamrelo03.lge.com [156.147.51.102]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 52A2940B6E6 for ; Wed, 25 Mar 2026 17:55:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=156.147.51.102 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774461307; cv=none; b=tXMDKU7KqUKLeQCk/PueV0Nj1MvmEf+HmjM5ph3yFP1CbQp2+Tv4HpLCqG3O9ZyWZE2KDQiDau+hzmw+fJYMPkw/9cDWyWpfJ/J4HeJePtqe3kaTPoRrSxOYP0cf0/U1PWJtzojTja7xgL8fjhmZrF9MCIMPbVGU700fPaiM+4Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774461307; c=relaxed/simple; bh=YtMoix/vqzVtjDXttB/xs+lLcH+ZY/0S4LfH1vfwDLU=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=dXAIxFH/5iG2+RMOuRv7xYCb7So7wg6TX4zPIWFkRVbAarPB6gNmBNMAKAc4V/lvwauWdHS+cmY8s6x9wX57g+NVmsvD9ACjhbLi8j1xg2OTf16/eqY2VlorGLMd7Y3mwDy9sAwYBdBWZltb+hjKvylD6RL0hq57CI+AqqaRyK0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=lge.com; spf=pass smtp.mailfrom=lge.com; arc=none smtp.client-ip=156.147.51.102 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=lge.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=lge.com Received: from unknown (HELO yjaykim-PowerEdge-T330.lge.net) (10.177.112.156) by 156.147.51.102 with ESMTP; 26 Mar 2026 02:54:56 +0900 X-Original-SENDERIP: 10.177.112.156 X-Original-MAILFROM: youngjun.park@lge.com From: Youngjun Park To: Andrew Morton Cc: Chris Li , Youngjun Park , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kasong@tencent.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, gunho.lee@lge.com, taejoon.song@lge.com, hyungjun.cho@lge.com, mkoutny@suse.com Subject: [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure Date: Thu, 26 Mar 2026 02:54:50 +0900 Message-Id: <20260325175453.2523280-2-youngjun.park@lge.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260325175453.2523280-1-youngjun.park@lge.com> References: <20260325175453.2523280-1-youngjun.park@lge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This patch introduces the "Swap tier" concept, which serves as an abstraction layer for managing swap devices based on their performance characteristics (e.g., NVMe, HDD, Network swap). Swap tiers are user-named groups representing priority ranges. Tier names must consist of alphanumeric characters and underscores. These tiers collectively cover the entire priority space from -1 (`DEF_SWAP_PRIO`) to `SHRT_MAX`. To configure tiers, a new sysfs interface is exposed at /sys/kernel/mm/swap/tiers. The input parser evaluates commands from left to right and supports batch input, allowing users to add or remove multiple tiers in a single write operation. Tier management enforces continuous priority ranges anchored by start priorities. Operations trigger range splitting or merging, but overwriting start priorities is forbidden. Merging expands lower tiers upwards to preserve configured start priorities, except when removing `DEF_SWAP_PRIO`, which merges downwards. Suggested-by: Chris Li Signed-off-by: Youngjun Park --- MAINTAINERS | 2 + mm/Kconfig | 12 ++ mm/Makefile | 2 +- mm/swap.h | 4 + mm/swap_state.c | 74 +++++++++++++ mm/swap_tier.c | 285 ++++++++++++++++++++++++++++++++++++++++++++++++ mm/swap_tier.h | 20 ++++ mm/swapfile.c | 8 +- 8 files changed, 403 insertions(+), 4 deletions(-) create mode 100644 mm/swap_tier.c create mode 100644 mm/swap_tier.h diff --git a/MAINTAINERS b/MAINTAINERS index 76431aa5efbe..f3b07f1fa38a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16916,6 +16916,8 @@ F: mm/swap.c F: mm/swap.h F: mm/swap_table.h F: mm/swap_state.c +F: mm/swap_tier.c +F: mm/swap_tier.h F: mm/swapfile.c =20 MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE) diff --git a/mm/Kconfig b/mm/Kconfig index bd283958d675..b645e9430af5 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -19,6 +19,18 @@ menuconfig SWAP used to provide more virtual memory than the actual RAM present in your computer. If unsure say Y. =20 +config NR_SWAP_TIERS + int "Number of swap device tiers" + depends on SWAP + default 4 + range 1 32 + help + Sets the number of swap device tiers. Swap devices are + grouped into tiers based on their priority, allowing the + system to prefer faster devices over slower ones. + + If unsure, say 4. + config ZSWAP bool "Compressed cache for swap pages" depends on SWAP diff --git a/mm/Makefile b/mm/Makefile index 8ad2ab08244e..db6449f84991 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -75,7 +75,7 @@ ifdef CONFIG_MMU obj-$(CONFIG_ADVISE_SYSCALLS) +=3D madvise.o endif =20 -obj-$(CONFIG_SWAP) +=3D page_io.o swap_state.o swapfile.o +obj-$(CONFIG_SWAP) +=3D page_io.o swap_state.o swapfile.o swap_tier.o obj-$(CONFIG_ZSWAP) +=3D zswap.o obj-$(CONFIG_HAS_DMA) +=3D dmapool.o obj-$(CONFIG_HUGETLBFS) +=3D hugetlb.o hugetlb_sysfs.o hugetlb_sysctl.o diff --git a/mm/swap.h b/mm/swap.h index a77016f2423b..fda8363bee73 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -16,6 +16,10 @@ extern int page_cluster; #define swap_entry_order(order) 0 #endif =20 +#define DEF_SWAP_PRIO -1 + +extern spinlock_t swap_lock; +extern struct plist_head swap_active_head; extern struct swap_info_struct *swap_info[]; =20 /* diff --git a/mm/swap_state.c b/mm/swap_state.c index 1415a5c54a43..bfdc0208e081 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -25,6 +25,7 @@ #include "internal.h" #include "swap_table.h" #include "swap.h" +#include "swap_tier.h" =20 /* * swapper_space is a fiction, retained to simplify the path through @@ -924,8 +925,81 @@ static ssize_t vma_ra_enabled_store(struct kobject *ko= bj, } static struct kobj_attribute vma_ra_enabled_attr =3D __ATTR_RW(vma_ra_enab= led); =20 +static ssize_t tiers_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return swap_tiers_sysfs_show(buf); +} + +static ssize_t tiers_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + char *p, *token, *name, *tmp; + int ret =3D 0; + short prio; + + tmp =3D kstrdup(buf, GFP_KERNEL); + if (!tmp) + return -ENOMEM; + + spin_lock(&swap_lock); + spin_lock(&swap_tier_lock); + swap_tiers_snapshot(); + + p =3D tmp; + while ((token =3D strsep(&p, ", \t\n")) !=3D NULL) { + if (!*token) + continue; + + switch (token[0]) { + case '+': + name =3D token + 1; + token =3D strchr(name, ':'); + if (!token) { + ret =3D -EINVAL; + goto out; + } + *token++ =3D '\0'; + if (kstrtos16(token, 10, &prio)) { + ret =3D -EINVAL; + goto out; + } + ret =3D swap_tiers_add(name, prio); + if (ret) + goto restore; + break; + case '-': + ret =3D swap_tiers_remove(token + 1); + if (ret) + goto restore; + break; + default: + ret =3D -EINVAL; + goto out; + } + } + + if (!swap_tiers_validate()) { + ret =3D -EINVAL; + goto restore; + } + goto out; + +restore: + swap_tiers_snapshot_restore(); +out: + spin_unlock(&swap_tier_lock); + spin_unlock(&swap_lock); + kfree(tmp); + return ret ? ret : count; +} + +static struct kobj_attribute tier_attr =3D __ATTR_RW(tiers); + static struct attribute *swap_attrs[] =3D { &vma_ra_enabled_attr.attr, + &tier_attr.attr, NULL, }; =20 diff --git a/mm/swap_tier.c b/mm/swap_tier.c new file mode 100644 index 000000000000..62b60fa8d3b7 --- /dev/null +++ b/mm/swap_tier.c @@ -0,0 +1,285 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include "memcontrol-v1.h" +#include +#include + +#include "swap.h" +#include "swap_tier.h" + +#define MAX_SWAPTIER CONFIG_NR_SWAP_TIERS +#define MAX_TIERNAME 16 + +/* + * struct swap_tier - structure representing a swap tier. + * + * @name: name of the swap_tier. + * @prio: starting value of priority. + * @list: linked list of tiers. + */ +static struct swap_tier { + char name[MAX_TIERNAME]; + short prio; + struct list_head list; +} swap_tiers[MAX_SWAPTIER]; + +DEFINE_SPINLOCK(swap_tier_lock); +/* active swap priority list, sorted in descending order */ +static LIST_HEAD(swap_tier_active_list); +/* unused swap_tier object */ +static LIST_HEAD(swap_tier_inactive_list); + +#define TIER_IDX(tier) ((tier) - swap_tiers) +#define TIER_MASK(tier) (1 << TIER_IDX(tier)) +#define TIER_INACTIVE_PRIO (DEF_SWAP_PRIO - 1) +#define TIER_IS_ACTIVE(tier) ((tier->prio) !=3D TIER_INACTIVE_PRIO) +#define TIER_END_PRIO(tier) \ + (!list_is_first(&(tier)->list, &swap_tier_active_list) ? \ + list_prev_entry((tier), list)->prio - 1 : SHRT_MAX) + +#define for_each_tier(tier, idx) \ + for (idx =3D 0, tier =3D &swap_tiers[0]; idx < MAX_SWAPTIER; \ + idx++, tier =3D &swap_tiers[idx]) + +#define for_each_active_tier(tier) \ + list_for_each_entry(tier, &swap_tier_active_list, list) + +#define for_each_inactive_tier(tier) \ + list_for_each_entry(tier, &swap_tier_inactive_list, list) + +/* + * Naming Convention: + * swap_tiers_*() - Public/exported functions + * swap_tier_*() - Private/internal functions + */ + +static bool swap_tier_is_active(void) +{ + return !list_empty(&swap_tier_active_list) ? true : false; +} + +static struct swap_tier *swap_tier_lookup(const char *name) +{ + struct swap_tier *tier; + + for_each_active_tier(tier) { + if (!strcmp(tier->name, name)) + return tier; + } + + return NULL; +} + +/* Insert new tier into the active list sorted by priority. */ +static void swap_tier_activate(struct swap_tier *new) +{ + struct swap_tier *tier; + + for_each_active_tier(tier) { + if (tier->prio <=3D new->prio) + break; + } + + list_add_tail(&new->list, &tier->list); +} + +static void swap_tier_inactivate(struct swap_tier *tier) +{ + list_move(&tier->list, &swap_tier_inactive_list); + tier->prio =3D TIER_INACTIVE_PRIO; +} + +void swap_tiers_init(void) +{ + struct swap_tier *tier; + int idx; + + BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER); + + for_each_tier(tier, idx) { + INIT_LIST_HEAD(&tier->list); + swap_tier_inactivate(tier); + } +} + +ssize_t swap_tiers_sysfs_show(char *buf) +{ + struct swap_tier *tier; + ssize_t len =3D 0; + + len +=3D sysfs_emit_at(buf, len, "%-16s %-5s %-11s %-11s\n", + "Name", "Idx", "PrioStart", "PrioEnd"); + + spin_lock(&swap_tier_lock); + for_each_active_tier(tier) { + len +=3D sysfs_emit_at(buf, len, "%-16s %-5ld %-11d %-11d\n", + tier->name, + TIER_IDX(tier), + tier->prio, + TIER_END_PRIO(tier)); + } + spin_unlock(&swap_tier_lock); + + return len; +} + +static struct swap_tier *swap_tier_prepare(const char *name, short prio) +{ + struct swap_tier *tier; + + lockdep_assert_held(&swap_tier_lock); + + if (prio < DEF_SWAP_PRIO) + return ERR_PTR(-EINVAL); + + if (list_empty(&swap_tier_inactive_list)) + return ERR_PTR(-ENOSPC); + + tier =3D list_first_entry(&swap_tier_inactive_list, + struct swap_tier, list); + + list_del_init(&tier->list); + strscpy(tier->name, name, MAX_TIERNAME); + tier->prio =3D prio; + + return tier; +} + +static int swap_tier_check_range(short prio) +{ + struct swap_tier *tier; + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + + for_each_active_tier(tier) { + /* No overwrite */ + if (tier->prio =3D=3D prio) + return -EINVAL; + } + + return 0; +} + +static bool swap_tier_validate_name(const char *name) +{ + if (!name || !*name) + return false; + + while (*name) { + if (!isalnum(*name) && *name !=3D '_') + return false; + name++; + } + return true; +} + +int swap_tiers_add(const char *name, int prio) +{ + int ret; + struct swap_tier *tier; + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + + /* Duplicate check */ + if (swap_tier_lookup(name)) + return -EEXIST; + + if (!swap_tier_validate_name(name)) + return -EINVAL; + + ret =3D swap_tier_check_range(prio); + if (ret) + return ret; + + tier =3D swap_tier_prepare(name, prio); + if (IS_ERR(tier)) { + ret =3D PTR_ERR(tier); + return ret; + } + + swap_tier_activate(tier); + + return ret; +} + +int swap_tiers_remove(const char *name) +{ + int ret =3D 0; + struct swap_tier *tier; + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + + tier =3D swap_tier_lookup(name); + if (!tier) + return -EINVAL; + + /* Removing DEF_SWAP_PRIO merges into the higher tier. */ + if (!list_is_singular(&swap_tier_active_list) + && tier->prio =3D=3D DEF_SWAP_PRIO) + list_prev_entry(tier, list)->prio =3D DEF_SWAP_PRIO; + + swap_tier_inactivate(tier); + + return ret; +} + +static struct swap_tier swap_tiers_snap[MAX_SWAPTIER]; +/* + * XXX: When multiple operations (adds and removes) are submitted in a + * single write, reverting each individually on failure is complex and + * error-prone. Instead, snapshot the entire state beforehand and + * restore it wholesale if any operation fails. + */ +void swap_tiers_snapshot(void) +{ + BUILD_BUG_ON(sizeof(swap_tiers_snap) !=3D sizeof(swap_tiers)); + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + + memcpy(swap_tiers_snap, swap_tiers, sizeof(swap_tiers)); +} + +void swap_tiers_snapshot_restore(void) +{ + struct swap_tier *tier; + int idx; + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + + memcpy(swap_tiers, swap_tiers_snap, sizeof(swap_tiers)); + + INIT_LIST_HEAD(&swap_tier_active_list); + INIT_LIST_HEAD(&swap_tier_inactive_list); + + for_each_tier(tier, idx) { + if (TIER_IS_ACTIVE(tier)) + swap_tier_activate(tier); + else + swap_tier_inactivate(tier); + } +} + +bool swap_tiers_validate(void) +{ + struct swap_tier *tier; + + /* + * Initial setting might not cover DEF_SWAP_PRIO. + * Swap tier must cover the full range (DEF_SWAP_PRIO to SHRT_MAX). + */ + if (swap_tier_is_active()) { + tier =3D list_last_entry(&swap_tier_active_list, + struct swap_tier, list); + + if (tier->prio !=3D DEF_SWAP_PRIO) + return false; + } + + return true; +} diff --git a/mm/swap_tier.h b/mm/swap_tier.h new file mode 100644 index 000000000000..a1395ec02c24 --- /dev/null +++ b/mm/swap_tier.h @@ -0,0 +1,20 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _SWAP_TIER_H +#define _SWAP_TIER_H + +#include +#include + +extern spinlock_t swap_tier_lock; + +/* Initialization and application */ +void swap_tiers_init(void); +ssize_t swap_tiers_sysfs_show(char *buf); + +int swap_tiers_add(const char *name, int prio); +int swap_tiers_remove(const char *name); + +void swap_tiers_snapshot(void); +void swap_tiers_snapshot_restore(void); +bool swap_tiers_validate(void); +#endif /* _SWAP_TIER_H */ diff --git a/mm/swapfile.c b/mm/swapfile.c index ff315b752afd..03bf2a0a42ac 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -49,6 +49,7 @@ #include "swap_table.h" #include "internal.h" #include "swap.h" +#include "swap_tier.h" =20 static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); @@ -64,7 +65,8 @@ static void move_cluster(struct swap_info_struct *si, * * Also protects swap_active_head total_swap_pages, and the SWP_WRITEOK fl= ag. */ -static DEFINE_SPINLOCK(swap_lock); +DEFINE_SPINLOCK(swap_lock); + static unsigned int nr_swapfiles; atomic_long_t nr_swap_pages; /* @@ -75,7 +77,6 @@ atomic_long_t nr_swap_pages; EXPORT_SYMBOL_GPL(nr_swap_pages); /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */ long total_swap_pages; -#define DEF_SWAP_PRIO -1 unsigned long swapfile_maximum_size; #ifdef CONFIG_MIGRATION bool swap_migration_ad_supported; @@ -88,7 +89,7 @@ static const char Bad_offset[] =3D "Bad swap offset entry= "; * all active swap_info_structs * protected with swap_lock, and ordered by priority. */ -static PLIST_HEAD(swap_active_head); +PLIST_HEAD(swap_active_head); =20 /* * all available (active, not full) swap_info_structs @@ -3890,6 +3891,7 @@ static int __init swapfile_init(void) swap_migration_ad_supported =3D true; #endif /* CONFIG_MIGRATION */ =20 + swap_tiers_init(); return 0; } subsys_initcall(swapfile_init); --=20 2.34.1