From nobody Mon Jun 22 07:37:01 2026 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 29B75342538 for ; Sat, 20 Jun 2026 18:16:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781979416; cv=none; b=cR1EdnFM5p2SFMJqmgWHjxOsxXflodhX/pXqs4thz/uWqm7sU4L6Oi2euATSivTa13uO5zYjTySIQ8synWQ/PXWiJZqeHCvI62qDeN72BrjRJdLQEgplacG52FSYfxmO8eOjEDkplDPGDylZv08I1381QPFEnH/oaXndGxMrUvs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781979416; c=relaxed/simple; bh=SIjAf+TUL2kQrkwE+jYmFAe8/JqfS72QxR1HDxMmAuw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=nyjiPNeA/XCPUy6M5NTripihWbgp6SF8NK1H1Av5OKQvEbIxenK258pj1A3T9EOn7lB9JNz1/opX18UQzaFX/GHozvvth90xjB/6xqjnOQsvy/D2eRCTudiX+fmD/O/JuPfIlp9vUDuYXlTOmqZseTT6anxGt0E60lyUEPYLpHI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=OqiwVrX+; arc=none smtp.client-ip=209.85.214.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="OqiwVrX+" Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-2c6fcfcdb2bso22598795ad.1 for ; Sat, 20 Jun 2026 11:16:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1781979413; x=1782584213; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=fqZzVql8FFtxQrj+HBUYcxO/Pzzfu2IxHZz+7Ww15C0=; b=OqiwVrX+wqrToz6fmkhJ4lJCyZQBjNg1pM3MUvcRKcSC1mH/JgYw0Ozk1T17gI7fam gILkmuIHUmiZQDXV0UC9ZyLHKHxNleE8YkxempfuS83faHIzEK/rgg7/ApoDAAl91q7P G+zqmYB7yj3bQrmkzqT6Zdi1o8bDjxN5hhq9kUkIHHDg4NOk3irzXEqKDGYwYQYO4axG he4ohZ6DyThFex/lmEtkG+2bpHxkHUZGBfKKmdHE9o8KJW1Tpej2XMz93WIYiaeT/zL2 AHRT0izz48Rp0OkNL0NGNaWAupMeC+UdXKwBr1EoMQIUG1CfMxKCZDnYevNzbCyuVTxS UKyA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1781979413; x=1782584213; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=fqZzVql8FFtxQrj+HBUYcxO/Pzzfu2IxHZz+7Ww15C0=; b=UTFEeziumybPV2Rh80wyqi1Aef1OreMZNOyPMSSD8GMpZEFiZMSL+HBfCzOxV1Hn9H PS28s78WPrCW1FlX1oROhf3jgabfPDyavU+muWWWyer2cYQzod82iZ+tBwMULhfr/Kky T0C11c2gbieyUxxfQt0WM67hbldROzfO/nJoD58VsqSp2Ll9w1/IvGRKwWT8JOYJefhU iNWtydavI6mHOuvmz804NJjRdHtRQfHwxZVzoDD3Ea0OxZvLBvNL7SdMxZXixKqGGFXK zJKFVLDvqEY3MZq7HTcewZsABOH82Ppp1lqE7eF4OkKyQxUXI8jxT/wiBgVfg6SVghiG 371Q== X-Forwarded-Encrypted: i=1; AHgh+RoX0R4TTRkr/ajfW1MWPRBlNsXpy1qEYPD0JaKdsxw3yWYCVwg11y3cAwNZWQ8CuhZjict+utkmBqJzLng=@vger.kernel.org X-Gm-Message-State: AOJu0YwIO01FnRGJamNDSvLTOwJrHlfLCSoCIeG90N2E42g2fzeU6HyE beBozjhMFLf0ubIOxpPXzzTZwGpwo9dkPR2jzG/lDZXe9aRmPaBVAjoF X-Gm-Gg: AfdE7cntNSvHV6VuUFcH5RfPK0dAONc92f3o0XDZmxf8ntZUiUC3q5bpfsdlXxKzCBf +FqnCGOEopccVL/0qwzCZ4dGS1M8tmyzC45l4j38kb1IeIKx0DYEhM8fi32u55b0q+3dvOupwbv muHfqlCr1UO4qkWAD1liysueervSXQVSouOztJjE2RC3w+LJdoCSJ29y4sC7gguAPNqtkgzTmHc cjC1gzCw5mjF80JpWcLyxLzm8/+oq5e5CVJa+esQabhhqIrSCNU//G4+G17TgDXxqMO+l1eS7YH jwltZFaS5noI5NuEs5PfUnGYqRnAjQS/GwqHsveGXEg5zHD3J6CtU2xSM4+umTUhR2O4AMxEq4l 6S1c2jXhxniq23PlGF3McULBkcdaa4emoaE2N3uwaiwCcxNtVFx2hT7stDV3gYnaCWko0WYjTzV gM+WRsVGAlR48Vm/1wPmqo1UAFaKdDKm9MGOtWJ0kENBmFNXwfev7KqKTPQCkZKau3582lwZkd2 M69pgCn6+7/ X-Received: by 2002:a17:902:cf0a:b0:2c6:cbcb:bc77 with SMTP id d9443c01a7336-2c725db5fbbmr81085355ad.28.1781979413256; Sat, 20 Jun 2026 11:16:53 -0700 (PDT) Received: from localhost.localdomain ([220.85.166.190]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2c7436af6d9sm30339465ad.4.2026.06.20.11.16.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 20 Jun 2026 11:16:52 -0700 (PDT) From: Youngjun Park X-Google-Original-From: Youngjun Park To: akpm@linux-foundation.org Cc: chrisl@kernel.org, youngjun.park@lge.com, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kasong@tencent.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, shikemeng@huaweicloud.com, nphamcs@gmail.com, baoquan.he@linux.dev, baohua@kernel.org, yosry@kernel.org, gunho.lee@lge.com, taejoon.song@lge.com, hyungjun.cho@lge.com, mkoutny@suse.com, baver.bae@lge.com, matia.kim@lge.com Subject: [PATCH v9 1/6] mm: swap: introduce swap tier infrastructure Date: Sun, 21 Jun 2026 03:16:26 +0900 Message-ID: <20260620181635.299364-2-youngjun.park@lge.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20260620181635.299364-1-youngjun.park@lge.com> References: <20260620181635.299364-1-youngjun.park@lge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This patch introduces the "Swap tier" concept, which serves as an abstraction layer for managing swap devices based on their performance characteristics (e.g., NVMe, HDD, Network swap). Swap tiers are user-named groups representing priority ranges. Tier names must consist of alphanumeric characters and underscores. These tiers collectively cover the entire priority space from -1 (`DEF_SWAP_PRIO`) to `SHRT_MAX`. To configure tiers, a new sysfs interface is exposed at /sys/kernel/mm/swap/tiers. The input parser evaluates commands from left to right and supports batch input, allowing users to add or remove multiple tiers in a single write operation. Tier management enforces continuous priority ranges anchored by start priorities. Operations trigger range splitting or merging, but overwriting start priorities is forbidden. Merging expands lower tiers upwards to preserve configured start priorities, except when removing `DEF_SWAP_PRIO`, which merges downwards. Suggested-by: Chris Li Reviewed-by: Baoquan He Signed-off-by: Youngjun Park --- MAINTAINERS | 2 + mm/Kconfig | 12 ++ mm/Makefile | 2 +- mm/swap.h | 4 + mm/swap_state.c | 74 ++++++++++++ mm/swap_tier.c | 302 ++++++++++++++++++++++++++++++++++++++++++++++++ mm/swap_tier.h | 20 ++++ mm/swapfile.c | 8 +- 8 files changed, 420 insertions(+), 4 deletions(-) create mode 100644 mm/swap_tier.c create mode 100644 mm/swap_tier.h diff --git a/MAINTAINERS b/MAINTAINERS index 65bd4328fe05..d1bb3b4b1e1c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17060,6 +17060,8 @@ F: mm/swap.c F: mm/swap.h F: mm/swap_table.h F: mm/swap_state.c +F: mm/swap_tier.c +F: mm/swap_tier.h F: mm/swapfile.c =20 MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE) diff --git a/mm/Kconfig b/mm/Kconfig index 776b67c66e82..5343937f3da9 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -19,6 +19,18 @@ menuconfig SWAP used to provide more virtual memory than the actual RAM present in your computer. If unsure say Y. =20 +config NR_SWAP_TIERS + int "Number of swap device tiers" + depends on SWAP + default 4 + range 1 31 + help + Sets the number of swap device tiers. Swap devices are + grouped into tiers based on their priority, allowing the + system to prefer faster devices over slower ones. + + If unsure, say 4. + config ZSWAP bool "Compressed cache for swap pages" depends on SWAP diff --git a/mm/Makefile b/mm/Makefile index eff9f9e7e061..29cb1e778285 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -75,7 +75,7 @@ ifdef CONFIG_MMU obj-$(CONFIG_ADVISE_SYSCALLS) +=3D madvise.o endif =20 -obj-$(CONFIG_SWAP) +=3D page_io.o swap_state.o swapfile.o +obj-$(CONFIG_SWAP) +=3D page_io.o swap_state.o swapfile.o swap_tier.o obj-$(CONFIG_ZSWAP) +=3D zswap.o obj-$(CONFIG_HAS_DMA) +=3D dmapool.o obj-$(CONFIG_HUGETLBFS) +=3D hugetlb.o hugetlb_sysfs.o hugetlb_sysctl.o diff --git a/mm/swap.h b/mm/swap.h index 77d2d14eda42..d6c5f5d31f63 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -34,6 +34,10 @@ extern int page_cluster; #define swap_entry_order(order) 0 #endif =20 +#define DEF_SWAP_PRIO -1 + +extern spinlock_t swap_lock; +extern struct plist_head swap_active_head; extern struct swap_info_struct *swap_info[]; =20 /* diff --git a/mm/swap_state.c b/mm/swap_state.c index 9c3a5cf99778..762d9ca6ad5a 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -25,6 +25,7 @@ #include "internal.h" #include "swap_table.h" #include "swap.h" +#include "swap_tier.h" =20 /* * swapper_space is a fiction, retained to simplify the path through @@ -1007,8 +1008,81 @@ static ssize_t vma_ra_enabled_store(struct kobject *= kobj, } static struct kobj_attribute vma_ra_enabled_attr =3D __ATTR_RW(vma_ra_enab= led); =20 +static ssize_t tiers_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return swap_tiers_sysfs_show(buf); +} + +static ssize_t tiers_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + char *p, *token, *name, *tmp; + int ret =3D 0; + short prio; + + tmp =3D kstrdup(buf, GFP_KERNEL); + if (!tmp) + return -ENOMEM; + + spin_lock(&swap_lock); + spin_lock(&swap_tier_lock); + swap_tiers_snapshot(); + + p =3D tmp; + while ((token =3D strsep(&p, ", \t\n")) !=3D NULL) { + if (!*token) + continue; + + switch (token[0]) { + case '+': + name =3D token + 1; + token =3D strchr(name, ':'); + if (!token) { + ret =3D -EINVAL; + goto restore; + } + *token++ =3D '\0'; + if (kstrtos16(token, 10, &prio)) { + ret =3D -EINVAL; + goto restore; + } + ret =3D swap_tiers_add(name, prio); + if (ret) + goto restore; + break; + case '-': + ret =3D swap_tiers_remove(token + 1); + if (ret) + goto restore; + break; + default: + ret =3D -EINVAL; + goto restore; + } + } + + if (!swap_tiers_validate()) { + ret =3D -EINVAL; + goto restore; + } + goto out; + +restore: + swap_tiers_snapshot_restore(); +out: + spin_unlock(&swap_tier_lock); + spin_unlock(&swap_lock); + kfree(tmp); + return ret ? ret : count; +} + +static struct kobj_attribute tier_attr =3D __ATTR_RW(tiers); + static struct attribute *swap_attrs[] =3D { &vma_ra_enabled_attr.attr, + &tier_attr.attr, NULL, }; =20 diff --git a/mm/swap_tier.c b/mm/swap_tier.c new file mode 100644 index 000000000000..ac7a3c2a48cb --- /dev/null +++ b/mm/swap_tier.c @@ -0,0 +1,302 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include "memcontrol-v1.h" +#include +#include + +#include "swap.h" +#include "swap_tier.h" + +#define MAX_SWAPTIER CONFIG_NR_SWAP_TIERS +#define MAX_TIERNAME 16 + +/* + * struct swap_tier - structure representing a swap tier. + * + * @name: name of the swap_tier. + * @prio: starting value of priority. + * @list: linked list of tiers. + */ +static struct swap_tier { + char name[MAX_TIERNAME]; + short prio; + struct list_head list; +} swap_tiers[MAX_SWAPTIER]; + +DEFINE_SPINLOCK(swap_tier_lock); +/* active swap priority list, sorted in descending order */ +static LIST_HEAD(swap_tier_active_list); +/* unused swap_tier object */ +static LIST_HEAD(swap_tier_inactive_list); + +#define TIER_IDX(tier) ((tier) - swap_tiers) +#define TIER_MASK(tier) (1U << TIER_IDX(tier)) +#define TIER_INACTIVE_PRIO (DEF_SWAP_PRIO - 1) +#define TIER_IS_ACTIVE(tier) ((tier->prio) !=3D TIER_INACTIVE_PRIO) +#define TIER_END_PRIO(tier) \ + (!list_is_first(&(tier)->list, &swap_tier_active_list) ? \ + list_prev_entry((tier), list)->prio - 1 : SHRT_MAX) + +#define for_each_tier(tier, idx) \ + for (idx =3D 0, tier =3D &swap_tiers[0]; idx < MAX_SWAPTIER; \ + idx++, tier =3D &swap_tiers[idx]) + +#define for_each_active_tier(tier) \ + list_for_each_entry(tier, &swap_tier_active_list, list) + +#define for_each_inactive_tier(tier) \ + list_for_each_entry(tier, &swap_tier_inactive_list, list) + +/* + * Naming Convention: + * swap_tiers_*() - Public/exported functions + * swap_tier_*() - Private/internal functions + */ + +static bool swap_tier_is_active(void) +{ + return !list_empty(&swap_tier_active_list); +} + +static struct swap_tier *swap_tier_lookup(const char *name) +{ + struct swap_tier *tier; + + for_each_active_tier(tier) { + if (!strcmp(tier->name, name)) + return tier; + } + + return NULL; +} + +/* Insert new tier into the active list sorted by priority. */ +static void swap_tier_activate(struct swap_tier *new) +{ + struct list_head *pos =3D &swap_tier_active_list; + struct swap_tier *tier; + + for_each_active_tier(tier) { + if (tier->prio <=3D new->prio) { + pos =3D &tier->list; + break; + } + } + + list_add_tail(&new->list, pos); +} + +static void swap_tier_inactivate(struct swap_tier *tier) +{ + list_move(&tier->list, &swap_tier_inactive_list); + tier->prio =3D TIER_INACTIVE_PRIO; +} + +void swap_tiers_init(void) +{ + struct swap_tier *tier; + int idx; + + BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER); + + for_each_tier(tier, idx) { + INIT_LIST_HEAD(&tier->list); + swap_tier_inactivate(tier); + } +} + +ssize_t swap_tiers_sysfs_show(char *buf) +{ + struct swap_tier *tier; + ssize_t len =3D 0; + + len +=3D sysfs_emit_at(buf, len, "%-16s %-5s %-11s %-11s\n", + "Name", "Idx", "PrioStart", "PrioEnd"); + + spin_lock(&swap_tier_lock); + for_each_active_tier(tier) { + len +=3D sysfs_emit_at(buf, len, "%-16s %-5td %-11d %-11d\n", + tier->name, + TIER_IDX(tier), + tier->prio, + TIER_END_PRIO(tier)); + } + spin_unlock(&swap_tier_lock); + + return len; +} + +static struct swap_tier *swap_tier_prepare(const char *name, short prio) +{ + struct swap_tier *tier; + + lockdep_assert_held(&swap_tier_lock); + + if (prio < DEF_SWAP_PRIO) + return ERR_PTR(-EINVAL); + + if (list_empty(&swap_tier_inactive_list)) + return ERR_PTR(-ENOSPC); + + tier =3D list_first_entry(&swap_tier_inactive_list, + struct swap_tier, list); + + list_del_init(&tier->list); + strscpy(tier->name, name, MAX_TIERNAME); + tier->prio =3D prio; + + return tier; +} + +static int swap_tier_check_range(short prio) +{ + struct swap_tier *tier; + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + + for_each_active_tier(tier) { + /* No overwrite */ + if (tier->prio =3D=3D prio) + return -EINVAL; + } + + return 0; +} + +static bool swap_tier_validate_name(const char *name) +{ + int len; + + if (!name || !*name) + return false; + + len =3D strlen(name); + if (len >=3D MAX_TIERNAME) + return false; + + while (*name) { + if (!isalnum(*name) && *name !=3D '_') + return false; + name++; + } + return true; +} + +int swap_tiers_add(const char *name, int prio) +{ + int ret; + struct swap_tier *tier; + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + + /* Duplicate check */ + if (swap_tier_lookup(name)) + return -EEXIST; + + if (!swap_tier_validate_name(name)) + return -EINVAL; + + ret =3D swap_tier_check_range(prio); + if (ret) + return ret; + + tier =3D swap_tier_prepare(name, prio); + if (IS_ERR(tier)) { + ret =3D PTR_ERR(tier); + return ret; + } + + swap_tier_activate(tier); + + return ret; +} + +int swap_tiers_remove(const char *name) +{ + int ret =3D 0; + struct swap_tier *tier; + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + + tier =3D swap_tier_lookup(name); + if (!tier) + return -EINVAL; + + /* Removing DEF_SWAP_PRIO merges into the higher tier. */ + if (!list_is_singular(&swap_tier_active_list) + && tier->prio =3D=3D DEF_SWAP_PRIO) + list_prev_entry(tier, list)->prio =3D DEF_SWAP_PRIO; + + swap_tier_inactivate(tier); + + return ret; +} + +static struct swap_tier swap_tiers_snap[MAX_SWAPTIER]; +/* + * XXX: When multiple operations (adds and removes) are submitted in a + * single write, reverting each individually on failure is complex and + * error-prone. Instead, snapshot the entire state beforehand and + * restore it wholesale if any operation fails. + */ +void swap_tiers_snapshot(void) +{ + BUILD_BUG_ON(sizeof(swap_tiers_snap) !=3D sizeof(swap_tiers)); + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + + memcpy(swap_tiers_snap, swap_tiers, sizeof(swap_tiers)); +} + +void swap_tiers_snapshot_restore(void) +{ + struct swap_tier *tier; + int idx; + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + + memcpy(swap_tiers, swap_tiers_snap, sizeof(swap_tiers)); + + INIT_LIST_HEAD(&swap_tier_active_list); + INIT_LIST_HEAD(&swap_tier_inactive_list); + + /* + * memcpy copied snapshot-time list pointers into each tier's + * list_head. Those references are stale, so re-init every + * tier before re-linking into the freshly initialised global + * lists below. + */ + for_each_tier(tier, idx) { + INIT_LIST_HEAD(&tier->list); + + if (TIER_IS_ACTIVE(tier)) + swap_tier_activate(tier); + else + swap_tier_inactivate(tier); + } +} + +bool swap_tiers_validate(void) +{ + struct swap_tier *tier; + + /* + * Initial setting might not cover DEF_SWAP_PRIO. + * Swap tier must cover the full range (DEF_SWAP_PRIO to SHRT_MAX). + */ + if (swap_tier_is_active()) { + tier =3D list_last_entry(&swap_tier_active_list, + struct swap_tier, list); + + if (tier->prio !=3D DEF_SWAP_PRIO) + return false; + } + + return true; +} diff --git a/mm/swap_tier.h b/mm/swap_tier.h new file mode 100644 index 000000000000..a1395ec02c24 --- /dev/null +++ b/mm/swap_tier.h @@ -0,0 +1,20 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _SWAP_TIER_H +#define _SWAP_TIER_H + +#include +#include + +extern spinlock_t swap_tier_lock; + +/* Initialization and application */ +void swap_tiers_init(void); +ssize_t swap_tiers_sysfs_show(char *buf); + +int swap_tiers_add(const char *name, int prio); +int swap_tiers_remove(const char *name); + +void swap_tiers_snapshot(void); +void swap_tiers_snapshot_restore(void); +bool swap_tiers_validate(void); +#endif /* _SWAP_TIER_H */ diff --git a/mm/swapfile.c b/mm/swapfile.c index e3d126602a1e..3f7225dbc6cd 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -48,6 +48,7 @@ #include "swap_table.h" #include "internal.h" #include "swap.h" +#include "swap_tier.h" =20 static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); @@ -63,7 +64,8 @@ static void move_cluster(struct swap_info_struct *si, * * Also protects swap_active_head total_swap_pages, and the SWP_WRITEOK fl= ag. */ -static DEFINE_SPINLOCK(swap_lock); +DEFINE_SPINLOCK(swap_lock); + static unsigned int nr_swapfiles; atomic_long_t nr_swap_pages; /* @@ -74,7 +76,6 @@ atomic_long_t nr_swap_pages; EXPORT_SYMBOL_GPL(nr_swap_pages); /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */ long total_swap_pages; -#define DEF_SWAP_PRIO -1 unsigned long swapfile_maximum_size; #ifdef CONFIG_MIGRATION bool swap_migration_ad_supported; @@ -87,7 +88,7 @@ static const char Bad_offset[] =3D "Bad swap offset entry= "; * all active swap_info_structs * protected with swap_lock, and ordered by priority. */ -static PLIST_HEAD(swap_active_head); +PLIST_HEAD(swap_active_head); =20 /* * all available (active, not full) swap_info_structs @@ -3988,6 +3989,7 @@ static int __init swapfile_init(void) swap_migration_ad_supported =3D true; #endif /* CONFIG_MIGRATION */ =20 + swap_tiers_init(); return 0; } subsys_initcall(swapfile_init); --=20 2.48.1 From nobody Mon Jun 22 07:37:01 2026 Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D2D1237180 for ; Sat, 20 Jun 2026 18:16:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.50 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781979421; cv=none; b=FXWo+VOzE/BwzbIbeBemGGE1PxSRdH9EOzo1NIxDGs+mlV1QkdsJj7IeL6lrX7C938XcQ9WV5Hmox69yyu5rsyGpV6yD2yfIgVQMOsZ6m3BvMThJ6QCEOjQ7aGQljEm8vTfgR2B+28w+B6fhVReKBQ0gyx71z2A+NVRu6ui+CVc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781979421; c=relaxed/simple; bh=5YKbnguI/qI8ROmrXJlg6Us4UhoRpIQZxq/tAND3sc0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=q3/z+elFak7GL+VqpUt4po+JvmEaHf3CeowvdTCPRlMovNLgLjPbvi3cyMg1CEEhLLEoSLwBcdnZ23leU+nUNn7usn7+3DLNWsyd9MISZtzX5vC6BFyG9ahKnaOlpyd0D8cZoW2EIt1UazEPVUnqV7h+6CuZpW6tjrcTFCkR2R0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=gQacVtSU; arc=none smtp.client-ip=209.85.216.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="gQacVtSU" Received: by mail-pj1-f50.google.com with SMTP id 98e67ed59e1d1-36ba706ab46so1844411a91.1 for ; Sat, 20 Jun 2026 11:16:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1781979419; x=1782584219; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=CJKZguEXiL1hQOqIpjORXu9lui0EjNZRPirTC2qeoYA=; b=gQacVtSUn5HwkC5vB4okzb84Iw41TLjcw01vmMtuG+jyNY17w0pCRhyrhJIiJqfJok MUaVEyuCV6ZUYm5t4vVdyvLQ6zke2Dq6oqPZtLB22GgP6yYtFtZPX7nd15dnpJnDhKto 1ynhm4PGjocrktytc2NEyxwdFHNTGR+0Y1RQQEm35So7y2x7JzPCde3Wx1kM21KWvMAl 4kYD5RW38wCDJs6ecuZ27yrb+rsf5yDnexJuObaKRbJv3EYhuEU3yA2rGLzBTOig9haN Ef5QjGT2pCRd/YYUYWTrSe4JeBx3JzNzdJidJBowLIJ8TnWqLJjIoN6jq1PYS0/0W03X Epgg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1781979419; x=1782584219; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=CJKZguEXiL1hQOqIpjORXu9lui0EjNZRPirTC2qeoYA=; b=AE4A8+7s+aoGw0Sspsz3XQCHLIwppPB2Y6oEJNxeO0vJ5h+UzxgDHAiitd5rY20rJp HEvYAnh/pHuzkNmrVARgDwNfrkiLZvSMFPl4/IxGIUu1q2COgPnboRM39oEEIuxSYE3n tU1Ua5YSruNiNE4EZF7qIdllyCE0H0hj4NhcprpZb4fyPOZAn/188tQFXqEE4Oq1Fd+D fijM2butkPOlcIh6vM7V/6ohB/yceuT5RJkS+CS6JV6afoPLikSts2ppSeaWh5vwJVAI G/XQ/RGxUzZ2kW7EFmodjkJqxiZ3yuoXtXCKrC3XZFG6ioXCOy3qcneUF8QCsRxDVDcP Edjg== X-Forwarded-Encrypted: i=1; AHgh+RoRlY1M00ggdSHtHdaW+YOLxsT4gCsOvMDVGTc4Vy7LLIXSVceB1E5tLXCl7ucBAc0U/x0lcH1ICBXDNEQ=@vger.kernel.org X-Gm-Message-State: AOJu0YwdC+f7dYT195SuO2zJpVtK0j0OmghQ73Z7bJq3ooAXKq33qu8E FDcgIDw07+lBoHncEc+L7dnysSHqo/9YC8PEDJCNgPQnjmCJSS3k5rNo X-Gm-Gg: AfdE7ckBu97TsMOM/YxQnWl/xN+ARxanW8hXzUJBkPlGIIWkrMbtYPrOwg9fOO7IZ3b +buCpHvdsKrJwJ3sQby3HctN6oWDXM2SGNFHRGIsCWOgR6NK8NP/P7qF3cu24TCGbyZIqcQbccQ 5lIp3cXfgVJZw0NgB6wH+V5/9dZpl5k9eGroeyACOq3xvx35nUEfxYdTB4fQ5TJsuTJivwdknP7 V2jeRJF0UIitSB1MV69rCkiCj7zpTeNMu27ngXofkGNdokMcqbJx9hAXOVQn7293Wh7D7e8NvXa tj4AXQjiSXwW0Z9ePAP5iRnJqX+VZtCbXRYWCW0M7L7oFlZDMdhl0uU03TMgpGxSJB0GliDFVPp B/6r5UtIKLuSbISZ10wRbDcPemtuTX/undFg0FonSGnwToHIZRDx+T0Kt3cRXAP6kMdaS4GurC8 6cFMCz6WwrJUjx55iU0Xp6TkgTOU8PuDQPwuH3obNTBY1mnrL2d/Ck91QJ6vt/4q5IdQeIxFGgP iM7LACE1TN4 X-Received: by 2002:a17:903:46c7:b0:2c6:b87c:e5a3 with SMTP id d9443c01a7336-2c718ed74e6mr89125855ad.15.1781979418632; Sat, 20 Jun 2026 11:16:58 -0700 (PDT) Received: from localhost.localdomain ([220.85.166.190]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2c7436af6d9sm30339465ad.4.2026.06.20.11.16.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 20 Jun 2026 11:16:57 -0700 (PDT) From: Youngjun Park X-Google-Original-From: Youngjun Park To: akpm@linux-foundation.org Cc: chrisl@kernel.org, youngjun.park@lge.com, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kasong@tencent.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, shikemeng@huaweicloud.com, nphamcs@gmail.com, baoquan.he@linux.dev, baohua@kernel.org, yosry@kernel.org, gunho.lee@lge.com, taejoon.song@lge.com, hyungjun.cho@lge.com, mkoutny@suse.com, baver.bae@lge.com, matia.kim@lge.com Subject: [PATCH v9 2/6] mm: swap: associate swap devices with tiers Date: Sun, 21 Jun 2026 03:16:27 +0900 Message-ID: <20260620181635.299364-3-youngjun.park@lge.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20260620181635.299364-1-youngjun.park@lge.com> References: <20260620181635.299364-1-youngjun.park@lge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable This patch connects swap devices to the swap tier infrastructure, ensuring that devices are correctly assigned to tiers based on their priority. A `tier_mask` is added to identify the tier membership of swap devices. Although tier-based allocation logic is not yet implemented, this mapping is necessary to track which tier a device belongs to. Upon activation, the device is assigned to a tier by matching its priority against the configured tier ranges. The infrastructure allows dynamic modification of tiers, such as splitting or merging ranges. These operations are permitted provided that the tier assignment of already configured swap devices remains unchanged. This patch also adds the documentation for the swap tier feature, covering the core concepts, sysfs interface usage, and configuration details. Reviewed-by: Baoquan He Signed-off-by: Youngjun Park --- Documentation/mm/index.rst | 1 + Documentation/mm/swap-tier.rst | 150 +++++++++++++++++++++++++++++++++ MAINTAINERS | 1 + include/linux/swap.h | 1 + mm/swap_state.c | 2 +- mm/swap_tier.c | 101 +++++++++++++++++++--- mm/swap_tier.h | 13 ++- mm/swapfile.c | 2 + 8 files changed, 257 insertions(+), 14 deletions(-) create mode 100644 Documentation/mm/swap-tier.rst diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst index 7aa2a8886908..a0d1447c5569 100644 --- a/Documentation/mm/index.rst +++ b/Documentation/mm/index.rst @@ -21,6 +21,7 @@ see the :doc:`admin guide <../admin-guide/mm/index>`. page_reclaim swap swap-table + swap-tier page_cache shmfs oom diff --git a/Documentation/mm/swap-tier.rst b/Documentation/mm/swap-tier.rst new file mode 100644 index 000000000000..0fb4a1153a67 --- /dev/null +++ b/Documentation/mm/swap-tier.rst @@ -0,0 +1,150 @@ +.. SPDX-License-Identifier: GPL-2.0 + +:Author: Chris Li Youngjun Park + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Swap Tier +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Swap tier is a collection of user-named groups classified by priority rang= es. +It acts as a facilitation layer, allowing users to manage swap devices bas= ed +on their speeds. + +Users are encouraged to assign swap device priorities according to device +speed to fully utilize this feature. While the current implementation is +integrated with cgroups, the concept is designed to be extensible for other +subsystems in the future. + +Priority Range +-------------- + +The specified tiers must cover the entire priority range from -1 +(DEF_SWAP_PRIO) to SHRT_MAX. + +Consistency +----------- + +Tier consistency is guaranteed with a focus on maximizing flexibility. Whe= n a +swap device is activated within a tier range, the tier covering that devic= e's +priority is guaranteed not to disappear or change while the device remains +active. Adding a new tier may split the range of an existing tier, but the +active device's tier assignment remains unchanged. + +However, specifying a tier in a cgroup does not guarantee the tier's exist= ence. +Consequently, the corresponding tier can disappear at any time. + +Configuration Interface +----------------------- + +The swap tiers can be configured via the following interface: + +/sys/kernel/mm/swap/tiers + +Operations can be performed using the following syntax: + +* Add: ``+"":""`` +* Remove: ``-""`` + +Tier names must consist of alphanumeric characters and underscores. Multip= le +operations can be provided in a single write, separated by commas (",") or +whitespace (spaces, tabs, newlines). + +When configuring tiers, the specified value represents the **start priorit= y** +of that tier. The end priority is automatically determined by the start +priority of the next higher tier. Consequently, adding a tier +automatically adjusts the ranges of adjacent tiers to ensure continuity. + +Examples +-------- + +**1. Initialization** + +A tier starting at -1 is mandatory to cover the entire priority range up to +SHRT_MAX. In this example, 'HDD' starts at 50, and 'NET' covers the remain= ing +lower range starting from -1. + +:: + + # echo "+HDD:50, +NET:-1" > /sys/kernel/mm/swap/tiers + # cat /sys/kernel/mm/swap/tiers + Name Idx PrioStart PrioEnd + HDD 0 50 32767 + NET 1 -1 49 + +**2. Adding a New Tier (split)** + +A new tier 'SSD' is added at priority 100, splitting the existing 'HDD' ti= er. +The ranges are automatically recalculated: + +* 'SSD' takes the top range (100 to SHRT_MAX). +* 'HDD' is adjusted to the range between 'NET' and 'SSD' (50 to 99). +* 'NET' remains unchanged (-1 to 49). + +:: + + # echo "+SSD:100" > /sys/kernel/mm/swap/tiers + # cat /sys/kernel/mm/swap/tiers + Name Idx PrioStart PrioEnd + SSD 2 100 32767 + HDD 0 50 99 + NET 1 -1 49 + +**3. Removal (merge)** + +Tiers can be removed using the '-' prefix. +:: + + # echo "-SSD" > /sys/kernel/mm/swap/tiers + +When a tier is removed, its priority range is merged into the adjacent +tier. The merge direction is always upward (the tier below expands), +except when the lowest tier is removed =E2=80=94 in that case the tier abo= ve +shifts its starting priority down to -1 to maintain full range coverage. + +:: + + Initial state: + Name Idx PrioStart PrioEnd + SSD 2 100 32767 + HDD 1 50 99 + NET 0 -1 49 + + # echo "-SSD" > /sys/kernel/mm/swap/tiers + + Name Idx PrioStart PrioEnd + HDD 1 50 32767 <- merged with SSD's ra= nge + NET 0 -1 49 + + # echo "-NET" > /sys/kernel/mm/swap/tiers + + Name Idx PrioStart PrioEnd + HDD 1 -1 32767 <- shifted down to -1 + +**4. Interaction with Active Swap Devices** + +If a swap device is active (swapon), the tier covering that device's +priority cannot be removed. Splitting the active tier's range is only +allowed above the device's priority. + +Assume a swap device is active at priority 60 (inside 'HDD' tier). + +:: + + # swapon -p 60 /dev/zram0 + + Name Idx PrioStart PrioEnd + HDD 0 50 32767 + NET 1 -1 49 + + # echo "-HDD" > /sys/kernel/mm/swap/tiers + -bash: echo: write error: Device or resource busy + + # echo "+SSD:60" > /sys/kernel/mm/swap/tiers + -bash: echo: write error: Device or resource busy + + # echo "+SSD:100" > /sys/kernel/mm/swap/tiers + + Name Idx PrioStart PrioEnd + SSD 2 100 32767 + HDD 0 50 99 <- device (prio 60) sta= ys here + NET 1 -1 49 diff --git a/MAINTAINERS b/MAINTAINERS index d1bb3b4b1e1c..4293048be1ab 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17052,6 +17052,7 @@ L: linux-mm@kvack.org S: Maintained F: Documentation/ABI/testing/sysfs-kernel-mm-swap F: Documentation/mm/swap-table.rst +F: Documentation/mm/swap-tier.rst F: include/linux/swap.h F: include/linux/swapfile.h F: include/linux/swapops.h diff --git a/include/linux/swap.h b/include/linux/swap.h index 6d72778e6cc3..21286945770a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -250,6 +250,7 @@ struct swap_info_struct { struct percpu_ref users; /* indicate and keep swap device valid. */ unsigned long flags; /* SWP_USED etc: see above */ signed short prio; /* swap priority of this type */ + int tier_mask; /* swap tier mask */ struct plist_node list; /* entry in swap_active_head */ signed char type; /* strange name for an index */ unsigned int max; /* size of this swap device */ diff --git a/mm/swap_state.c b/mm/swap_state.c index 762d9ca6ad5a..2f382d4dcbdc 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -1063,7 +1063,7 @@ static ssize_t tiers_store(struct kobject *kobj, } } =20 - if (!swap_tiers_validate()) { + if (!swap_tiers_update()) { ret =3D -EINVAL; goto restore; } diff --git a/mm/swap_tier.c b/mm/swap_tier.c index ac7a3c2a48cb..6b57cadb3e95 100644 --- a/mm/swap_tier.c +++ b/mm/swap_tier.c @@ -38,6 +38,8 @@ static LIST_HEAD(swap_tier_inactive_list); (!list_is_first(&(tier)->list, &swap_tier_active_list) ? \ list_prev_entry((tier), list)->prio - 1 : SHRT_MAX) =20 +#define MASK_TO_TIER(mask) (&swap_tiers[__ffs((mask))]) + #define for_each_tier(tier, idx) \ for (idx =3D 0, tier =3D &swap_tiers[0]; idx < MAX_SWAPTIER; \ idx++, tier =3D &swap_tiers[idx]) @@ -59,6 +61,26 @@ static bool swap_tier_is_active(void) return !list_empty(&swap_tier_active_list); } =20 +static bool swap_tier_prio_in_range(struct swap_tier *tier, short prio) +{ + if (tier->prio <=3D prio && TIER_END_PRIO(tier) >=3D prio) + return true; + + return false; +} + +static bool swap_tier_prio_is_used(short prio) +{ + struct swap_tier *tier; + + for_each_active_tier(tier) { + if (tier->prio =3D=3D prio) + return true; + } + + return false; +} + static struct swap_tier *swap_tier_lookup(const char *name) { struct swap_tier *tier; @@ -99,6 +121,7 @@ void swap_tiers_init(void) int idx; =20 BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER); + BUILD_BUG_ON(MAX_SWAPTIER > TIER_DEFAULT_IDX); =20 for_each_tier(tier, idx) { INIT_LIST_HEAD(&tier->list); @@ -149,17 +172,29 @@ static struct swap_tier *swap_tier_prepare(const char= *name, short prio) return tier; } =20 -static int swap_tier_check_range(short prio) +static int swap_tier_can_split_range(short new_prio) { + struct swap_info_struct *p; struct swap_tier *tier; =20 lockdep_assert_held(&swap_lock); lockdep_assert_held(&swap_tier_lock); =20 - for_each_active_tier(tier) { - /* No overwrite */ - if (tier->prio =3D=3D prio) - return -EINVAL; + plist_for_each_entry(p, &swap_active_head, list) { + if (p->tier_mask =3D=3D TIER_DEFAULT_MASK) + continue; + + tier =3D MASK_TO_TIER(p->tier_mask); + if (!swap_tier_prio_in_range(tier, new_prio)) + continue; + + /* + * Device sits in a tier that spans new_prio; + * splitting here would reassign it to a + * different tier. + */ + if (p->prio >=3D new_prio) + return -EBUSY; } =20 return 0; @@ -199,7 +234,11 @@ int swap_tiers_add(const char *name, int prio) if (!swap_tier_validate_name(name)) return -EINVAL; =20 - ret =3D swap_tier_check_range(prio); + /* No overwrite */ + if (swap_tier_prio_is_used(prio)) + return -EBUSY; + + ret =3D swap_tier_can_split_range(prio); if (ret) return ret; =20 @@ -226,6 +265,11 @@ int swap_tiers_remove(const char *name) if (!tier) return -EINVAL; =20 + /* Simulate adding a tier to check for conflicts */ + ret =3D swap_tier_can_split_range(tier->prio); + if (ret) + return ret; + /* Removing DEF_SWAP_PRIO merges into the higher tier. */ if (!list_is_singular(&swap_tier_active_list) && tier->prio =3D=3D DEF_SWAP_PRIO) @@ -236,13 +280,15 @@ int swap_tiers_remove(const char *name) return ret; } =20 -static struct swap_tier swap_tiers_snap[MAX_SWAPTIER]; /* - * XXX: When multiple operations (adds and removes) are submitted in a - * single write, reverting each individually on failure is complex and - * error-prone. Instead, snapshot the entire state beforehand and - * restore it wholesale if any operation fails. + * XXX: Static global snapshot buffer for batch operations. Small + * and used once per write, so a static global is not bad. + * When multiple adds/removes are submitted in a single write, + * reverting each individually on failure is error-prone. Instead, + * snapshot beforehand and restore wholesale if any operation fails. */ +static struct swap_tier swap_tiers_snap[MAX_SWAPTIER]; + void swap_tiers_snapshot(void) { BUILD_BUG_ON(sizeof(swap_tiers_snap) !=3D sizeof(swap_tiers)); @@ -282,10 +328,30 @@ void swap_tiers_snapshot_restore(void) } } =20 -bool swap_tiers_validate(void) +void swap_tiers_assign_dev(struct swap_info_struct *swp) { struct swap_tier *tier; =20 + lockdep_assert_held(&swap_lock); + + for_each_active_tier(tier) { + if (swap_tier_prio_in_range(tier, swp->prio)) { + swp->tier_mask =3D TIER_MASK(tier); + return; + } + } + + swp->tier_mask =3D TIER_DEFAULT_MASK; +} + +bool swap_tiers_update(void) +{ + struct swap_tier *tier; + struct swap_info_struct *swp; + + lockdep_assert_held(&swap_lock); + lockdep_assert_held(&swap_tier_lock); + /* * Initial setting might not cover DEF_SWAP_PRIO. * Swap tier must cover the full range (DEF_SWAP_PRIO to SHRT_MAX). @@ -298,5 +364,16 @@ bool swap_tiers_validate(void) return false; } =20 + /* + * If applied initially, the swap tier_mask may change + * from the default value. + */ + plist_for_each_entry(swp, &swap_active_head, list) { + /* Tier is already configured */ + if (swp->tier_mask !=3D TIER_DEFAULT_MASK) + break; + swap_tiers_assign_dev(swp); + } + return true; } diff --git a/mm/swap_tier.h b/mm/swap_tier.h index a1395ec02c24..3e355f857363 100644 --- a/mm/swap_tier.h +++ b/mm/swap_tier.h @@ -5,8 +5,15 @@ #include #include =20 +/* Forward declarations */ +struct swap_info_struct; + extern spinlock_t swap_tier_lock; =20 +#define TIER_ALL_MASK (~0) +#define TIER_DEFAULT_IDX (31) +#define TIER_DEFAULT_MASK (1U << TIER_DEFAULT_IDX) + /* Initialization and application */ void swap_tiers_init(void); ssize_t swap_tiers_sysfs_show(char *buf); @@ -16,5 +23,9 @@ int swap_tiers_remove(const char *name); =20 void swap_tiers_snapshot(void); void swap_tiers_snapshot_restore(void); -bool swap_tiers_validate(void); +bool swap_tiers_update(void); + +/* Tier assignment */ +void swap_tiers_assign_dev(struct swap_info_struct *swp); + #endif /* _SWAP_TIER_H */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 3f7225dbc6cd..9a86ebe992f4 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3036,6 +3036,8 @@ static void _enable_swap_info(struct swap_info_struct= *si) =20 /* Add back to available list */ add_to_avail_list(si, true); + + swap_tiers_assign_dev(si); } =20 /* --=20 2.48.1 From nobody Mon Jun 22 07:37:01 2026 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DB50B345CAF for ; Sat, 20 Jun 2026 18:17:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781979426; cv=none; b=B9BRwcsKCv4HPOq4tbQZH/2r534W3mxc9CCrb7jd3ZCpujtUu0jNdKMDE+sWnCu4ELyi4zhFwrG1Ylt5bENfIGAk4sTp535mRUSoK1c1i1QzFjeRb4kZHAU5Bj4OsM6vcXImmolV0sy7kpF5205zw9wo7nJ8HCqM5y5wFDSt67c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781979426; c=relaxed/simple; bh=ENr6YAPp19BxIaCtJXBCCOLDywZxLaNSRUcJ1CCqy2k=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=MIjNirc19fLIoXYHl5qZfvrFS+DldMBdjowGfWFQ45XiRuzPNiMepJZil/bVBcR12Ng7WlrLbz/QtVNNYPXfrFkxWZmt8+EZgUOx/9hPDtzxXw3/fvpmuVpkcr6CRnFUtdZXOEtJeoKejF1HKa5kJ0Ls/0zLKtnIAsxVjcPrWRI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=sv646IAs; arc=none smtp.client-ip=209.85.214.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="sv646IAs" Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-2c40397e3caso29426725ad.2 for ; Sat, 20 Jun 2026 11:17:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1781979424; x=1782584224; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=XJdnGcYqrN2PZHNQjPn0cMT+HfsR3sWO/K33y7osfm0=; b=sv646IAs7JfFif9oT2DxdubR9EqQ1IvNTdLOFGJrkr2aymstSh4cc/rvv8RnRiapGw QJXEHxXe5aDfCg4eBYtmx8Y4Z7KpvKqDjQfUWccpEdrdGc6ON1LM3p/6z/BPPooBKTxs evivdNpbu/pUQxLpnvx2EMYpxaX9pWMf1vOnFmKJC/uMN6TvFA81uM9uph3NpNDpCb5I haDaIqQAktbUqTrZFoYxtylFgVChodI+aV92qTYb79zzHLYaTOu5O/AFJJJMGrM6zkR3 2ebU/Fs3zrNM4rwi4oWBHRTD9hPISj5/3tqYi05broeoA4XTkpNd1Q/778CUREYFCAoI 9a7w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1781979424; x=1782584224; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=XJdnGcYqrN2PZHNQjPn0cMT+HfsR3sWO/K33y7osfm0=; b=ZjbOJpaphRFp7uyC8A9btPioBpkbOKdERYzWy3Hgb/8YSvT+PWWHi9nfJdpeXGgVLe +vNypsoyvsMjnUszHflwmnpbL9ai0BEmhKTjS5M56N92vagoQEHZktpjwSzMG2wJeeeX GYVdm4L1EZGFJ4be0Cl6ZTy784B12T6r+vJXthM9uexUdGztcAeQVSqZn7AVtp83KMpv 3wAhk/TK69NQa8d5JatM16+cMiHq502Z3e542y9x5CSFyu71nOWAMwjWQYz8pxSuFd1b TIgbm8MlDcCwBpI9ud9dNM3SqHcX4v81tjYZVkOZc+TERXYQ3FBnzzSiRaldXKiqcAV1 tRGw== X-Forwarded-Encrypted: i=1; AHgh+RrTR5vM4nDaGLQdXecu/8DxJyzDEY97FWygkkOAAZnDOcfoEw+uz3uAeNollUOnYAuXeEEY3Ch/Q4Zukgg=@vger.kernel.org X-Gm-Message-State: AOJu0Yygyvwq0QeAVGIh3GBCuCFKeiciHcvuc/o7D47On19yC2SYBSse hSjikbZWtPbAjVurMYmhf84xinIG3ZsLH+LeHHYSvm/f0s4ysf9xfcsl X-Gm-Gg: AfdE7cl0iO0zv+PTP61CfS83vbJZY952B8DoXpXz/FoZrjsz/w+idHDPXGvIvLpYfoF nbgGFj7Pjlm7dJJLJ9X8NaXJmubmkCtlSvViTg1npJAXBrnQHcPeaW2M/pwu5cGvc5+Yh8IFstF 1XxXhzcKbUu7LNKFcmXUNwG4Q4fnRy8W3pMjifam+YEQYLb5Oj6HczMKz9EkO5Jpp9BHetwZBSL xgokDN8xTgM4VgbeXmgo+ZqfOnXLRHk+P92OVCS/s6yRdjRgP34VZyFF4IMTrYTFkgPncYmbDsN aIcukDeZSle7Ag/uwlwX4b38Z1feQsjoKPFPh7sQPvIQvk16ditcjFvY17IO5EUSRKsqqUNg7ZP hG2d4ZdtHSjcLG4SNjfMoipkoa9PTjvBfbAtBqYZ8ik0ZshlBS8DCmz6x4KjUR/BrS/G2kpQRJJ 4lnUHQ0p78zR2yFVId46Kr6oyMV89QOHrTukAT/L3woYw/hkn06iNIfzKbMYsHLMkherPAvVsJJ A== X-Received: by 2002:a17:903:1a44:b0:2ba:6518:a6d4 with SMTP id d9443c01a7336-2c725d7ee92mr81603135ad.20.1781979424175; Sat, 20 Jun 2026 11:17:04 -0700 (PDT) Received: from localhost.localdomain ([220.85.166.190]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2c7436af6d9sm30339465ad.4.2026.06.20.11.16.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 20 Jun 2026 11:17:03 -0700 (PDT) From: Youngjun Park X-Google-Original-From: Youngjun Park To: akpm@linux-foundation.org Cc: chrisl@kernel.org, youngjun.park@lge.com, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kasong@tencent.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, shikemeng@huaweicloud.com, nphamcs@gmail.com, baoquan.he@linux.dev, baohua@kernel.org, yosry@kernel.org, gunho.lee@lge.com, taejoon.song@lge.com, hyungjun.cho@lge.com, mkoutny@suse.com, baver.bae@lge.com, matia.kim@lge.com Subject: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection Date: Sun, 21 Jun 2026 03:16:28 +0900 Message-ID: <20260620181635.299364-4-youngjun.park@lge.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20260620181635.299364-1-youngjun.park@lge.com> References: <20260620181635.299364-1-youngjun.park@lge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce memory.swap.tiers.max, a flat-keyed file listing each tier defined in /sys/kernel/mm/swap/tiers with its state, "max" (allowed, the default) or "0" (disabled). A tier is one bit in the cgroup's tier mask, so writing " max" or " 0" sets or clears that bit. Since the current use case lacks amount control, it only supports "max" (on) and "0" (off). Therefore, it does not track per-tier swap usage, relying instead on a fast runtime bitmask check. We maintain both `mask` and `effective_mask`. The `effective_mask` is strictly bounded by the parent (e.g., if a parent is "0", the child's effective state is "0" even if its `mask` is "max"). Maintaining this separately avoids costly cgroup tree traversals to check ancestors at runtime. Suggested-by: Shakeel Butt Suggested-by: Yosry Ahmed Signed-off-by: Youngjun Park --- Documentation/admin-guide/cgroup-v2.rst | 20 +++++ Documentation/mm/swap-tier.rst | 9 +++ include/linux/memcontrol.h | 5 ++ mm/memcontrol.c | 67 ++++++++++++++++ mm/swap_state.c | 5 +- mm/swap_tier.c | 102 +++++++++++++++++++++++- mm/swap_tier.h | 57 +++++++++++-- 7 files changed, 255 insertions(+), 10 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-= guide/cgroup-v2.rst index 6efd0095ed99..4843ffcfd110 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1850,6 +1850,26 @@ The following nested keys are defined. Swap usage hard limit. If a cgroup's swap usage reaches this limit, anonymous memory of the cgroup will not be swapped out. =20 + memory.swap.tiers.max + A read-write flat-keyed file which exists on non-root + cgroups. The default is "max" for every tier. + + Limits the swap tiers this cgroup may swap to. Tiers are + defined globally in /sys/kernel/mm/swap/tiers and listed here, + one per line. When read, the values are displayed in descending + order of the tiers (highest tier first):: + + max + 0 + ... + + Currently, only "max" and "0" are supported. "max" allows the + tier, "0" disables it. Each write sets a single " max" + or " 0" pair. + + A child may only narrow what its parent allows. A tier an + ancestor disabled stays disabled regardless of the value here. + memory.swap.events A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified diff --git a/Documentation/mm/swap-tier.rst b/Documentation/mm/swap-tier.rst index 0fb4a1153a67..addbc495de8c 100644 --- a/Documentation/mm/swap-tier.rst +++ b/Documentation/mm/swap-tier.rst @@ -15,6 +15,15 @@ speed to fully utilize this feature. While the current i= mplementation is integrated with cgroups, the concept is designed to be extensible for other subsystems in the future. =20 +Use case +--------- + +Users can perform selective swapping by choosing a swap tier assigned acco= rding +to speed within a cgroup. + +For more information on cgroup v2, please refer to +``Documentation/admin-guide/cgroup-v2.rst``. + Priority Range -------------- =20 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index e1f46a0016fc..d53826c68562 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -283,6 +283,11 @@ struct mem_cgroup { struct lru_gen_mm_list mm_list; #endif =20 +#ifdef CONFIG_SWAP + int tier_mask; + int tier_effective_mask; +#endif + #ifdef CONFIG_MEMCG_V1 /* Legacy consumer-oriented counters */ struct page_counter kmem; /* v1 only */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 56cd4af08232..63259576792a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -68,6 +68,7 @@ #include #include "slab.h" #include "memcontrol-v1.h" +#include "swap_tier.h" =20 #include =20 @@ -4244,6 +4245,8 @@ static int mem_cgroup_css_online(struct cgroup_subsys= _state *css) refcount_set(&memcg->id.ref, 1); css_get(css); =20 + swap_tiers_memcg_inherit_mask(memcg); + /* * Ensure mem_cgroup_from_private_id() works once we're fully online. * @@ -5785,6 +5788,64 @@ static int swap_events_show(struct seq_file *m, void= *v) return 0; } =20 +static int swap_tier_max_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m); + + swap_tiers_mask_show(m, memcg); + return 0; +} + +static ssize_t swap_tier_max_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg =3D mem_cgroup_from_css(of_css(of)); + char *pos, *name, *val; + bool enable; + int mask; + int ret =3D 0; + + pos =3D strstrip(buf); + name =3D strsep(&pos, " \t\n"); + if (!name || !*name) + return -EINVAL; + if (pos) + pos =3D skip_spaces(pos); + val =3D strsep(&pos, " \t\n"); + if (!val || !*val) + return -EINVAL; + if (pos && *skip_spaces(pos)) + return -EINVAL; + + if (!strcmp(val, "max")) + enable =3D true; + else if (!strcmp(val, "0")) + enable =3D false; + else + return -EINVAL; + + spin_lock(&swap_tier_lock); + mask =3D swap_tiers_mask_lookup(name); + if (!mask) { + ret =3D -EINVAL; + goto out; + } + + /* + * tier_mask is set per memcg here; the effective mask is clamped + * to the parent's in swap_tiers_memcg_sync_mask(). + */ + if (enable) + WRITE_ONCE(memcg->tier_mask, memcg->tier_mask | mask); + else + WRITE_ONCE(memcg->tier_mask, memcg->tier_mask & ~mask); + + swap_tiers_memcg_sync_mask(memcg); +out: + spin_unlock(&swap_tier_lock); + return ret ? ret : nbytes; +} + static struct cftype swap_files[] =3D { { .name =3D "swap.current", @@ -5817,6 +5878,12 @@ static struct cftype swap_files[] =3D { .file_offset =3D offsetof(struct mem_cgroup, swap_events_file), .seq_show =3D swap_events_show, }, + { + .name =3D "swap.tiers.max", + .flags =3D CFTYPE_NOT_ON_ROOT, + .seq_show =3D swap_tier_max_show, + .write =3D swap_tier_max_write, + }, { } /* terminate */ }; =20 diff --git a/mm/swap_state.c b/mm/swap_state.c index 2f382d4dcbdc..712b225509cc 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -1021,6 +1021,7 @@ static ssize_t tiers_store(struct kobject *kobj, char *p, *token, *name, *tmp; int ret =3D 0; short prio; + int mask =3D 0; =20 tmp =3D kstrdup(buf, GFP_KERNEL); if (!tmp) @@ -1053,7 +1054,7 @@ static ssize_t tiers_store(struct kobject *kobj, goto restore; break; case '-': - ret =3D swap_tiers_remove(token + 1); + ret =3D swap_tiers_remove(token + 1, &mask); if (ret) goto restore; break; @@ -1063,7 +1064,7 @@ static ssize_t tiers_store(struct kobject *kobj, } } =20 - if (!swap_tiers_update()) { + if (!swap_tiers_update(mask)) { ret =3D -EINVAL; goto restore; } diff --git a/mm/swap_tier.c b/mm/swap_tier.c index 6b57cadb3e95..98bfee760b8d 100644 --- a/mm/swap_tier.c +++ b/mm/swap_tier.c @@ -253,7 +253,7 @@ int swap_tiers_add(const char *name, int prio) return ret; } =20 -int swap_tiers_remove(const char *name) +int swap_tiers_remove(const char *name, int *mask) { int ret =3D 0; struct swap_tier *tier; @@ -276,6 +276,7 @@ int swap_tiers_remove(const char *name) list_prev_entry(tier, list)->prio =3D DEF_SWAP_PRIO; =20 swap_tier_inactivate(tier); + *mask |=3D TIER_MASK(tier); =20 return ret; } @@ -344,7 +345,24 @@ void swap_tiers_assign_dev(struct swap_info_struct *sw= p) swp->tier_mask =3D TIER_DEFAULT_MASK; } =20 -bool swap_tiers_update(void) +#ifdef CONFIG_MEMCG +static void swap_tier_memcg_propagate(int mask) +{ + struct mem_cgroup *child; + + for_each_mem_cgroup_tree(child, root_mem_cgroup) { + WRITE_ONCE(child->tier_mask, child->tier_mask | mask); + WRITE_ONCE(child->tier_effective_mask, + child->tier_effective_mask | mask); + } +} +#else +static void swap_tier_memcg_propagate(int mask) +{ +} +#endif + +bool swap_tiers_update(int mask) { struct swap_tier *tier; struct swap_info_struct *swp; @@ -375,5 +393,85 @@ bool swap_tiers_update(void) swap_tiers_assign_dev(swp); } =20 + /* + * When a tier is removed, its index (bit position in the mask) becomes + * free for reassignment to a future tier. If a memcg had previously + * disabled this tier (cleared the bit in its swap.tiers.max file), the + * effective mask would keep that bit clear -- meaning the new tier at + * the same index would be silently unavailable, an invisible cgroup + * constraint left behind by a tier that no longer exists. + * + * To prevent this, OR the removed tier's mask bit into every memcg's + * tier_mask and tier_effective_mask. This resets the bit so the new + * tier is accessible by default; users who want to restrict it must + * explicitly disable it after the tier is re-created. + */ + if (mask) + swap_tier_memcg_propagate(mask); + return true; } + +#ifdef CONFIG_MEMCG +void swap_tiers_mask_show(struct seq_file *m, struct mem_cgroup *memcg) +{ + struct swap_tier *tier; + int mask; + + spin_lock(&swap_tier_lock); + mask =3D READ_ONCE(memcg->tier_mask); + + for_each_active_tier(tier) + seq_printf(m, "%s %s\n", tier->name, + (mask & TIER_MASK(tier)) ? "max" : "0"); + spin_unlock(&swap_tier_lock); +} + +int swap_tiers_mask_lookup(const char *name) +{ + struct swap_tier *tier; + + lockdep_assert_held(&swap_tier_lock); + + for_each_active_tier(tier) { + if (!strcmp(name, tier->name)) + return TIER_MASK(tier); + } + + return 0; +} + +static void __swap_tier_memcg_inherit_mask(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ + int parent_mask =3D parent + ? READ_ONCE(parent->tier_effective_mask) + : TIER_ALL_MASK; + + WRITE_ONCE(memcg->tier_effective_mask, + parent_mask & READ_ONCE(memcg->tier_mask)); +} + +/* Computes the initial effective mask from the parent's effective mask. */ +void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg) +{ + spin_lock(&swap_tier_lock); + memcg->tier_mask =3D TIER_ALL_MASK; + __swap_tier_memcg_inherit_mask(memcg, parent_mem_cgroup(memcg)); + spin_unlock(&swap_tier_lock); +} + +/* + * Called when a memcg's tier_mask is modified. Walks the subtree + * and recomputes each descendant's effective mask against its parent. + */ +void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg) +{ + struct mem_cgroup *child; + + lockdep_assert_held(&swap_tier_lock); + + for_each_mem_cgroup_tree(child, memcg) + __swap_tier_memcg_inherit_mask(child, parent_mem_cgroup(child)); +} +#endif diff --git a/mm/swap_tier.h b/mm/swap_tier.h index 3e355f857363..e2f0cf32035b 100644 --- a/mm/swap_tier.h +++ b/mm/swap_tier.h @@ -10,22 +10,67 @@ struct swap_info_struct; =20 extern spinlock_t swap_tier_lock; =20 -#define TIER_ALL_MASK (~0) -#define TIER_DEFAULT_IDX (31) -#define TIER_DEFAULT_MASK (1U << TIER_DEFAULT_IDX) - /* Initialization and application */ void swap_tiers_init(void); ssize_t swap_tiers_sysfs_show(char *buf); =20 int swap_tiers_add(const char *name, int prio); -int swap_tiers_remove(const char *name); +int swap_tiers_remove(const char *name, int *mask); =20 void swap_tiers_snapshot(void); void swap_tiers_snapshot_restore(void); -bool swap_tiers_update(void); +bool swap_tiers_update(int mask); =20 /* Tier assignment */ void swap_tiers_assign_dev(struct swap_info_struct *swp); =20 +#define TIER_ALL_MASK (~0) +#define TIER_DEFAULT_IDX (31) +#define TIER_DEFAULT_MASK (1U << TIER_DEFAULT_IDX) + +#if defined(CONFIG_SWAP) && defined(CONFIG_MEMCG) +/* Memcg related functions */ +void swap_tiers_mask_show(struct seq_file *m, struct mem_cgroup *memcg); +void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg); +void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg); +int swap_tiers_mask_lookup(const char *name); +static inline int folio_tier_effective_mask(struct folio *folio) +{ + struct mem_cgroup *memcg; + int mask =3D TIER_ALL_MASK; + + rcu_read_lock(); + memcg =3D folio_memcg(folio); + if (memcg) + mask =3D READ_ONCE(memcg->tier_effective_mask); + rcu_read_unlock(); + + return mask; +} +#else +static inline void swap_tiers_mask_show(struct seq_file *m, + struct mem_cgroup *memcg) {} +static inline void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg)= {} +static inline void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg) {} +static inline int swap_tiers_mask_lookup(const char *name) +{ + return 0; +} +static inline int folio_tier_effective_mask(struct folio *folio) +{ + return TIER_ALL_MASK; +} +#endif + +/** + * swap_tiers_mask_test - Check if the tier mask is valid + * @tier_mask: The tier mask to check + * @mask: The mask to compare against + * + * Return: true if condition matches, false otherwise + */ +static inline bool swap_tiers_mask_test(int tier_mask, int mask) +{ + return tier_mask & mask; +} #endif /* _SWAP_TIER_H */ --=20 2.48.1 From nobody Mon Jun 22 07:37:01 2026 Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 13CE23438A6 for ; Sat, 20 Jun 2026 18:17:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781979431; cv=none; b=XCdCrIechnFEG0tElWG1Vqypz3oaJtxEjhAf7lXmEDAnog/Ge+4MXvc5/ln+yIvS9oUjzJVt4OXnGXS5RWUeKuc+y7RqMP4umxq80fdyu0jxzQuxk2GEC2W6pGUyn7cos3j83OU6TuHuBUgojSgDtTz+sIEwrucRTSejhrRvUxU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781979431; c=relaxed/simple; bh=pNZz6I4aoMMaZf0YjkzTitFAZSWSIRpbW4EBzjpsk58=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=k6jDGqfMGrQ4/vzivhayY73R+IM0OW3MZS/5ULuBraPxXtKRUpGVH8QmlfknDXfyeD3QUmN1kSHJqqkw0g3lxUS/ViPM2cpFcmPGWKB2eOq1TJPJm1cYo9MtI8hIHq6z5kqsZvWRPyEkoXIXVopuCQM0H0KZs5C3TtME87a+3Bs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=BHdFa90z; arc=none smtp.client-ip=209.85.214.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="BHdFa90z" Received: by mail-pl1-f180.google.com with SMTP id d9443c01a7336-2c40397e3caso29427005ad.2 for ; Sat, 20 Jun 2026 11:17:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1781979429; x=1782584229; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=GYeylIml7s8sCOsaJDbGPnepJyRivCDRMtCf2+mDIaE=; b=BHdFa90zDBclBbApum0Qi05Ka1yzsIznYnkCXMELg67yXA3dKrAJH75SvJ4FwZM4P9 gJMSPvN+8xS3UrJcx/8lb+iCyprUkcouX99i0BW/WGOjfv5lR9y6XKamXyi6CRLgj0Ci zVovixSVBIHdokoDPo1g36dF9KCoRfb34CNGDhOKxqEVkQn/X/cVnjUknibkYOHU7qYu 0CR7CsArxbE7ns3DameBvR4W2HojLmt2S3vzdrX+WHIHVzzlTcjSae8Ea3H4GqpGel5N zCXncyp7bPofdiSNY5DzJu+z50JQBPVcfJR3cDTVmSt/6RcLgYXUDjHFeFHJYfs3frhy q4MA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1781979429; x=1782584229; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=GYeylIml7s8sCOsaJDbGPnepJyRivCDRMtCf2+mDIaE=; b=UBGboMWK1jEH20/Sj9Y41vS+/IOSk2xmjTfxIcY7o9rp7PWsNScKv1bOGAgLRN+6wm 0zoWIHb1hQHM967FB7TdYa+hWw4LC0xK/wjH9ipSbDQArEADo5RnvG1Y+AcRj8Hbh3pl eN3GCO8Xcqo9MO04rY3u+PIW0FyDwwco8tHTEgaWzNsnploFbAVCVSnvp+CUw8/67Jr4 EZVd3/+SEFVbsdCJYAe+/Yjk2NXhwBJZIB18XY55eAVjo9khrztlwxhiZEX+Mfzbl9GK tRyG05qXqKxuK/GwtUJ+lSnsqVWS0OFGSki6nPcTOcJFzwJqwe7LwwZrniFiYJOFYAsN zkqQ== X-Forwarded-Encrypted: i=1; AHgh+Rok8aqrozDMObAVhdxJsoQUQRIkpvMNfTwwMkU6tIVPmNJTNsrNNW/l27n/1qG6+jm4hpZoZtaOQT5Id/I=@vger.kernel.org X-Gm-Message-State: AOJu0Yxy145euNtOBqG4XCX+HNadj6zlun/2z4FspPMwenrAfuUb5yWy bo7iZxs0EWjrczFUwN083c2X/RxF5B7W4CUS5P+lNzV5LLnke5f+I2QT X-Gm-Gg: AfdE7cmBXoNZh2I6/9F4bm96bPf28vxUQvWVA/ITlwoH8Ay12GMXu4QeyWp9/OxBJ2k mQIp+p6CuGi6DLaFSBP3fYeTrwZJcFW0Zxmey3uxJJ9BgHTl8EnS7mM0Mad7A4A7Km1aeWiT7w4 hoj2Etr3u8EfgLXYl2ZjKz0f4sAlkjDGO+qYjCtdcsaD047nehYqW7HoAZDPFP5HMypcP/7NeI8 pGuNVD4GGCrFD2COhZ2l6bPS9Gu99HI4d7pigKlM5QqDZYDMVisPWpffKGuRrLSIoqkND5+aevm 5t9VQANambCXoOfltHa4k4J+Fgduheza+RHPH9ZjhtFdx/KAXFO2w4zgrKLDfqyH0XzURluaj2I sIQbdUNkITrHnxOa1ujk2YA1R2iHQcR3QKLz6JkgYTQFbM4VbzAo2qKATrEHX6YxFHXD1ty03i9 mi/4o/M89bDkLsZsQGI4BXfsUgMEYoz4S9doZU5T/1sx3bMaG/oUaepRhmU/JbdXnR8qXaqev+s w== X-Received: by 2002:a17:902:d586:b0:2c2:27be:39aa with SMTP id d9443c01a7336-2c725d7d58dmr82330625ad.17.1781979429395; Sat, 20 Jun 2026 11:17:09 -0700 (PDT) Received: from localhost.localdomain ([220.85.166.190]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2c7436af6d9sm30339465ad.4.2026.06.20.11.17.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 20 Jun 2026 11:17:08 -0700 (PDT) From: Youngjun Park X-Google-Original-From: Youngjun Park To: akpm@linux-foundation.org Cc: chrisl@kernel.org, youngjun.park@lge.com, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kasong@tencent.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, shikemeng@huaweicloud.com, nphamcs@gmail.com, baoquan.he@linux.dev, baohua@kernel.org, yosry@kernel.org, gunho.lee@lge.com, taejoon.song@lge.com, hyungjun.cho@lge.com, mkoutny@suse.com, baver.bae@lge.com, matia.kim@lge.com Subject: [PATCH v9 4/6] mm: swap: filter swap allocation by memcg tier mask Date: Sun, 21 Jun 2026 03:16:29 +0900 Message-ID: <20260620181635.299364-5-youngjun.park@lge.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20260620181635.299364-1-youngjun.park@lge.com> References: <20260620181635.299364-1-youngjun.park@lge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Apply memcg tier effective mask during swap slot allocation to enforce per-cgroup swap tier restrictions. The folio's effective mask is computed once and passed to the fast, slow and discard paths as a parameter, so all of them act on the same mask even if the memcg's mask changes concurrently. In the fast path, check the percpu cached swap_info's tier_mask against the folio's effective mask. If it does not match, fall through to the slow path. In the slow path, skip swap devices whose tier_mask is not covered by the folio's effective mask. The discard fallback honors the mask too: otherwise it would drain the discard clusters of a device outside the folio's tiers and then loop back to allocate from a tier the memcg is not allowed to use. This works correctly when there is only one non-rotational device in the system and no devices share the same priority. However, there are known limitations: - When non-rotational devices are distributed across multiple tiers, and different memcgs are configured to use those distinct tiers, they may constantly overwrite the shared percpu swap cache. This cache thrashing leads to frequent fast path misses. - Combined with the above issue, if same-priority devices exist among them, a percpu cache miss (overwritten by another memcg) forces the allocator to round-robin to the next device prematurely, even if the current cluster is not fully exhausted. These edge cases do not affect the primary use case of directing swap traffic per cgroup. Further optimization is planned for future work. Signed-off-by: Youngjun Park --- mm/swapfile.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 9a86ebe992f4..624d1ba93fd9 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1359,7 +1359,7 @@ static bool get_swap_device_info(struct swap_info_str= uct *si) * Fast path try to get swap entries with specified order from current * CPU's swap entry pool (a cluster). */ -static bool swap_alloc_fast(struct folio *folio) +static bool swap_alloc_fast(struct folio *folio, int mask) { unsigned int order =3D folio_order(folio); struct swap_cluster_info *ci; @@ -1371,8 +1371,11 @@ static bool swap_alloc_fast(struct folio *folio) * so checking it's liveness by get_swap_device_info is enough. */ si =3D this_cpu_read(percpu_swap_cluster.si[order]); + if (!si || !swap_tiers_mask_test(si->tier_mask, mask)) + return false; + offset =3D this_cpu_read(percpu_swap_cluster.offset[order]); - if (!si || !offset || !get_swap_device_info(si)) + if (!offset || !get_swap_device_info(si)) return false; =20 ci =3D swap_cluster_lock(si, offset); @@ -1389,13 +1392,16 @@ static bool swap_alloc_fast(struct folio *folio) } =20 /* Rotate the device and switch to a new cluster */ -static void swap_alloc_slow(struct folio *folio) +static void swap_alloc_slow(struct folio *folio, int mask) { struct swap_info_struct *si, *next; =20 spin_lock(&swap_avail_lock); start_over: plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) { + if (!swap_tiers_mask_test(si->tier_mask, mask)) + continue; + /* Rotate the device and switch to a new cluster */ plist_requeue(&si->avail_list, &swap_avail_head); spin_unlock(&swap_avail_lock); @@ -1429,7 +1435,7 @@ static void swap_alloc_slow(struct folio *folio) * Discard pending clusters in a synchronized way when under high pressure. * Return: true if any cluster is discarded. */ -static bool swap_sync_discard(void) +static bool swap_sync_discard(int mask) { bool ret =3D false; struct swap_info_struct *si, *next; @@ -1437,6 +1443,8 @@ static bool swap_sync_discard(void) spin_lock(&swap_lock); start_over: plist_for_each_entry_safe(si, next, &swap_active_head, list) { + if (!swap_tiers_mask_test(si->tier_mask, mask)) + continue; spin_unlock(&swap_lock); if (get_swap_device_info(si)) { if (si->flags & SWP_PAGE_DISCARD) @@ -1736,6 +1744,7 @@ int folio_alloc_swap(struct folio *folio) { unsigned int order =3D folio_order(folio); unsigned int size =3D 1 << order; + int mask; =20 VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio); @@ -1759,13 +1768,14 @@ int folio_alloc_swap(struct folio *folio) } =20 again: + mask =3D folio_tier_effective_mask(folio); local_lock(&percpu_swap_cluster.lock); - if (!swap_alloc_fast(folio)) - swap_alloc_slow(folio); + if (!swap_alloc_fast(folio, mask)) + swap_alloc_slow(folio, mask); local_unlock(&percpu_swap_cluster.lock); =20 if (!order && unlikely(!folio_test_swapcache(folio))) { - if (swap_sync_discard()) + if (swap_sync_discard(mask)) goto again; } =20 --=20 2.48.1 From nobody Mon Jun 22 07:37:01 2026 Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 27DCD237180 for ; Sat, 20 Jun 2026 18:17:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781979438; cv=none; b=sPfdQ6AJDzvxg7q7uoaMrefwYOzDrvNrvQCanfT8+3VybSV0mbWwtPkZvYaKfMLeKajsIq5km6aNj3gs1fhNSaFbFTNUYbECQQZtREXoQmU+g0GxzqWZeunT+dw/FlWnleri42sL+NYEuJ9gETP0RzJDQK8JgU2arFk7+BYWyEk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781979438; c=relaxed/simple; bh=S5sYyvW9QyY6clCsvttkMWXxewus1Gyh5/QzlK0Cpuc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=sTOV9rcgK5Te5t0ti3i684k3/AIgYTq7yL/eZi5t8bJv6lqWPSDasGDwPU9ZyLepuS2wmud5XoosEz8v4iJ7AJ4XMIgRjJa7CSDEi+xyCJ0gOxFehi5fr3BJ+8mHw6xQ9BUuG0rj+czHGVIu0tjo5llNPiJuEqjW2RG1CuABjo0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=AOTF/UVk; arc=none smtp.client-ip=209.85.214.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="AOTF/UVk" Received: by mail-pl1-f180.google.com with SMTP id d9443c01a7336-2c74383c93cso8512175ad.1 for ; Sat, 20 Jun 2026 11:17:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1781979435; x=1782584235; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=RgH4Dj17obnxSLq/3NpwgVl7Hll2I/0OEirSKJI2pMo=; b=AOTF/UVkJymF10Vt4pjUsPBw/z9L/WN6j+D6VqpHpWVTLD8kWJ6e9WoR5ZP2vCbKIB J6/4X1BRDoC1lJg2sPHIhF+YY5UhC7fT9dbCWd9xKhySq4S3kuh4mH84kmPEGLBEQi3L A/vfB6cvJkb2eppyiPqqPcW9Pnw8PSQK7ic6+f0RHUnN3jB4Y7hNa2qqpP1TiteOx/JD xMtAqWngnHA9AdQksIgSA3sdaMNn2qmeIyCvZgAMx27BesSOT1//UqDznWBhvBvkoA7V ShVYYPqzoUWACOjNs3DmtsQhRRapIFedAVnd3timextaM4YDh/T6pDNFZ6oB/7uJ168Y znVg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1781979435; x=1782584235; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=RgH4Dj17obnxSLq/3NpwgVl7Hll2I/0OEirSKJI2pMo=; b=DuIF8sbwi96Fp3wBpAE0vfPhsfZq0pa9KRKh97S9W/NrUjfwxo61RSAn6eRMOTkkzw o/08AvYpvs5BEoShdobrJRBOqDHD4Zx7MWOf9+/YAC8Q4kHKf8DlkutFLVkr7YSnJwDR UnCcXuG2PmS169SLHfZxjVf+ENn64QsZoM8uZS5tr4LzsvbvKp2dBdWrJUFRilNa39z/ FbqCWuzHou838cCQ1GmdqR1FzYUlfw7WpgX8QBxO+uYYDUkyB9FjB3Mwzsf4SCP4dcd6 NcDpubXzCE05pQE5s116BT/Wakz6ee/u06JiSW86qnubAnSJ92oDf2JGq7Ibe7V+oBpC e4DQ== X-Forwarded-Encrypted: i=1; AHgh+RqKbKDTIq9SsVy0hSU6UNS6CWU0qGtEBx28RJ5vC7gCY2Kc8T94TdJfbiMh+1v+WLcynUObfstbkOhNIQU=@vger.kernel.org X-Gm-Message-State: AOJu0Yys4Na2l1/EFC9WG0/EckAPfQrcMP74bfn3QvRrZpB2L8PSTOWO qj+wEgNksIOxxr3wl669KMAHCN7/6db8o1BQzTSL2/zlujZYIo0IsQDP X-Gm-Gg: AfdE7cmGAVOHdPYuxQdr8IhKQQY8BZrZ/PITVo6vpkFBMgWnfZp5ccl+yXiVfZFR8IY dkqPcw8nAbfgRjfJFkrdylUfHOfQqRbziNb+Nid/din24SEDH7D6+PEPVJ45lKMXA67pcy23jZ9 +yRJFVDyva9efimvgXM4uPXQgop64kiVuF34V8/INUveN65nYloJccaNoZ5nmLm1dIUIs3V8U6E n8D3FAzZLGi6nAMShzNsoCdBYU3/mRkC0hpX6ROZ0VAPvJQynjZgfY0swTIQZJFHLimV4Qx/kY0 boXEahmNAOGxVI+CgrcSfNqlM+zFdPWG6TyxiLyFstFYXoNCcr4wUoSu4Rjer21K99qow5THU5G lpMI0aO07ig/aAZQQx4/Ncb5OLxJsrroZPrWWkl8oEYUYYqFBvYVNPC0DJBs5zz1w+atxJmiZS+ HocIUPoY4g/ruaB5i4QSi2QCuSsTUjMQMi2VTetiLp1UM41BrXKpTz46eJs0rFOHg57uVNVR8ps g== X-Received: by 2002:a17:902:cf03:b0:2c1:f262:4962 with SMTP id d9443c01a7336-2c718f03795mr93845965ad.20.1781979435385; Sat, 20 Jun 2026 11:17:15 -0700 (PDT) Received: from localhost.localdomain ([220.85.166.190]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2c7436af6d9sm30339465ad.4.2026.06.20.11.17.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 20 Jun 2026 11:17:14 -0700 (PDT) From: Youngjun Park X-Google-Original-From: Youngjun Park To: akpm@linux-foundation.org Cc: chrisl@kernel.org, youngjun.park@lge.com, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kasong@tencent.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, shikemeng@huaweicloud.com, nphamcs@gmail.com, baoquan.he@linux.dev, baohua@kernel.org, yosry@kernel.org, gunho.lee@lge.com, taejoon.song@lge.com, hyungjun.cho@lge.com, mkoutny@suse.com, baver.bae@lge.com, matia.kim@lge.com Subject: [PATCH v9 5/6] selftests/mm: add a swap tier configuration test Date: Sun, 21 Jun 2026 03:16:30 +0900 Message-ID: <20260620181635.299364-6-youngjun.park@lge.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20260620181635.299364-1-youngjun.park@lge.com> References: <20260620181635.299364-1-youngjun.park@lge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This commit adds a test program for the global swap tier interface at /sys/kernel/mm/swap/tiers. It checks the add, split and remove operations and the documented error and batch atomicity rules. It also checks that a tier with an active swap device cannot be removed until the device is swapped off. That device is a zram device, and the check is skipped when zram is not available. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Youngjun Park --- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/config | 2 + tools/testing/selftests/mm/run_vmtests.sh | 5 + tools/testing/selftests/mm/swap_tier.c | 323 ++++++++++++++++++++++ 5 files changed, 332 insertions(+) create mode 100644 tools/testing/selftests/mm/swap_tier.c diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftest= s/mm/.gitignore index 9ccd9e1447e6..a6e588c7979e 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -46,6 +46,7 @@ hmm-tests memfd_secret soft-dirty split_huge_page_test +swap_tier ksm_tests local_config.h local_config.mk diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/= mm/Makefile index e6df968f0971..1836127df092 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -105,6 +105,7 @@ TEST_GEN_FILES +=3D guard-regions TEST_GEN_FILES +=3D merge TEST_GEN_FILES +=3D rmap TEST_GEN_FILES +=3D folio_split_race_test +TEST_GEN_FILES +=3D swap_tier =20 ifneq ($(ARCH),arm64) TEST_GEN_FILES +=3D soft-dirty diff --git a/tools/testing/selftests/mm/config b/tools/testing/selftests/mm= /config index 06f78bd232e2..de3752e1bbd2 100644 --- a/tools/testing/selftests/mm/config +++ b/tools/testing/selftests/mm/config @@ -14,3 +14,5 @@ CONFIG_UPROBES=3Dy CONFIG_MEMORY_FAILURE=3Dy CONFIG_HWPOISON_INJECT=3Dm CONFIG_PROC_MEM_ALWAYS_FORCE=3Dy +CONFIG_SWAP=3Dy +CONFIG_ZRAM=3Dy diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/self= tests/mm/run_vmtests.sh index 8c296dedf047..1b0b8ec185a9 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -71,6 +71,8 @@ separated by spaces: tests for VM_PFNMAP handling - process_madv test for process_madv +- swap_tier + test the /sys/kernel/mm/swap/tiers configuration interface - cow test copy-on-write semantics - thp @@ -353,6 +355,9 @@ CATEGORY=3D"process_madv" run_test ./process_madv =20 CATEGORY=3D"vma_merge" run_test ./merge =20 +# swap tier configuration interface (/sys/kernel/mm/swap/tiers) +CATEGORY=3D"swap_tier" run_test ./swap_tier + if [ -x ./memfd_secret ] then if [ -f /proc/sys/kernel/yama/ptrace_scope ]; then diff --git a/tools/testing/selftests/mm/swap_tier.c b/tools/testing/selftes= ts/mm/swap_tier.c new file mode 100644 index 000000000000..b4fe21b0eb5d --- /dev/null +++ b/tools/testing/selftests/mm/swap_tier.c @@ -0,0 +1,323 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include + +#include "kselftest.h" + +#define TIERS_PATH "/sys/kernel/mm/swap/tiers" + +static int tiers_write(const char *cmd) +{ + int fd, ret =3D 0; + + fd =3D open(TIERS_PATH, O_WRONLY); + if (fd < 0) + return -errno; + if (write(fd, cmd, strlen(cmd)) < 0) + ret =3D -errno; + close(fd); + return ret; +} + +static int tier_range(const char *name, int *start, int *end) +{ + char buf[4096], *line, *save; + int fd; + ssize_t n; + + fd =3D open(TIERS_PATH, O_RDONLY); + if (fd < 0) + return -1; + n =3D read(fd, buf, sizeof(buf) - 1); + close(fd); + if (n < 0) + return -1; + buf[n] =3D '\0'; + + for (line =3D strtok_r(buf, "\n", &save); line; + line =3D strtok_r(NULL, "\n", &save)) { + char tname[64]; + int idx, s, e; + + /* The header line has no integer columns, so sscanf skips it. */ + if (sscanf(line, "%63s %d %d %d", tname, &idx, &s, &e) !=3D 4) + continue; + if (!strcmp(tname, name)) { + *start =3D s; + *end =3D e; + return 0; + } + } + return -1; +} + +static bool tier_exists(const char *name) +{ + int s, e; + + return tier_range(name, &s, &e) =3D=3D 0; +} + +static bool range_is(const char *name, int start, int end) +{ + int s, e; + + if (tier_range(name, &s, &e)) + return false; + return s =3D=3D start && e =3D=3D end; +} + +static int tier_count(void) +{ + char buf[4096], *line, *save; + int fd, count =3D 0; + ssize_t n; + + fd =3D open(TIERS_PATH, O_RDONLY); + if (fd < 0) + return -1; + n =3D read(fd, buf, sizeof(buf) - 1); + close(fd); + if (n < 0) + return -1; + buf[n] =3D '\0'; + + for (line =3D strtok_r(buf, "\n", &save); line; + line =3D strtok_r(NULL, "\n", &save)) { + char tname[64]; + int idx, s, e; + + if (sscanf(line, "%63s %d %d %d", tname, &idx, &s, &e) =3D=3D 4) + count++; + } + return count; +} + +/* + * A single add at a priority above -1, from the empty set, leaves the ran= ge + * below it uncovered and must be rejected. the set stays empty. + */ +static int test_coverage(void) +{ + if (tiers_write("+orphan:100") !=3D -EINVAL) + return KSFT_FAIL; + if (tier_exists("orphan")) + return KSFT_FAIL; + return KSFT_PASS; +} + +/* + * Add two tiers covering the full range. The end priority of a tier is the + * start of the next higher tier minus one. + */ +static int test_add(void) +{ + if (tiers_write("+lo:-1 +hi:50")) + return KSFT_FAIL; + if (!range_is("hi", 50, SHRT_MAX) || !range_is("lo", -1, 49)) + return KSFT_FAIL; + return KSFT_PASS; +} + +/* Adding a tier inside an existing range splits it. the lower part shrink= s. */ +static int test_split(void) +{ + if (tiers_write("+mid:100")) + return KSFT_FAIL; + if (!range_is("mid", 100, SHRT_MAX) || + !range_is("hi", 50, 99) || + !range_is("lo", -1, 49)) + return KSFT_FAIL; + return KSFT_PASS; +} + +/* Removing a tier merges its range into the adjacent (lower) tier. */ +static int test_remove(void) +{ + /* Remove the top tier: 'hi' re-expands upward to SHRT_MAX. */ + if (tiers_write("-mid")) + return KSFT_FAIL; + if (tier_exists("mid") || !range_is("hi", 50, SHRT_MAX)) + return KSFT_FAIL; + + /* Remove the lowest tier: 'hi' shifts its start down to -1. */ + if (tiers_write("-lo")) + return KSFT_FAIL; + if (tier_exists("lo") || !range_is("hi", -1, SHRT_MAX)) + return KSFT_FAIL; + + return KSFT_PASS; +} + +/* Each invalid operation must fail with its documented errno. State: hi[-= 1,MAX]. */ +static int test_errors(void) +{ + if (tiers_write("+hi:100") !=3D -EEXIST) /* duplicate name */ + return KSFT_FAIL; + if (tiers_write("+bad.name:100") !=3D -EINVAL) /* illegal name */ + return KSFT_FAIL; + if (tiers_write("+dup:-1") !=3D -EBUSY) /* priority in use */ + return KSFT_FAIL; + if (tiers_write("+low:-2") !=3D -EINVAL) /* prio < DEF_SWAP_PRIO */ + return KSFT_FAIL; + return KSFT_PASS; +} + +/* + * A write carrying several operations is atomic: if any operation fails, = the + * whole batch is rolled back. The second '+a' duplicates the first and fa= ils, + * so neither must take effect. State before/after: hi[-1,MAX]. + */ +static int test_atomic(void) +{ + if (tiers_write("+a:50 +a:60") !=3D -EEXIST) + return KSFT_FAIL; + if (tier_exists("a") || !range_is("hi", -1, SHRT_MAX)) + return KSFT_FAIL; + return KSFT_PASS; +} + +static int zram_add(long size) +{ + char path[128], val[64]; + ssize_t n; + int idx, fd; + + fd =3D open("/sys/class/zram-control/hot_add", O_RDONLY); + if (fd < 0) + return -1; + n =3D read(fd, val, sizeof(val) - 1); + close(fd); + if (n <=3D 0) + return -1; + val[n] =3D '\0'; + idx =3D atoi(val); + + snprintf(path, sizeof(path), "/sys/block/zram%d/disksize", idx); + fd =3D open(path, O_WRONLY); + if (fd < 0) + return -1; + snprintf(val, sizeof(val), "%ld", size); + n =3D write(fd, val, strlen(val)); + close(fd); + return n < 0 ? -1 : idx; +} + +static void zram_remove(int idx) +{ + char val[16]; + int fd; + + fd =3D open("/sys/class/zram-control/hot_remove", O_WRONLY); + if (fd < 0) + return; + snprintf(val, sizeof(val), "%d", idx); + if (write(fd, val, strlen(val)) < 0) + ; /* ignore: best-effort cleanup */ + close(fd); +} + +static int swap_setup(const char *dev, int prio) +{ + char cmd[128]; + + snprintf(cmd, sizeof(cmd), "mkswap %s >/dev/null 2>&1", dev); + if (system(cmd)) + return -1; + return swapon(dev, SWAP_FLAG_PREFER | (prio & SWAP_FLAG_PRIO_MASK)); +} + +/* A tier holding an active swap device can't be removed until swapoff. */ +static int test_device_pins_tier(void) +{ + char dev[32]; + int zidx, ret =3D KSFT_FAIL; + + if (tiers_write("+top:50")) + return KSFT_FAIL; + + zidx =3D zram_add(64 << 20); + if (zidx < 0) { + ret =3D KSFT_SKIP; + goto out_tier; + } + snprintf(dev, sizeof(dev), "/dev/zram%d", zidx); + if (swap_setup(dev, 50)) { + ret =3D KSFT_SKIP; + goto out_zram; + } + + if (tiers_write("-top") =3D=3D -EBUSY) { /* blocked while active */ + swapoff(dev); + if (!tiers_write("-top")) /* removable after swapoff */ + ret =3D KSFT_PASS; + } else { + swapoff(dev); + } +out_zram: + zram_remove(zidx); +out_tier: + tiers_write("-top"); + return ret; +} + +/* Remove all remaining tiers, so a mid-test failure still leaves them emp= ty. */ +static void tiers_clear(void) +{ + char buf[4096], *line, *save; + int fd; + ssize_t n; + + fd =3D open(TIERS_PATH, O_RDONLY); + if (fd < 0) + return; + n =3D read(fd, buf, sizeof(buf) - 1); + close(fd); + if (n < 0) + return; + buf[n] =3D '\0'; + + for (line =3D strtok_r(buf, "\n", &save); line; + line =3D strtok_r(NULL, "\n", &save)) { + char name[64], cmd[80]; + int idx, s, e; + + if (sscanf(line, "%63s %d %d %d", name, &idx, &s, &e) !=3D 4) + continue; + snprintf(cmd, sizeof(cmd), "-%s", name); + tiers_write(cmd); + } +} + +int main(void) +{ + ksft_print_header(); + ksft_set_plan(7); + + if (geteuid() !=3D 0) + ksft_exit_skip("test requires root\n"); + if (access(TIERS_PATH, F_OK)) + ksft_exit_skip("%s not present (CONFIG_SWAP/tiers)\n", TIERS_PATH); + if (tier_count() !=3D 0) + ksft_exit_skip("swap tiers already configured; run on a clean system\n"); + + ksft_test_result(test_coverage() =3D=3D KSFT_PASS, "coverage rule\n"); + ksft_test_result(test_add() =3D=3D KSFT_PASS, "add tiers\n"); + ksft_test_result(test_split() =3D=3D KSFT_PASS, "split tier\n"); + ksft_test_result(test_remove() =3D=3D KSFT_PASS, "remove and merge\n"); + ksft_test_result(test_errors() =3D=3D KSFT_PASS, "invalid operations\n"); + ksft_test_result(test_atomic() =3D=3D KSFT_PASS, "batch atomicity\n"); + + ksft_test_result_code(test_device_pins_tier(), "device pins its tier", NU= LL); + + tiers_clear(); + + ksft_finished(); +} --=20 2.48.1 From nobody Mon Jun 22 07:37:01 2026 Received: from mail-pg1-f180.google.com (mail-pg1-f180.google.com [209.85.215.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B239D348C5E for ; Sat, 20 Jun 2026 18:17:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781979443; cv=none; b=r1+GhTx2SarDbdZW0iuglbykLxsONFwD3mr/rd5CH7akSLVTaEC8le17HuopoajFGxGgTtH/d3kGIcK36E3U7fDokm3yNg6VqnS1YL9X0YRgfVX+8Vz+vjimqftdg1iOXRc7gyCBlr3SA+I0y+xDEX2jdssiTHFluGf7cOtSp+o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781979443; c=relaxed/simple; bh=qBJUy0V7WEPdRQARhiTv8C7JgaiMP82scW2vnf9sc74=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=GesF5aYNHO8z+HGDNJvMDK/0L+yEioeAmh0lV7NQTUfFSrxh/7eonfT9gisHg+rJxO6U6hKSMce/jCw69Hl2rpbkFof19WwIOZJ9BL2SkNcr6Lp5H8GWczUQFgMAKzLw6qmVYWyuTgOPi7+b+YrBWyKFJ5Mdo3pCoCRhHwWi38Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=XivMFJ7S; arc=none smtp.client-ip=209.85.215.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="XivMFJ7S" Received: by mail-pg1-f180.google.com with SMTP id 41be03b00d2f7-c88eb5e7713so1203038a12.1 for ; Sat, 20 Jun 2026 11:17:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1781979441; x=1782584241; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=tG1goLxilAMOHGDfAUrbCLR/sxZCH1YlTcgqW/QVggE=; b=XivMFJ7SOE708BUOU4w6L1ZUzqSrbOrIKtoX/Lc7b4v0ALv0RunpiYWf7FbWmC++Au ZaZgBd/vJ8ibrSfMFmvJXsyQqXIQTiZ2xEEvBZzHSQY4oq1+h/5TnB9TF87SuK05eNpR VuPiUfC3sGhgcwJ7DdD1acyAs8UnMzlBhW6wCbMSjnBbTOohDwxOh4tE+7fqX8skI7Dj yA6GVQBRqWI+ChoeY0DnqmADd45R+E2UUnu7EMmIEkOgKSXS7a8ZHzzVSwo4AglrTQ4N EbhKx1wHHsdlPiUY3mBjrXybH0MR1bW7MJwQez4hOYWhWUPWu/fHKb4kqjzJQQiSXByA eWfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1781979441; x=1782584241; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=tG1goLxilAMOHGDfAUrbCLR/sxZCH1YlTcgqW/QVggE=; b=C2FfXNAteERCNrFtf31c5OKO2/qy/0Z8AJAgdkcTnpmF2UoU2JWL7rFMLEGv0UqHsO Pb0K6UuIzrCaopbVZZFbJTYoXqiEn5y6CFDRZ1QvPquKIx2baINppvN6rfh/JyefojVq GqdllRXTVBG13QGA5BkfaXtgBJXFpNnLyyTuZszLlqoG94brqRL+0Dr6+TXVvN8dYGjD FACBiODitCqLvv9sj6NlBRqFfQH0ALoGAqCN+jHrMKAFc4zuNWkOdPKJTZBPEGVh+IQ4 1jmD4buqck7isn+Ztg05FPtljZ9Yq8GrfXDThZFRjEqFQ0ZRXgWt/2HRMBLVoZgMYwbV gI1g== X-Forwarded-Encrypted: i=1; AHgh+RqfZ9eBMr4QluCDXvGFwI7vdlDUttibCTVdraU02+lBeNM41ipF1C5Z+QTx/D03ReztiWls+U5lEIT87hw=@vger.kernel.org X-Gm-Message-State: AOJu0YzU0XsINRT/dyGDUaTXxA1MylkWtEyP9GxbAQFTMmxpLDwsktr3 8D08P6jmZG1TteoABDHKF1CWGGS53ogkAnnJSv0+H74qn5zy6wFrq+zn X-Gm-Gg: AfdE7ckQwNQ2HbYs3VTb+eq/hszjGgmf8uaBiPyFxmaRM1dTnArpDMBvKizQgMzuOD0 3kBmy9XlEqcBEAmTM1/t0qqahYddnUNWpn3S8opZ7LQe9AumveMBIEDfekDvrMESEeRAeqxn0wV G87CyspTvbPk+AknzkNM4z0ILKnUdJlqUge7p8SbxUtnDngFzTwYKR1iV5qZuw95a7J6da5qSFw stlcsrT96t/ZZQYgoKDQ/YpnotPbXGEdUWYBj68Q4CMDLppdDevMhdAY+uDFTqFoH63Toa7Fu8b w8wyMdP5eUFN5lukhLOJd8bwwGkBsaaC4bebL7ParAiQJpEgw7YHYKQCepv7HBpaFD5eyFf7WRA O3Uamc3hAS/GfQJIp/AIkJCP/+G2hz/X7V9Du8qSRmeUwZe1TxE+7s0d7QJPo/izU8Cf0K+2AHm lxnNv51JDZoDWIWZy3+//0u7FyIeLk0TbYy2aGRbqqSy8G3bg14LWZYJZNVE1W29xravTOFdVN1 8Tstahoj0xp X-Received: by 2002:a17:902:e805:b0:2b2:4b4e:e4d2 with SMTP id d9443c01a7336-2c725b96249mr73239185ad.15.1781979440730; Sat, 20 Jun 2026 11:17:20 -0700 (PDT) Received: from localhost.localdomain ([220.85.166.190]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2c7436af6d9sm30339465ad.4.2026.06.20.11.17.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 20 Jun 2026 11:17:20 -0700 (PDT) From: Youngjun Park X-Google-Original-From: Youngjun Park To: akpm@linux-foundation.org Cc: chrisl@kernel.org, youngjun.park@lge.com, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kasong@tencent.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, shikemeng@huaweicloud.com, nphamcs@gmail.com, baoquan.he@linux.dev, baohua@kernel.org, yosry@kernel.org, gunho.lee@lge.com, taejoon.song@lge.com, hyungjun.cho@lge.com, mkoutny@suse.com, baver.bae@lge.com, matia.kim@lge.com Subject: [PATCH v9 6/6] selftests/cgroup: add a swap tier routing test Date: Sun, 21 Jun 2026 03:16:31 +0900 Message-ID: <20260620181635.299364-7-youngjun.park@lge.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20260620181635.299364-1-youngjun.park@lge.com> References: <20260620181635.299364-1-youngjun.park@lge.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This commit adds a test program for the per-cgroup swap tier control memory.swap.tiers.max. It checks the default mask, toggling a tier, rejection of invalid input, and that recreating a tier resets the mask. It also checks that a cgroup's pages swap only to an allowed tier, including across the parent and child hierarchy. The routing check uses two zram devices placed in different tiers. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Youngjun Park --- tools/testing/selftests/cgroup/.gitignore | 1 + tools/testing/selftests/cgroup/Makefile | 2 + tools/testing/selftests/cgroup/config | 2 + .../selftests/cgroup/test_swap_tiers.c | 500 ++++++++++++++++++ 4 files changed, 505 insertions(+) create mode 100644 tools/testing/selftests/cgroup/test_swap_tiers.c diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/self= tests/cgroup/.gitignore index 952e4448bf07..77b8e6c3e592 100644 --- a/tools/testing/selftests/cgroup/.gitignore +++ b/tools/testing/selftests/cgroup/.gitignore @@ -8,5 +8,6 @@ test_kill test_kmem test_memcontrol test_pids +test_swap_tiers test_zswap wait_inotify diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selfte= sts/cgroup/Makefile index e01584c2189a..a98e3c414cd5 100644 --- a/tools/testing/selftests/cgroup/Makefile +++ b/tools/testing/selftests/cgroup/Makefile @@ -16,6 +16,7 @@ TEST_GEN_PROGS +=3D test_kill TEST_GEN_PROGS +=3D test_kmem TEST_GEN_PROGS +=3D test_memcontrol TEST_GEN_PROGS +=3D test_pids +TEST_GEN_PROGS +=3D test_swap_tiers TEST_GEN_PROGS +=3D test_zswap =20 LOCAL_HDRS +=3D $(selfdir)/clone3/clone3_selftests.h $(selfdir)/pidfd/pidf= d.h @@ -32,4 +33,5 @@ $(OUTPUT)/test_kill: $(LIBCGROUP_O) $(OUTPUT)/test_kmem: $(LIBCGROUP_O) $(OUTPUT)/test_memcontrol: $(LIBCGROUP_O) $(OUTPUT)/test_pids: $(LIBCGROUP_O) +$(OUTPUT)/test_swap_tiers: $(LIBCGROUP_O) $(OUTPUT)/test_zswap: $(LIBCGROUP_O) diff --git a/tools/testing/selftests/cgroup/config b/tools/testing/selftest= s/cgroup/config index 39f979690dd3..6086bb5bba97 100644 --- a/tools/testing/selftests/cgroup/config +++ b/tools/testing/selftests/cgroup/config @@ -4,3 +4,5 @@ CONFIG_CGROUP_FREEZER=3Dy CONFIG_CGROUP_SCHED=3Dy CONFIG_MEMCG=3Dy CONFIG_PAGE_COUNTER=3Dy +CONFIG_SWAP=3Dy +CONFIG_ZRAM=3Dy diff --git a/tools/testing/selftests/cgroup/test_swap_tiers.c b/tools/testi= ng/selftests/cgroup/test_swap_tiers.c new file mode 100644 index 000000000000..24420c1ef398 --- /dev/null +++ b/tools/testing/selftests/cgroup/test_swap_tiers.c @@ -0,0 +1,500 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "kselftest.h" +#include "cgroup_util.h" + +#ifndef MADV_PAGEOUT +#define MADV_PAGEOUT 21 +#endif + +#define TIERS_PATH "/sys/kernel/mm/swap/tiers" +#define TIERS_MAX "memory.swap.tiers.max" + +static int tiers_write(const char *cmd) +{ + int fd, ret =3D 0; + + fd =3D open(TIERS_PATH, O_WRONLY); + if (fd < 0) + return -errno; + if (write(fd, cmd, strlen(cmd)) < 0) + ret =3D -errno; + close(fd); + return ret; +} + +static int tier_count(void) +{ + char buf[4096], *line, *save; + int fd, count =3D 0; + ssize_t n; + + fd =3D open(TIERS_PATH, O_RDONLY); + if (fd < 0) + return -1; + n =3D read(fd, buf, sizeof(buf) - 1); + close(fd); + if (n < 0) + return -1; + buf[n] =3D '\0'; + + for (line =3D strtok_r(buf, "\n", &save); line; + line =3D strtok_r(NULL, "\n", &save)) { + char name[64]; + int idx, s, e; + + if (sscanf(line, "%63s %d %d %d", name, &idx, &s, &e) =3D=3D 4) + count++; + } + return count; +} + +static long swap_used_kb(const char *dev) +{ + char line[256]; + long used =3D -1; + FILE *f; + + f =3D fopen("/proc/swaps", "r"); + if (!f) + return -1; + while (fgets(line, sizeof(line), f)) { + char name[128], type[64]; + long size, u, prio; + + if (sscanf(line, "%127s %63s %ld %ld %ld", + name, type, &size, &u, &prio) >=3D 4 && + !strcmp(name, dev)) { + used =3D u; + break; + } + } + fclose(f); + return used; +} + +static int swap_active_count(void) +{ + char line[256]; + int n =3D 0; + FILE *f; + + f =3D fopen("/proc/swaps", "r"); + if (!f) + return -1; + if (fgets(line, sizeof(line), f)) /* header */ + while (fgets(line, sizeof(line), f)) + n++; + fclose(f); + return n; +} + +static int zram_add(long size) +{ + char path[128], val[64]; + ssize_t n; + int idx, fd; + + fd =3D open("/sys/class/zram-control/hot_add", O_RDONLY); + if (fd < 0) + return -1; + n =3D read(fd, val, sizeof(val) - 1); + close(fd); + if (n <=3D 0) + return -1; + val[n] =3D '\0'; + idx =3D atoi(val); + + snprintf(path, sizeof(path), "/sys/block/zram%d/disksize", idx); + fd =3D open(path, O_WRONLY); + if (fd < 0) + return -1; + snprintf(val, sizeof(val), "%ld", size); + n =3D write(fd, val, strlen(val)); + close(fd); + return n < 0 ? -1 : idx; +} + +static void zram_remove(int idx) +{ + char val[16]; + int fd; + + fd =3D open("/sys/class/zram-control/hot_remove", O_WRONLY); + if (fd < 0) + return; + snprintf(val, sizeof(val), "%d", idx); + if (write(fd, val, strlen(val)) < 0) + ; /* ignore: best-effort cleanup */ + close(fd); +} + +static int swap_setup(const char *dev, int prio) +{ + char cmd[128]; + + snprintf(cmd, sizeof(cmd), "mkswap %s >/dev/null 2>&1", dev); + if (system(cmd)) + return -1; + return swapon(dev, SWAP_FLAG_PREFER | (prio & SWAP_FLAG_PRIO_MASK)); +} + +/* A new cgroup may use every tier ("max"). */ +static int test_default(const char *root) +{ + char *cg =3D cg_name(root, "swaptier_default"); + int ret =3D KSFT_FAIL; + + if (!cg || cg_create(cg)) + goto out; + if (!cg_read_strstr(cg, TIERS_MAX, "fast max") && + !cg_read_strstr(cg, TIERS_MAX, "slow max")) + ret =3D KSFT_PASS; +out: + if (cg) { + cg_destroy(cg); + free(cg); + } + return ret; +} + +/* A tier can be disabled and re-enabled, and the change reads back. */ +static int test_toggle(const char *root) +{ + char *cg =3D cg_name(root, "swaptier_toggle"); + int ret =3D KSFT_FAIL; + + if (!cg || cg_create(cg)) + goto out; + if (cg_write(cg, TIERS_MAX, "fast 0")) + goto out; + if (cg_read_strstr(cg, TIERS_MAX, "fast 0")) + goto out; + if (cg_write(cg, TIERS_MAX, "fast max")) + goto out; + if (cg_read_strstr(cg, TIERS_MAX, "fast max")) + goto out; + ret =3D KSFT_PASS; +out: + if (cg) { + cg_destroy(cg); + free(cg); + } + return ret; +} + +/* An unknown tier name or a bad value must be rejected. */ +static int test_invalid(const char *root) +{ + char *cg =3D cg_name(root, "swaptier_invalid"); + int ret =3D KSFT_FAIL; + + if (!cg || cg_create(cg)) + goto out; + if (!cg_write(cg, TIERS_MAX, "nosuchtier 0")) + goto out; + if (!cg_write(cg, TIERS_MAX, "fast bogus")) + goto out; + ret =3D KSFT_PASS; +out: + if (cg) { + cg_destroy(cg); + free(cg); + } + return ret; +} + +/* A tier recreated by the same name is allowed again, even if disabled be= fore. */ +static int test_recreate(const char *root) +{ + char *cg =3D cg_name(root, "swaptier_recreate"); + int ret =3D KSFT_FAIL; + + if (!cg || cg_create(cg)) + goto out; + if (cg_write(cg, TIERS_MAX, "fast 0")) + goto out; + if (cg_read_strstr(cg, TIERS_MAX, "fast 0")) + goto out; + if (tiers_write("-fast") || tiers_write("+fast:10")) + goto out; + if (cg_read_strstr(cg, TIERS_MAX, "fast max")) + goto out; + ret =3D KSFT_PASS; +out: + if (cg) { + cg_destroy(cg); + free(cg); + } + return ret; +} + +/* Map anon memory, fault it in, push it to swap, then wait to be killed. = */ +static int swapout_child(const char *cgroup, void *arg) +{ + size_t size =3D (size_t)arg; + char *mem; + size_t i; + int page_size; + + mem =3D mmap(NULL, size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (mem =3D=3D MAP_FAILED) + return -1; + + page_size =3D sysconf(_SC_PAGE_SIZE); + for (i =3D 0; i < size; i +=3D page_size) + mem[i] =3D 'x'; + if (madvise(mem, size, MADV_PAGEOUT)) + return -1; + /* Hold the swap entries while the parent inspects /proc/swaps. */ + pause(); + return 0; +} + +static int run_routing_case(const char *cg) +{ + char fast_dev[32], slow_dev[32]; + int zfast =3D -1, zslow =3D -1; + long used_fast, used_slow; + int ret =3D KSFT_SKIP; + pid_t pid =3D -1; + int i; + + /* Only our devices must be present, so usage is unambiguous. */ + if (swap_active_count() !=3D 0) + return KSFT_SKIP; + + zfast =3D zram_add(MB(128)); + zslow =3D zram_add(MB(128)); + if (zfast < 0 || zslow < 0) + goto out; + snprintf(fast_dev, sizeof(fast_dev), "/dev/zram%d", zfast); + snprintf(slow_dev, sizeof(slow_dev), "/dev/zram%d", zslow); + + /* prio 10 -> 'fast' tier [10, MAX]; prio 0 -> 'slow' tier [-1, 9]. */ + if (swap_setup(fast_dev, 10) || swap_setup(slow_dev, 0)) + goto out; + + ret =3D KSFT_FAIL; + + pid =3D cg_run_nowait(cg, swapout_child, (void *)MB(64)); + if (pid < 0) + goto out; + + for (i =3D 0; i < 50; i++) { /* up to ~5s for pageout */ + if (swap_used_kb(slow_dev) > 0) + break; + usleep(100000); + } + + used_fast =3D swap_used_kb(fast_dev); + used_slow =3D swap_used_kb(slow_dev); + if (used_slow > 0 && used_fast =3D=3D 0) + ret =3D KSFT_PASS; + else + ksft_print_msg("routing[%s]: fast=3D%ldKB slow=3D%ldKB (want fast=3D0, s= low>0)\n", + cg, used_fast, used_slow); +out: + if (pid > 0) { + kill(pid, SIGKILL); + waitpid(pid, NULL, 0); + } + if (zfast >=3D 0) { + swapoff(fast_dev); + zram_remove(zfast); + } + if (zslow >=3D 0) { + swapoff(slow_dev); + zram_remove(zslow); + } + return ret; +} + +/* + * A cgroup that disabled the high-priority 'fast' tier must swap only to = the + * 'slow' tier's device; the fast device must stay untouched. + */ +static int test_routing(const char *root) +{ + char *cg =3D cg_name(root, "swaptier_routing"); + int ret =3D KSFT_FAIL; + + if (!cg || cg_create(cg)) + goto out; + if (cg_write(cg, TIERS_MAX, "fast 0")) + goto out; + ret =3D run_routing_case(cg); +out: + if (cg) { + cg_destroy(cg); + free(cg); + } + return ret; +} + +/* Create @name under @root and delegate the memory controller to its chil= dren. */ +static char *make_parent(const char *root, const char *name) +{ + char *cg =3D cg_name(root, name); + + if (cg && !cg_create(cg) && + !cg_write(cg, "cgroup.subtree_control", "+memory")) + return cg; + + if (cg) { + cg_destroy(cg); + free(cg); + } + return NULL; +} + +/* + * The effective mask is the parent's intersected with the child's, so a t= ier + * the parent disabled stays disabled for the child even if the child re-e= nables + * it. Parent disables 'fast', child sets 'fast max' -> child still swaps= slow. + */ +static int test_routing_parent_wins(const char *root) +{ + char *parent =3D make_parent(root, "swaptier_pwins"); + char *child =3D NULL; + int ret =3D KSFT_FAIL; + + if (!parent) + goto out; + if (cg_write(parent, TIERS_MAX, "fast 0")) + goto out; + + child =3D cg_name(parent, "child"); + if (!child || cg_create(child)) + goto out; + if (cg_write(child, TIERS_MAX, "fast max")) /* child tries to re-enable */ + goto out; + + ret =3D run_routing_case(child); +out: + if (child) { + cg_destroy(child); + free(child); + } + if (parent) { + cg_destroy(parent); + free(parent); + } + return ret; +} + +/* + * A child can restrict below its parent: the parent leaves all tiers enab= led, + * the child disables 'fast' on its own -> the child swaps only to slow. + */ +static int test_routing_child_restricts(const char *root) +{ + char *parent =3D make_parent(root, "swaptier_crestr"); + char *child =3D NULL; + int ret =3D KSFT_FAIL; + + if (!parent) + goto out; + + child =3D cg_name(parent, "child"); + if (!child || cg_create(child)) + goto out; + if (cg_write(child, TIERS_MAX, "fast 0")) + goto out; + + ret =3D run_routing_case(child); +out: + if (child) { + cg_destroy(child); + free(child); + } + if (parent) { + cg_destroy(parent); + free(parent); + } + return ret; +} + +/* Remove all remaining tiers, so a mid-test failure still leaves them emp= ty. */ +static void tiers_clear(void) +{ + char buf[4096], *line, *save; + int fd; + ssize_t n; + + fd =3D open(TIERS_PATH, O_RDONLY); + if (fd < 0) + return; + n =3D read(fd, buf, sizeof(buf) - 1); + close(fd); + if (n < 0) + return; + buf[n] =3D '\0'; + + for (line =3D strtok_r(buf, "\n", &save); line; + line =3D strtok_r(NULL, "\n", &save)) { + char name[64], cmd[80]; + int idx, s, e; + + if (sscanf(line, "%63s %d %d %d", name, &idx, &s, &e) !=3D 4) + continue; + snprintf(cmd, sizeof(cmd), "-%s", name); + tiers_write(cmd); + } +} + +int main(void) +{ + char root[PATH_MAX]; + + ksft_print_header(); + ksft_set_plan(7); + + if (geteuid() !=3D 0) + ksft_exit_skip("test requires root\n"); + if (cg_find_unified_root(root, sizeof(root), NULL)) + ksft_exit_skip("cgroup v2 isn't mounted\n"); + if (cg_read_strstr(root, "cgroup.controllers", "memory")) + ksft_exit_skip("memory controller isn't available\n"); + if (cg_read_strstr(root, "cgroup.subtree_control", "memory")) + if (cg_write(root, "cgroup.subtree_control", "+memory")) + ksft_exit_skip("failed to enable memory controller\n"); + if (access(TIERS_PATH, F_OK)) + ksft_exit_skip("swap tiers interface not present\n"); + if (tier_count() !=3D 0) + ksft_exit_skip("swap tiers already configured; run on a clean system\n"); + + /* Two tiers: fast =3D [10, MAX], slow =3D [-1, 9]. */ + if (tiers_write("+slow:-1 +fast:10")) + ksft_exit_skip("failed to configure swap tiers\n"); + + ksft_test_result(test_default(root) =3D=3D KSFT_PASS, "default mask is ma= x\n"); + ksft_test_result(test_toggle(root) =3D=3D KSFT_PASS, "enable/disable tier= \n"); + ksft_test_result(test_invalid(root) =3D=3D KSFT_PASS, "invalid input reje= cted\n"); + ksft_test_result(test_recreate(root) =3D=3D KSFT_PASS, + "recreated tier resets cgroup mask\n"); + + ksft_test_result_code(test_routing(root), + "swapout honors tier mask", NULL); + ksft_test_result_code(test_routing_parent_wins(root), + "child cannot re-enable a parent-disabled tier", NULL); + ksft_test_result_code(test_routing_child_restricts(root), + "child can restrict tiers below its parent", NULL); + + tiers_clear(); + + ksft_finished(); +} --=20 2.48.1