From nobody Mon Oct  6 21:02:11 2025
Received: from lgeamrelo07.lge.com (lgeamrelo07.lge.com [156.147.51.103])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A39A22BCF65
	for <linux-kernel@vger.kernel.org>; Wed, 16 Jul 2025 20:21:03 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=156.147.51.103
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1752697271; cv=none;
 b=MwEp1ENLFegVefuBhSjV5lQ45B1RgdYv15QSo5YCZZl9QKmYVGFHy9UhgIZLKcLxOkCEX8LMyN0j/QdnW5HlF4o7unst2fztiuXjy/MEy2a8QeKsGFcAfNDO44Mcb4rReWs+clvU+RNCU03UzyfDuknGXL2P6ZIeev3nXYK7gWI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1752697271; c=relaxed/simple;
	bh=Z/BWOFpvwZAZAJI47MPQixk2F9MhyQLIGpRqWCZMwbA=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=LPV/MN6gay2tZBNFgeHaQWtBls8A85rIcpOUVIzFDWigDJRQbqsdtOcU98hte7d5jS13F7ODhj2thC2JCbSV6TjIB6w0PaBqmHCQdEWvB5WE3UfZrCGcZ9akAENWodlrB8ChfhPS3DFiPl7wJ8hN8K0cMI0oD2twtuxfj9iOq4Y=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=lge.com;
 spf=pass smtp.mailfrom=lge.com; arc=none smtp.client-ip=156.147.51.103
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=lge.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=lge.com
Received: from unknown (HELO yjaykim-PowerEdge-T330.lge.net) (10.177.112.156)
	by 156.147.51.103 with ESMTP; 17 Jul 2025 05:20:55 +0900
X-Original-SENDERIP: 10.177.112.156
X-Original-MAILFROM: youngjun.park@lge.com
From: Youngjun Park <youngjun.park@lge.com>
To: akpm@linux-foundation.org,
	hannes@cmpxchg.org
Cc: mhocko@kernel.org,
	roman.gushchin@linux.dev,
	shakeel.butt@linux.dev,
	muchun.song@linux.dev,
	shikemeng@huaweicloud.com,
	kasong@tencent.com,
	nphamcs@gmail.com,
	bhe@redhat.com,
	baohua@kernel.org,
	chrisl@kernel.org,
	cgroups@vger.kernel.org,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	gunho.lee@lge.com,
	iamjoonsoo.kim@lge.com,
	taejoon.song@lge.com,
	Youngjun Park <youngjun.park@lge.com>,
	=?UTF-8?q?Michal=20Koutn=C3=BD?= <mkoutny@suse.com>
Subject: [PATCH 1/4] mm/swap,
 memcg: Introduce infrastructure for cgroup-based swap priority
Date: Thu, 17 Jul 2025 05:20:03 +0900
Message-Id: <20250716202006.3640584-2-youngjun.park@lge.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20250716202006.3640584-1-youngjun.park@lge.com>
References: <20250716202006.3640584-1-youngjun.park@lge.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

In resource-constrained environments with limited RAM and storage, it is
often desirable to utilize remote or heterogeneous storage devices as swap
targets. To maximize responsiveness under memory pressure, particularly for
latency-critical applications, it is important to control which cgroups use
which swap devices.

This patch introduces a mechanism for assigning swap device priorities on a
per-cgroup basis. By allowing cgroups to customize the relative priority of
available swap devices, faster local swap can be reserved for critical
workloads, while background tasks can be directed to slower or remote swap.

This commit provides the base infrastructure for priority tracking:

- Introduces `memory.swap.priority`, a new cgroup2 interface that allows
  setting per-device priorities using `<id> <priority>` pairs. The swap
  device ID corresponds to the identifier in `/proc/swaps`.

- Internally, priorities are tracked with `struct swap_cgroup_priority`,
  which holds dynamically allocated pnode structures (`struct
  swap_cgroup_priority_pnode`) per device.

- Objects are created on-demand when the cgroup interface is written to,
  and automatically freed when:
    - The configured priorities match the global system defaults
    - The memory cgroup is removed

- Swapon and swapoff propagation is supported:
    - When a new swap device is activated, default values (e.g.,
      `default none`, `default disabled`) determine how the cgroup treats
      that device
    - When a swap device is removed via `swapoff`, it is cleared from all
      affected cgroups

- Priority semantics follow the global swap rules:
    - Higher values are preferred
    - Equal values round-robin
    - Negative values follow NUMA-aware fallback

The default value mechanism (`default none`, `default disabled`) was propos=
ed
by Michal Koutn=C3=BD and integrated into the design to better support swap=
on
propagation and reduce configuration overhead.

The general design, including how to track priorities and manage per-cgroup
objects, was refined through internal discussions with Joonsoo Kim.

Enforcement logic within the swap allocator is introduced in the next patch.

Suggested-by: Michal Koutn=C3=BD <mkoutny@suse.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  62 ++
 MAINTAINERS                             |   2 +
 include/linux/memcontrol.h              |   3 +
 include/linux/swap.h                    |   3 +
 mm/Kconfig                              |  14 +
 mm/Makefile                             |   1 +
 mm/memcontrol.c                         |  91 ++-
 mm/swap_cgroup_priority.c               | 739 ++++++++++++++++++++++++
 mm/swap_cgroup_priority.h               |  86 +++
 mm/swapfile.c                           |  17 +-
 10 files changed, 1009 insertions(+), 9 deletions(-)
 create mode 100644 mm/swap_cgroup_priority.c
 create mode 100644 mm/swap_cgroup_priority.h
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-=
guide/cgroup-v2.rst
index bd98ea3175ec..35fb9677f0d6 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1839,6 +1839,68 @@ The following nested keys are defined.
 	higher than the limit for an extended period of time.  This
 	reduces the impact on the workload and memory management.
=20
+  memory.swap.priority
+    A read-write flat-keyed file which exists on non-root cgroups.
+    This interface allows you to set per-swap-device priorities for the cu=
rrent
+    cgroup and to define how they differ from the global swap system.
+
+    To assign priorities or define specific behaviors for swap devices
+    in the current cgroup, write one or more lines in the following
+    formats:
+
+     - <swap_device_id> <priority>
+     - <swap_device_id> disabled
+     - <swap_device_id> none
+     - default none
+     - default disabled
+
+    Each <swap_device_id> refers to a unique swap device registered
+    in the system. You can check the ID, device path, and current
+    priority of active swap devices through the `/proc/swaps` file.
+    This provides a clear mapping between swap devices and the IDs
+    used in this interface.
+
+    The 'default' keyword sets the fallback priority behavior rule for
+    this cgroup. If no specific entry matches a swap device, this default
+    applies.
+
+    * 'default none': This is the default if no configuration
+      is explicitly written. Swap devices follow the system-wide
+      swap priorities.
+
+    * 'default disabled': All swap devices are excluded from this cgroup=
=E2=80=99s
+      swap priority list and will not be used by this cgroup.
+
+    The priority semantics are consistent with the global swap system:
+
+      - Higher numerical values indicate higher preference.
+      - See Documentation/admin-guide/mm/swap_numa.rst for details on
+        swap NUMA autobinding and negative priority rules.
+
+    The handling of negative priorities in this cgroup interface
+    has specific behaviors for assignment and restoration:
+
+    * Negative Priority Assignment
+      This interface allows you to explicitly override priorities with neg=
ative
+      values. When you do so, the total number of negative slots and their=
 order
+      may shift depending on how the new value compares to existing ones:
+
+      - If you override an existing priority (whether originally positive =
or negative)
+        with a smaller (more negative) number, it may push other negative =
priorities
+        upward (toward zero).
+
+      - If you override an existing negative priority with a larger
+        (less negative) number, it may push other negative priorities
+        downward (more negative).
+
+    * Negative Priority Restoration with 'none'
+      When restoring a device=E2=80=99s priority to its global value using=
 'none',
+      if the original priority was negative, it might not revert to the ex=
act
+      same global negative value if the total number of negative priorities
+      in the cgroup has decreased. In such cases, you may need to adjust
+      other negative priorities to restore the same ordering as the global
+      swap configuration.
+
   memory.zswap.current
 	A read-only single value file which exists on non-root
 	cgroups.
diff --git a/MAINTAINERS b/MAINTAINERS
index 60bba48f5479..d51ddc2272a7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6169,6 +6169,8 @@ F:	mm/memcontrol.c
 F:	mm/memcontrol-v1.c
 F:	mm/memcontrol-v1.h
 F:	mm/swap_cgroup.c
+F:	mm/swap_cgroup_priority.c
+F:	mm/swap_cgroup_priority.h
 F:	samples/cgroup/*
 F:	tools/testing/selftests/cgroup/memcg_protection.m
 F:	tools/testing/selftests/cgroup/test_hugetlb_memcg.c
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 87b6688f124a..625e59f9ecd2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -218,6 +218,9 @@ struct mem_cgroup {
 	bool zswap_writeback;
 #endif
=20
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	struct swap_cgroup_priority *swap_priority;
+#endif
 	/* vmpressure notifications */
 	struct vmpressure vmpressure;
=20
diff --git a/include/linux/swap.h b/include/linux/swap.h
index bc0e1c275fc0..bfddbec2ee28 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -339,6 +339,9 @@ struct swap_info_struct {
 	struct work_struct discard_work; /* discard worker */
 	struct work_struct reclaim_work; /* reclaim worker */
 	struct list_head discard_clusters; /* discard clusters list */
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	u64 id;
+#endif
 	struct plist_node avail_lists[]; /*
 					   * entries in swap_avail_heads, one
 					   * entry per node.
diff --git a/mm/Kconfig b/mm/Kconfig
index 781be3240e21..43751e8d0bc4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -190,6 +190,20 @@ config ZSMALLOC_CHAIN_SIZE
=20
 	  For more information, see zsmalloc documentation.
=20
+config SWAP_CGROUP_PRIORITY
+	bool "Per cgroup swap priority (EXPERIMENTAL)"
+	depends on SWAP && CGROUPS
+	default n
+	help
+          Enable per-cgroup swap device priority control.
+
+          This option allows configuring swap device priorities on a
+          per-cgroup basis, and makes it possible to exclude specific swap
+          devices from use by a cgroup.
+
+          If no configuration is set for a cgroup, it falls back to the
+          system-wide swap device priorities defined at swapon time.
+
 menu "Slab allocator options"
=20
 config SLUB
diff --git a/mm/Makefile b/mm/Makefile
index 1a7a11d4933d..dde27ee58a8d 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -76,6 +76,7 @@ ifdef CONFIG_MMU
 endif
=20
 obj-$(CONFIG_SWAP)	+=3D page_io.o swap_state.o swapfile.o
+obj-$(CONFIG_SWAP_CGROUP_PRIORITY) +=3D swap_cgroup_priority.o
 obj-$(CONFIG_ZSWAP)	+=3D zswap.o
 obj-$(CONFIG_HAS_DMA)	+=3D dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+=3D hugetlb.o
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 70fdeda1120b..ea207d498ad6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -69,6 +69,8 @@
 #include <net/ip.h>
 #include "slab.h"
 #include "memcontrol-v1.h"
+#include "swap.h"
+#include "swap_cgroup_priority.h"
=20
 #include <linux/uaccess.h>
=20
@@ -3700,6 +3702,9 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
 {
 	lru_gen_exit_memcg(memcg);
 	memcg_wb_domain_exit(memcg);
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	delete_swap_cgroup_priority(memcg);
+#endif
 	__mem_cgroup_free(memcg);
 }
=20
@@ -3793,6 +3798,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *pare=
nt_css)
=20
 	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
 	memcg1_soft_limit_reset(memcg);
+
 #ifdef CONFIG_ZSWAP
 	memcg->zswap_max =3D PAGE_COUNTER_MAX;
 	WRITE_ONCE(memcg->zswap_writeback, true);
@@ -3800,7 +3806,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *pare=
nt_css)
 	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
 	if (parent) {
 		WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent));
-
 		page_counter_init(&memcg->memory, &parent->memory, memcg_on_dfl);
 		page_counter_init(&memcg->swap, &parent->swap, false);
 #ifdef CONFIG_MEMCG_V1
@@ -5401,6 +5406,82 @@ static int swap_events_show(struct seq_file *m, void=
 *v)
 	return 0;
 }
=20
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+static ssize_t swap_cgroup_priority_write(struct kernfs_open_file *of,
+					  char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg =3D mem_cgroup_from_css(of_css(of));
+	u64 id;
+	int prio;
+	int ret;
+	char first_token[32];
+	char second_token[32];
+	char dummy[2];
+	char *stripped_buf;
+	int num_parsed;
+
+	stripped_buf =3D strstrip(buf);
+	num_parsed =3D sscanf(stripped_buf, "%31s %31s %1s", first_token,
+			    second_token, dummy);
+	if (num_parsed =3D=3D 2) {
+		if (strcmp(first_token, "default") =3D=3D 0) {
+			if (strcmp(second_token, "none") =3D=3D 0)
+				ret =3D apply_swap_cgroup_priority(
+					memcg, DEFAULT_ID, SWAP_PRIORITY_GLOBAL);
+			else if (strcmp(second_token, "disabled") =3D=3D 0)
+				ret =3D apply_swap_cgroup_priority(
+					memcg, DEFAULT_ID, SWAP_PRIORITY_DISABLE);
+			else
+				ret =3D -EINVAL;
+		} else {
+			ret =3D kstrtoull(first_token, 10, &id);
+			if (ret)
+				return -EINVAL;
+
+			if (strcmp(second_token, "none") =3D=3D 0) {
+				ret =3D apply_swap_cgroup_priority(
+					memcg, id, SWAP_PRIORITY_GLOBAL);
+			} else if (strcmp(second_token, "disabled") =3D=3D 0) {
+				ret =3D apply_swap_cgroup_priority(
+					memcg, id, SWAP_PRIORITY_DISABLE);
+			} else {
+				ret =3D kstrtoint(second_token, 10, &prio);
+				if (ret)
+					return -EINVAL;
+				if (prio =3D=3D -1)
+					return -EINVAL;
+				else if (prio > SHRT_MAX || prio < SHRT_MIN)
+					return -EINVAL;
+				ret =3D apply_swap_cgroup_priority(memcg, id,
+								 prio);
+			}
+		}
+	} else if (num_parsed =3D=3D 1) {
+		if (strcmp(first_token, "none") =3D=3D 0)
+			ret =3D apply_swap_cgroup_priority(
+				memcg, id, SWAP_PRIORITY_GLOBAL);
+		else if (strcmp(first_token, "disabled") =3D=3D 0)
+			ret =3D apply_swap_cgroup_priority(
+				memcg, id, SWAP_PRIORITY_DISABLE);
+		else
+			ret =3D -EINVAL;
+	} else {
+		return -EINVAL;
+	}
+
+	if (ret)
+		return ret;
+
+	return nbytes;
+}
+
+static int swap_cgroup_priority_show(struct seq_file *m, void *v)
+{
+	show_swap_cgroup_priority(m);
+	return 0;
+}
+#endif
+
 static struct cftype swap_files[] =3D {
 	{
 		.name =3D "swap.current",
@@ -5433,6 +5514,14 @@ static struct cftype swap_files[] =3D {
 		.file_offset =3D offsetof(struct mem_cgroup, swap_events_file),
 		.seq_show =3D swap_events_show,
 	},
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	{
+		.name =3D "swap.priority",
+		.flags =3D CFTYPE_NOT_ON_ROOT,
+		.seq_show =3D swap_cgroup_priority_show,
+		.write =3D swap_cgroup_priority_write,
+	},
+#endif
 	{ }	/* terminate */
 };
=20
diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c
new file mode 100644
index 000000000000..abbefa6de63a
--- /dev/null
+++ b/mm/swap_cgroup_priority.c
@@ -0,0 +1,739 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2025 LG Electronics Inc.
+ *
+ * This file is part of the Linux kernel and implements per-cgroup
+ * swap device priority control.
+ *
+ * This feature allows configuring the preference and exclusion of
+ * swap devices on a per-cgroup basis.
+ *
+ * If no configuration is set, the system-wide swap priorities
+ * assigned at swapon time will apply.
+ *
+ * Author: Youngjun Park <youngjun.park@lge.com>
+ */
+#include <linux/swap.h>
+#include <linux/rcupdate.h>
+#include <linux/memcontrol.h>
+#include <linux/plist.h>
+#include "swap.h"
+#include "swap_cgroup_priority.h"
+#include "memcontrol-v1.h"
+
+static LIST_HEAD(swap_cgroup_priority_list);
+
+/*
+ * struct swap_cgroup_priority
+ *
+ * This structure is RCU protected. Its lifecycle is determined by its
+ * owning memcg or when its 'distance' reaches zero. The 'distance' field
+ * tracks priority differences from global swap. If zero, and its default_=
prio
+ * follows global swap priority(SWAP_PRIORITY_GLOBAL), the object is destr=
oyed.
+ *
+ * pnode - Array of pointers to swap device priority nodes.
+ * owner - The owning memory cgroup.
+ * rcu - RCU free callback.
+ * link - Global linked list entry.
+ * least_priority - Current lowest priority.
+ * distance - Priority differences from global swap priority.
+ * default_prio - Default priority for this cgroup.
+ * plist - Priority list head.
+ */
+struct swap_cgroup_priority {
+	struct swap_cgroup_priority_pnode *pnode[MAX_SWAPFILES];
+	struct mem_cgroup *owner;
+
+	union {
+		struct rcu_head rcu;
+		struct list_head link;
+	};
+
+	int least_priority;
+	s8 distance;
+	int default_prio;
+	struct plist_head plist[];
+};
+
+/*
+ * struct swap_cgroup_priority_pnode
+ *
+ * This structure represents a priority node for a specific swap device
+ * within a cgroup.
+ *
+ * swap - Pointer to the associated swap device.
+ * id - Unique identifier for the swap device.
+ * prio - Configured priority for this device.
+ * avail_lists - Connections to various priority lists.
+ */
+struct swap_cgroup_priority_pnode {
+	struct swap_info_struct *swap;
+	u64 id;
+	signed short prio;
+	struct plist_node avail_lists[];
+};
+
+/*
+ * Even with a zero distance, a swap device isn't assigned if it doesn't
+ * meet global swap priority conditions; thus, we don't clear it.
+ */
+static bool should_clear_swap_cgroup_priority(
+	struct swap_cgroup_priority *swap_priority)
+{
+	WARN_ON_ONCE(swap_priority->distance < 0 ||
+		swap_priority->distance > MAX_SWAPFILES);
+
+	if (swap_priority->distance =3D=3D 0 &&
+	    swap_priority->default_prio =3D=3D SWAP_PRIORITY_GLOBAL)
+		return true;
+
+	return false;
+}
+
+/*
+ * swapdev_id
+ *
+ * A unique identifier for a swap device.
+ *
+ * This ID ensures stable identification for users and crucial synchroniza=
tion
+ * for swap cgroup priority settings. It provides a reliable reference eve=
n if
+ * device paths or numbers change.
+ */
+static atomic64_t swapdev_id_counter;
+
+void get_swapdev_id(struct swap_info_struct *si)
+{
+	si->id =3D atomic64_inc_return(&swapdev_id_counter);
+}
+
+static struct swap_cgroup_priority *get_swap_cgroup_priority(
+	struct mem_cgroup *memcg)
+{
+	if (!memcg)
+		return NULL;
+
+	return rcu_dereference(memcg->swap_priority);
+}
+
+static struct swap_cgroup_priority_pnode *alloc_swap_cgroup_priority_pnode(
+	gfp_t gfp)
+{
+	struct swap_cgroup_priority_pnode *pnode;
+	pnode =3D kvzalloc(struct_size(pnode, avail_lists, nr_node_ids),
+			 gfp);
+
+	return pnode;
+}
+
+static void free_swap_cgroup_priority_pnode(
+	struct swap_cgroup_priority_pnode *pnode)
+{
+	if (pnode)
+		kvfree(pnode);
+}
+
+static void free_swap_cgroup_priority(
+	struct swap_cgroup_priority *swap_priority)
+{
+	for (int i =3D 0; i < MAX_SWAPFILES; i++)
+		free_swap_cgroup_priority_pnode(swap_priority->pnode[i]);
+
+	kvfree(swap_priority);
+}
+
+static struct swap_cgroup_priority *alloc_swap_cgroup_priority(void)
+{
+	struct swap_cgroup_priority *swap_priority;
+
+	swap_priority =3D kvzalloc(struct_size(swap_priority, plist, nr_node_ids),
+				 GFP_KERNEL);
+	if (!swap_priority)
+		return NULL;
+
+	/*
+	 * Pre-allocates pnode array up to nr_swapfiles at init.
+	 * Individual pnodes are assigned on swapon, but not freed
+	 * on swapoff. This avoids complex ref-counting, simplifying
+	 * the structure for swap cgroup priority management.
+	 */
+	for (int i =3D 0; i < nr_swapfiles; i++) {
+		swap_priority->pnode[i] =3D alloc_swap_cgroup_priority_pnode(
+						GFP_KERNEL);
+		if (!swap_priority->pnode[i]) {
+			free_swap_cgroup_priority(swap_priority);
+			return NULL;
+		}
+
+	}
+
+	return swap_priority;
+}
+
+static void rcu_free_swap_cgroup_priority(struct rcu_head *rcu)
+{
+	struct swap_cgroup_priority *swap_priority
+		=3D container_of(rcu, struct swap_cgroup_priority, rcu);
+
+	free_swap_cgroup_priority(swap_priority);
+}
+
+void show_swap_cgroup_priority(struct seq_file *m)
+{
+	struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m);
+	struct swap_cgroup_priority *swap_priority;
+
+	spin_lock(&swap_lock);
+	swap_priority =3D memcg->swap_priority;
+	if (!swap_priority || swap_priority->owner !=3D memcg) {
+		spin_unlock(&swap_lock);
+		return;
+	}
+
+	if (swap_priority->default_prio !=3D SWAP_PRIORITY_GLOBAL)
+		seq_printf(m,  "default disabled\n");
+
+	for (int i =3D 0; i < nr_swapfiles; i++) {
+		struct swap_info_struct *si =3D swap_info[i];
+		struct swap_cgroup_priority_pnode *pnode;
+		signed short prio;
+
+		if (!(si->flags & SWP_USED) || !(si->flags & SWP_WRITEOK))
+			continue;
+
+		pnode =3D swap_priority->pnode[i];
+
+		if (WARN_ON_ONCE(!pnode))
+			continue;
+
+		prio =3D pnode->prio;
+		if (prio =3D=3D si->prio)
+			continue;
+
+		seq_printf(m,  "%lld", si->id);
+		if (prio !=3D SWAP_PRIORITY_DISABLE)
+			seq_printf(m,  " %d\n", prio);
+		else
+			seq_printf(m,  " disabled\n");
+	}
+
+	spin_unlock(&swap_lock);
+}
+
+static void __delete_swap_cgroup_priority(struct mem_cgroup *memcg);
+void purge_swap_cgroup_priority(void)
+{
+	struct swap_cgroup_priority *swap_priority, *tmp;
+
+	spin_lock(&swap_avail_lock);
+	list_for_each_entry_safe(swap_priority, tmp, &swap_cgroup_priority_list,
+				 link) {
+
+		if (should_clear_swap_cgroup_priority(swap_priority))
+			__delete_swap_cgroup_priority(swap_priority->owner);
+	}
+	spin_unlock(&swap_avail_lock);
+}
+
+bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
+				swp_entry_t *entry, int order)
+{
+	struct swap_cgroup_priority *swap_priority;
+	struct swap_cgroup_priority_pnode *pnode, *next;
+	struct swap_info_struct *si;
+	unsigned long offset;
+	int node;
+
+	/* TODO
+	 * Per-cpu swapdev cache can't be used directly as cgroup-specific
+	 * priorities may select different devices.
+	 */
+	spin_lock(&swap_avail_lock);
+	node =3D numa_node_id();
+
+	swap_priority =3D get_swap_cgroup_priority(memcg);
+swap_priority_check:
+	if (!swap_priority) {
+		spin_unlock(&swap_avail_lock);
+		return false;
+	}
+
+start_over:
+	plist_for_each_entry_safe(pnode, next, &swap_priority->plist[node],
+				  avail_lists[node]) {
+		si =3D pnode->swap;
+		plist_requeue(&pnode->avail_lists[node],
+			&swap_priority->plist[node]);
+		spin_unlock(&swap_avail_lock);
+		if (get_swap_device_info(si)) {
+			offset =3D cluster_alloc_swap_entry(si, order,
+							  SWAP_HAS_CACHE);
+			put_swap_device(si);
+			if (offset) {
+				*entry =3D swp_entry(si->type, offset);
+				return true;
+			}
+
+			if (order)
+				return false;
+		}
+
+		spin_lock(&swap_avail_lock);
+		/*
+		 * If 'swap_cgroup_priority' changes while we're holding a lock,
+		 * we must verify its state to ensure memory validness.
+		 */
+		if (memcg->swap_priority !=3D swap_priority)
+			goto swap_priority_check;
+
+		if (plist_node_empty(&next->avail_lists[node]))
+			goto start_over;
+	}
+	spin_unlock(&swap_avail_lock);
+
+	return false;
+}
+
+/* add_to_avail_list (swapon / swapusage > 0) */
+void activate_swap_cgroup_priority(struct swap_info_struct *swp,
+				   bool swapon)
+{
+	struct swap_cgroup_priority *swap_priority;
+	int i;
+
+	list_for_each_entry(swap_priority, &swap_cgroup_priority_list, link) {
+		struct swap_cgroup_priority_pnode *pnode =3D
+			swap_priority->pnode[swp->type];
+
+		if (WARN_ON_ONCE(!pnode))
+			continue;
+
+		/* Exclude reinsert */
+		if (swapon && pnode->id !=3D swp->id) {
+			pnode->swap =3D swp;
+			if (swap_priority->default_prio =3D=3D SWAP_PRIORITY_GLOBAL) {
+				if (swp->prio >=3D 0)
+					pnode->prio =3D swp->prio;
+				else
+					pnode->prio =3D
+						--swap_priority->least_priority;
+			} else {
+				pnode->prio =3D SWAP_PRIORITY_DISABLE;
+				swap_priority->distance++;
+			}
+		}
+
+		/* NUMA priority handling */
+		for_each_node(i) {
+			if (swapon) {
+				if (pnode->prio < 0 && swap_node(swp) =3D=3D i) {
+					plist_node_init(
+						&pnode->avail_lists[i],
+						1);
+				} else {
+					plist_node_init(
+						&pnode->avail_lists[i],
+						-pnode->prio);
+				}
+			}
+
+			if (pnode->prio !=3D SWAP_PRIORITY_DISABLE)
+				plist_add(&pnode->avail_lists[i],
+					  &swap_priority->plist[i]);
+		}
+	}
+}
+
+/* del_from_avail_list (swapoff / swap usage <=3D 0) */
+void deactivate_swap_cgroup_priority(struct swap_info_struct *swp,
+				     bool swapoff)
+{
+	struct swap_cgroup_priority *swap_priority, *tmp;
+	int nid, i;
+
+	list_for_each_entry_safe(swap_priority, tmp, &swap_cgroup_priority_list,
+				 link) {
+		struct swap_cgroup_priority_pnode *pnode =3D
+			swap_priority->pnode[swp->type];
+
+		if (WARN_ON_ONCE(!pnode))
+			continue;
+
+		if (swapoff) {
+			if (pnode->prio !=3D swp->prio)
+				swap_priority->distance--;
+		}
+
+		if (pnode->prio =3D=3D SWAP_PRIORITY_DISABLE)
+			continue;
+
+		if (swapoff && pnode->prio < 0) {
+			struct swap_cgroup_priority_pnode *tmp;
+			/*
+			 * NUMA priority handling
+			 * mimic swapoff prio adjustment without plist
+			 */
+			for (int i =3D 0; i < nr_swapfiles; i++) {
+				tmp =3D swap_priority->pnode[i];
+				if (!tmp || tmp->prio > pnode->prio ||
+				    tmp->swap =3D=3D swp)
+					continue;
+				tmp->prio++;
+				for_each_node(nid) {
+					if (tmp->avail_lists[nid].prio !=3D 1)
+						tmp->avail_lists[nid].prio--;
+				}
+			}
+
+			swap_priority->least_priority++;
+		}
+
+		for_each_node(i)
+			plist_del(&pnode->avail_lists[i],
+				&swap_priority->plist[i]);
+	}
+}
+
+static void apply_swap_cgroup_priority_pnode(
+	struct swap_cgroup_priority *swap_priority,
+	struct swap_cgroup_priority_pnode *pnode,
+	int prio,
+	bool clear)
+{
+	int nid;
+
+	if (clear && pnode->prio !=3D SWAP_PRIORITY_DISABLE) {
+		for_each_node(nid) {
+			plist_del(&pnode->avail_lists[nid],
+				&swap_priority->plist[nid]);
+		}
+	}
+
+	if (pnode->swap->prio !=3D prio && pnode->swap->prio =3D=3D pnode->prio)
+		swap_priority->distance++;
+	else if (pnode->swap->prio =3D=3D prio && pnode->swap->prio !=3D pnode->p=
rio)
+		swap_priority->distance--;
+
+	pnode->prio =3D prio;
+	for_each_node(nid) {
+		if (pnode->prio >=3D 0) {
+			plist_node_init(&pnode->avail_lists[nid],
+				-pnode->prio);
+		} else {
+			if (swap_node(pnode->swap) =3D=3D nid)
+				plist_node_init(
+					&pnode->avail_lists[nid],
+					1);
+			else
+				plist_node_init(
+					&pnode->avail_lists[nid],
+					-pnode->prio);
+		}
+
+		/*
+		 * Check SWP_WRITEOK for skipping
+		 * 1. reinsert case when swapoff fails
+		 * 2. on-going swapon before adding avail list
+		 */
+		if (pnode->prio !=3D SWAP_PRIORITY_DISABLE &&
+		    (pnode->swap->flags & SWP_WRITEOK))
+			plist_add(&pnode->avail_lists[nid],
+				&swap_priority->plist[nid]);
+	}
+}
+
+static int __apply_swap_cgroup_priority(
+	struct swap_cgroup_priority *swap_priority, u64 id, int prio, bool new)
+{
+	struct swap_cgroup_priority_pnode *pnode;
+	struct swap_info_struct *si;
+	int old_prio;
+	int type;
+
+	if (new)
+		swap_priority->least_priority =3D least_priority;
+
+	if (id =3D=3D DEFAULT_ID) {
+		swap_priority->default_prio =3D prio;
+		if (new)
+			goto assign_prio;
+
+		goto out;
+	}
+
+	for (type =3D 0; type < nr_swapfiles; type++) {
+		si =3D swap_info[type];
+		if (id =3D=3D si->id)
+			break;
+		si =3D NULL;
+	}
+
+	if (!si)
+		return -EIO;
+
+	if (!(si->flags & SWP_USED) || !(si->flags & SWP_WRITEOK))
+		return -EFAULT;
+
+	if (si->id !=3D id)
+		return -EINVAL;
+
+	if (prio =3D=3D SWAP_PRIORITY_GLOBAL)
+		prio =3D si->prio;
+
+	pnode =3D swap_priority->pnode[type];
+	/* Assigning the same priority has no effect. */
+	if (!new && pnode && pnode->prio =3D=3D prio)
+		return 0;
+	else if (new && si->prio =3D=3D prio)
+		return 0;
+
+	if (new) {
+		pnode->id =3D id;
+		pnode->swap =3D si;
+		pnode->prio =3D si->prio;
+	}
+	old_prio =3D pnode->prio;
+
+	/*
+	 * When a new negative priority is added, least_priority decreases.
+	 * When a negative priority is deleted, least_priority increases.
+	 */
+	if (prio < SWAP_PRIORITY_DISABLE && old_prio >=3D SWAP_PRIORITY_DISABLE)
+		swap_priority->least_priority--;
+	else if (prio >=3D SWAP_PRIORITY_DISABLE &&
+		 old_prio < SWAP_PRIORITY_DISABLE)
+		swap_priority->least_priority++;
+
+	if (prio < swap_priority->least_priority)
+		prio =3D swap_priority->least_priority;
+
+	apply_swap_cgroup_priority_pnode(swap_priority, pnode, prio, !new);
+
+	/*
+	 * This logic adjusts priorities according to global swap on/off rule.
+	 * Priorities at or above SWAP_PRIORITY_DISABLE don't affect other swap
+	 * device priorities. However, negative priorities below this threshold
+	 * influence each other based on their values. Adjustments are made if a
+	 * swap device's priority becomes negative and starts influencing others,
+	 * or if it moves out of the negative range and stops influencing them.
+	 */
+assign_prio:
+	for (int i =3D 0; i < nr_swapfiles; i++) {
+		int changed_prio;
+		si =3D swap_info[i];
+		/*
+		 * nr_swapfiles may have increased after initial alloc
+		 * due to missing swap_lock
+		 */
+		if (!(pnode =3D swap_priority->pnode[si->type])) {
+			pnode =3D alloc_swap_cgroup_priority_pnode(GFP_ATOMIC);
+			if (!pnode)
+				return -ENOMEM;
+			swap_priority->pnode[si->type] =3D pnode;
+		}
+
+		/*
+		 * Does not check SWP_WRITEOK. device could be reinserted.
+		 * Ensure si->map is valid before proceeding.
+		 * This prevents missing swapon failures where SWP_USED
+		 * state persists unexpectedly.
+		 */
+		if (!(si->flags & SWP_USED) || !si->swap_map)
+			continue;
+
+		if (si->id =3D=3D id)
+			continue;
+
+		if (si->id !=3D pnode->id) {
+			pnode->id =3D si->id;
+			pnode->prio =3D si->prio;
+			pnode->swap =3D si;
+		}
+
+		changed_prio =3D pnode->prio;
+
+		/*
+		 * A new negative value is added,
+		 * so all values lower than it are shifted backward by one.
+		 */
+		if (old_prio >=3D SWAP_PRIORITY_DISABLE &&
+		    prio < SWAP_PRIORITY_DISABLE &&
+		    (pnode->prio < SWAP_PRIORITY_DISABLE &&
+		    pnode->prio <=3D prio)) {
+			changed_prio--;
+		/*
+		 * One negative value is removed,
+		 * so all higher values are shifted forward by one.
+		 */
+		} else if (old_prio < SWAP_PRIORITY_DISABLE &&
+			   prio >=3D SWAP_PRIORITY_DISABLE &&
+			   (pnode->prio < SWAP_PRIORITY_DISABLE &&
+			   pnode->prio <=3D old_prio)) {
+			changed_prio++;
+		} else if (old_prio < SWAP_PRIORITY_DISABLE &&
+			   prio < SWAP_PRIORITY_DISABLE) {
+			/*
+			 * If it was negative already but becomes smaller,
+			 * shift all values in range backward by one.
+			 */
+			if (old_prio > prio &&
+			    (prio <=3D pnode->prio && old_prio >=3D pnode->prio)) {
+				changed_prio++;
+			/*
+			 * If it was negative already but becomes larger,
+			 * shift all values in range forward by one.
+			 */
+			} else if (old_prio < prio &&
+				   (prio >=3D pnode->prio &&
+				   old_prio <=3D pnode->prio)) {
+				changed_prio--;
+			}
+		}
+
+		if (!new && changed_prio =3D=3D pnode->prio)
+			continue;
+
+		apply_swap_cgroup_priority_pnode(
+			swap_priority, pnode, changed_prio, !new);
+	}
+
+out:
+	if (should_clear_swap_cgroup_priority(swap_priority))
+		return 1;
+
+	return 0;
+}
+
+int prepare_swap_cgroup_priority(int type)
+{
+	struct swap_cgroup_priority *swap_priority;
+	int err =3D 0;
+
+	spin_lock(&swap_avail_lock);
+	list_for_each_entry_rcu(swap_priority,
+				&swap_cgroup_priority_list, link) {
+		if (!swap_priority->pnode[type]) {
+			swap_priority->pnode[type] =3D
+				alloc_swap_cgroup_priority_pnode(GFP_ATOMIC);
+
+			if (!swap_priority->pnode[type]) {
+				err =3D -ENOMEM;
+				break;
+			}
+		}
+
+	}
+	spin_unlock(&swap_avail_lock);
+
+	return err;
+}
+
+int apply_swap_cgroup_priority(struct mem_cgroup *memcg, u64 id, int prio)
+{
+	struct swap_cgroup_priority *swap_priority;
+	int nid;
+	bool new =3D false;
+	int err =3D 0;
+
+	rcu_read_lock();
+	swap_priority =3D rcu_dereference(memcg->swap_priority);
+	if (swap_priority && swap_priority->owner =3D=3D memcg) {
+		rcu_read_unlock();
+		goto prio_set;
+	}
+	rcu_read_unlock();
+	new =3D true;
+
+	/* No need to define "global swap priority" for a new cgroup. */
+	if (new && prio =3D=3D SWAP_PRIORITY_GLOBAL)
+		return 0;
+
+	swap_priority =3D alloc_swap_cgroup_priority();
+	if (!swap_priority)
+		return -ENOMEM;
+
+	/* Just initialize. may changed on __apply_swap_cgroup_priority */
+	swap_priority->default_prio =3D SWAP_PRIORITY_GLOBAL;
+	INIT_LIST_HEAD(&swap_priority->link);
+	for_each_node(nid)
+		plist_head_init(&swap_priority->plist[nid]);
+
+prio_set:
+	spin_lock(&swap_lock);
+	spin_lock(&swap_avail_lock);
+
+	/* Simultaneous calls to the same interface.*/
+	if (new && memcg->swap_priority &&
+	    memcg->swap_priority->owner =3D=3D memcg) {
+		new =3D false;
+		free_swap_cgroup_priority(swap_priority);
+		swap_priority =3D memcg->swap_priority;
+	}
+
+	err =3D __apply_swap_cgroup_priority(swap_priority, id, prio, new);
+	if (err) {
+		/*
+		 * The difference with the global swap priority is now zero.
+		 * Remove the swap priority.
+		 */
+		if (err =3D=3D 1) {
+			err =3D 0;
+			__delete_swap_cgroup_priority(memcg);
+		}
+
+		goto error_locked;
+	}
+
+	if (new) {
+		swap_priority->owner =3D memcg;
+		list_add_rcu(&swap_priority->link, &swap_cgroup_priority_list);
+		memcg->swap_priority =3D swap_priority;
+
+		for (int i =3D 0; i < nr_swapfiles; i++) {
+			if (!swap_priority->pnode[i]->swap) {
+				free_swap_cgroup_priority_pnode(
+					swap_priority->pnode[i]);
+				swap_priority->pnode[i] =3D NULL;
+			}
+		}
+	}
+
+	spin_unlock(&swap_avail_lock);
+	spin_unlock(&swap_lock);
+
+	return 0;
+
+error_locked:
+	spin_unlock(&swap_avail_lock);
+	spin_unlock(&swap_lock);
+
+	if (!new)
+		return err;
+
+	free_swap_cgroup_priority(swap_priority);
+	return err;
+}
+
+static void __delete_swap_cgroup_priority(struct mem_cgroup *memcg)
+{
+	struct swap_cgroup_priority *swap_priority =3D memcg->swap_priority;
+
+	lockdep_assert_held(&swap_avail_lock);
+
+	if (!swap_priority)
+		return;
+
+	/* If using a cached swap_priority, there is no need to remove it. */
+	if (swap_priority->owner !=3D memcg)
+		return;
+
+	rcu_assign_pointer(memcg->swap_priority, NULL);
+	list_del_rcu(&swap_priority->link);
+	call_rcu(&swap_priority->rcu, rcu_free_swap_cgroup_priority);
+}
+
+void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
+{
+	spin_lock(&swap_avail_lock);
+	__delete_swap_cgroup_priority(memcg);
+	spin_unlock(&swap_avail_lock);
+}
diff --git a/mm/swap_cgroup_priority.h b/mm/swap_cgroup_priority.h
new file mode 100644
index 000000000000..253e95623270
--- /dev/null
+++ b/mm/swap_cgroup_priority.h
@@ -0,0 +1,86 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SWAP_CGROUP_PRIORITY_H
+#define _SWAP_CGROUP_PRIORITY_H
+#include <linux/swap.h>
+
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+#include <linux/limits.h>
+/*
+ * A priority of -1 is not assigned to global swap entries,
+ * based on the kernel's specific negative priority assignment rules.
+ */
+#define SWAP_PRIORITY_DISABLE	-1
+/*
+ * (SHRT_MAX + 1) exceeds the maximum 'prio' value for signed short.
+ * This marks it as an invalid or special priority state, not for standard=
 use.
+ */
+#define SWAP_PRIORITY_GLOBAL	SHRT_MAX+1
+/*
+ * ID 0 is reserved/unused in kernel swap management, allowing its use
+ * for special internal states or flags, as swap IDs typically start from =
1.
+ */
+#define DEFAULT_ID		0
+
+/* linux/mm/swapfile.c */
+extern spinlock_t swap_lock;
+extern int least_priority;
+extern unsigned int nr_swapfiles;
+extern spinlock_t swap_avail_lock;
+extern struct swap_info_struct *swap_info[MAX_SWAPFILES];
+int swap_node(struct swap_info_struct *si);
+unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int or=
der,
+				       unsigned char usage);
+bool get_swap_device_info(struct swap_info_struct *si);
+
+/* linux/mm/swap_cgroup_priority.c */
+int apply_swap_cgroup_priority(struct mem_cgroup *memcg, u64 id, int prio);
+void activate_swap_cgroup_priority(struct swap_info_struct *swp, bool swap=
on);
+void deactivate_swap_cgroup_priority(struct swap_info_struct *swp,
+				     bool swapoff);
+int prepare_swap_cgroup_priority(int type);
+void show_swap_cgroup_priority(struct seq_file *m);
+void get_swapdev_id(struct swap_info_struct *si);
+void purge_swap_cgroup_priority(void);
+bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg, swp_entry_t *ent=
ry,
+				int order);
+void delete_swap_cgroup_priority(struct mem_cgroup *memcg);
+#else
+int swap_node(struct swap_info_struct *si);
+unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int or=
der,
+				       unsigned char usage);
+bool get_swap_device_info(struct swap_info_struct *si);
+
+static inline int apply_swap_cgroup_priority(struct mem_cgroup *memcg, int=
 id,
+					     int prio)
+{
+	return 0;
+}
+static inline void activate_swap_cgroup_priority(struct swap_info_struct *=
swp,
+						 bool swapon)
+{
+}
+static inline void deactivate_swap_cgroup_priority(struct swap_info_struct=
 *swp,=20
+						   bool swapoff)
+{
+}
+static inline int prepare_swap_cgroup_priority(int type)
+{
+	return 0;
+}
+
+static inline void get_swapdev_id(struct swap_info_struct *si)
+{
+}
+static inline void purge_swap_cgroup_priority(void)
+{
+}
+static inline bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
+					      swp_entry_t *entry, int order)
+{
+	return false;
+}
+static inline void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
+{
+}
+#endif
+#endif
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 68ce283e84be..4b56f117b2b0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -48,6 +48,7 @@
 #include <linux/swap_cgroup.h>
 #include "internal.h"
 #include "swap.h"
+#include "swap_cgroup_priority.h"
=20
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
@@ -62,8 +63,8 @@ static struct swap_cluster_info *lock_cluster(struct swap=
_info_struct *si,
 					      unsigned long offset);
 static inline void unlock_cluster(struct swap_cluster_info *ci);
=20
-static DEFINE_SPINLOCK(swap_lock);
-static unsigned int nr_swapfiles;
+DEFINE_SPINLOCK(swap_lock);
+unsigned int nr_swapfiles;
 atomic_long_t nr_swap_pages;
 /*
  * Some modules use swappable objects and may try to swap them out under
@@ -73,7 +74,7 @@ atomic_long_t nr_swap_pages;
 EXPORT_SYMBOL_GPL(nr_swap_pages);
 /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
 long total_swap_pages;
-static int least_priority =3D -1;
+int least_priority =3D -1;
 unsigned long swapfile_maximum_size;
 #ifdef CONFIG_MIGRATION
 bool swap_migration_ad_supported;
@@ -103,9 +104,9 @@ static PLIST_HEAD(swap_active_head);
  * before any swap_info_struct->lock.
  */
 static struct plist_head *swap_avail_heads;
-static DEFINE_SPINLOCK(swap_avail_lock);
+DEFINE_SPINLOCK(swap_avail_lock);
=20
-static struct swap_info_struct *swap_info[MAX_SWAPFILES];
+struct swap_info_struct *swap_info[MAX_SWAPFILES];
=20
 static DEFINE_MUTEX(swapon_mutex);
=20
@@ -878,7 +879,7 @@ static void swap_reclaim_work(struct work_struct *work)
  * Try to allocate swap entries with specified order and try set a new
  * cluster for current CPU too.
  */
-static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,=
 int order,
+unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int or=
der,
 					      unsigned char usage)
 {
 	struct swap_cluster_info *ci;
@@ -1156,7 +1157,7 @@ static void swap_range_free(struct swap_info_struct *=
si, unsigned long offset,
 	swap_usage_sub(si, nr_entries);
 }
=20
-static bool get_swap_device_info(struct swap_info_struct *si)
+bool get_swap_device_info(struct swap_info_struct *si)
 {
 	if (!percpu_ref_tryget_live(&si->users))
 		return false;
@@ -2536,7 +2537,7 @@ static int setup_swap_extents(struct swap_info_struct=
 *sis, sector_t *span)
 	return generic_swapfile_activate(sis, swap_file, span);
 }
=20
-static int swap_node(struct swap_info_struct *si)
+int swap_node(struct swap_info_struct *si)
 {
 	struct block_device *bdev;
=20
--=20
2.34.1

From nobody Mon Oct  6 21:02:11 2025
Received: from lgeamrelo07.lge.com (lgeamrelo07.lge.com [156.147.51.103])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 248392D77FA
	for <linux-kernel@vger.kernel.org>; Wed, 16 Jul 2025 20:21:15 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=156.147.51.103
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1752697280; cv=none;
 b=VeSFTfsm+gkEHhmLa2qm6T+I5C4kSMDX7UcjMOMipmE6eFCZ5jj/l3VBF78TLBag7AFHa6YZ3IwOIsOm4Oiv4uO1xn4E/ew2nQzLnsiCA7Q8uN6u+Tgp5jY2W75v/gJx4MD/WH9M+xZDmGSUdaNTsUPmzesafo9Ipn++t1Jwtms=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1752697280; c=relaxed/simple;
	bh=L+6dhr9Tqz0Aoxu4yvfGFJiKwxKTNoFWGxvhDOthrT0=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=rGCH36mKqMmdhhljzsT2YCjHEAYgWXfFuarliuX0TvyBuoGkrGMnhVhSk4P9tHuyrO/gF4I7TOsqy6dOdaBkU3eQh+20VswRKw73sR/LdVujik83VFJa2AkU1JSiv3zkXcNDSbTLNlD/Qde5vMp+0VkQSa9y1a4rN5hJX+VGUGM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=lge.com;
 spf=pass smtp.mailfrom=lge.com; arc=none smtp.client-ip=156.147.51.103
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=lge.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=lge.com
Received: from unknown (HELO yjaykim-PowerEdge-T330.lge.net) (10.177.112.156)
	by 156.147.51.103 with ESMTP; 17 Jul 2025 05:21:07 +0900
X-Original-SENDERIP: 10.177.112.156
X-Original-MAILFROM: youngjun.park@lge.com
From: Youngjun Park <youngjun.park@lge.com>
To: akpm@linux-foundation.org,
	hannes@cmpxchg.org
Cc: mhocko@kernel.org,
	roman.gushchin@linux.dev,
	shakeel.butt@linux.dev,
	muchun.song@linux.dev,
	shikemeng@huaweicloud.com,
	kasong@tencent.com,
	nphamcs@gmail.com,
	bhe@redhat.com,
	baohua@kernel.org,
	chrisl@kernel.org,
	cgroups@vger.kernel.org,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	gunho.lee@lge.com,
	iamjoonsoo.kim@lge.com,
	taejoon.song@lge.com,
	Youngjun Park <youngjun.park@lge.com>
Subject: [PATCH 2/4] mm: swap: Apply per-cgroup swap priority mechanism to
 swap layer
Date: Thu, 17 Jul 2025 05:20:04 +0900
Message-Id: <20250716202006.3640584-3-youngjun.park@lge.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20250716202006.3640584-1-youngjun.park@lge.com>
References: <20250716202006.3640584-1-youngjun.park@lge.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

This patch applies the per-cgroup swap priority mechanism to the swap layer.

It implements:
- Swap device ID assignment based on the cgroup's effective priority
- Swap device selection respecting cgroup-specific priorities
- Swap on/off propagation logic that updates per-cgroup settings accordingly

Currently, the per-CPU swap cluster cache is bypassed, since different
cgroups may select different devices based on their configured priorities.

Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 mm/swap_cgroup_priority.c |  6 ++---
 mm/swapfile.c             | 46 +++++++++++++++++++++++++++++++++++++--
 2 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c
index abbefa6de63a..979bc18d2eed 100644
--- a/mm/swap_cgroup_priority.c
+++ b/mm/swap_cgroup_priority.c
@@ -243,9 +243,9 @@ bool swap_alloc_cgroup_priority(struct mem_cgroup *memc=
g,
 	unsigned long offset;
 	int node;
=20
-	/* TODO
-	 * Per-cpu swapdev cache can't be used directly as cgroup-specific
-	 * priorities may select different devices.
+	/*
+	 * TODO: Per-cpu swap cluster cache can't be used directly
+	 * as cgroup-specific priorities may select different devices.
 	 */
 	spin_lock(&swap_avail_lock);
 	node =3D numa_node_id();
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4b56f117b2b0..bfd0532ad250 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1029,6 +1029,7 @@ static void del_from_avail_list(struct swap_info_stru=
ct *si, bool swapoff)
 	for_each_node(nid)
 		plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]);
=20
+	deactivate_swap_cgroup_priority(si, swapoff);
 skip:
 	spin_unlock(&swap_avail_lock);
 }
@@ -1072,6 +1073,7 @@ static void add_to_avail_list(struct swap_info_struct=
 *si, bool swapon)
 	for_each_node(nid)
 		plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
=20
+	activate_swap_cgroup_priority(si, swapon);
 skip:
 	spin_unlock(&swap_avail_lock);
 }
@@ -1292,8 +1294,10 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 	}
=20
 	local_lock(&percpu_swap_cluster.lock);
-	if (!swap_alloc_fast(&entry, order))
-		swap_alloc_slow(&entry, order);
+	if (!swap_alloc_cgroup_priority(folio_memcg(folio), &entry, order)) {
+		if (!swap_alloc_fast(&entry, order))
+			swap_alloc_slow(&entry, order);
+	}
 	local_unlock(&percpu_swap_cluster.lock);
=20
 	/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
@@ -2778,6 +2782,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special=
file)
 	if (!p->bdev || !bdev_nonrot(p->bdev))
 		atomic_dec(&nr_rotate_swap);
=20
+	purge_swap_cgroup_priority();
 	mutex_lock(&swapon_mutex);
 	spin_lock(&swap_lock);
 	spin_lock(&p->lock);
@@ -2895,6 +2900,8 @@ static void swap_stop(struct seq_file *swap, void *v)
 	mutex_unlock(&swapon_mutex);
 }
=20
+
+#ifndef CONFIG_SWAP_CGROUP_PRIORITY
 static int swap_show(struct seq_file *swap, void *v)
 {
 	struct swap_info_struct *si =3D v;
@@ -2921,6 +2928,34 @@ static int swap_show(struct seq_file *swap, void *v)
 			si->prio);
 	return 0;
 }
+#else
+static int swap_show(struct seq_file *swap, void *v)
+{
+	struct swap_info_struct *si =3D v;
+	struct file *file;
+	int len;
+	unsigned long bytes, inuse;
+
+	if (si =3D=3D SEQ_START_TOKEN) {
+		seq_puts(swap, "Filename\t\t\t\tType\t\tSize\t\tUsed\t\tPriority\t\tId\n=
");
+		return 0;
+	}
+
+	bytes =3D K(si->pages);
+	inuse =3D K(swap_usage_in_pages(si));
+
+	file =3D si->swap_file;
+	len =3D seq_file_path(swap, file, " \t\n\\");
+	seq_printf(swap, "%*s%s\t%lu\t%s%lu\t%s%d\t\t\t%llu\n",
+			len < 40 ? 40 - len : 1, " ",
+			S_ISBLK(file_inode(file)->i_mode) ?
+				"partition" : "file\t",
+			bytes, bytes < 10000000 ? "\t" : "",
+			inuse, inuse < 10000000 ? "\t" : "",
+			si->prio, si->id);
+	return 0;
+}
+#endif
=20
 static const struct seq_operations swaps_op =3D {
 	.start =3D	swap_start,
@@ -3463,6 +3498,13 @@ SYSCALL_DEFINE2(swapon, const char __user *, special=
file, int, swap_flags)
 		goto free_swap_zswap;
 	}
=20
+	error =3D prepare_swap_cgroup_priority(si->type);
+	if (error) {
+		inode->i_flags &=3D ~S_SWAPFILE;
+		goto free_swap_zswap;
+	}
+	get_swapdev_id(si);
+
 	mutex_lock(&swapon_mutex);
 	prio =3D -1;
 	if (swap_flags & SWAP_FLAG_PREFER)
--=20
2.34.1
From nobody Mon Oct  6 21:02:11 2025
Received: from lgeamrelo07.lge.com (lgeamrelo07.lge.com [156.147.51.103])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C2ACC2C1798
	for <linux-kernel@vger.kernel.org>; Wed, 16 Jul 2025 20:21:25 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=156.147.51.103
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1752697290; cv=none;
 b=O7wa/bSkHeSaJ53b8ixLeFhsupQoMH+xgKCeOI0QjKi+InrBPwhYXP50AS5ykZ9l9s5WmUh05xxxpfnQ1ZF9B0RdSlgGY+ZcBMAFaZlC5m8Vqe+SD4zC44215YM72MGYFc0Oh4WbtBf+G1jPmVe2HbI+rH11JgxgY2exeb3swlA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1752697290; c=relaxed/simple;
	bh=R49JyfSFG/TbXcGcMdXj10k1XNRDNfs6c0SGbi4hsDE=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=sCngg7830mx1wpm5DztzSekVRPIiMfHQyM9ojtZ3HzW2tetX31YDLu30qmTeGowdpHvRFflt+4H2e4oqKunSRkJw7m/ixhO9ChFWZZgpQVfXvhAVdfwFMLoxcQMUFf+CUATphXunpi2tH99lt7HQ6KK50ESAYVAdEZb4b18ZZWY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=lge.com;
 spf=pass smtp.mailfrom=lge.com; arc=none smtp.client-ip=156.147.51.103
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=lge.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=lge.com
Received: from unknown (HELO yjaykim-PowerEdge-T330.lge.net) (10.177.112.156)
	by 156.147.51.103 with ESMTP; 17 Jul 2025 05:21:19 +0900
X-Original-SENDERIP: 10.177.112.156
X-Original-MAILFROM: youngjun.park@lge.com
From: Youngjun Park <youngjun.park@lge.com>
To: akpm@linux-foundation.org,
	hannes@cmpxchg.org
Cc: mhocko@kernel.org,
	roman.gushchin@linux.dev,
	shakeel.butt@linux.dev,
	muchun.song@linux.dev,
	shikemeng@huaweicloud.com,
	kasong@tencent.com,
	nphamcs@gmail.com,
	bhe@redhat.com,
	baohua@kernel.org,
	chrisl@kernel.org,
	cgroups@vger.kernel.org,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	gunho.lee@lge.com,
	iamjoonsoo.kim@lge.com,
	taejoon.song@lge.com,
	Youngjun Park <youngjun.park@lge.com>,
	=?UTF-8?q?Michal=20Koutn=C3=BD?= <mkoutny@suse.com>
Subject: [PATCH 3/4] mm: memcg: Add swap cgroup priority inheritance mechanism
Date: Thu, 17 Jul 2025 05:20:05 +0900
Message-Id: <20250716202006.3640584-4-youngjun.park@lge.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20250716202006.3640584-1-youngjun.park@lge.com>
References: <20250716202006.3640584-1-youngjun.park@lge.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

This patch introduces inheritance semantics for swap cgroup priorities.

Each cgroup can configure its own swap priority via the
memory.swap.priority interface. However, the effective priority is
determined by walking up the cgroup hierarchy and applying the highest
ancestor's configured value.

If no ancestor has a configured value, the cgroup's own setting is used.
If neither is present, it falls back to the global swap configuration.

To make inheritance visible to userspace, this patch introduces the
memory.swap.priority.effective interface.

Suggested-by: Michal Koutn=C3=BD <mkoutny@suse.com>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  14 ++
 mm/memcontrol.c                         |  14 ++
 mm/swap_cgroup_priority.c               | 203 ++++++++++++++++++++----
 mm/swap_cgroup_priority.h               |   3 +
 4 files changed, 207 insertions(+), 27 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-=
guide/cgroup-v2.rst
index 35fb9677f0d6..ae6a0c809db4 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1901,6 +1901,20 @@ The following nested keys are defined.
       other negative priorities to restore the same ordering as the global
       swap configuration.
=20
+  memory.swap.priority.effective
+        A read-only file showing the effective swap priority ordering
+        actually applied to this cgroup, after resolving inheritance
+        from ancestors. The effective swap priority for a cgroup is
+        also influenced by its position within the cgroup hierarchy. If any
+        ancestor cgroup has set a swap priority configuration, it is
+        propagated and inherited by all descendants. In that case, the
+        child=E2=80=99s own configuration is ignored and the topmost confi=
gured
+        ancestor determines the effective priority ordering.
+
+        If there is no configuration in the current cgroup and its
+        ancestors, this file shows the global swap device priority from
+        `swapon`, in the form of id and priority pairs.
+
   memory.zswap.current
 	A read-only single value file which exists on non-root
 	cgroups.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ea207d498ad6..4a0762060f99 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3806,6 +3806,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *pare=
nt_css)
 	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
 	if (parent) {
 		WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent));
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+		memcg->swap_priority =3D inherit_swap_cgroup_priority(parent);
+#endif
 		page_counter_init(&memcg->memory, &parent->memory, memcg_on_dfl);
 		page_counter_init(&memcg->swap, &parent->swap, false);
 #ifdef CONFIG_MEMCG_V1
@@ -5480,6 +5483,12 @@ static int swap_cgroup_priority_show(struct seq_file=
 *m, void *v)
 	show_swap_cgroup_priority(m);
 	return 0;
 }
+
+static int swap_cgroup_priority_effective_show(struct seq_file *m, void *v)
+{
+	show_swap_cgroup_priority_effective(m);
+	return 0;
+}
 #endif
=20
 static struct cftype swap_files[] =3D {
@@ -5521,6 +5530,11 @@ static struct cftype swap_files[] =3D {
 		.seq_show =3D swap_cgroup_priority_show,
 		.write =3D swap_cgroup_priority_write,
 	},
+	{
+		.name =3D "swap.priority.effective",
+		.flags =3D CFTYPE_NOT_ON_ROOT,
+		.seq_show =3D swap_cgroup_priority_effective_show,
+	},
 #endif
 	{ }	/* terminate */
 };
diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c
index 979bc18d2eed..84e876b77f01 100644
--- a/mm/swap_cgroup_priority.c
+++ b/mm/swap_cgroup_priority.c
@@ -21,6 +21,7 @@
 #include "swap_cgroup_priority.h"
 #include "memcontrol-v1.h"
=20
+static DEFINE_MUTEX(swap_cgroup_priority_inherit_lck);
 static LIST_HEAD(swap_cgroup_priority_list);
=20
 /*
@@ -31,6 +32,16 @@ static LIST_HEAD(swap_cgroup_priority_list);
  * tracks priority differences from global swap. If zero, and its default_=
prio
  * follows global swap priority(SWAP_PRIORITY_GLOBAL), the object is destr=
oyed.
  *
+ * Child cgroups hold direct pointers to this object for fast access.
+ * No reference counting is needed, as the owner's teardown or zero
+ * distance directly implies this object's destruction.
+ *
+ * A child cgroup that has its own effective swap_cgroup_priority uses
+ * the 'effective' field to point to the top-most cgroup's relevant
+ * swap_cgroup_priority object that it should inherit. Changes in the
+ * parent cgroup's swap priority are appropriately propagated downwards.
+ *
+ * effective - Actual effective swap cgroup priority.
  * pnode - Array of pointers to swap device priority nodes.
  * owner - The owning memory cgroup.
  * rcu - RCU free callback.
@@ -41,6 +52,7 @@ static LIST_HEAD(swap_cgroup_priority_list);
  * plist - Priority list head.
  */
 struct swap_cgroup_priority {
+	struct swap_cgroup_priority *effective;
 	struct swap_cgroup_priority_pnode *pnode[MAX_SWAPFILES];
 	struct mem_cgroup *owner;
=20
@@ -106,13 +118,38 @@ void get_swapdev_id(struct swap_info_struct *si)
 	si->id =3D atomic64_inc_return(&swapdev_id_counter);
 }
=20
-static struct swap_cgroup_priority *get_swap_cgroup_priority(
+static struct swap_cgroup_priority *get_effective_swap_cgroup_priority(
 	struct mem_cgroup *memcg)
 {
+	struct swap_cgroup_priority *swap_priority;
 	if (!memcg)
 		return NULL;
=20
-	return rcu_dereference(memcg->swap_priority);
+	swap_priority =3D memcg->swap_priority;
+	if (!swap_priority)
+		return NULL;
+
+	return swap_priority->effective;
+}
+
+static bool validate_effective_swap_cgroup_priority(
+	struct mem_cgroup *memcg,
+	struct swap_cgroup_priority **swap_priority)
+{
+	struct swap_cgroup_priority *target =3D memcg->swap_priority;
+
+	if (!target) {
+		*swap_priority =3D NULL;
+		return false;
+	}
+
+	target =3D target->effective;
+	if (target !=3D *swap_priority) {
+		*swap_priority =3D target;
+		return false;
+	}
+
+	return true;
 }
=20
 static struct swap_cgroup_priority_pnode *alloc_swap_cgroup_priority_pnode(
@@ -182,10 +219,13 @@ void show_swap_cgroup_priority(struct seq_file *m)
 	struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m);
 	struct swap_cgroup_priority *swap_priority;
=20
+	mutex_lock(&swap_cgroup_priority_inherit_lck);
 	spin_lock(&swap_lock);
+
 	swap_priority =3D memcg->swap_priority;
 	if (!swap_priority || swap_priority->owner !=3D memcg) {
 		spin_unlock(&swap_lock);
+		mutex_unlock(&swap_cgroup_priority_inherit_lck);
 		return;
 	}
=20
@@ -217,6 +257,47 @@ void show_swap_cgroup_priority(struct seq_file *m)
 	}
=20
 	spin_unlock(&swap_lock);
+	mutex_unlock(&swap_cgroup_priority_inherit_lck);
+}
+
+void show_swap_cgroup_priority_effective(struct seq_file *m)
+{
+	struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m);
+	struct swap_cgroup_priority *swap_priority;
+
+	mutex_lock(&swap_cgroup_priority_inherit_lck);
+	spin_lock(&swap_lock);
+
+	swap_priority =3D get_effective_swap_cgroup_priority(memcg);
+	if (swap_priority && swap_priority->default_prio !=3D SWAP_PRIORITY_GLOBA=
L)
+		seq_printf(m,  "default disabled\n");
+
+	for (int i =3D 0; i < nr_swapfiles; i++) {
+		struct swap_info_struct *si =3D swap_info[i];
+		struct swap_cgroup_priority_pnode *pnode;
+		signed short prio;
+
+		if (!(si->flags & SWP_USED) || !(si->flags & SWP_WRITEOK))
+			continue;
+
+		seq_printf(m,  "%lld", si->id);
+		if (!swap_priority) {
+			seq_printf(m, " %d\n", si->prio);
+			continue;
+		}
+
+		pnode =3D swap_priority->pnode[i];
+		if (WARN_ON(!pnode))
+			continue;
+
+		prio =3D pnode->prio;
+		if (prio !=3D SWAP_PRIORITY_DISABLE)
+			seq_printf(m,  " %d\n", prio);
+		else
+			seq_printf(m,  " disabled\n");
+	}
+	spin_unlock(&swap_lock);
+	mutex_unlock(&swap_cgroup_priority_inherit_lck);
 }
=20
 static void __delete_swap_cgroup_priority(struct mem_cgroup *memcg);
@@ -224,6 +305,7 @@ void purge_swap_cgroup_priority(void)
 {
 	struct swap_cgroup_priority *swap_priority, *tmp;
=20
+	mutex_lock(&swap_cgroup_priority_inherit_lck);
 	spin_lock(&swap_avail_lock);
 	list_for_each_entry_safe(swap_priority, tmp, &swap_cgroup_priority_list,
 				 link) {
@@ -232,6 +314,7 @@ void purge_swap_cgroup_priority(void)
 			__delete_swap_cgroup_priority(swap_priority->owner);
 	}
 	spin_unlock(&swap_avail_lock);
+	mutex_unlock(&swap_cgroup_priority_inherit_lck);
 }
=20
 bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
@@ -250,7 +333,7 @@ bool swap_alloc_cgroup_priority(struct mem_cgroup *memc=
g,
 	spin_lock(&swap_avail_lock);
 	node =3D numa_node_id();
=20
-	swap_priority =3D get_swap_cgroup_priority(memcg);
+	swap_priority =3D get_effective_swap_cgroup_priority(memcg);
 swap_priority_check:
 	if (!swap_priority) {
 		spin_unlock(&swap_avail_lock);
@@ -282,7 +365,8 @@ bool swap_alloc_cgroup_priority(struct mem_cgroup *memc=
g,
 		 * If 'swap_cgroup_priority' changes while we're holding a lock,
 		 * we must verify its state to ensure memory validness.
 		 */
-		if (memcg->swap_priority !=3D swap_priority)
+		if (!validate_effective_swap_cgroup_priority(memcg,
+							     &swap_priority))
 			goto swap_priority_check;
=20
 		if (plist_node_empty(&next->avail_lists[node]))
@@ -350,7 +434,7 @@ void deactivate_swap_cgroup_priority(struct swap_info_s=
truct *swp,
 	struct swap_cgroup_priority *swap_priority, *tmp;
 	int nid, i;
=20
-	list_for_each_entry_safe(swap_priority, tmp, &swap_cgroup_priority_list,
+	list_for_each_entry_safe(swap_priority, tmp, &swap_cgroup_priority_list,=20
 				 link) {
 		struct swap_cgroup_priority_pnode *pnode =3D
 			swap_priority->pnode[swp->type];
@@ -603,17 +687,57 @@ static int __apply_swap_cgroup_priority(
 	return 0;
 }
=20
+/*
+ * If this is the top-level swap_cgroup_priority, propagation is needed.
+ * We traverse the 'mem_cgroup_tree' using 'for_each_mem_cgroup_tree'.
+ * Due to its pre-order traversal, after propagating changes in the parent,
+ * subsequent child nodes can correctly retrieve the parent's effective
+ * swap_cgroup_priority, ensuring proper propagation.
+ */
+static void propagate_swap_cgroup_priority(
+	struct mem_cgroup *memcg,
+	struct swap_cgroup_priority *swap_priority)
+{
+	struct mem_cgroup *iter;
+
+	iter =3D parent_mem_cgroup(memcg);
+	while (iter) {
+		if (iter->swap_priority)
+			return;
+		iter =3D parent_mem_cgroup(iter);
+	}
+
+	for_each_mem_cgroup_tree(iter, memcg) {
+		if (iter =3D=3D memcg)
+			continue;
+
+		if (iter->swap_priority &&
+			iter->swap_priority->owner =3D=3D iter) {
+			rcu_assign_pointer(iter->swap_priority->effective,
+					   swap_priority ?
+					   swap_priority : iter->swap_priority);
+		} else {
+			struct swap_cgroup_priority *effective =3D
+				get_effective_swap_cgroup_priority(
+					parent_mem_cgroup(iter));
+			iter->swap_priority =3D effective;
+		}
+	}
+
+	return;
+}
+
 int prepare_swap_cgroup_priority(int type)
 {
 	struct swap_cgroup_priority *swap_priority;
 	int err =3D 0;
=20
-	spin_lock(&swap_avail_lock);
+	mutex_lock(&swap_cgroup_priority_inherit_lck);
 	list_for_each_entry_rcu(swap_priority,
 				&swap_cgroup_priority_list, link) {
 		if (!swap_priority->pnode[type]) {
 			swap_priority->pnode[type] =3D
-				alloc_swap_cgroup_priority_pnode(GFP_ATOMIC);
+				alloc_swap_cgroup_priority_pnode(GFP_KERNEL);
=20
 			if (!swap_priority->pnode[type]) {
 				err =3D -ENOMEM;
@@ -622,11 +746,23 @@ int prepare_swap_cgroup_priority(int type)
 		}
=20
 	}
-	spin_unlock(&swap_avail_lock);
+	mutex_unlock(&swap_cgroup_priority_inherit_lck);
=20
 	return err;
 }
=20
+struct swap_cgroup_priority *inherit_swap_cgroup_priority(
+	struct mem_cgroup *parent)
+{
+	struct swap_cgroup_priority *swap_priority;
+
+	mutex_lock(&swap_cgroup_priority_inherit_lck);
+	swap_priority =3D get_effective_swap_cgroup_priority(parent);
+	mutex_unlock(&swap_cgroup_priority_inherit_lck);
+
+	return swap_priority;
+}
+
 int apply_swap_cgroup_priority(struct mem_cgroup *memcg, u64 id, int prio)
 {
 	struct swap_cgroup_priority *swap_priority;
@@ -634,22 +770,24 @@ int apply_swap_cgroup_priority(struct mem_cgroup *mem=
cg, u64 id, int prio)
 	bool new =3D false;
 	int err =3D 0;
=20
-	rcu_read_lock();
-	swap_priority =3D rcu_dereference(memcg->swap_priority);
-	if (swap_priority && swap_priority->owner =3D=3D memcg) {
-		rcu_read_unlock();
+	mutex_lock(&swap_cgroup_priority_inherit_lck);
+	swap_priority =3D memcg->swap_priority;
+	if (swap_priority && swap_priority->owner =3D=3D memcg)
 		goto prio_set;
-	}
-	rcu_read_unlock();
+
 	new =3D true;
=20
 	/* No need to define "global swap priority" for a new cgroup. */
-	if (new && prio =3D=3D SWAP_PRIORITY_GLOBAL)
+	if (new && prio =3D=3D SWAP_PRIORITY_GLOBAL) {
+		mutex_unlock(&swap_cgroup_priority_inherit_lck);
 		return 0;
+	}
=20
 	swap_priority =3D alloc_swap_cgroup_priority();
-	if (!swap_priority)
+	if (!swap_priority) {
+		mutex_unlock(&swap_cgroup_priority_inherit_lck);
 		return -ENOMEM;
+	}
=20
 	/* Just initialize. may changed on __apply_swap_cgroup_priority */
 	swap_priority->default_prio =3D SWAP_PRIORITY_GLOBAL;
@@ -661,23 +799,17 @@ int apply_swap_cgroup_priority(struct mem_cgroup *mem=
cg, u64 id, int prio)
 	spin_lock(&swap_lock);
 	spin_lock(&swap_avail_lock);
=20
-	/* Simultaneous calls to the same interface.*/
-	if (new && memcg->swap_priority &&
-	    memcg->swap_priority->owner =3D=3D memcg) {
-		new =3D false;
-		free_swap_cgroup_priority(swap_priority);
-		swap_priority =3D memcg->swap_priority;
-	}
-
 	err =3D __apply_swap_cgroup_priority(swap_priority, id, prio, new);
 	if (err) {
 		/*
 		 * The difference with the global swap priority is now zero.
-		 * Remove the swap priority.
+		 * Remove the swap priority, and propagate if needed.
 		 */
 		if (err =3D=3D 1) {
 			err =3D 0;
 			__delete_swap_cgroup_priority(memcg);
+			if (swap_priority !=3D swap_priority->effective)
+				memcg->swap_priority =3D swap_priority->effective;
 		}
=20
 		goto error_locked;
@@ -686,7 +818,19 @@ int apply_swap_cgroup_priority(struct mem_cgroup *memc=
g, u64 id, int prio)
 	if (new) {
 		swap_priority->owner =3D memcg;
 		list_add_rcu(&swap_priority->link, &swap_cgroup_priority_list);
-		memcg->swap_priority =3D swap_priority;
+	        /* If there was an inherited swap priority, update effective. */
+		if (memcg->swap_priority) {
+			swap_priority->effective =3D memcg->swap_priority;
+			memcg->swap_priority =3D swap_priority;
+		} else {
+			swap_priority->effective =3D swap_priority;
+			memcg->swap_priority =3D swap_priority;
+	                /*
+			 * Might be a top-level parent memcg,
+			 * so propagate effective priority.
+			 */
+			propagate_swap_cgroup_priority(memcg, swap_priority);
+		}
=20
 		for (int i =3D 0; i < nr_swapfiles; i++) {
 			if (!swap_priority->pnode[i]->swap) {
@@ -699,12 +843,13 @@ int apply_swap_cgroup_priority(struct mem_cgroup *mem=
cg, u64 id, int prio)
=20
 	spin_unlock(&swap_avail_lock);
 	spin_unlock(&swap_lock);
-
+	mutex_unlock(&swap_cgroup_priority_inherit_lck);
 	return 0;
=20
 error_locked:
 	spin_unlock(&swap_avail_lock);
 	spin_unlock(&swap_lock);
+	mutex_unlock(&swap_cgroup_priority_inherit_lck);
=20
 	if (!new)
 		return err;
@@ -717,6 +862,7 @@ static void __delete_swap_cgroup_priority(struct mem_cg=
roup *memcg)
 {
 	struct swap_cgroup_priority *swap_priority =3D memcg->swap_priority;
=20
+	lockdep_assert_held(&swap_cgroup_priority_inherit_lck);
 	lockdep_assert_held(&swap_avail_lock);
=20
 	if (!swap_priority)
@@ -727,13 +873,16 @@ static void __delete_swap_cgroup_priority(struct mem_=
cgroup *memcg)
 		return;
=20
 	rcu_assign_pointer(memcg->swap_priority, NULL);
+	propagate_swap_cgroup_priority(memcg, NULL);
 	list_del_rcu(&swap_priority->link);
 	call_rcu(&swap_priority->rcu, rcu_free_swap_cgroup_priority);
 }
=20
 void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
 {
+	mutex_lock(&swap_cgroup_priority_inherit_lck);
 	spin_lock(&swap_avail_lock);
 	__delete_swap_cgroup_priority(memcg);
 	spin_unlock(&swap_avail_lock);
+	mutex_unlock(&swap_cgroup_priority_inherit_lck);
 }
diff --git a/mm/swap_cgroup_priority.h b/mm/swap_cgroup_priority.h
index 253e95623270..5d16b63d12e0 100644
--- a/mm/swap_cgroup_priority.h
+++ b/mm/swap_cgroup_priority.h
@@ -39,8 +39,11 @@ void deactivate_swap_cgroup_priority(struct swap_info_st=
ruct *swp,
 				     bool swapoff);
 int prepare_swap_cgroup_priority(int type);
 void show_swap_cgroup_priority(struct seq_file *m);
+void show_swap_cgroup_priority_effective(struct seq_file *m);
 void get_swapdev_id(struct swap_info_struct *si);
 void purge_swap_cgroup_priority(void);
+struct swap_cgroup_priority *inherit_swap_cgroup_priority(
+	struct mem_cgroup *parent);
 bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg, swp_entry_t *ent=
ry,
 				int order);
 void delete_swap_cgroup_priority(struct mem_cgroup *memcg);
--=20
2.34.1

From nobody Mon Oct  6 21:02:11 2025
Received: from lgeamrelo07.lge.com (lgeamrelo07.lge.com [156.147.51.103])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 42AAA2D29D6
	for <linux-kernel@vger.kernel.org>; Wed, 16 Jul 2025 20:21:32 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=156.147.51.103
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1752697296; cv=none;
 b=qBVah4d9Kc90+fLYam297GbjivI1Aa1hid7kgvnzHEbwFEdibBoK37guYwMwKMbwlalTAhRcEpcbjT7+rUTWTDPFZ8ZT0ncIvOYMbkj16Qxp+Vis5R2KeLe2kBXmF3LoI+oA4tQ74++HIjtnqXjszKX8SA3nbzXNgArg1bUR1Dc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1752697296; c=relaxed/simple;
	bh=6k6oMKVv3MY8X4J/+oRu1DGCe8pTbKfA1bjBy+xbDMQ=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=dGOepKP4eDfEtZMIj75krAciy27TgvInaNk7F8omhK2OVhdAvTLlgtz1yQSsAx+T34WwmLcXjPuwb7OuPv0/NcUc0W1MVVQ8wHSsKe+bOcaSlfsMRwPJUdYxjoW87jwfZtugHvKvvTmUChWBw75UP93dXhxlfJSOJH/IGGj+OSU=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=lge.com;
 spf=pass smtp.mailfrom=lge.com; arc=none smtp.client-ip=156.147.51.103
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=lge.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=lge.com
Received: from unknown (HELO yjaykim-PowerEdge-T330.lge.net) (10.177.112.156)
	by 156.147.51.103 with ESMTP; 17 Jul 2025 05:21:27 +0900
X-Original-SENDERIP: 10.177.112.156
X-Original-MAILFROM: youngjun.park@lge.com
From: Youngjun Park <youngjun.park@lge.com>
To: akpm@linux-foundation.org,
	hannes@cmpxchg.org
Cc: mhocko@kernel.org,
	roman.gushchin@linux.dev,
	shakeel.butt@linux.dev,
	muchun.song@linux.dev,
	shikemeng@huaweicloud.com,
	kasong@tencent.com,
	nphamcs@gmail.com,
	bhe@redhat.com,
	baohua@kernel.org,
	chrisl@kernel.org,
	cgroups@vger.kernel.org,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	gunho.lee@lge.com,
	iamjoonsoo.kim@lge.com,
	taejoon.song@lge.com,
	Youngjun Park <youngjun.park@lge.com>
Subject: [PATCH 4/4] mm: swap: Per-cgroup per-CPU swap device cache with
 shared clusters
Date: Thu, 17 Jul 2025 05:20:06 +0900
Message-Id: <20250716202006.3640584-5-youngjun.park@lge.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20250716202006.3640584-1-youngjun.park@lge.com>
References: <20250716202006.3640584-1-youngjun.park@lge.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

This patch introduces a new swap allocation mechanism that supports
per-cgroup per-CPU swap device caches, combined with per-device per-CPU
cluster management.

The existing global swap allocator uses a per-CPU device cache and
cluster, shared by all cgroups. Under this model, per-cgroup swap
priorities cannot be effectively honored on the fast path, as allocations
do not distinguish between cgroups.

To address this, we introduce per-cgroup per-CPU swap device caches.
This allows fast-path swap allocations to respect each cgroup=E2=80=99s
individual priority settings.

To avoid an explosion of cluster structures proportional to the number
of cgroups, clusters remain per-device and are shared across cgroups.
This strikes a balance between performance and memory overhead.

Suggested-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 include/linux/swap.h      |   7 ++
 mm/swap_cgroup_priority.c | 156 +++++++++++++++++++++++++++++++++++++-
 mm/swap_cgroup_priority.h |  39 ++++++++++
 mm/swapfile.c             |  47 +++++++-----
 4 files changed, 228 insertions(+), 21 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index bfddbec2ee28..ab15f4c103a1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -283,6 +283,12 @@ enum swap_cluster_flags {
 #define SWAP_NR_ORDERS		1
 #endif
=20
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+struct percpu_cluster {
+	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
+};
+#endif
+
 /*
  * We keep using same cluster for rotational device so IO will be sequenti=
al.
  * The purpose is to optimize SWAP throughput on these device.
@@ -341,6 +347,7 @@ struct swap_info_struct {
 	struct list_head discard_clusters; /* discard clusters list */
 #ifdef CONFIG_SWAP_CGROUP_PRIORITY
 	u64 id;
+	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio=
n */
 #endif
 	struct plist_node avail_lists[]; /*
 					   * entries in swap_avail_heads, one
diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c
index 84e876b77f01..f960c3dcab48 100644
--- a/mm/swap_cgroup_priority.c
+++ b/mm/swap_cgroup_priority.c
@@ -21,6 +21,17 @@
 #include "swap_cgroup_priority.h"
 #include "memcontrol-v1.h"
=20
+/*
+ * We do maintain a cache on a per-cgroup-per-swap-device basis.
+ * However, the underlying cluster cache itself is managed
+ * per-swap-device. This design prevents each individual
+ * swap_cgroup_priority entry from caching its own cluster data,
+ * even as the number of such entries increases.
+ */
+struct percpu_swap_device {
+	struct swap_info_struct *si[SWAP_NR_ORDERS];
+};
+
 static DEFINE_MUTEX(swap_cgroup_priority_inherit_lck);
 static LIST_HEAD(swap_cgroup_priority_list);
=20
@@ -49,6 +60,7 @@ static LIST_HEAD(swap_cgroup_priority_list);
  * least_priority - Current lowest priority.
  * distance - Priority differences from global swap priority.
  * default_prio - Default priority for this cgroup.
+ * pcpu_swapdev - Per-CPU swap device.
  * plist - Priority list head.
  */
 struct swap_cgroup_priority {
@@ -64,6 +76,7 @@ struct swap_cgroup_priority {
 	int least_priority;
 	s8 distance;
 	int default_prio;
+	struct percpu_swap_device __percpu *pcpu_swapdev;
 	struct plist_head plist[];
 };
=20
@@ -132,6 +145,21 @@ static struct swap_cgroup_priority *get_effective_swap=
_cgroup_priority(
 	return swap_priority->effective;
 }
=20
+static struct swap_cgroup_priority *get_effective_swap_cgroup_priority_rcu(
+	struct mem_cgroup *memcg)
+{
+	struct swap_cgroup_priority *swap_priority;
+
+	if (!memcg)
+		return NULL;
+
+	swap_priority =3D rcu_dereference(memcg->swap_priority);
+	if (!swap_priority)
+		return NULL;
+
+	return rcu_dereference(swap_priority->effective);
+}
+
 static bool validate_effective_swap_cgroup_priority(
 	struct mem_cgroup *memcg,
 	struct swap_cgroup_priority **swap_priority)
@@ -172,6 +200,9 @@ static void free_swap_cgroup_priority_pnode(
 static void free_swap_cgroup_priority(
 	struct swap_cgroup_priority *swap_priority)
 {
+	if (swap_priority->pcpu_swapdev)
+		free_percpu(swap_priority->pcpu_swapdev);
+
 	for (int i =3D 0; i < MAX_SWAPFILES; i++)
 		free_swap_cgroup_priority_pnode(swap_priority->pnode[i]);
=20
@@ -187,6 +218,12 @@ static struct swap_cgroup_priority *alloc_swap_cgroup_=
priority(void)
 	if (!swap_priority)
 		return NULL;
=20
+	swap_priority->pcpu_swapdev =3D alloc_percpu(struct percpu_swap_device);
+	if (!swap_priority->pcpu_swapdev) {
+		kvfree(swap_priority);
+		return NULL;
+	}
+
 	/*
 	 * Pre-allocates pnode array up to nr_swapfiles at init.
 	 * Individual pnodes are assigned on swapon, but not freed
@@ -326,10 +363,34 @@ bool swap_alloc_cgroup_priority(struct mem_cgroup *me=
mcg,
 	unsigned long offset;
 	int node;
=20
-	/*
-	 * TODO: Per-cpu swap cluster cache can't be used directly
-	 * as cgroup-specific priorities may select different devices.
-	 */
+	rcu_read_lock();
+	if (!(swap_priority =3D get_effective_swap_cgroup_priority_rcu(memcg))) {
+		rcu_read_unlock();
+		return false;
+	}
+
+	/* Fast path */
+	si =3D this_cpu_read(swap_priority->pcpu_swapdev->si[order]);
+	if (si && get_swap_device_info(si)) {
+		offset =3D cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
+		if (offset) {
+			*entry =3D swp_entry(si->type, offset);
+			/*
+			 * Protected by 'percpu_swap_cluster' local_lock;
+			 * CPU migration is disabled during this operation.
+			 */
+			this_cpu_write(swap_priority->pcpu_swapdev->si[order],
+				       si);
+			put_swap_device(si);
+			rcu_read_unlock();
+
+			return true;
+		}
+		put_swap_device(si);
+	}
+	rcu_read_unlock();
+
+	/* Slow path */
 	spin_lock(&swap_avail_lock);
 	node =3D numa_node_id();
=20
@@ -350,6 +411,14 @@ bool swap_alloc_cgroup_priority(struct mem_cgroup *mem=
cg,
 		if (get_swap_device_info(si)) {
 			offset =3D cluster_alloc_swap_entry(si, order,
 							  SWAP_HAS_CACHE);
+			/*
+			 * Protected by 'percpu_swap_cluster' local_lock;
+			 * CPU migration is disabled during this operation.
+			 */
+			if (memcg->swap_priority =3D=3D swap_priority)
+				this_cpu_write(
+					swap_priority->pcpu_swapdev->si[order],
+					si);
 			put_swap_device(si);
 			if (offset) {
 				*entry =3D swp_entry(si->type, offset);
@@ -687,6 +756,21 @@ static int __apply_swap_cgroup_priority(
 	return 0;
 }
=20
+static int init_swap_cgroup_priority_pcpu_swapdev_cache(
+	struct swap_cgroup_priority *swap_priority)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct percpu_swap_device *pcp_swap_dev =3D
+			per_cpu_ptr(swap_priority->pcpu_swapdev, cpu);
+		for (int i =3D 0; i < SWAP_NR_ORDERS; i++)
+			pcp_swap_dev->si[i] =3D NULL;
+	}
+
+	return 0;
+}
+
 /*
  * If this is the top-level swap_cgroup_priority, propagation is needed.
  * We traverse the 'mem_cgroup_tree' using 'for_each_mem_cgroup_tree'.
@@ -795,6 +879,8 @@ int apply_swap_cgroup_priority(struct mem_cgroup *memcg=
, u64 id, int prio)
 	for_each_node(nid)
 		plist_head_init(&swap_priority->plist[nid]);
=20
+	init_swap_cgroup_priority_pcpu_swapdev_cache(swap_priority);
+
 prio_set:
 	spin_lock(&swap_lock);
 	spin_lock(&swap_avail_lock);
@@ -843,6 +929,23 @@ int apply_swap_cgroup_priority(struct mem_cgroup *memc=
g, u64 id, int prio)
=20
 	spin_unlock(&swap_avail_lock);
 	spin_unlock(&swap_lock);
+	/*
+	 * XXX: We cannot fully synchronize with swap_alloc_cgroup_priority
+	 * when updating the next si.
+	 * Still, we ensure that flush operations inside swap_priority
+	 * are performed as reliably as possible.
+	 */
+	if (id !=3D DEFAULT_ID &&
+	    swap_priority =3D=3D swap_priority->effective && !new) {
+		int cpu;
+		struct swap_info_struct **pcp_si;
+		for_each_possible_cpu(cpu) {
+			pcp_si =3D per_cpu_ptr(
+				swap_priority->pcpu_swapdev->si, cpu);
+			for (int i =3D 0; i < SWAP_NR_ORDERS; i++)
+				pcp_si[i] =3D NULL;
+		}
+	}
 	mutex_unlock(&swap_cgroup_priority_inherit_lck);
 	return 0;
=20
@@ -886,3 +989,48 @@ void delete_swap_cgroup_priority(struct mem_cgroup *me=
mcg)
 	spin_unlock(&swap_avail_lock);
 	mutex_unlock(&swap_cgroup_priority_inherit_lck);
 }
+
+void flush_swap_cgroup_priority_percpu_swapdev(struct swap_info_struct *si)
+{
+	int cpu, i;
+	struct swap_info_struct **pcp_si;
+	struct swap_cgroup_priority *swap_priority;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(swap_priority,
+				&swap_cgroup_priority_list, link) {
+		for_each_possible_cpu(cpu) {
+			pcp_si =3D per_cpu_ptr(
+					swap_priority->pcpu_swapdev->si, cpu);
+
+			for (i =3D 0; i < SWAP_NR_ORDERS; i++)
+				cmpxchg(&pcp_si[i], si, NULL);
+		}
+	}
+	rcu_read_unlock();
+}
+
+bool alloc_percpu_swap_cluster(struct swap_info_struct *si)
+{
+	si->percpu_cluster =3D alloc_percpu(struct percpu_cluster);
+	if (!si->percpu_cluster)
+		return false;
+
+	int cpu;
+	int i;
+	for_each_possible_cpu(cpu) {
+		struct percpu_cluster *cluster;
+
+		cluster =3D per_cpu_ptr(si->percpu_cluster, cpu);
+		for (i =3D 0; i < SWAP_NR_ORDERS; i++)
+			cluster->next[i] =3D SWAP_ENTRY_INVALID;
+	}
+
+	return true;
+}
+
+void free_percpu_swap_cluster(struct swap_info_struct *si)
+{
+	free_percpu(si->percpu_cluster);
+	si->percpu_cluster =3D NULL;
+}
diff --git a/mm/swap_cgroup_priority.h b/mm/swap_cgroup_priority.h
index 5d16b63d12e0..815822ebd0d1 100644
--- a/mm/swap_cgroup_priority.h
+++ b/mm/swap_cgroup_priority.h
@@ -47,6 +47,22 @@ struct swap_cgroup_priority *inherit_swap_cgroup_priorit=
y(
 bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg, swp_entry_t *ent=
ry,
 				int order);
 void delete_swap_cgroup_priority(struct mem_cgroup *memcg);
+void flush_swap_cgroup_priority_percpu_swapdev(struct swap_info_struct *si=
);
+
+bool alloc_percpu_swap_cluster(struct swap_info_struct *si);
+void free_percpu_swap_cluster(struct swap_info_struct *si);
+static inline void write_percpu_swap_cluster_next(struct swap_info_struct =
*si,
+						  int order,
+						  unsigned int next)
+{
+	this_cpu_write(si->percpu_cluster->next[order], next);
+}
+
+static inline unsigned int read_percpu_swap_cluster_next(
+	struct swap_info_struct *si, int order)
+{
+        return __this_cpu_read(si->percpu_cluster->next[order]);
+}
 #else
 int swap_node(struct swap_info_struct *si);
 unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int or=
der,
@@ -85,5 +101,28 @@ static inline bool swap_alloc_cgroup_priority(struct me=
m_cgroup *memcg,
 static inline void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
 {
 }
+static inline void flush_swap_cgroup_priority_percpu_swapdev(
+	struct swap_info_struct *si)
+{
+}
+static inline bool alloc_percpu_swap_cluster(struct swap_info_struct *si)
+{
+	return true;
+}
+static inline void free_percpu_swap_cluster(struct swap_info_struct *si)
+{
+}
+static inline void write_percpu_swap_cluster_next(struct swap_info_struct =
*si,
+						  int order,
+						  unsigned int next)
+{
+	return;
+}
+
+static inline unsigned int read_percpu_swap_cluster_next(
+	struct swap_info_struct *si, int order)
+{
+	return SWAP_ENTRY_INVALID;
+}
 #endif
 #endif
diff --git a/mm/swapfile.c b/mm/swapfile.c
index bfd0532ad250..6a5ac9962e9f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -817,12 +817,15 @@ static unsigned int alloc_swap_scan_cluster(struct sw=
ap_info_struct *si,
 out:
 	relocate_cluster(si, ci);
 	unlock_cluster(ci);
+
 	if (si->flags & SWP_SOLIDSTATE) {
 		this_cpu_write(percpu_swap_cluster.offset[order], next);
 		this_cpu_write(percpu_swap_cluster.si[order], si);
+		write_percpu_swap_cluster_next(si, order, next);
 	} else {
 		si->global_cluster->next[order] =3D next;
 	}
+
 	return found;
 }
=20
@@ -892,26 +895,29 @@ unsigned long cluster_alloc_swap_entry(struct swap_in=
fo_struct *si, int order,
 	if (order && !(si->flags & SWP_BLKDEV))
 		return 0;
=20
-	if (!(si->flags & SWP_SOLIDSTATE)) {
+	if (si->flags & SWP_SOLIDSTATE) {
+		offset =3D read_percpu_swap_cluster_next(si, order);
+	} else {
 		/* Serialize HDD SWAP allocation for each device. */
 		spin_lock(&si->global_cluster_lock);
 		offset =3D si->global_cluster->next[order];
-		if (offset =3D=3D SWAP_ENTRY_INVALID)
-			goto new_cluster;
+	}
=20
-		ci =3D lock_cluster(si, offset);
-		/* Cluster could have been used by another order */
-		if (cluster_is_usable(ci, order)) {
-			if (cluster_is_empty(ci))
-				offset =3D cluster_offset(si, ci);
-			found =3D alloc_swap_scan_cluster(si, ci, offset,
-							order, usage);
-		} else {
-			unlock_cluster(ci);
-		}
-		if (found)
-			goto done;
+	if (offset =3D=3D SWAP_ENTRY_INVALID)
+		goto new_cluster;
+
+	ci =3D lock_cluster(si, offset);
+	/* Cluster could have been used by another order */
+	if (cluster_is_usable(ci, order)) {
+		if (cluster_is_empty(ci))
+			offset =3D cluster_offset(si, ci);
+		found =3D alloc_swap_scan_cluster(si, ci, offset,
+						order, usage);
+	} else {
+		unlock_cluster(ci);
 	}
+	if (found)
+		goto done;
=20
 new_cluster:
 	ci =3D isolate_lock_cluster(si, &si->free_clusters);
@@ -991,6 +997,7 @@ unsigned long cluster_alloc_swap_entry(struct swap_info=
_struct *si, int order,
 done:
 	if (!(si->flags & SWP_SOLIDSTATE))
 		spin_unlock(&si->global_cluster_lock);
+
 	return found;
 }
=20
@@ -2674,6 +2681,8 @@ static void flush_percpu_swap_cluster(struct swap_inf=
o_struct *si)
 		for (i =3D 0; i < SWAP_NR_ORDERS; i++)
 			cmpxchg(&pcp_si[i], si, NULL);
 	}
+
+	flush_swap_cgroup_priority_percpu_swapdev(si);
 }
=20
=20
@@ -2802,6 +2811,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special=
file)
 	arch_swap_invalidate_area(p->type);
 	zswap_swapoff(p->type);
 	mutex_unlock(&swapon_mutex);
+	free_percpu_swap_cluster(p);
 	kfree(p->global_cluster);
 	p->global_cluster =3D NULL;
 	vfree(swap_map);
@@ -2900,7 +2910,6 @@ static void swap_stop(struct seq_file *swap, void *v)
 	mutex_unlock(&swapon_mutex);
 }
=20
-
 #ifndef CONFIG_SWAP_CGROUP_PRIORITY
 static int swap_show(struct seq_file *swap, void *v)
 {
@@ -3239,7 +3248,10 @@ static struct swap_cluster_info *setup_clusters(stru=
ct swap_info_struct *si,
 	for (i =3D 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);
=20
-	if (!(si->flags & SWP_SOLIDSTATE)) {
+	if (si->flags & SWP_SOLIDSTATE) {
+		if (!alloc_percpu_swap_cluster(si))
+			goto err_free;
+	} else {
 		si->global_cluster =3D kmalloc(sizeof(*si->global_cluster),
 				     GFP_KERNEL);
 		if (!si->global_cluster)
@@ -3532,6 +3544,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf=
ile, int, swap_flags)
 bad_swap_unlock_inode:
 	inode_unlock(inode);
 bad_swap:
+	free_percpu_swap_cluster(si);
 	kfree(si->global_cluster);
 	si->global_cluster =3D NULL;
 	inode =3D NULL;
--=20
2.34.1