From nobody Fri Oct 10 21:10:45 2025
Received: from lgeamrelo07.lge.com (lgeamrelo07.lge.com [156.147.51.103])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8C3C4239E92
	for <linux-kernel@vger.kernel.org>; Thu, 12 Jun 2025 10:53:32 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=156.147.51.103
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1749725615; cv=none;
 b=bsynZ7enFYytREPa00wckitSMKZZFawUOaDL9Iq/LVKXub1TYwslcyBqnYW741ggSRTLUVPnyR4+955fjOAFj9LScSbqsCgaDuV7RuS16ChipJDHWPIqfdgI73Ytfx1r4c+pNmUFKGKlGJubm8Y0N+IjrewwivOCF24r4FP8/30=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1749725615; c=relaxed/simple;
	bh=+whqsb51WlU/JCoJ2tY2+hnQPGDZ0RxBZxYca/d7G3E=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=UCkJ8ka5VE1cZk0LIcKD2aCaZQReiary3NSkRx3WDlA8JFhX7Pbd78H7XTTn0+Te0hnjZ2VS5qDeyKxKVzA6wsxdo8mbRy/LecESEKQFhFnp8lnVe2Mp1m3M315NOrye2I7HituuKHY320lT9Tb+rzDTMGbL4o18N3wbt9zPS1o=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=lge.com;
 spf=pass smtp.mailfrom=lge.com; arc=none smtp.client-ip=156.147.51.103
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=lge.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=lge.com
Received: from unknown (HELO yjaykim-PowerEdge-T330.lge.net) (10.177.112.156)
	by 156.147.51.103 with ESMTP; 12 Jun 2025 19:38:31 +0900
X-Original-SENDERIP: 10.177.112.156
X-Original-MAILFROM: youngjun.park@lge.com
From: youngjun.park@lge.com
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org,
	hannes@cmpxchg.org,
	mhocko@kernel.org,
	roman.gushchin@linux.dev,
	shakeel.butt@linux.dev,
	cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	shikemeng@huaweicloud.com,
	kasong@tencent.com,
	nphamcs@gmail.com,
	bhe@redhat.com,
	baohua@kernel.org,
	chrisl@kernel.org,
	muchun.song@linux.dev,
	iamjoonsoo.kim@lge.com,
	taejoon.song@lge.com,
	gunho.lee@lge.com,
	"youngjun.park" <youngjun.park@lge.com>
Subject: [RFC PATCH 1/2] mm/swap,
 memcg: basic structure and logic for per cgroup swap priority control
Date: Thu, 12 Jun 2025 19:37:43 +0900
Message-Id: <20250612103743.3385842-2-youngjun.park@lge.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20250612103743.3385842-1-youngjun.park@lge.com>
References: <20250612103743.3385842-1-youngjun.park@lge.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: "youngjun.park" <youngjun.park@lge.com>

We are working in a constrained environment where devices often
operate under limited resources. To improve overall system responsiveness,
especially under memory pressure, we aim to utilize idle devices as swap
targets over the network.

In this context, we propose a mechanism to control swap priorities on a
per-cgroup basis.
By assigning different swap priorities to each cgroup, we can ensure that
ciritical applications maintain higher responsiveness and stability,
while less important workloads experience deferred swap activity.

The following is detailed explanation of the implementation.

1. Object Description

- swap_cgroup_priority
This object manages an array of swap_cgroup_priority_pnode
that points to swap devices and their associated priorities.

- swap_cgroup_priority_pnode
This object points to a swap device and contains priority information
that can be allocated through an interface.

2. Object Lifecycle

- The swap_cgroup_priority and swap_cgroup_priority_pnode share the same
lifetime.

- Object is dealt with memory.swap.priority interface.
Each swap device is assigned a unique ID at swapon time,
which can be queried via the memory.swap.priority interface.

Example:
cat memory.swap.priority
Inactive
/dev/sdb	unique:1	 prio:10
/dev/sdc	unique:2	 prio:5

- Creation
echo  "unique id of swapdev 1: priority, unique id of swapdev 2: priority .=
.."
> memory.swap.priority

- Destruction
Reset through the memory.swap.priority interface.
Example: echo "" > memory.swap.priority

And also be destroyed when the mem_cgroup is removed.

3. Priority Mechanism

- Follows the original concept of swap priority.
(This includes automatic binding of swap devices to NUMA nodes.)

- Swap On/Off Propagation
When swapon is executed, the settings are propagated.
Also when swapoff is executed, the settings are removed.

The implementation of swap on/off propagation and the mechanism
for iterating through the configured swap cgroup priorities
are available in the next patch.

Signed-off-by: Youngjun Park <youngjun.park@lge.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/memcontrol.h |   3 +
 include/linux/swap.h       |   3 +
 mm/Kconfig                 |   7 ++
 mm/memcontrol.c            |  55 ++++++++++
 mm/swap.h                  |  10 ++
 mm/swap_cgroup_priority.c  | 202 +++++++++++++++++++++++++++++++++++++
 mm/swapfile.c              |   6 ++
 7 files changed, 286 insertions(+)
 create mode 100644 mm/swap_cgroup_priority.c

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 87b6688f124a..625e59f9ecd2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -218,6 +218,9 @@ struct mem_cgroup {
 	bool zswap_writeback;
 #endif
=20
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	struct swap_cgroup_priority *swap_priority;
+#endif
 	/* vmpressure notifications */
 	struct vmpressure vmpressure;
=20
diff --git a/include/linux/swap.h b/include/linux/swap.h
index bc0e1c275fc0..49b73911c1bd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -339,6 +339,9 @@ struct swap_info_struct {
 	struct work_struct discard_work; /* discard worker */
 	struct work_struct reclaim_work; /* reclaim worker */
 	struct list_head discard_clusters; /* discard clusters list */
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	int unique_id;
+#endif
 	struct plist_node avail_lists[]; /*
 					   * entries in swap_avail_heads, one
 					   * entry per node.
diff --git a/mm/Kconfig b/mm/Kconfig
index 781be3240e21..ff4b0ef867f4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -190,6 +190,13 @@ config ZSMALLOC_CHAIN_SIZE
=20
 	  For more information, see zsmalloc documentation.
=20
+config SWAP_CGROUP_PRIORITY
+	bool "Use swap cgroup priority"
+	default false
+	depends on SWAP && CGROUPS
+	help
+	  This option sets per cgroup swap device priority.
+
 menu "Slab allocator options"
=20
 config SLUB
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 902da8a9c643..628ffb048489 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -69,6 +69,7 @@
 #include <net/ip.h>
 #include "slab.h"
 #include "memcontrol-v1.h"
+#include "swap.h"
=20
 #include <linux/uaccess.h>
=20
@@ -3702,6 +3703,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
 {
 	lru_gen_exit_memcg(memcg);
 	memcg_wb_domain_exit(memcg);
+	delete_swap_cgroup_priority(memcg);
 	__mem_cgroup_free(memcg);
 }
=20
@@ -5403,6 +5405,51 @@ static int swap_events_show(struct seq_file *m, void=
 *v)
 	return 0;
 }
=20
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+static ssize_t swap_cgroup_priority_write(struct kernfs_open_file *of,
+			      char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg =3D mem_cgroup_from_css(of_css(of));
+	int ret;
+	int unique[MAX_SWAPFILES] =3D {0, };
+	int prios[MAX_SWAPFILES] =3D {0,};
+	int idx =3D 0;
+	char *token;
+
+	buf =3D strstrip(buf);
+	if (strlen(buf) =3D=3D 0) {
+		delete_swap_cgroup_priority(memcg);
+		return nbytes;
+	}
+
+	while ((token =3D strsep(&buf, ",")) !=3D NULL) {
+		char *token2 =3D token;
+		char *token3;
+
+		token3 =3D strsep(&token2, ":");
+		if (!token2 || !token3)
+			return -EINVAL;
+
+		if (kstrtoint(token3, 10, &unique[idx]) ||
+			kstrtoint(token2, 10, &prios[idx]))
+			return -EINVAL;
+
+		idx++;
+	}
+
+	if ((ret =3D create_swap_cgroup_priority(memcg, unique, prios, idx)))
+		return ret;
+
+	return nbytes;
+}
+
+static int swap_cgroup_priority_show(struct seq_file *m, void *v)
+{
+	show_swap_device_unique_id(m);
+	return 0;
+}
+#endif
+
 static struct cftype swap_files[] =3D {
 	{
 		.name =3D "swap.current",
@@ -5435,6 +5482,14 @@ static struct cftype swap_files[] =3D {
 		.file_offset =3D offsetof(struct mem_cgroup, swap_events_file),
 		.seq_show =3D swap_events_show,
 	},
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	{
+		.name =3D "swap.priority",
+		.flags =3D CFTYPE_NOT_ON_ROOT,
+		.seq_show =3D swap_cgroup_priority_show,
+		.write =3D swap_cgroup_priority_write,
+	},
+#endif
 	{ }	/* terminate */
 };
=20
diff --git a/mm/swap.h b/mm/swap.h
index 2269eb9df0af..cd2649c632ed 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -106,6 +106,16 @@ static inline int swap_zeromap_batch(swp_entry_t entry=
, int max_nr,
 		return find_next_bit(sis->zeromap, end, start) - start;
 }
=20
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+int create_swap_cgroup_priority(struct mem_cgroup *memcg,
+				int unique[], int prio[], int nr);
+void delete_swap_cgroup_priority(struct mem_cgroup *memcg);
+void show_swap_device_unique_id(struct seq_file *m);
+#else
+static inline void delete_swap_cgroup_priority(struct mem_cgroup *memcg) {}
+static inline void get_swap_unique_id(struct swap_info_struct *si) {}
+#endif
+
 #else /* CONFIG_SWAP */
 struct swap_iocb;
 static inline void swap_read_folio(struct folio *folio, struct swap_iocb *=
*plug)
diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c
new file mode 100644
index 000000000000..b3e20b676680
--- /dev/null
+++ b/mm/swap_cgroup_priority.c
@@ -0,0 +1,202 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* per mem_cgroup */
+struct swap_cgroup_priority {
+	struct list_head link;
+	/* XXX: to flatten memory is hard. variable array is our enemy */
+	struct swap_cgroup_priority_pnode *pnode[MAX_SWAPFILES];
+	struct plist_head plist[];
+};
+
+/* per mem_cgroup & per swap device node */
+struct swap_cgroup_priority_pnode {
+	struct swap_info_struct *swap;
+	int prio;
+	struct plist_node avail_lists[];
+};
+
+/* per swap device unique id counter */
+static atomic_t swap_unique_id_counter;
+
+/* active swap_cgroup_priority list */
+static LIST_HEAD(swap_cgroup_priority_list);
+
+/* XXX: Not want memcontrol to know swap_cgroup_priority internal. */
+void show_swap_device_unique_id(struct seq_file *m)
+{
+	struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m);
+
+	spin_lock(&swap_lock);
+	/* XXX: what is beautiful visibility? */
+	seq_printf(m, "%s\n", memcg->swap_priority ? "Active" : "Inactive");
+	for (int i =3D 0; i < nr_swapfiles; i++) {
+		struct swap_info_struct *si =3D swap_info[i];
+
+		if (!(si->flags & SWP_USED))
+			continue;
+
+		seq_file_path(m, si->swap_file, "\t\n\\");
+		seq_printf(m,  "\tunique:%d\t", si->unique_id);
+
+		if (!memcg->swap_priority) {
+			seq_printf(m, " prio:%d\n", si->prio);
+			continue;
+		}
+
+		seq_printf(m,  "prio:%d\n",
+			memcg->swap_priority->pnode[i]->prio);
+	}
+	spin_unlock(&swap_lock);
+}
+
+static void get_swap_unique_id(struct swap_info_struct *si)
+{
+	si->unique_id =3D atomic_add_return(1, &swap_unique_id_counter);
+}
+
+int create_swap_cgroup_priority(struct mem_cgroup *memcg,
+		int unique[], int prio[], int nr)
+{
+	bool b_found =3D false;
+	struct swap_cgroup_priority *swap_priority, *old_swap_priority =3D NULL;
+	int nid;
+
+	/* Fast check */
+	if (nr !=3D nr_swapfiles)
+		return -EINVAL;
+
+	/*
+	* XXX: always make newly object and exchange it.
+	* possible to give object reusability if it is simple and better.
+	*/
+	swap_priority =3D kvmalloc(struct_size(swap_priority, plist, nr_node_ids),
+			GFP_KERNEL);
+
+	if (!swap_priority)
+		return -ENOMEM;
+
+	/* XXX: use pre allocate. think swapon time allocate is better? */
+	for (int i =3D 0; i < MAX_SWAPFILES; i++) {
+		swap_priority->pnode[i] =3D
+			kvmalloc(struct_size(swap_priority->pnode[0],
+				avail_lists, nr_node_ids),
+				GFP_KERNEL);
+
+		if (!swap_priority->pnode[i]) {
+			for (int j =3D 0; j < i; j++)
+				kvfree(swap_priority->pnode[i]);
+
+			kvfree(swap_priority);
+			return -ENOMEM;
+		}
+	}
+
+	INIT_LIST_HEAD(&swap_priority->link);
+	for_each_node(nid)
+		plist_head_init(&swap_priority->plist[nid]);
+
+	spin_lock(&swap_lock);
+	spin_lock(&swap_avail_lock);
+
+	/* swap on/off under us. */
+	if (nr !=3D nr_swapfiles)
+		goto error;
+
+	/* TODO: naive search. make it fast.*/
+	for (int i =3D 0; i < nr; i++) {
+		b_found =3D false;
+		for (int j =3D 0; j < nr_swapfiles; j++) {
+			struct swap_info_struct *si =3D swap_info[j];
+			struct swap_cgroup_priority_pnode *pnode
+					=3D swap_priority->pnode[j];
+
+			if (si->unique_id !=3D unique[i])
+				continue;
+
+			/* swap off under us */
+			if (!(si->flags & SWP_USED))
+				goto error;
+
+			int k;
+			for_each_node(k) {
+				if (prio[i] >=3D 0) {
+					pnode->prio =3D prio[i];
+					plist_node_init(&pnode->avail_lists[k],
+						-pnode->prio);
+				} else {
+					pnode->prio =3D si->prio;
+					if (swap_node(si) =3D=3D k)
+						plist_node_init(
+							&pnode->avail_lists[k],
+							1);
+					else
+						plist_node_init(
+							&pnode->avail_lists[k],
+							-pnode->prio);
+				}
+
+				plist_add(&pnode->avail_lists[k],
+					&swap_priority->plist[k]);
+			}
+
+			pnode->swap =3D si;
+			b_found =3D true;
+			break;
+		}
+
+		/* cannot find unique id pair */
+		if (!b_found)
+			goto error;
+	}
+
+	if (memcg->swap_priority) {
+		old_swap_priority =3D memcg->swap_priority;
+		list_del(&old_swap_priority->link);
+	}
+
+	list_add(&swap_priority->link, &swap_cgroup_priority_list);
+
+	memcg->swap_priority =3D swap_priority;
+	spin_unlock(&swap_avail_lock);
+	spin_unlock(&swap_lock);
+
+	if (old_swap_priority) {
+		for (int i =3D 0; i < MAX_SWAPFILES; i++)
+			kvfree(old_swap_priority->pnode[i]);
+		kvfree(old_swap_priority);
+	}
+
+	return 0;
+
+error:
+	spin_unlock(&swap_avail_lock);
+	spin_unlock(&swap_lock);
+
+	for (int i =3D 0; i < MAX_SWAPFILES; i++)
+		kvfree(swap_priority->pnode[i]);
+	kvfree(swap_priority);
+
+	return -EINVAL;
+}
+
+void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
+{
+	struct swap_cgroup_priority *swap_priority;
+
+	spin_lock(&swap_avail_lock);
+	swap_priority =3D memcg->swap_priority;
+	if (!swap_priority) {
+		spin_unlock(&swap_avail_lock);
+		return;
+	}
+	memcg->swap_priority =3D NULL;
+	list_del(&swap_priority->link);
+	spin_unlock(&swap_avail_lock);
+
+	/* wait show_swap_device_unique_id */
+	synchronize_rcu();
+
+	for (int i =3D 0; i < MAX_SWAPFILES; i++)
+		kvfree(swap_priority->pnode[i]);
+	kvfree(swap_priority);
+}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 68ce283e84be..f8e48dd2381e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -126,6 +126,10 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, perc=
pu_swap_cluster) =3D {
 	.offset =3D { SWAP_ENTRY_INVALID },
 	.lock =3D INIT_LOCAL_LOCK(),
 };
+/* TODO: better choice? */
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+#include "swap_cgroup_priority.c"
+#endif
=20
 static struct swap_info_struct *swap_type_to_swap_info(int type)
 {
@@ -3462,6 +3466,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf=
ile, int, swap_flags)
 		goto free_swap_zswap;
 	}
=20
+	get_swap_unique_id(si);
+
 	mutex_lock(&swapon_mutex);
 	prio =3D -1;
 	if (swap_flags & SWAP_FLAG_PREFER)
--=20
2.34.1
From nobody Fri Oct 10 21:10:45 2025
Received: from lgeamrelo07.lge.com (lgeamrelo07.lge.com [156.147.51.103])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7FD8423BCED
	for <linux-kernel@vger.kernel.org>; Thu, 12 Jun 2025 10:53:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=156.147.51.103
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1749725617; cv=none;
 b=mSnMyxrtuX6JPM8JPtuDeZIkmSkYYthur0t6Z0WV/knpSfLiJmr5SERs36Xhsk746+6WsbsaFt8mYOV+Tjyg1J6+aecGf2PmUHQoEMGF9/1ltMLsOqpZ78DvOjnpU29qtWwSHTQbK26wdct/pJcM1lorB+Ik20EO1y3y4IsTKIU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1749725617; c=relaxed/simple;
	bh=ALZMxnbU3t7+QRRD3cqcdF8BHAFeVToWtA6sWETWCXY=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Hgqu3C8qoMbpixgZI46UTfGkgZta3HYl0fygQDEUghvreSsrttll6Wb39fQ4QmCjo5YMG6f+PUHwujAc/HZ82+NA/NujLrBtRGqD1etzFkG4M+UuZSpg9Id7rkMxqE0PGNv0K4pm/0oyJfj0G0TF+o0ZnjLGQV+DkCvxfLDxnBU=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=lge.com;
 spf=pass smtp.mailfrom=lge.com; arc=none smtp.client-ip=156.147.51.103
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=lge.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=lge.com
Received: from unknown (HELO yjaykim-PowerEdge-T330.lge.net) (10.177.112.156)
	by 156.147.51.103 with ESMTP; 12 Jun 2025 19:38:32 +0900
X-Original-SENDERIP: 10.177.112.156
X-Original-MAILFROM: youngjun.park@lge.com
From: youngjun.park@lge.com
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org,
	hannes@cmpxchg.org,
	mhocko@kernel.org,
	roman.gushchin@linux.dev,
	shakeel.butt@linux.dev,
	cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	shikemeng@huaweicloud.com,
	kasong@tencent.com,
	nphamcs@gmail.com,
	bhe@redhat.com,
	baohua@kernel.org,
	chrisl@kernel.org,
	muchun.song@linux.dev,
	iamjoonsoo.kim@lge.com,
	taejoon.song@lge.com,
	gunho.lee@lge.com,
	"youngjun.park" <youngjun.park@lge.com>
Subject: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on
 swap layer
Date: Thu, 12 Jun 2025 19:37:44 +0900
Message-Id: <20250612103743.3385842-3-youngjun.park@lge.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20250612103743.3385842-1-youngjun.park@lge.com>
References: <20250612103743.3385842-1-youngjun.park@lge.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: "youngjun.park" <youngjun.park@lge.com>

This patch implements swap device selection and swap on/off propagation
when a cgroup-specific swap priority is set.

There is one workaround to this implementation as follows.
Current per-cpu swap cluster enforces swap device selection based solely
on CPU locality, overriding the swap cgroup's configured priorities.
Therefore, when a swap cgroup priority is assigned, we fall back to
using per-CPU clusters per swap device, similar to the previous behavior.

A proper fix for this workaround will be evaluated in the next patch.

Signed-off-by: Youngjun park <youngjun.park@lge.com>
---
 include/linux/swap.h      |   8 +++
 mm/swap.h                 |   8 +++
 mm/swap_cgroup_priority.c | 133 ++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c             | 125 ++++++++++++++++++++++++-----------
 4 files changed, 238 insertions(+), 36 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 49b73911c1bd..d158b0d5c997 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -283,6 +283,13 @@ enum swap_cluster_flags {
 #define SWAP_NR_ORDERS		1
 #endif
=20
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+struct percpu_cluster {
+	local_lock_t lock; /* Protect the percpu_cluster above */
+	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
+};
+#endif
+
 /*
  * We keep using same cluster for rotational device so IO will be sequenti=
al.
  * The purpose is to optimize SWAP throughput on these device.
@@ -341,6 +348,7 @@ struct swap_info_struct {
 	struct list_head discard_clusters; /* discard clusters list */
 #ifdef CONFIG_SWAP_CGROUP_PRIORITY
 	int unique_id;
+	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio=
n */
 #endif
 	struct plist_node avail_lists[]; /*
 					   * entries in swap_avail_heads, one
diff --git a/mm/swap.h b/mm/swap.h
index cd2649c632ed..cb6d653fe3f1 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -113,7 +113,15 @@ void delete_swap_cgroup_priority(struct mem_cgroup *me=
mcg);
 void show_swap_device_unique_id(struct seq_file *m);
 #else
 static inline void delete_swap_cgroup_priority(struct mem_cgroup *memcg) {}
+static inline void activate_swap_cgroup_priority_pnode(struct swap_info_st=
ruct *swp, bool swapon) {}
+static inline void deactivate_swap_cgroup_priority_pnode(struct swap_info_=
struct *swp, bool swapoff){}
 static inline void get_swap_unique_id(struct swap_info_struct *si) {}
+static inline bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
+				swp_entry_t *entry, int order)
+{
+	return false;
+}
+
 #endif
=20
 #else /* CONFIG_SWAP */
diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c
index b3e20b676680..bb18cb251f60 100644
--- a/mm/swap_cgroup_priority.c
+++ b/mm/swap_cgroup_priority.c
@@ -54,6 +54,132 @@ static void get_swap_unique_id(struct swap_info_struct =
*si)
 	si->unique_id =3D atomic_add_return(1, &swap_unique_id_counter);
 }
=20
+static bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
+				swp_entry_t *entry, int order)
+{
+	struct swap_cgroup_priority *swap_priority;
+	struct swap_cgroup_priority_pnode *pnode, *next;
+	unsigned long offset;
+	int node;
+
+	if (!memcg)
+		return false;
+
+	spin_lock(&swap_avail_lock);
+priority_check:
+	swap_priority =3D memcg->swap_priority;
+	if (!swap_priority) {
+		spin_unlock(&swap_avail_lock);
+		return false;
+	}
+
+	node =3D numa_node_id();
+start_over:
+	plist_for_each_entry_safe(pnode, next, &swap_priority->plist[node],
+					avail_lists[node]) {
+		struct swap_info_struct *si =3D pnode->swap;
+		plist_requeue(&pnode->avail_lists[node],
+			&swap_priority->plist[node]);
+		spin_unlock(&swap_avail_lock);
+
+		if (get_swap_device_info(si)) {
+			offset =3D cluster_alloc_swap_entry(si,
+					order, SWAP_HAS_CACHE, true);
+			put_swap_device(si);
+			if (offset) {
+				*entry =3D swp_entry(si->type, offset);
+				return true;
+			}
+			if (order)
+				return false;
+		}
+
+		spin_lock(&swap_avail_lock);
+
+		/* swap_priority is remove or changed under us. */
+		if (swap_priority !=3D memcg->swap_priority)
+			goto priority_check;
+
+		if (plist_node_empty(&next->avail_lists[node]))
+			goto start_over;
+	}
+	spin_unlock(&swap_avail_lock);
+
+	return false;
+}
+
+/* add_to_avail_list (swapon / swapusage > 0) */
+static void activate_swap_cgroup_priority_pnode(struct swap_info_struct *s=
wp,
+			bool swapon)
+{
+	struct swap_cgroup_priority *swap_priority;
+	int i;
+
+	list_for_each_entry(swap_priority, &swap_cgroup_priority_list, link) {
+		struct swap_cgroup_priority_pnode *pnode
+			=3D swap_priority->pnode[swp->type];
+
+		if (swapon) {
+			pnode->swap =3D swp;
+			pnode->prio =3D swp->prio;
+		}
+
+		/* NUMA priority handling */
+		for_each_node(i) {
+			if (swapon) {
+				if (swap_node(swp) =3D=3D i) {
+					plist_node_init(
+						&pnode->avail_lists[i],
+						1);
+				} else {
+					plist_node_init(
+						&pnode->avail_lists[i],
+						-pnode->prio);
+				}
+			}
+
+			plist_add(&pnode->avail_lists[i],
+				&swap_priority->plist[i]);
+		}
+	}
+}
+
+/* del_from_avail_list (swapoff / swap usage <=3D 0) */
+static void deactivate_swap_cgroup_priority_pnode(struct swap_info_struct =
*swp,
+		bool swapoff)
+{
+	struct swap_cgroup_priority *swap_priority;
+	int nid, i;
+
+	list_for_each_entry(swap_priority, &swap_cgroup_priority_list, link) {
+		struct swap_cgroup_priority_pnode *pnode;
+
+		if (swapoff && swp->prio < 0) {
+			/*
+			* NUMA priority handling
+			* mimic swapoff prio adjustment without plist
+			*/
+			for (int i =3D 0; i < MAX_SWAPFILES; i++) {
+				pnode =3D swap_priority->pnode[i];
+				if (pnode->prio > swp->prio ||
+					pnode->swap =3D=3D swp)
+					continue;
+
+				pnode->prio++;
+				for_each_node(nid) {
+					if (pnode->avail_lists[nid].prio !=3D 1)
+						pnode->avail_lists[nid].prio--;
+				}
+			}
+		}
+
+		pnode =3D swap_priority->pnode[swp->type];
+		for_each_node(i)
+			plist_del(&pnode->avail_lists[i],
+				&swap_priority->plist[i]);
+	}
+}
+
 int create_swap_cgroup_priority(struct mem_cgroup *memcg,
 		int unique[], int prio[], int nr)
 {
@@ -183,6 +309,12 @@ void delete_swap_cgroup_priority(struct mem_cgroup *me=
mcg)
 {
 	struct swap_cgroup_priority *swap_priority;
=20
+	/*
+	* XXX: Possible RCU wait? No. Cannot protect priority list addition.
+	* swap_avail_lock gives protection.
+	* Think about other object protection mechanism
+	* might be solve it and better. (e.g object reference)
+	*/
 	spin_lock(&swap_avail_lock);
 	swap_priority =3D memcg->swap_priority;
 	if (!swap_priority) {
@@ -198,5 +330,6 @@ void delete_swap_cgroup_priority(struct mem_cgroup *mem=
cg)
=20
 	for (int i =3D 0; i < MAX_SWAPFILES; i++)
 		kvfree(swap_priority->pnode[i]);
+
 	kvfree(swap_priority);
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f8e48dd2381e..28afe4ec0504 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -126,8 +126,12 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, perc=
pu_swap_cluster) =3D {
 	.offset =3D { SWAP_ENTRY_INVALID },
 	.lock =3D INIT_LOCAL_LOCK(),
 };
-/* TODO: better choice? */
+/* TODO: better arrangement */
 #ifdef CONFIG_SWAP_CGROUP_PRIORITY
+static bool get_swap_device_info(struct swap_info_struct *si);
+static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,=
 int order,
+					      unsigned char usage, bool is_cgroup_priority);
+static int swap_node(struct swap_info_struct *si);
 #include "swap_cgroup_priority.c"
 #endif
=20
@@ -776,7 +780,8 @@ static unsigned int alloc_swap_scan_cluster(struct swap=
_info_struct *si,
 					    struct swap_cluster_info *ci,
 					    unsigned long offset,
 					    unsigned int order,
-					    unsigned char usage)
+					    unsigned char usage,
+					    bool is_cgroup_priority)
 {
 	unsigned int next =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTRY_INVALID;
 	unsigned long start =3D ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
@@ -820,12 +825,19 @@ static unsigned int alloc_swap_scan_cluster(struct sw=
ap_info_struct *si,
 out:
 	relocate_cluster(si, ci);
 	unlock_cluster(ci);
+
 	if (si->flags & SWP_SOLIDSTATE) {
-		this_cpu_write(percpu_swap_cluster.offset[order], next);
-		this_cpu_write(percpu_swap_cluster.si[order], si);
-	} else {
+		if (!is_cgroup_priority) {
+			this_cpu_write(percpu_swap_cluster.offset[order], next);
+			this_cpu_write(percpu_swap_cluster.si[order], si);
+		} else {
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+			__this_cpu_write(si->percpu_cluster->next[order], next);
+#endif
+		}
+	} else
 		si->global_cluster->next[order] =3D next;
-	}
+
 	return found;
 }
=20
@@ -883,7 +895,7 @@ static void swap_reclaim_work(struct work_struct *work)
  * cluster for current CPU too.
  */
 static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,=
 int order,
-					      unsigned char usage)
+					      unsigned char usage, bool is_cgroup_priority)
 {
 	struct swap_cluster_info *ci;
 	unsigned int offset =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTRY_INVALID;
@@ -895,32 +907,38 @@ static unsigned long cluster_alloc_swap_entry(struct =
swap_info_struct *si, int o
 	if (order && !(si->flags & SWP_BLKDEV))
 		return 0;
=20
-	if (!(si->flags & SWP_SOLIDSTATE)) {
+	if (si->flags & SWP_SOLIDSTATE) {
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+                local_lock(&si->percpu_cluster->lock);
+                offset =3D __this_cpu_read(si->percpu_cluster->next[order]=
);
+#endif
+	} else {
 		/* Serialize HDD SWAP allocation for each device. */
 		spin_lock(&si->global_cluster_lock);
 		offset =3D si->global_cluster->next[order];
-		if (offset =3D=3D SWAP_ENTRY_INVALID)
-			goto new_cluster;
+	}
=20
-		ci =3D lock_cluster(si, offset);
-		/* Cluster could have been used by another order */
-		if (cluster_is_usable(ci, order)) {
-			if (cluster_is_empty(ci))
-				offset =3D cluster_offset(si, ci);
-			found =3D alloc_swap_scan_cluster(si, ci, offset,
-							order, usage);
-		} else {
-			unlock_cluster(ci);
-		}
-		if (found)
-			goto done;
+	if (offset =3D=3D SWAP_ENTRY_INVALID)
+		goto new_cluster;
+
+	ci =3D lock_cluster(si, offset);
+	/* Cluster could have been used by another order */
+	if (cluster_is_usable(ci, order)) {
+		if (cluster_is_empty(ci))
+			offset =3D cluster_offset(si, ci);
+		found =3D alloc_swap_scan_cluster(si, ci, offset,
+						order, usage, is_cgroup_priority);
+	} else {
+		unlock_cluster(ci);
 	}
+	if (found)
+		goto done;
=20
 new_cluster:
 	ci =3D isolate_lock_cluster(si, &si->free_clusters);
 	if (ci) {
 		found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-						order, usage);
+						order, usage, is_cgroup_priority);
 		if (found)
 			goto done;
 	}
@@ -934,7 +952,7 @@ static unsigned long cluster_alloc_swap_entry(struct sw=
ap_info_struct *si, int o
=20
 		while ((ci =3D isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
 			found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-							order, usage);
+							order, usage, is_cgroup_priority);
 			if (found)
 				goto done;
 			/* Clusters failed to allocate are moved to frag_clusters */
@@ -952,7 +970,7 @@ static unsigned long cluster_alloc_swap_entry(struct sw=
ap_info_struct *si, int o
 			 * reclaimable (eg. lazy-freed swap cache) slots.
 			 */
 			found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-							order, usage);
+							order, usage, is_cgroup_priority);
 			if (found)
 				goto done;
 			frags++;
@@ -979,21 +997,27 @@ static unsigned long cluster_alloc_swap_entry(struct =
swap_info_struct *si, int o
 		while ((ci =3D isolate_lock_cluster(si, &si->frag_clusters[o]))) {
 			atomic_long_dec(&si->frag_cluster_nr[o]);
 			found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-							0, usage);
+							0, usage, is_cgroup_priority);
 			if (found)
 				goto done;
 		}
=20
 		while ((ci =3D isolate_lock_cluster(si, &si->nonfull_clusters[o]))) {
 			found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-							0, usage);
+							0, usage, is_cgroup_priority);
 			if (found)
 				goto done;
 		}
 	}
 done:
-	if (!(si->flags & SWP_SOLIDSTATE))
+	if (si->flags & SWP_SOLIDSTATE) {
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+		local_unlock(&si->percpu_cluster->lock);
+#endif
+	} else {
 		spin_unlock(&si->global_cluster_lock);
+	}
+
 	return found;
 }
=20
@@ -1032,6 +1056,7 @@ static void del_from_avail_list(struct swap_info_stru=
ct *si, bool swapoff)
 	for_each_node(nid)
 		plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]);
=20
+	deactivate_swap_cgroup_priority_pnode(si, swapoff);
 skip:
 	spin_unlock(&swap_avail_lock);
 }
@@ -1075,6 +1100,7 @@ static void add_to_avail_list(struct swap_info_struct=
 *si, bool swapon)
 	for_each_node(nid)
 		plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
=20
+	activate_swap_cgroup_priority_pnode(si, swapon);
 skip:
 	spin_unlock(&swap_avail_lock);
 }
@@ -1200,7 +1226,8 @@ static bool swap_alloc_fast(swp_entry_t *entry,
 	if (cluster_is_usable(ci, order)) {
 		if (cluster_is_empty(ci))
 			offset =3D cluster_offset(si, ci);
-		found =3D alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE);
+		found =3D alloc_swap_scan_cluster(si, ci, offset, order,
+				SWAP_HAS_CACHE, false);
 		if (found)
 			*entry =3D swp_entry(si->type, found);
 	} else {
@@ -1227,7 +1254,7 @@ static bool swap_alloc_slow(swp_entry_t *entry,
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
-			offset =3D cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
+			offset =3D cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE, false);
 			put_swap_device(si);
 			if (offset) {
 				*entry =3D swp_entry(si->type, offset);
@@ -1294,10 +1321,12 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 		}
 	}
=20
-	local_lock(&percpu_swap_cluster.lock);
-	if (!swap_alloc_fast(&entry, order))
-		swap_alloc_slow(&entry, order);
-	local_unlock(&percpu_swap_cluster.lock);
+	if (!swap_alloc_cgroup_priority(folio_memcg(folio), &entry, order)) {
+		local_lock(&percpu_swap_cluster.lock);
+		if (!swap_alloc_fast(&entry, order))
+			swap_alloc_slow(&entry, order);
+		local_unlock(&percpu_swap_cluster.lock);
+	}
=20
 	/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
 	if (mem_cgroup_try_charge_swap(folio, entry))
@@ -1870,7 +1899,7 @@ swp_entry_t get_swap_page_of_type(int type)
 	/* This is called for allocating swap entry, not cache */
 	if (get_swap_device_info(si)) {
 		if (si->flags & SWP_WRITEOK) {
-			offset =3D cluster_alloc_swap_entry(si, 0, 1);
+			offset =3D cluster_alloc_swap_entry(si, 0, 1, false);
 			if (offset) {
 				entry =3D swp_entry(si->type, offset);
 				atomic_long_dec(&nr_swap_pages);
@@ -2800,6 +2829,10 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specia=
lfile)
 	arch_swap_invalidate_area(p->type);
 	zswap_swapoff(p->type);
 	mutex_unlock(&swapon_mutex);
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	free_percpu(p->percpu_cluster);
+	p->percpu_cluster =3D NULL;
+#endif
 	kfree(p->global_cluster);
 	p->global_cluster =3D NULL;
 	vfree(swap_map);
@@ -3207,7 +3240,23 @@ static struct swap_cluster_info *setup_clusters(stru=
ct swap_info_struct *si,
 	for (i =3D 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);
=20
-	if (!(si->flags & SWP_SOLIDSTATE)) {
+	if (si->flags & SWP_SOLIDSTATE) {
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+		si->percpu_cluster =3D alloc_percpu(struct percpu_cluster);
+		if (!si->percpu_cluster)
+			goto err_free;
+
+		int cpu;
+		for_each_possible_cpu(cpu) {
+			struct percpu_cluster *cluster;
+
+			cluster =3D per_cpu_ptr(si->percpu_cluster, cpu);
+			for (i =3D 0; i < SWAP_NR_ORDERS; i++)
+				cluster->next[i] =3D SWAP_ENTRY_INVALID;
+			local_lock_init(&cluster->lock);
+		}
+#endif
+	} else {
 		si->global_cluster =3D kmalloc(sizeof(*si->global_cluster),
 				     GFP_KERNEL);
 		if (!si->global_cluster)
@@ -3495,6 +3544,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, special=
file, int, swap_flags)
 bad_swap_unlock_inode:
 	inode_unlock(inode);
 bad_swap:
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	free_percpu(si->percpu_cluster);
+	si->percpu_cluster =3D NULL;
+#endif
 	kfree(si->global_cluster);
 	si->global_cluster =3D NULL;
 	inode =3D NULL;
--=20
2.34.1