From nobody Mon Feb  9 05:58:33 2026
Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com
 [209.85.214.182])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A7A8D1D8DE1
	for <linux-kernel@vger.kernel.org>; Tue, 24 Dec 2024 14:39:58 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.182
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1735051200; cv=none;
 b=iAXHU7K6oFLMiduqmmEpJqHOgOXxU2lY+sxtkI4o3Z6YaTrg4I5QQRA80+emSwY0ju90yQK+vC3Qp2/9+noCyPlUVt9+2pcDnGYFAvZGslF8TvGaVNRApAiXlHXQaBXehIJxO7nERoc9bivEdGK1c5LNPIIB76IEUABvihzjhes=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1735051200; c=relaxed/simple;
	bh=dKGm1GuCRiWTfxKsd/z6BiVWR7Z0xgv3xtT3S1va8hk=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=nOH8PCBfAzTpc0gUMPZP6AzkMhXV6sghedUBWrrFFvrg3CGJHGhglhX1OCWoDZyQr6LfnQ+iVvM+8pOQpm/cWgOB/7uGE3bKTHq5fn8LKoUap+WIc2o8If+phTFOONS6cpPuVUwAUzyPRDLlCxNoiYCuabI/soE4Pp8Lg+t2Zwk=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=l1djGjqb; arc=none smtp.client-ip=209.85.214.182
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="l1djGjqb"
Received: by mail-pl1-f182.google.com with SMTP id
 d9443c01a7336-216426b0865so52320905ad.0
        for <linux-kernel@vger.kernel.org>;
 Tue, 24 Dec 2024 06:39:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1735051198; x=1735655998;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=9HF1qACSod1kWAoA/COSAk5nfGEmiGZxS1K9zisyvtY=;
        b=l1djGjqbK9CqOwUhl3nasjT2Zp5/m7GIJowgi3SLYAzzskW9GJ1V9lBFuLpBbGbb1t
         peYKsRC5oianuLB+KAF0BQyABaQdZsqIWiyhhHaHKY3VqxXwXP4kYxNCO6R940JUE0qA
         +1u3QQhS77AqxpeFqzSDvKdDUhze7ArSDofKR66AsYLPJEYd6xCWDIEh7l5DspV1jnVu
         /BysEfzneYEQdqDJ/H7CejL/Qen/6R7K5BmxWUrVExycQSxycRNuv7ImEsVPWgct3pzC
         WIWEzg4b/1MQUTQG1oRdf2X0c0YTz364621qCnF7pXDZIqvrLpAXhthmBTmuPgYajJff
         Oo2Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1735051198; x=1735655998;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=9HF1qACSod1kWAoA/COSAk5nfGEmiGZxS1K9zisyvtY=;
        b=sHXPY8+/SjJSVQAJUXhqTjX0SIxiA1tQ8xHUzuT7O3eLxsR5ZplbNMzIT6+Zw8feQh
         FP0iw42N0UGphu27nquaEX4AmtdRZg890FFnFOJ7D0B1uE5kzFW21VPgfLzrbNRi04nf
         Nkuwmb36zD/h8lzzjA/Hh4dWlUGFW2X49+6kiuzJ70FBepQCUjwYyleJjW/C8t4Nh2VR
         KmdmUyUyMzTlaTDiQE6vUqfk27d8M521vKOHetAqbM+Dy6FcosjfM8H/fSVhn2FSoUbI
         n+eza5Yt9plBojVd/9UwMk7yNINOXZ7693k6IIglDFzordSR/DCoL5PgE84I2UHmJpSk
         hNUQ==
X-Forwarded-Encrypted: i=1;
 AJvYcCXOIsRm7wYh+kBy8Ol/hQmKk5FX+rOKaIX7ouLY9TjMCSy0nvNELz3Bfsq62vtIm5DNXzzugsrAlAsu0Q8=@vger.kernel.org
X-Gm-Message-State: AOJu0YybgjZowMH7ZIIQB9S4h+2oetJVg5CEzRit6EgBmQJsm24608s7
	rSQ0Lk7xayW6JlgxLkKtt5bsUhrH0gPniyrcSZfHRJGWyG2kKhQG8/1tKzD7bG0=
X-Gm-Gg: ASbGnctMGf5QBsDl9TJt/oKCk40aFZcdiAJ1Hp3qzC7Y4tUNf1ciqDtkli5kNp6X3zK
	vSmCG05WYFjwmZdYuhgWeXEsKlVOUOcsYiIyB0D/fhUYHcClagEznoxZg70OGYZcJLUj5yusfpl
	/zaiQwsuek7R22JdSzI8D4J+3DnFCQtg4Z0fedwGRWBtfE+hXy2woBOZR7Y6i7rCLvbMf/qUbAn
	Quqyf7/Wiu0ZTNhSS2mb0LkuQcBJa3JPPuvG8BUnABdu8Y2/0lH0vCrf1x1ZIBWMSwAolPLnnda
	ug==
X-Google-Smtp-Source: 
 AGHT+IG6ErdK7jWxaeL/lueO5nmWZSFVm6gM6DX2BhsbGjPUTsi72O0feo489S4NIqCP9JZBLNzB4w==
X-Received: by 2002:a17:903:182:b0:216:4943:e575 with SMTP id
 d9443c01a7336-219e6f28547mr173619465ad.57.1735051198014;
        Tue, 24 Dec 2024 06:39:58 -0800 (PST)
Received: from KASONG-MC4.tencent.com ([115.171.41.189])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-219dc9cdec1sm90598735ad.136.2024.12.24.06.39.54
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Tue, 24 Dec 2024 06:39:57 -0800 (PST)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chrisl@kernel.org>,
	Barry Song <v-songbaohua@oppo.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Hugh Dickins <hughd@google.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Kalesh Singh <kaleshsingh@google.com>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH v2 07/13] mm,
 swap: hold a reference during scan and cleanup flag usage
Date: Tue, 24 Dec 2024 22:38:05 +0800
Message-ID: <20241224143811.33462-8-ryncsn@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20241224143811.33462-1-ryncsn@gmail.com>
References: <20241224143811.33462-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Kairui Song <kasong@tencent.com>

The flag SWP_SCANNING was used as an indicator of whether a device
is being scanned for allocation, and prevents swapoff. Combined with
SWP_WRITEOK, they work as a set of barriers for a clean swapoff:

1. Swapoff clears SWP_WRITEOK, allocation requests will see
   ~SWP_WRITEOK and abort as it's serialized by si->lock.
2. Swapoff unuses all allocated entries.
3. Swapoff waits for SWP_SCANNING flag to be cleared, so ongoing
   allocations will stop, preventing UAF.
4. Now swapoff can free everything safely.

This will make the allocation path have a hard dependency on
si->lock. Allocation always have to acquire si->lock first for
setting SWP_SCANNING and checking SWP_WRITEOK.

This commit removes this flag, and just uses the existing per-CPU
refcount instead to prevent UAF in step 3, which serves well for
such usage without dependency on si->lock, and scales very well too.
Just hold a reference during the whole scan and allocation process.
Swapoff will kill and wait for the counter.

And for preventing any allocation from happening after step 1 so the
unuse in step 2 can ensure all slots are free, swapoff will acquire
the ci->lock of each cluster one by one to ensure all allocations
see ~SWP_WRITEOK and abort.

This way these dependences on si->lock are gone. And worth noting we
can't kill the refcount as the first step for swapoff as the unuse
process have to acquire the refcount.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  1 -
 mm/swapfile.c        | 90 ++++++++++++++++++++++++++++----------------
 2 files changed, 57 insertions(+), 34 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index e1eeea6307cd..02120f1005d5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -219,7 +219,6 @@ enum {
 	SWP_STABLE_WRITES =3D (1 << 11),	/* no overwrite PG_writeback pages */
 	SWP_SYNCHRONOUS_IO =3D (1 << 12),	/* synchronous IO is efficient */
 					/* add others here before... */
-	SWP_SCANNING	=3D (1 << 14),	/* refcount in scan_swap_map */
 };
=20
 #define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ae0f7df06474..0abff343f5f0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -658,6 +658,8 @@ static bool cluster_alloc_range(struct swap_info_struct=
 *si, struct swap_cluster
 {
 	unsigned int nr_pages =3D 1 << order;
=20
+	lockdep_assert_held(&ci->lock);
+
 	if (!(si->flags & SWP_WRITEOK))
 		return false;
=20
@@ -1055,8 +1057,6 @@ static int cluster_alloc_swap(struct swap_info_struct=
 *si,
 {
 	int n_ret =3D 0;
=20
-	si->flags +=3D SWP_SCANNING;
-
 	while (n_ret < nr) {
 		unsigned long offset =3D cluster_alloc_swap_entry(si, order, usage);
=20
@@ -1065,8 +1065,6 @@ static int cluster_alloc_swap(struct swap_info_struct=
 *si,
 		slots[n_ret++] =3D swp_entry(si->type, offset);
 	}
=20
-	si->flags -=3D SWP_SCANNING;
-
 	return n_ret;
 }
=20
@@ -1108,6 +1106,22 @@ static int scan_swap_map_slots(struct swap_info_stru=
ct *si,
 	return cluster_alloc_swap(si, usage, nr, slots, order);
 }
=20
+static bool get_swap_device_info(struct swap_info_struct *si)
+{
+	if (!percpu_ref_tryget_live(&si->users))
+		return false;
+	/*
+	 * Guarantee the si->users are checked before accessing other
+	 * fields of swap_info_struct, and si->flags (SWP_WRITEOK) is
+	 * up to dated.
+	 *
+	 * Paired with the spin_unlock() after setup_swap_info() in
+	 * enable_swap_info(), and smp_wmb() in swapoff.
+	 */
+	smp_rmb();
+	return true;
+}
+
 int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 {
 	int order =3D swap_entry_order(entry_order);
@@ -1135,13 +1149,16 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entr=
ies[], int entry_order)
 		/* requeue si to after same-priority siblings */
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
-		spin_lock(&si->lock);
-		n_ret =3D scan_swap_map_slots(si, SWAP_HAS_CACHE,
-					    n_goal, swp_entries, order);
-		spin_unlock(&si->lock);
-		if (n_ret || size > 1)
-			goto check_out;
-		cond_resched();
+		if (get_swap_device_info(si)) {
+			spin_lock(&si->lock);
+			n_ret =3D scan_swap_map_slots(si, SWAP_HAS_CACHE,
+					n_goal, swp_entries, order);
+			spin_unlock(&si->lock);
+			put_swap_device(si);
+			if (n_ret || size > 1)
+				goto check_out;
+			cond_resched();
+		}
=20
 		spin_lock(&swap_avail_lock);
 		/*
@@ -1292,16 +1309,8 @@ struct swap_info_struct *get_swap_device(swp_entry_t=
 entry)
 	si =3D swp_swap_info(entry);
 	if (!si)
 		goto bad_nofile;
-	if (!percpu_ref_tryget_live(&si->users))
+	if (!get_swap_device_info(si))
 		goto out;
-	/*
-	 * Guarantee the si->users are checked before accessing other
-	 * fields of swap_info_struct.
-	 *
-	 * Paired with the spin_unlock() after setup_swap_info() in
-	 * enable_swap_info().
-	 */
-	smp_rmb();
 	offset =3D swp_offset(entry);
 	if (offset >=3D si->max)
 		goto put_out;
@@ -1781,10 +1790,13 @@ swp_entry_t get_swap_page_of_type(int type)
 		goto fail;
=20
 	/* This is called for allocating swap entry, not cache */
-	spin_lock(&si->lock);
-	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
-		atomic_long_dec(&nr_swap_pages);
-	spin_unlock(&si->lock);
+	if (get_swap_device_info(si)) {
+		spin_lock(&si->lock);
+		if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0=
))
+			atomic_long_dec(&nr_swap_pages);
+		spin_unlock(&si->lock);
+		put_swap_device(si);
+	}
 fail:
 	return entry;
 }
@@ -2558,6 +2570,25 @@ bool has_usable_swap(void)
 	return ret;
 }
=20
+/*
+ * Called after clearing SWP_WRITEOK, ensures cluster_alloc_range
+ * see the updated flags, so there will be no more allocations.
+ */
+static void wait_for_allocation(struct swap_info_struct *si)
+{
+	unsigned long offset;
+	unsigned long end =3D ALIGN(si->max, SWAPFILE_CLUSTER);
+	struct swap_cluster_info *ci;
+
+	BUG_ON(si->flags & SWP_WRITEOK);
+
+	for (offset =3D 0; offset < end; offset +=3D SWAPFILE_CLUSTER) {
+		ci =3D lock_cluster(si, offset);
+		unlock_cluster(ci);
+		offset +=3D SWAPFILE_CLUSTER;
+	}
+}
+
 SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 {
 	struct swap_info_struct *p =3D NULL;
@@ -2628,6 +2659,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special=
file)
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
=20
+	wait_for_allocation(p);
+
 	disable_swap_slots_cache_lock();
=20
 	set_current_oom_origin();
@@ -2670,15 +2703,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specia=
lfile)
 	spin_lock(&p->lock);
 	drain_mmlist();
=20
-	/* wait for anyone still in scan_swap_map_slots */
-	while (p->flags >=3D SWP_SCANNING) {
-		spin_unlock(&p->lock);
-		spin_unlock(&swap_lock);
-		schedule_timeout_uninterruptible(1);
-		spin_lock(&swap_lock);
-		spin_lock(&p->lock);
-	}
-
 	swap_file =3D p->swap_file;
 	p->swap_file =3D NULL;
 	p->max =3D 0;
--=20
2.47.1