From nobody Thu Dec 18 20:36:31 2025 Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com [209.85.214.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1AB091C5F30 for ; Mon, 13 Jan 2025 18:00:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791221; cv=none; b=sYQJ1C6lqHXpNDq4jzkHloP0QaDm7msmSbrkzBFmO+aJSnbvZkry1/NH9y5xlSeF7HuQXRpP6EjK8HqWqbmIqLV6nO/ggaMA6mn92TlyROLI80dwCz4i41ph0zNV4Xq7vYGRRv+l16W6WckmICJ0ywurNi6wtWJ9jJKvv7ZG1iE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791221; c=relaxed/simple; bh=09kqjsGWxrTZxltfWLj6iwGSywuk0KWET1bZzNPMlgU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=XSFIMTIxRVITfjWoTZriE3mLLnf19dTCXaXG+9JNK4O858aRSRy8qPS4TXMNGWisa7JFewK3ncza1D413Z6I7/0+rhlDvRMV9tFRG/xuuY0nOY5SOJOKgP8ZJ6O6P+t2uvE2rxjOaZ/UH0lCwzXhiij+ourpBPZYU7ZYH1XU/3Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=NWIfOh/Z; arc=none smtp.client-ip=209.85.214.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="NWIfOh/Z" Received: by mail-pl1-f174.google.com with SMTP id d9443c01a7336-2165cb60719so82362395ad.0 for ; Mon, 13 Jan 2025 10:00:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736791218; x=1737396018; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=dWh63O4GHmxsuBKSbBliKUVuioOhHJ78YLSnvJ5urEw=; b=NWIfOh/ZyPUueA/2QSKuNjgOOAc6iUvFXNj1ndzRcnMra39uZl/yUUD1qzVes8AVW5 8plWOLq/rvBf7i9nlZcOjlb+O7YkoEIaHnYZllKWKU2miMbYkETDA0BJ03dBL15+pnT9 9IJH7qPof49iXHLxkpnPB+V+oDPtmK9wkRbKfclDlRaNgBFHEjI107lbfYM8cbCyaCsN ik0zvYa32ecH7JBnXur9dqxb+ZPCia4gHknu2fAuOv5qc0YUnbRb1TqekU/GBsF8W8aU YPV+XX67AcPNyCba7LWzPTS4bDwkDiaT9rH0gIvSsl/FkgL5fP+qX4hvyuz0zd672hwa Q3dg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736791218; x=1737396018; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=dWh63O4GHmxsuBKSbBliKUVuioOhHJ78YLSnvJ5urEw=; b=eBO4f59yr55KrHGeF3rgzmqETApufSa9+JZIb2H7U1OZxoWRJqCMaInki4aYhfRPYK aSJTG1WetN3vb9L0sgNDF2R3NbftEltTjz7g+a55sX0p2QP4Uiw3H+540t0iZKWB4UF1 hh8R2TIH+uoLTrqCcaJKEFoP/J0WOfULeCAGzEOwIIpH+aFwVHqyaM6Zv09qeU60BYyT K/Ij5ArUcboR8yL7GQ7BVxkJjGkJMSmvekFB+v7yxXE0ecE5uOat+V4MC/iALNFkuLKy i3MsBf6dXbokTpwXgSg7QHqyAJJl64wR1cVjRLfuI0jWW3k8coqhPNbelPPp4LKjJA97 NRfg== X-Forwarded-Encrypted: i=1; AJvYcCXg0QXBK1D44sSSqWVDbZA4KQPl2Z0O09851mO1DjB104NRF+1mMbRrzkrvSJo6oYcrIWS5yq+Z7H+obJ8=@vger.kernel.org X-Gm-Message-State: AOJu0YyHX31b/z/t66+O+KCCllWq21AlrDfIP8Xu7rCMInvcXHlajk+G HaocsqykE7caJ0jpYiZTWzH5kOFHE3FZKAyK/FAfIemP64rmoT7ceUYRGJHUVFw= X-Gm-Gg: ASbGnctPnTI4Dp2KliqwPjqCUXsfUptu9kYpe4QF8uLLRKpl1uDUpniw5fGlMwfeAuB q27GOzCm/VjGofkF+4z+Mo8CpqXIOCcr5uxemPh+kD8wZR4wIAS0g0xVV4kpPH1rG1WIC8NK6zY DlgISFbG7cjLysEekcU84H3p6HAjB1W42NtvF5ilm5kNewiTBtvzLUQ2A8YZED/eFNwqR2dqqns 8Yk68rETr2lI7AdVGRgLm+Mj9gBg5of3Lp/+Fk96vbFRHU7tFMDLnOEbHxt9/PoHJjY+eD5tFYu jg== X-Google-Smtp-Source: AGHT+IFkbMojpX7s6VN8Lz/QQdWEvSeMZ/ugRu8uwONPdYOHeJ6nn0l7HLskgv5eoQxu2cxPIS9zAg== X-Received: by 2002:a17:903:2311:b0:215:a2e2:53ff with SMTP id d9443c01a7336-21a83f36e1dmr352526495ad.11.1736791218117; Mon, 13 Jan 2025 10:00:18 -0800 (PST) Received: from KASONG-MC4.tencent.com ([115.171.41.132]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.10.00.14 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 13 Jan 2025 10:00:17 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Baoquan He , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 07/13] mm, swap: hold a reference during scan and cleanup flag usage Date: Tue, 14 Jan 2025 01:57:26 +0800 Message-ID: <20250113175732.48099-8-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250113175732.48099-1-ryncsn@gmail.com> References: <20250113175732.48099-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song The flag SWP_SCANNING was used as an indicator of whether a device is being scanned for allocation, and prevents swapoff. Combined with SWP_WRITEOK, they work as a set of barriers for a clean swapoff: 1. Swapoff clears SWP_WRITEOK, allocation requests will see ~SWP_WRITEOK and abort as it's serialized by si->lock. 2. Swapoff unuses all allocated entries. 3. Swapoff waits for SWP_SCANNING flag to be cleared, so ongoing allocations will stop, preventing UAF. 4. Now swapoff can free everything safely. This will make the allocation path have a hard dependency on si->lock. Allocation always have to acquire si->lock first for setting SWP_SCANNING and checking SWP_WRITEOK. This commit removes this flag, and just uses the existing per-CPU refcount instead to prevent UAF in step 3, which serves well for such usage without dependency on si->lock, and scales very well too. Just hold a reference during the whole scan and allocation process. Swapoff will kill and wait for the counter. And for preventing any allocation from happening after step 1 so the unuse in step 2 can ensure all slots are free, swapoff will acquire the ci->lock of each cluster one by one to ensure all allocations see ~SWP_WRITEOK and abort. This way these dependences on si->lock are gone. And worth noting we can't kill the refcount as the first step for swapoff as the unuse process have to acquire the refcount. Signed-off-by: Kairui Song --- include/linux/swap.h | 1 - mm/swapfile.c | 90 ++++++++++++++++++++++++++++---------------- 2 files changed, 57 insertions(+), 34 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index e1eeea6307cd..02120f1005d5 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -219,7 +219,6 @@ enum { SWP_STABLE_WRITES =3D (1 << 11), /* no overwrite PG_writeback pages */ SWP_SYNCHRONOUS_IO =3D (1 << 12), /* synchronous IO is efficient */ /* add others here before... */ - SWP_SCANNING =3D (1 << 14), /* refcount in scan_swap_map */ }; =20 #define SWAP_CLUSTER_MAX 32UL diff --git a/mm/swapfile.c b/mm/swapfile.c index 91faf2073006..3898576f947a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -658,6 +658,8 @@ static bool cluster_alloc_range(struct swap_info_struct= *si, struct swap_cluster { unsigned int nr_pages =3D 1 << order; =20 + lockdep_assert_held(&ci->lock); + if (!(si->flags & SWP_WRITEOK)) return false; =20 @@ -1059,8 +1061,6 @@ static int cluster_alloc_swap(struct swap_info_struct= *si, { int n_ret =3D 0; =20 - si->flags +=3D SWP_SCANNING; - while (n_ret < nr) { unsigned long offset =3D cluster_alloc_swap_entry(si, order, usage); =20 @@ -1069,8 +1069,6 @@ static int cluster_alloc_swap(struct swap_info_struct= *si, slots[n_ret++] =3D swp_entry(si->type, offset); } =20 - si->flags -=3D SWP_SCANNING; - return n_ret; } =20 @@ -1112,6 +1110,22 @@ static int scan_swap_map_slots(struct swap_info_stru= ct *si, return cluster_alloc_swap(si, usage, nr, slots, order); } =20 +static bool get_swap_device_info(struct swap_info_struct *si) +{ + if (!percpu_ref_tryget_live(&si->users)) + return false; + /* + * Guarantee the si->users are checked before accessing other + * fields of swap_info_struct, and si->flags (SWP_WRITEOK) is + * up to dated. + * + * Paired with the spin_unlock() after setup_swap_info() in + * enable_swap_info(), and smp_wmb() in swapoff. + */ + smp_rmb(); + return true; +} + int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) { int order =3D swap_entry_order(entry_order); @@ -1139,13 +1153,16 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entr= ies[], int entry_order) /* requeue si to after same-priority siblings */ plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); spin_unlock(&swap_avail_lock); - spin_lock(&si->lock); - n_ret =3D scan_swap_map_slots(si, SWAP_HAS_CACHE, - n_goal, swp_entries, order); - spin_unlock(&si->lock); - if (n_ret || size > 1) - goto check_out; - cond_resched(); + if (get_swap_device_info(si)) { + spin_lock(&si->lock); + n_ret =3D scan_swap_map_slots(si, SWAP_HAS_CACHE, + n_goal, swp_entries, order); + spin_unlock(&si->lock); + put_swap_device(si); + if (n_ret || size > 1) + goto check_out; + cond_resched(); + } =20 spin_lock(&swap_avail_lock); /* @@ -1296,16 +1313,8 @@ struct swap_info_struct *get_swap_device(swp_entry_t= entry) si =3D swp_swap_info(entry); if (!si) goto bad_nofile; - if (!percpu_ref_tryget_live(&si->users)) + if (!get_swap_device_info(si)) goto out; - /* - * Guarantee the si->users are checked before accessing other - * fields of swap_info_struct. - * - * Paired with the spin_unlock() after setup_swap_info() in - * enable_swap_info(). - */ - smp_rmb(); offset =3D swp_offset(entry); if (offset >=3D si->max) goto put_out; @@ -1785,10 +1794,13 @@ swp_entry_t get_swap_page_of_type(int type) goto fail; =20 /* This is called for allocating swap entry, not cache */ - spin_lock(&si->lock); - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0)) - atomic_long_dec(&nr_swap_pages); - spin_unlock(&si->lock); + if (get_swap_device_info(si)) { + spin_lock(&si->lock); + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0= )) + atomic_long_dec(&nr_swap_pages); + spin_unlock(&si->lock); + put_swap_device(si); + } fail: return entry; } @@ -2562,6 +2574,25 @@ bool has_usable_swap(void) return ret; } =20 +/* + * Called after clearing SWP_WRITEOK, ensures cluster_alloc_range + * see the updated flags, so there will be no more allocations. + */ +static void wait_for_allocation(struct swap_info_struct *si) +{ + unsigned long offset; + unsigned long end =3D ALIGN(si->max, SWAPFILE_CLUSTER); + struct swap_cluster_info *ci; + + BUG_ON(si->flags & SWP_WRITEOK); + + for (offset =3D 0; offset < end; offset +=3D SWAPFILE_CLUSTER) { + ci =3D lock_cluster(si, offset); + unlock_cluster(ci); + offset +=3D SWAPFILE_CLUSTER; + } +} + SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) { struct swap_info_struct *p =3D NULL; @@ -2632,6 +2663,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) spin_unlock(&p->lock); spin_unlock(&swap_lock); =20 + wait_for_allocation(p); + disable_swap_slots_cache_lock(); =20 set_current_oom_origin(); @@ -2674,15 +2707,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specia= lfile) spin_lock(&p->lock); drain_mmlist(); =20 - /* wait for anyone still in scan_swap_map_slots */ - while (p->flags >=3D SWP_SCANNING) { - spin_unlock(&p->lock); - spin_unlock(&swap_lock); - schedule_timeout_uninterruptible(1); - spin_lock(&swap_lock); - spin_lock(&p->lock); - } - swap_file =3D p->swap_file; p->swap_file =3D NULL; p->max =3D 0; --=20 2.47.1