From nobody Thu Oct 2 13:01:38 2025 Received: from mail-qk1-f176.google.com (mail-qk1-f176.google.com [209.85.222.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4C5D828D830 for ; Tue, 16 Sep 2025 16:01:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038482; cv=none; b=gP9zKDj1ZX7hul+qS+oN+3JtmFI6Ppp0dlm3WtJF7mR9SNst7ejz0WJQ9tuqOGJ3F0l6oqbsbymUshJsN2qqt6jyCGKrr7gGL5ZKgZ2nMJ+GI0LFTn9lZfCcgL9h5CTQrhDnmPUkx/xm/mWw7hrSuEumvuNmqPcy85c9uSAAFwI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038482; c=relaxed/simple; bh=o6kb0md5MeaHDAk1URWbr4k37vvjaRoINUoHP5a7lag=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=jplwl+fHuaRQFFKhTtQR1KImKb04mk6jGXgqoqio/VtkzP51aC+Z1HccOSizrttcyYNfwlXm+zSPR6AOh4yrj7frVwZEr85YQXCH0VGUUxaGbnJsYk+ypBEJ/N6wC4uE1zegnvbZk7/VA1TmYdmy+k2INxQbYtAE0icgIWpZlHI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=R0rUsDvO; arc=none smtp.client-ip=209.85.222.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="R0rUsDvO" Received: by mail-qk1-f176.google.com with SMTP id af79cd13be357-8117aef2476so570606585a.1 for ; Tue, 16 Sep 2025 09:01:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1758038478; x=1758643278; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=y/10HhICLTBAM9DGxW+AJtzORYwGHaxVefpsLmsvFfE=; b=R0rUsDvOgshr1H/nzsI2i+pJeW9xYW3tb7HCntXcGEx/2fziTRISupLgTEacJz7M6a tYdv9o78H9kQrYLPGXwm5ey7YrXjLlGCLJM2nirNlSiO22C6+vvJnyIlmLkLeBGK0k7L Iegzr2x5pq7B8SvbFbSbS9P+RhiMgJ2r6YeDUStIblX6crFLTYnr3N0DS9OyHQajCWMD vBzQnm3ub2PblgHMnsM7tv+OyPhmYSr34be3F2oBCZ2yiZMmmwylJc26FzkEVEuTD+xR 0uyrBi9wNAhWYwZnyFJCE7OcPkV8x/Xn4DVvsXAW80JMLFiwps/UJtYZHIvleaRBOV4i oUqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758038478; x=1758643278; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=y/10HhICLTBAM9DGxW+AJtzORYwGHaxVefpsLmsvFfE=; b=KmGs2hnMloeNoj8AUEoBq0PuNP5aiCVm2m3v6qdgaYwaOhYv7xHSlqbqapPBLNAEsX ixKDUoP4Kc/kiVYOa301g4x/poCYlqBXn3EFwm+jvvY7OxaHh9ZfFPHiIWxkSFA1/nhq 6kapxDomRs6bQMhavWsoyraCUKQwCxbntpdHksmFMFJsuazqKe8OJRo3XACLLpvc4yMy vHdW4sWLaTq2EJi6EiKlldAfcmRFsZkAlaAduEQ1IR46iUfbx8771Cwr05ES+F7YgeBC SAaYdRf+aekTihYWzQJzE6+pC0+t/W012DzxCLRNAYwwd0CuL47zno5hcXdHTEIGVAfu WRwg== X-Forwarded-Encrypted: i=1; AJvYcCU77UT6Ig3ECcwF+qim7gmnzfnLlhUDnUO3eVi2DO9wCTBAMklMCgUcK7pmxOYci8Chg6q7pWqt5igGVDQ=@vger.kernel.org X-Gm-Message-State: AOJu0YxTVR/byTIUPBbhq/GchK+RnGlRJsWxilcWFVvJ2xsx9gG1sxAK WLNIPZRGzpRuv74gPoHqO2lYZuLCGxY4ihhGZ0ivPKyWjiYuYp6M9EhV+O5NlFfsY2U= X-Gm-Gg: ASbGncuMi47RagMosPAhs4E4rdxsP0/wSrzon2YxH0mNJ7K+NX3pONX3Xj4XuaTnGsN UUPeJMNR2Hsq8uMMR2GfOi6fxQ6bGPX8wlKKfaxsswxkYMaDmjrbwKNeWjbqY+R6tmxJTR3QrOF Fl1s742ILVJm8bB1nj0V/ZGzzfNM2pq4KOCiujZD+NVtzRXaTJA/dtauk3ZVDEyxdeg6TEZafmQ 7cGSbMfsBOCLy1/gdInViEoECrXELw/5wd+Gr1KqK3n6X3HGA9DnesqbpIvQMewanzntHLyVHNT AY4tPv2kF1mHxkPVgY9+XBPKEk1byFdkREWsgJ460sf3xws8p8ajfswOe4ZeFNLoedsAHaeYZym VnwlQgCqPTwZ2v8KXKdtIyVUGsrH2hhPkzg8Y8yd63nn4RCQ= X-Google-Smtp-Source: AGHT+IEk/B7drObHLtIjBOcJtdSnUnQFiVQtqdOA5y3ysTkFfJWgXvqapxngSvusLiINI+bB9tXJnQ== X-Received: by 2002:a05:620a:7112:b0:82f:4be4:3788 with SMTP id af79cd13be357-82f4be438e1mr36800185a.45.1758038477821; Tue, 16 Sep 2025 09:01:17 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-820cd703f54sm969765485a.37.2025.09.16.09.01.11 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 16 Sep 2025 09:01:17 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Kairui Song , Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 01/15] docs/mm: add document for swap table Date: Wed, 17 Sep 2025 00:00:46 +0800 Message-ID: <20250916160100.31545-2-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250916160100.31545-1-ryncsn@gmail.com> References: <20250916160100.31545-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chris Li Swap table is the new swap cache. Signed-off-by: Chris Li Signed-off-by: Kairui Song Suggested-by: Chris Li --- Documentation/mm/index.rst | 1 + Documentation/mm/swap-table.rst | 72 +++++++++++++++++++++++++++++++++ MAINTAINERS | 1 + 3 files changed, 74 insertions(+) create mode 100644 Documentation/mm/swap-table.rst diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst index fb45acba16ac..828ad9b019b3 100644 --- a/Documentation/mm/index.rst +++ b/Documentation/mm/index.rst @@ -57,6 +57,7 @@ documentation, or deleted if it has served its purpose. page_table_check remap_file_pages split_page_table_lock + swap-table transhuge unevictable-lru vmalloced-kernel-stacks diff --git a/Documentation/mm/swap-table.rst b/Documentation/mm/swap-table.= rst new file mode 100644 index 000000000000..acae6ceb4f7b --- /dev/null +++ b/Documentation/mm/swap-table.rst @@ -0,0 +1,72 @@ +.. SPDX-License-Identifier: GPL-2.0 + +:Author: Chris Li , Kairui Song + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Swap Table +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Swap table implements swap cache as a per-cluster swap cache value array. + +Swap Entry +---------- + +A swap entry contains the information required to serve the anonymous page +fault. + +Swap entry is encoded as two parts: swap type and swap offset. + +The swap type indicates which swap device to use. +The swap offset is the offset of the swap file to read the page data from. + +Swap Cache +---------- + +Swap cache is a map to look up folios using swap entry as the key. The res= ult +value can have three possible types depending on which stage of this swap = entry +was in. + +1. NULL: This swap entry is not used. + +2. folio: A folio has been allocated and bound to this swap entry. This is + the transient state of swap out or swap in. The folio data can be in + the folio or swap file, or both. + +3. shadow: The shadow contains the working set information of the swapped + out folio. This is the normal state for a swapped out page. + +Swap Table Internals +-------------------- + +The previous swap cache is implemented by XArray. The XArray is a tree +structure. Each lookup will go through multiple nodes. Can we do better? + +Notice that most of the time when we look up the swap cache, we are either +in a swap in or swap out path. We should already have the swap cluster, +which contains the swap entry. + +If we have a per-cluster array to store swap cache value in the cluster. +Swap cache lookup within the cluster can be a very simple array lookup. + +We give such a per-cluster swap cache value array a name: the swap table. + +Each swap cluster contains 512 entries, so a swap table stores one cluster +worth of swap cache values, which is exactly one page. This is not +coincidental because the cluster size is determined by the huge page size. +The swap table is holding an array of pointers. The pointer has the same +size as the PTE. The size of the swap table should match to the second +last level of the page table page, exactly one page. + +With swap table, swap cache lookup can achieve great locality, simpler, +and faster. + +Locking +------- + +Swap table modification requires taking the cluster lock. If a folio +is being added to or removed from the swap table, the folio must be +locked prior to the cluster lock. After adding or removing is done, the +folio shall be unlocked. + +Swap table lookup is protected by RCU and atomic read. If the lookup +returns a folio, the user must lock the folio before use. diff --git a/MAINTAINERS b/MAINTAINERS index 68d29f0220fc..3d113bfc3c82 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16225,6 +16225,7 @@ R: Barry Song R: Chris Li L: linux-mm@kvack.org S: Maintained +F: Documentation/mm/swap-table.rst F: include/linux/swap.h F: include/linux/swapfile.h F: include/linux/swapops.h --=20 2.51.0 From nobody Thu Oct 2 13:01:38 2025 Received: from mail-qk1-f177.google.com (mail-qk1-f177.google.com [209.85.222.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7F56728312D for ; Tue, 16 Sep 2025 16:01:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038487; cv=none; b=J3HP+mRpTkn5gFIeDtE/9cqaMWkuU+vMzy9wjUkzEb0kkQylA/frYHUdMxbEnXz2VuaVNH/hcvn6OljPe3qbv+/lT+GPYipCenb9KI8mHtsWnzjNLUEb4DSLV0EXdVjVI+T5lkGHxmMO2/6PQ3aj3MWbZdKWy93hcvQui4CJw5M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038487; c=relaxed/simple; bh=xgBQSstXvBp8VeP+kZcAikb8t1kgokUoq5Ewfa+sYSk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ft3KCyIGHR9ZFAB19PxQXqxgNTE7eTYE/27zmWx0sNOcJuFxT1eJJfQFCb4K4j3/Cnqzm1vNJ91Uq/m5YQhlPOln9KjZybNtTaI1dIQ9cOcloBNM9KkAc61ie/FmvH5kyxfkeXQ8Rt7AW9nR5AMutVL/Z6GZ+gCuBqkocdnDWpE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=bO1rZ1aB; arc=none smtp.client-ip=209.85.222.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="bO1rZ1aB" Received: by mail-qk1-f177.google.com with SMTP id af79cd13be357-812bc4ff723so524446585a.0 for ; Tue, 16 Sep 2025 09:01:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1758038484; x=1758643284; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=QBZjrSlGrCuoXEovfI0RqN6ge66DgK4hS3OmggQQcuM=; b=bO1rZ1aBI/TSdzK1qTqA1kVgvK9ETfwZHW+01pQhOXxSVlo/IkwftzbT4R5uDyTkTc OeE3k+OxJJ9UngB5QAIQ/yiRlp5YKciDmZf4tmRZIESbBp+LCUsV8jg7nqctD/q1iXq+ zXghk7ei3r8aeKvcr7YjSMRn5/JP+A/dCpZ5Aqx6V2isZJ9gZ1LmoG5TdudS/wmgrxM1 uNi/5rV6Oa+NDTN7JMT6NChmJpf4TDQYFHBWYiIeukNcPvToIm5H0vrvcCbcRHDEquGM zwD/oIB4xvhzJBsIagPeSkVi7bKWz9ojt6D0nB3VVL/OTtIDO73myDNIkdAQYu3D+Lw0 spNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758038484; x=1758643284; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=QBZjrSlGrCuoXEovfI0RqN6ge66DgK4hS3OmggQQcuM=; b=oyXMptgrb28Dw6jpNc3nhIrKEgBXad8gc7O0lATPAQPBf8ODqA0kJ3UOxbmo668Rbk cuvjYLp0eZYzDJRzgoSPEh8r/UiDImCDKtjbWjKQiw/hSFKKkqNZCkXW9rhFNMuwuymS sk+5UXPGm+AAQqhgQ7Xjqojfl01aWKbZqgKiQnkh3iqvC8w8nHsd08+/ZUCKusWYNgMc rTvN9e7w9D517cr3PvwdVtLLyNAC9frJnglDxgkK9B2b4xGX31NRZgrgMaNJGmie5o8D 4KlBhuBowahP7QRylB8/e0mCPRNMslrEFeq0bQRnbNX5iS9/bOooUNUY8phzwGaU6Iu7 S5OA== X-Forwarded-Encrypted: i=1; AJvYcCVZEAMJfaHfoEhiWuc34a3us8FkCZXSQsgMzs8Gz6Ce7TQJZ0VgPU7atwfsR4t0bezoWTftqT5mFrhlrAY=@vger.kernel.org X-Gm-Message-State: AOJu0Yy0RtJkHsNQpr0LJRZORK2qxT4GPbGSZkipZuPKx3lzxqZvRbqm DZV7yn/Rbtzg6vmC8uWVVfG3/liTm/dB8iW/53ttx15B2J1oI3mvGE6b X-Gm-Gg: ASbGncuAr6RygJGf0VjYfy/PyxlrQ+du8Jk3hSKV13bSZKWO/oqhcjokRiHTUbHim38 3mML8kVcfL6legBPQyWmvFjBq0xNOUqU0tqV8dDXd896lYd4yCnsT63eMv9H44IMSq18J8UnVqe 8c7mu459EKT6itKbOp0Ter8tvzB2svAErO2oD4khnsIJ7bUzKNiP9s2IdCkvYmEnIp8d3ZhmReV muJh+SQWtYA0sQbMQQfDaV3LThW1anRaim0xnEZAL2AjvrXiVtaeJ62U4xDlL5g191y2YQ5/DPj oWmHvPSh7FknuArV+6uRBsYCx74HQF4D73n5xEZZ97oio3hko+Fn4sGwUaB9r1+Aiz4HbZrJJVa vArR92Bnj0c5MskiBxuHpNzeyworRtaXJt1X40qNRpdOmwCo= X-Google-Smtp-Source: AGHT+IFD6VPE9lPfvHFSBaljmFDVLzQtyWvwDx/rwgVySA9wLmcADIPAcZRd7DP0JAGnDFYpW6j97g== X-Received: by 2002:a05:620a:6130:b0:822:f45b:a5ef with SMTP id af79cd13be357-823fc8907acmr1768739085a.29.1758038484117; Tue, 16 Sep 2025 09:01:24 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-820cd703f54sm969765485a.37.2025.09.16.09.01.18 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 16 Sep 2025 09:01:23 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Kairui Song , Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 02/15] mm, swap: use unified helper for swap cache look up Date: Wed, 17 Sep 2025 00:00:47 +0800 Message-ID: <20250916160100.31545-3-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250916160100.31545-1-ryncsn@gmail.com> References: <20250916160100.31545-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song The swap cache lookup helper swap_cache_get_folio currently does readahead updates as well, so callers that are not doing swapin from any VMA or mapping are forced to reuse filemap helpers instead, and have to access the swap cache space directly. So decouple readahead update with swap cache lookup. Move the readahead update part into a standalone helper. Let the caller call the readahead update helper if they do readahead. And convert all swap cache lookups to use swap_cache_get_folio. After this commit, there are only three special cases for accessing swap cache space now: huge memory splitting, migration, and shmem replacing, because they need to lock the XArray. The following commits will wrap their accesses to the swap cache too, with special helpers. And worth noting, currently dropbehind is not supported for anon folio, and we will never see a dropbehind folio in swap cache. The unified helper can be updated later to handle that. While at it, add proper kernedoc for touched helpers. No functional change. Signed-off-by: Kairui Song Reviewed-by: Baolin Wang Reviewed-by: Barry Song Acked-by: David Hildenbrand Acked-by: Chris Li Acked-by: Nhat Pham Suggested-by: Chris Li --- mm/memory.c | 6 ++- mm/mincore.c | 3 +- mm/shmem.c | 4 +- mm/swap.h | 13 ++++-- mm/swap_state.c | 109 +++++++++++++++++++++++++---------------------- mm/swapfile.c | 11 +++-- mm/userfaultfd.c | 5 +-- 7 files changed, 81 insertions(+), 70 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index d9de6c056179..10ef528a5f44 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4660,9 +4660,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (unlikely(!si)) goto out; =20 - folio =3D swap_cache_get_folio(entry, vma, vmf->address); - if (folio) + folio =3D swap_cache_get_folio(entry); + if (folio) { + swap_update_readahead(folio, vma, vmf->address); page =3D folio_file_page(folio, swp_offset(entry)); + } swapcache =3D folio; =20 if (!folio) { diff --git a/mm/mincore.c b/mm/mincore.c index 2f3e1816a30d..8ec4719370e1 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -76,8 +76,7 @@ static unsigned char mincore_swap(swp_entry_t entry, bool= shmem) if (!si) return 0; } - folio =3D filemap_get_entry(swap_address_space(entry), - swap_cache_index(entry)); + folio =3D swap_cache_get_folio(entry); if (shmem) put_swap_device(si); /* The swap cache space contains either folio, shadow or NULL */ diff --git a/mm/shmem.c b/mm/shmem.c index 29e1eb690125..410f27bc4752 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2317,7 +2317,7 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, } =20 /* Look it up and read it in.. */ - folio =3D swap_cache_get_folio(swap, NULL, 0); + folio =3D swap_cache_get_folio(swap); if (!folio) { if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { /* Direct swapin skipping swap cache & readahead */ @@ -2342,6 +2342,8 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, count_vm_event(PGMAJFAULT); count_memcg_event_mm(fault_mm, PGMAJFAULT); } + } else { + swap_update_readahead(folio, NULL, 0); } =20 if (order > folio_order(folio)) { diff --git a/mm/swap.h b/mm/swap.h index 1ae44d4193b1..efb6d7ff9f30 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -62,8 +62,7 @@ void delete_from_swap_cache(struct folio *folio); void clear_shadow_from_swap_cache(int type, unsigned long begin, unsigned long end); void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int n= r); -struct folio *swap_cache_get_folio(swp_entry_t entry, - struct vm_area_struct *vma, unsigned long addr); +struct folio *swap_cache_get_folio(swp_entry_t entry); struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr, struct swap_iocb **plug); @@ -74,6 +73,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, g= fp_t flag, struct mempolicy *mpol, pgoff_t ilx); struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf); +void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, + unsigned long addr); =20 static inline unsigned int folio_swap_flags(struct folio *folio) { @@ -159,6 +160,11 @@ static inline struct folio *swapin_readahead(swp_entry= _t swp, gfp_t gfp_mask, return NULL; } =20 +static inline void swap_update_readahead(struct folio *folio, + struct vm_area_struct *vma, unsigned long addr) +{ +} + static inline int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug) { @@ -169,8 +175,7 @@ static inline void swapcache_clear(struct swap_info_str= uct *si, swp_entry_t entr { } =20 -static inline struct folio *swap_cache_get_folio(swp_entry_t entry, - struct vm_area_struct *vma, unsigned long addr) +static inline struct folio *swap_cache_get_folio(swp_entry_t entry) { return NULL; } diff --git a/mm/swap_state.c b/mm/swap_state.c index 99513b74b5d8..68ec531d0f2b 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -69,6 +69,27 @@ void show_swap_cache_info(void) printk("Total swap =3D %lukB\n", K(total_swap_pages)); } =20 +/** + * swap_cache_get_folio - Looks up a folio in the swap cache. + * @entry: swap entry used for the lookup. + * + * A found folio will be returned unlocked and with its refcount increased. + * + * Context: Caller must ensure @entry is valid and protect the swap device + * with reference count or locks. + * Return: Returns the found folio on success, NULL otherwise. The caller + * must lock and check if the folio still matches the swap entry before + * use. + */ +struct folio *swap_cache_get_folio(swp_entry_t entry) +{ + struct folio *folio =3D filemap_get_folio(swap_address_space(entry), + swap_cache_index(entry)); + if (IS_ERR(folio)) + return NULL; + return folio; +} + void *get_shadow_from_swap_cache(swp_entry_t entry) { struct address_space *address_space =3D swap_address_space(entry); @@ -272,55 +293,43 @@ static inline bool swap_use_vma_readahead(void) return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap); } =20 -/* - * Lookup a swap entry in the swap cache. A found folio will be returned - * unlocked and with its refcount incremented - we rely on the kernel - * lock getting page table operations atomic even if we drop the folio - * lock before returning. - * - * Caller must lock the swap device or hold a reference to keep it valid. +/** + * swap_update_readahead - Update the readahead statistics of VMA or globa= lly. + * @folio: the swap cache folio that just got hit. + * @vma: the VMA that should be updated, could be NULL for global update. + * @addr: the addr that triggered the swapin, ignored if @vma is NULL. */ -struct folio *swap_cache_get_folio(swp_entry_t entry, - struct vm_area_struct *vma, unsigned long addr) +void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, + unsigned long addr) { - struct folio *folio; - - folio =3D filemap_get_folio(swap_address_space(entry), swap_cache_index(e= ntry)); - if (!IS_ERR(folio)) { - bool vma_ra =3D swap_use_vma_readahead(); - bool readahead; + bool readahead, vma_ra =3D swap_use_vma_readahead(); =20 - /* - * At the moment, we don't support PG_readahead for anon THP - * so let's bail out rather than confusing the readahead stat. - */ - if (unlikely(folio_test_large(folio))) - return folio; - - readahead =3D folio_test_clear_readahead(folio); - if (vma && vma_ra) { - unsigned long ra_val; - int win, hits; - - ra_val =3D GET_SWAP_RA_VAL(vma); - win =3D SWAP_RA_WIN(ra_val); - hits =3D SWAP_RA_HITS(ra_val); - if (readahead) - hits =3D min_t(int, hits + 1, SWAP_RA_HITS_MAX); - atomic_long_set(&vma->swap_readahead_info, - SWAP_RA_VAL(addr, win, hits)); - } - - if (readahead) { - count_vm_event(SWAP_RA_HIT); - if (!vma || !vma_ra) - atomic_inc(&swapin_readahead_hits); - } - } else { - folio =3D NULL; + /* + * At the moment, we don't support PG_readahead for anon THP + * so let's bail out rather than confusing the readahead stat. + */ + if (unlikely(folio_test_large(folio))) + return; + + readahead =3D folio_test_clear_readahead(folio); + if (vma && vma_ra) { + unsigned long ra_val; + int win, hits; + + ra_val =3D GET_SWAP_RA_VAL(vma); + win =3D SWAP_RA_WIN(ra_val); + hits =3D SWAP_RA_HITS(ra_val); + if (readahead) + hits =3D min_t(int, hits + 1, SWAP_RA_HITS_MAX); + atomic_long_set(&vma->swap_readahead_info, + SWAP_RA_VAL(addr, win, hits)); } =20 - return folio; + if (readahead) { + count_vm_event(SWAP_RA_HIT); + if (!vma || !vma_ra) + atomic_inc(&swapin_readahead_hits); + } } =20 struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, @@ -336,14 +345,10 @@ struct folio *__read_swap_cache_async(swp_entry_t ent= ry, gfp_t gfp_mask, *new_page_allocated =3D false; for (;;) { int err; - /* - * First check the swap cache. Since this is normally - * called after swap_cache_get_folio() failed, re-calling - * that would confuse statistics. - */ - folio =3D filemap_get_folio(swap_address_space(entry), - swap_cache_index(entry)); - if (!IS_ERR(folio)) + + /* Check the swap cache in case the folio is already there */ + folio =3D swap_cache_get_folio(entry); + if (folio) goto got_folio; =20 /* diff --git a/mm/swapfile.c b/mm/swapfile.c index a7ffabbe65ef..4b8ab2cb49ca 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -213,15 +213,14 @@ static int __try_to_reclaim_swap(struct swap_info_str= uct *si, unsigned long offset, unsigned long flags) { swp_entry_t entry =3D swp_entry(si->type, offset); - struct address_space *address_space =3D swap_address_space(entry); struct swap_cluster_info *ci; struct folio *folio; int ret, nr_pages; bool need_reclaim; =20 again: - folio =3D filemap_get_folio(address_space, swap_cache_index(entry)); - if (IS_ERR(folio)) + folio =3D swap_cache_get_folio(entry); + if (!folio) return 0; =20 nr_pages =3D folio_nr_pages(folio); @@ -2131,7 +2130,7 @@ static int unuse_pte_range(struct vm_area_struct *vma= , pmd_t *pmd, pte_unmap(pte); pte =3D NULL; =20 - folio =3D swap_cache_get_folio(entry, vma, addr); + folio =3D swap_cache_get_folio(entry); if (!folio) { struct vm_fault vmf =3D { .vma =3D vma, @@ -2357,8 +2356,8 @@ static int try_to_unuse(unsigned int type) (i =3D find_next_to_unuse(si, i)) !=3D 0) { =20 entry =3D swp_entry(type, i); - folio =3D filemap_get_folio(swap_address_space(entry), swap_cache_index(= entry)); - if (IS_ERR(folio)) + folio =3D swap_cache_get_folio(entry); + if (!folio) continue; =20 /* diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 50aaa8dcd24c..af61b95c89e4 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1489,9 +1489,8 @@ static long move_pages_ptes(struct mm_struct *mm, pmd= _t *dst_pmd, pmd_t *src_pmd * separately to allow proper handling. */ if (!src_folio) - folio =3D filemap_get_folio(swap_address_space(entry), - swap_cache_index(entry)); - if (!IS_ERR_OR_NULL(folio)) { + folio =3D swap_cache_get_folio(entry); + if (folio) { if (folio_test_large(folio)) { ret =3D -EBUSY; folio_put(folio); --=20 2.51.0 From nobody Thu Oct 2 13:01:38 2025 Received: from mail-yx1-f43.google.com (mail-yx1-f43.google.com [74.125.224.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1154424DD00 for ; Tue, 16 Sep 2025 16:01:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.224.43 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038494; cv=none; b=Y9IHOLD4fi8EMB5GwDBnrM4ucWNimEkWVjE60CJJEVwwMtwKdkM7GMiAXxzoBtC7GxNSZyP9lgRfP1mqHclqOrqEAYMNl+7YdVrbRf3Ot2uJcEXz8UuCOnVDfH9R0wUfH2E5d6X8w7ZZ88IfkItLZ3oXAGULcY9HlKMfsiBV1dA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038494; c=relaxed/simple; bh=Khh5987+1u6MVC/NKJ8ASpt3UuFIKNSR0pBiIhzVLUQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=PFDKsSdyDUwabKZJcUPAEkN7RCGy08v3FpH/0UBIYeZuqqoNBzQ1FpLs8AazVzeDA98U5QBeN/Qy5dG8cLDsH5gkYrvh7H3bM5+t3i7icfeIhnoq+Bc+cMwM7W8LGBUVPINGjufx/p/5wn7LRH8gnxDaY9CrFc9J7yw0igu8v0s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=c/uQ8Rc2; arc=none smtp.client-ip=74.125.224.43 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="c/uQ8Rc2" Received: by mail-yx1-f43.google.com with SMTP id 956f58d0204a3-6294ff16bacso1656406d50.2 for ; Tue, 16 Sep 2025 09:01:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1758038491; x=1758643291; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=xvXia67Cq9g+cEA2tdQsRO5nx9SORMf8wIBhDb/3W4o=; b=c/uQ8Rc2gmptwcLCkmmxAB5JgpM+s/6WnQ/XtpBQCWoniynK22xzsx5IvR3bLu3gZK v5ZhUIGLRH4wfFd8LPu0vjn2jFXnYzOLZtt6Wfr29dUgM8tUBT6OzovasKIZipziRaIz /V9f8X+Y/49unfmKeqKJiSHCC2CtMRcinnJvU6o1TPHyzYfhl9KjLXGxUSWDjY7CQsvu mMwQa0/s0uqq6LZ+tfZvN5Dj0jzHX8VxRv/IZJOuRTcjXMrIgWhTpp4z56D6zkYhux0z mZTkTSAo+/dBs3Xph16f0rL52H95lgV4hK9Pl6+liAWDVunpNivHUpQCV9BxRYb9mIlQ jsqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758038491; x=1758643291; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=xvXia67Cq9g+cEA2tdQsRO5nx9SORMf8wIBhDb/3W4o=; b=kgevRlDNqg46A3naAaIAr8xg6RdP1H5FzhHsgfjbFCwk6Q2DIYoQVY5NuQtqbGO9AZ PxMIF5ZERMwBfQp1Gvu7WhJF5+vxgRJJOCy113njfQGxUEUM+hDMtANM/ThMlfBaiKDj mHtRjyzs8rDLDP6pgU3Lk7dFEBCJdJm9h2QWkzwDmFAV8jrsShE9ni+XG+MMQQZ6ehqw GICH2yGnAtwZ7qsztKQluOpUj2oRBEY6JL1VDAHCSjhKZM5tIDrIMykq/GHJFORHvgvP W7Qw4OG/NJgh5Px31SHhDvWrNNGx1VcsFNYG0cyTe8NMEnucPUvvUtKPGwdToCMev7so UIsw== X-Forwarded-Encrypted: i=1; AJvYcCViCGLYWHOs5yCC4DmpvFYSb9AgeTpfaqEMCgtZ0hQQNvMTxhRwVOE4yHqcL6hhqIUGab0SLCGZvXMlSls=@vger.kernel.org X-Gm-Message-State: AOJu0YxWs9qtDtYzBaTDhPiaErHhMQPlAMrRjVH1U9zwTL4QjtG7Dxkz gyuLzgdazchOvvcP+ypRquaTWNazBl1B1XaZJ8gF99RDtZkg1mpc5iyh X-Gm-Gg: ASbGncvaxE8aW3LZjjgbfN+6F9wQx1uaTXNqdbEtqzCODPCf11CZwOieDriL9pi6rcf 7R3XkDf8PEUgwmpTHicAvkvYFW74GPZypOI4T7iz9StNGaVD/j7FXker2ueFfsy37uwR3IvNJXz CJ60AxwF+qPEcb8VcZgCE0e9NAxW4pr/EPfVh/ExqYz33dKD1dpAh6TxOeMRsIyGzh04pgnA/bu Whti2PAwYQALX5KKFLFLQDjnkdPmZsd7ZDqcal7++EnOdUJyRT4IDUIGIBZt+kITy/1oycuCbp4 i/zzsIyrEGWZ5gIgetp9Wl443i9INfvBJJ2b8hI9A1UNAASTGtZSqVfv/bLVs90FlYsH0iFbi1E kMuXZ+XAqmVFUD/geRVb/85oNWfT8uJuNUrEbLbfR8RkowRE= X-Google-Smtp-Source: AGHT+IEH+JurO6Pqf+3vm5jI9TQTunBdyoFl7E4dFaoXc2kYbDIILkBmSJJMNdbYEyyRHGeHv0wzRQ== X-Received: by 2002:a53:864f:0:b0:600:d00d:f4eb with SMTP id 956f58d0204a3-6271c5cfcc6mr11388302d50.12.1758038490342; Tue, 16 Sep 2025 09:01:30 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-820cd703f54sm969765485a.37.2025.09.16.09.01.24 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 16 Sep 2025 09:01:29 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Kairui Song , Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 03/15] mm, swap: fix swap cache index error when retrying reclaim Date: Wed, 17 Sep 2025 00:00:48 +0800 Message-ID: <20250916160100.31545-4-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250916160100.31545-1-ryncsn@gmail.com> References: <20250916160100.31545-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song The allocator will reclaim cached slots while scanning. Currently, it will try again if reclaim found a folio that is already removed from the swap cache due to a race. But the following lookup will be using the wrong index. It won't cause any OOB issue since the swap cache index is truncated upon lookup, but it may lead to reclaiming of an irrelevant folio. This should not cause a measurable issue, but we should fix it. Fixes: fae8595505313 ("mm, swap: avoid reclaiming irrelevant swap cache") Signed-off-by: Kairui Song Reviewed-by: Baolin Wang Acked-by: Nhat Pham Acked-by: Chris Li Acked-by: David Hildenbrand Suggested-by: Chris Li --- mm/swapfile.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 4b8ab2cb49ca..4baebd8b48f4 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -212,7 +212,7 @@ static bool swap_is_last_map(struct swap_info_struct *s= i, static int __try_to_reclaim_swap(struct swap_info_struct *si, unsigned long offset, unsigned long flags) { - swp_entry_t entry =3D swp_entry(si->type, offset); + const swp_entry_t entry =3D swp_entry(si->type, offset); struct swap_cluster_info *ci; struct folio *folio; int ret, nr_pages; @@ -240,13 +240,13 @@ static int __try_to_reclaim_swap(struct swap_info_str= uct *si, * Offset could point to the middle of a large folio, or folio * may no longer point to the expected offset before it's locked. */ - entry =3D folio->swap; - if (offset < swp_offset(entry) || offset >=3D swp_offset(entry) + nr_page= s) { + if (offset < swp_offset(folio->swap) || + offset >=3D swp_offset(folio->swap) + nr_pages) { folio_unlock(folio); folio_put(folio); goto again; } - offset =3D swp_offset(entry); + offset =3D swp_offset(folio->swap); =20 need_reclaim =3D ((flags & TTRS_ANYWAY) || ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) || --=20 2.51.0 From nobody Thu Oct 2 13:01:38 2025 Received: from mail-qk1-f170.google.com (mail-qk1-f170.google.com [209.85.222.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2963B2C11CE for ; Tue, 16 Sep 2025 16:01:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038499; cv=none; b=qGlvixmSSHbpUnkXcKonpogF306mNd+r/QWvKxvcoyprzg0zvT4/G4GBppe39DdgsDkL/u/xuBHPbVyE+gwLTOWrkgx6JTwMRishvWQrvQjEClkZpiBcjd+esmZmCh4JwErdD5PZ5y+nU8MUogVXcuvKPcJ42ciklXps3thq05s= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038499; c=relaxed/simple; bh=iOB1NvD3fZfO+f3XO4KmMG2pU/dIrJPJARyo3yMa57U=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=WgYe7DMUeeKGxG635DMB2p+b369+7Z3MLnDUmVmOB2sYaNIP5y1mP+HFLMgwtCNL6ywTyvyFO1TY9MY7nik8vGmjeOhpHE+eOweSlyp2Ork/bf6pkcbexOPrpJOo5zvjVlsK9LoMpC3k+jz+G5HopVbz6ni0LlV28EB4svqveL4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=AX4Hp+c7; arc=none smtp.client-ip=209.85.222.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="AX4Hp+c7" Received: by mail-qk1-f170.google.com with SMTP id af79cd13be357-82884bb66d6so1395085a.0 for ; Tue, 16 Sep 2025 09:01:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1758038497; x=1758643297; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=QHfOln/VR8uFzgqsT6uF/KmaET9pjaPTDRMh4vKV700=; b=AX4Hp+c7q4+WhBptyRPOA7mMOxCztPfkSo4wJGDTcCFbBWG1zTzzrz9W8Bi0280dpm O98HhTKx2glF/3XSSkrmunyOWJddqo0aSf2ZpwA/VW5QiXabUwyjTn2YmO6SDb7HhJW7 5i4l0t/CV1mH5AymE89NZ3FqL1KSuLczAZ9csDdfRCsQGp3J7gkrankF1I+1Gw1TMiMo CE2oKpscvBzgUkoklCuKx+8Rgt9KebF7nx7aM/CtOjFXCYFRWediIioO7lwTQvAIGE9g d1MPBCG+D2YZfSvwfCQVxKZfDvNquVHZTioxK66ahNt3XK1h73kVyxnHoSn9DNAvYM9S Imow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758038497; x=1758643297; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=QHfOln/VR8uFzgqsT6uF/KmaET9pjaPTDRMh4vKV700=; b=twUZmE3E7iSXY5UGgMe30AvrdB58uYuRq352QQDC9vroLvde+pooSEzzbHGgQQqaMn G3dx0LaqdOd38nQ0/57+A+iIjCFu0xQ5bjJXUSmNannuQrXGhGbZg5vbpXH9M/WszS57 vs8/QTdMgoVqEBUASGtkhfdGdapJmTkgkcStpje2SYTRLr/PGXt8cztbg0WU2U3F3yQc Q1TxBeZ3Ofk5GgaJVauyjNGB96FO9WTHTyJVs+OUKB5SFN005usIp/SvlhCAzOrN0am2 TGJ33DfGN0JGglxqNJBFaNMHZfbRS3XTikS8IRc0D9fagSQGyjlDhVgD5qa/kg/ovuIY 9Ixg== X-Forwarded-Encrypted: i=1; AJvYcCXGMjMCgbom97sD5Ks4X94/3nrUQgB0xDAONU8wKmvL/98XFGfr0G7mxuvf1j3SI1VQIMmI6zAsOeSUj7Q=@vger.kernel.org X-Gm-Message-State: AOJu0YxiyxswoHXaqZKFtXm9JfEm8aYH0s+W9xdzbjGtXBq2ir6xIbpc vwwn15tH8ScZUmAt+C/AO7HNSRwh0ojY2OmXdy20SB7QlXc21PqwPACL X-Gm-Gg: ASbGncvMZgMrI4fBns7wVZxN7DUzLKKfBjI2NWdtuHKL9Tr1Auub+OptUNEjADExcUu TlYvvDD3sjVnv9qrIuN1LE0Ta1MOYgHv0SclgbcpUDhjhSS9HSR7CSjpyd771wni3uXpi5meZ/C HFpoRP92QBBu3Z/HM3y8ZPoSZjnVziBKhAPGwwUqbrL55U2KdHNoDHML1jxDeg+6Xrcn9jeDVLE GUaGssfnXhAT7Sf/vtvSPm/ktnL6JORXOQ8Uo0l4jsHci6HrQmOttS/KNNX+uVp7Xp3qs2lz8q9 L/KqSSpQICWtm5ycXEqozH52jC8GlwEMJAKuSsD72Gd+eUSOXLhG0uPCA6s9x6MfJFubVo4YIhA fLUshKuKRNxrHwlCt87nuy6cHMQRiBq2mO3Qxo6LuPI7xJPI= X-Google-Smtp-Source: AGHT+IGAZRkBIeKRz5NtoJB/djuVWVXhL/G4pUd3uLXDSdbMDwhCdTxwRhwBYAhwJr9GLp1bon/fKg== X-Received: by 2002:a05:620a:2a08:b0:81d:9079:3d22 with SMTP id af79cd13be357-82b9c180de5mr320092785a.8.1758038496724; Tue, 16 Sep 2025 09:01:36 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-820cd703f54sm969765485a.37.2025.09.16.09.01.30 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 16 Sep 2025 09:01:36 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Kairui Song , Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 04/15] mm, swap: check page poison flag after locking it Date: Wed, 17 Sep 2025 00:00:49 +0800 Message-ID: <20250916160100.31545-5-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250916160100.31545-1-ryncsn@gmail.com> References: <20250916160100.31545-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Instead of checking the poison flag only in the fast swap cache lookup path, always check the poison flags after locking a swap cache folio. There are two reasons to do so. The folio is unstable and could be removed from the swap cache anytime, so it's totally possible that the folio is no longer the backing folio of a swap entry, and could be an irrelevant poisoned folio. We might mistakenly kill a faulting process. And it's totally possible or even common for the slow swap in path (swapin_readahead) to bring in a cached folio. The cache folio could be poisoned, too. Only checking the poison flag in the fast path will miss such folios. The race window is tiny, so it's very unlikely to happen, though. While at it, also add a unlikely prefix. Signed-off-by: Kairui Song Acked-by: Chris Li Acked-by: David Hildenbrand Acked-by: Nhat Pham Suggested-by: Chris Li --- mm/memory.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 10ef528a5f44..94a5928e8ace 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4661,10 +4661,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out; =20 folio =3D swap_cache_get_folio(entry); - if (folio) { + if (folio) swap_update_readahead(folio, vma, vmf->address); - page =3D folio_file_page(folio, swp_offset(entry)); - } swapcache =3D folio; =20 if (!folio) { @@ -4735,20 +4733,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) ret =3D VM_FAULT_MAJOR; count_vm_event(PGMAJFAULT); count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); - page =3D folio_file_page(folio, swp_offset(entry)); - } else if (PageHWPoison(page)) { - /* - * hwpoisoned dirty swapcache pages are kept for killing - * owner processes (which may be unknown at hwpoison time) - */ - ret =3D VM_FAULT_HWPOISON; - goto out_release; } =20 ret |=3D folio_lock_or_retry(folio, vmf); if (ret & VM_FAULT_RETRY) goto out_release; =20 + page =3D folio_file_page(folio, swp_offset(entry)); if (swapcache) { /* * Make sure folio_free_swap() or swapoff did not release the @@ -4761,6 +4752,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) page_swap_entry(page).val !=3D entry.val)) goto out_page; =20 + if (unlikely(PageHWPoison(page))) { + /* + * hwpoisoned dirty swapcache pages are kept for killing + * owner processes (which may be unknown at hwpoison time) + */ + ret =3D VM_FAULT_HWPOISON; + goto out_page; + } + /* * KSM sometimes has to copy on read faults, for example, if * folio->index of non-ksm folios would be nonlinear inside the --=20 2.51.0 From nobody Thu Oct 2 13:01:38 2025 Received: from mail-oi1-f178.google.com (mail-oi1-f178.google.com [209.85.167.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 01A4F233D9C for ; Tue, 16 Sep 2025 16:01:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.167.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038506; cv=none; b=Gy2LD2yh3TC+li5njvmQJewQhdXjFbhBrustaFFbm3koEcoZesz5bhsok2iDyhNRgwn4Jv73qk13o3wTHv1gCbfqf2AgDCCEi5TFX4i1YtF+xT+2XEUVhlNaeNbAErKNlQgaNB/xfncLH8acv/IOSgi/jz5vLX2Lph1qicHNKaI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038506; c=relaxed/simple; bh=yieOT8/80M/1WnN5biSPhKtXAGpzbL26RDyuZIwykEY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=dy1dgVKJeEX9j9YjzGyjF+hQ5sln2q19hKqlO7xDJkfiPFym64lPK/6VQlw4ID48ICGp0J1JHjH8dmu28lEzMgbIB1gTMdRck4qP533Y0WKKxgDD8nkKCKFlnlqxvlipLYPak7fYCZ5WDlC9uWfRotLdFOQ61bos3+fseV85nW8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=STb1ljDT; arc=none smtp.client-ip=209.85.167.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="STb1ljDT" Received: by mail-oi1-f178.google.com with SMTP id 5614622812f47-43d2da52291so946856b6e.1 for ; Tue, 16 Sep 2025 09:01:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1758038504; x=1758643304; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=5WUbuObwR/zGB6o+458rEcXiMc4zzBo5aJZO+2YfyDE=; b=STb1ljDTL52wNwOxMrnRtOqht+UIX6vGlQFgfqvfql/KuP/456g/NW0wJbZ8QnHQBv 0ZwvfOv2cP8l55hQ8ic1PA3g2nxR0MVhAJZ/V/pDPzgYMavCTpFUJPBQ+TQqkl066Fvs MGfTiMyEtFm2/n5DCzfFNGeEuEfN9c4M7rbm/kDiYmmRn+MOpTWLWCBvKJoaBlAdvmOK yty3HD2AA1xZPERUq0U48YPk0kLA4A1D+YM+qeCjaV6YHaELg6Xv26O71sWqknhbGZC/ OXg3qvIf/DTGTB6qIa/oNYQe735T8O5A6ekmFpAB0iHzfChrdelPdJfkidF+LoO3xkM/ zU2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758038504; x=1758643304; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=5WUbuObwR/zGB6o+458rEcXiMc4zzBo5aJZO+2YfyDE=; b=O8vWC35zfsNX7opzpBm5RaumeXYoghDq2zwtnVKe8/vZfVZrdSoj/VO3VFYERcvoK0 J5JV7sEW1Ce4zQOOOObFwPJ9aePW9XkoxzevroIxkWDEWlT7qaUy+abK4ElH0PCAugSZ 8s9K2sAVc9aR81p58PETIOjohrRM1k2pVLcuLwM8v6OchW1TDq/xp7mZRSwC2piUovWS UpI7aPclbCtfyoZqGKemYwdcL/x1NFhR59/OYm0G8qRpOq+zgp8lK86jmmklwNsA5mUX idKK1WdiZqP7N/wRm2FT3cfid+5xdSP9tYtEt7NC33LXhkxfHOfL20i1LSNtp+dJbGSL mXTw== X-Forwarded-Encrypted: i=1; AJvYcCVyTkDHOst5AaRDgEO/Zbt5qm9UFLE/LN4Zf+pMGxaFWXpvhZ88tNDbTclfWC43J02joAGGtH0k3jeZ8Fs=@vger.kernel.org X-Gm-Message-State: AOJu0YxfPpfAK6uL6UDylOaQv02MGy8T609mwqCjJIK24mab6saH08VX 4B+jHVGpgDpfQkYU3VhKMh+wiWaTzgFDqILPJY4stSB5vpd0pDSh7FSm X-Gm-Gg: ASbGncugOc6a7qlEG9tdF1Y6T3Ce1f3SxOODCBxEs4VlYs6Nus+re3ucToxsTx5JnuM NuDA6YpROn2wSPe9Pam7mBqkzLbFssl2b2FWc3O8neSNSxy/KY/P0R0SGBl2GBfbxI8okSy6Uoz pnfj36hlNYqK7z1EW9CVWnfjZRwr6aHkW/XMp4bhbr1ezDkV2iYdSmU15e1RmCFrVyvTNzcS/46 Di0suCO4dx68MhyUhrvrscHgUsIBzsqMmzFgBkvIYbwrgjj7nKzssnxIyRSQKviFlvEf+hSi6Jj NNC/DcFMMjD2E6d6tibrl31070VOgxlEfuTrU+qfiKpqDZ0GaAxBuU/HYaqnrGPxxTVVZi9bNu4 6hdmxL+bczVY3tUnJJ0yYABVQLBctzB/uWNnct0FtfXe2GJM= X-Google-Smtp-Source: AGHT+IFBhPfVPOhAahuA6aEtJjmdTvB1mnKAm6YCg8IDJDgWKnZus2UsNTSxw89RB9zsYks/2PG9OQ== X-Received: by 2002:a05:6808:1514:b0:437:f998:22 with SMTP id 5614622812f47-43d3f46503emr1431448b6e.21.1758038503102; Tue, 16 Sep 2025 09:01:43 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-820cd703f54sm969765485a.37.2025.09.16.09.01.37 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 16 Sep 2025 09:01:42 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Kairui Song , Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 05/15] mm, swap: always lock and check the swap cache folio before use Date: Wed, 17 Sep 2025 00:00:50 +0800 Message-ID: <20250916160100.31545-6-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250916160100.31545-1-ryncsn@gmail.com> References: <20250916160100.31545-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Swap cache lookup only increases the reference count of the returned folio. That's not enough to ensure a folio is stable in the swap cache, so the folio could be removed from the swap cache at any time. The caller should always lock and check the folio before using it. We have just documented this in kerneldoc, now introduce a helper for swap cache folio verification with proper sanity checks. Also, sanitize a few current users to use this convention and the new helper for easier debugging. They were not having observable problems yet, only trivial issues like wasted CPU cycles on swapoff or reclaiming. They would fail in some other way, but it is still better to always follow this convention to make things robust and make later commits easier to do. Signed-off-by: Kairui Song Acked-by: David Hildenbrand Acked-by: Chris Li Acked-by: Nhat Pham Reviewed-by: Barry Song Suggested-by: Chris Li --- mm/memory.c | 3 +-- mm/swap.h | 27 +++++++++++++++++++++++++++ mm/swap_state.c | 7 +++++-- mm/swapfile.c | 10 ++++++++-- 4 files changed, 41 insertions(+), 6 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 94a5928e8ace..5808c4ef21b3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4748,8 +4748,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * swapcache, we need to check that the page's swap has not * changed. */ - if (unlikely(!folio_test_swapcache(folio) || - page_swap_entry(page).val !=3D entry.val)) + if (unlikely(!folio_matches_swap_entry(folio, entry))) goto out_page; =20 if (unlikely(PageHWPoison(page))) { diff --git a/mm/swap.h b/mm/swap.h index efb6d7ff9f30..7d868f8de696 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -52,6 +52,28 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry) return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK; } =20 +/** + * folio_matches_swap_entry - Check if a folio matches a given swap entry. + * @folio: The folio. + * @entry: The swap entry to check against. + * + * Context: The caller should have the folio locked to ensure it's stable + * and nothing will move it in or out of the swap cache. + * Return: true or false. + */ +static inline bool folio_matches_swap_entry(const struct folio *folio, + swp_entry_t entry) +{ + swp_entry_t folio_entry =3D folio->swap; + long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + if (!folio_test_swapcache(folio)) + return false; + VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio_entry.val, nr_pages), folio); + return folio_entry.val =3D=3D round_down(entry.val, nr_pages); +} + void show_swap_cache_info(void); void *get_shadow_from_swap_cache(swp_entry_t entry); int add_to_swap_cache(struct folio *folio, swp_entry_t entry, @@ -144,6 +166,11 @@ static inline pgoff_t swap_cache_index(swp_entry_t ent= ry) return 0; } =20 +static inline bool folio_matches_swap_entry(const struct folio *folio, swp= _entry_t entry) +{ + return false; +} + static inline void show_swap_cache_info(void) { } diff --git a/mm/swap_state.c b/mm/swap_state.c index 68ec531d0f2b..9225d6b695ad 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -79,7 +79,7 @@ void show_swap_cache_info(void) * with reference count or locks. * Return: Returns the found folio on success, NULL otherwise. The caller * must lock and check if the folio still matches the swap entry before - * use. + * use (e.g. with folio_matches_swap_entry). */ struct folio *swap_cache_get_folio(swp_entry_t entry) { @@ -346,7 +346,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entr= y, gfp_t gfp_mask, for (;;) { int err; =20 - /* Check the swap cache in case the folio is already there */ + /* + * Check the swap cache first, if a cached folio is found, + * return it unlocked. The caller will lock and check it. + */ folio =3D swap_cache_get_folio(entry); if (folio) goto got_folio; diff --git a/mm/swapfile.c b/mm/swapfile.c index 4baebd8b48f4..c3c3364cb42e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -240,8 +240,7 @@ static int __try_to_reclaim_swap(struct swap_info_struc= t *si, * Offset could point to the middle of a large folio, or folio * may no longer point to the expected offset before it's locked. */ - if (offset < swp_offset(folio->swap) || - offset >=3D swp_offset(folio->swap) + nr_pages) { + if (!folio_matches_swap_entry(folio, entry)) { folio_unlock(folio); folio_put(folio); goto again; @@ -2004,6 +2003,13 @@ static int unuse_pte(struct vm_area_struct *vma, pmd= _t *pmd, bool hwpoisoned =3D false; int ret =3D 1; =20 + /* + * If the folio is removed from swap cache by others, continue to + * unuse other PTEs. try_to_unuse may try again if we missed this one. + */ + if (!folio_matches_swap_entry(folio, entry)) + return 0; + swapcache =3D folio; folio =3D ksm_might_need_to_copy(folio, vma, addr); if (unlikely(!folio)) --=20 2.51.0 From nobody Thu Oct 2 13:01:38 2025 Received: from mail-qk1-f182.google.com (mail-qk1-f182.google.com [209.85.222.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 33712288C26 for ; Tue, 16 Sep 2025 16:01:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038515; cv=none; b=r3D8UV4mpiCVOqV5VcNvcchxO9efxcsq4JMCaygQXugrv75YDilLZlLl7Re+iQBv1L9YY2BQdFqWFU1YC6V5eImxTznUL/QobE6jCKzlBQfCZYppz50PVEuwWCsBfB5FCTPJSrOA7XC0dsvzyyZnfqdwPac80X+ul0rRS11r8tA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038515; c=relaxed/simple; bh=wJhAb1IxdHbv+Me+GqkcThLByjfwrNDWDry/5YDd+Yg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=IjFxOXubd57q4J29bljl7ZIMO5BMiumdgDalUX1VQWfocviYI2b/GrPxdeQs1zHxmp8HcitQCOtcj+PQqXUsI2eFjchuHuuUJgkO+nMtez3tx3ORMBtLYElGuztp7zudbf7E5IUNEcgylJgiSsMovwc+YvVg8LONL5f8/tEVcTI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Xk5eNVYY; arc=none smtp.client-ip=209.85.222.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Xk5eNVYY" Received: by mail-qk1-f182.google.com with SMTP id af79cd13be357-80e2c527010so360601785a.1 for ; Tue, 16 Sep 2025 09:01:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1758038510; x=1758643310; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=h5XE3hl6EvqOsTkZfhneog+5rlu5GEH6XiBbz2n6aJA=; b=Xk5eNVYYyUc+dLPyIGSU9BkToTo/Jc2ktRf/cgZgKjJ3XH3QoFCFDUdnVPp2hw53k9 +jKB7HzxNtYB/PD/tYZe5XZDuSYRzztpJEOwkk9y8703W5NPDFqL+t1U4iKQetbdqzcz FQw58NVdbGxwPCwDzBSVsz6saV15gh+iM9MHC6+Q3mlbMav0RSolXdMLtOAEjBu0KW3M V5Cbxgn3f+Wp0NebDPFXyKb7JNHnuKGLiI0UrjIUuHdQMZ+VAPGumd8+9sy3Ad2XT6Xg 3hSeAU1s64Kju48x9QmhEFmFUO9AL87Dsu5pdtUKkFLQawhwOUtluevaqCly8kdM+aPp tAVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758038510; x=1758643310; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=h5XE3hl6EvqOsTkZfhneog+5rlu5GEH6XiBbz2n6aJA=; b=m3Fa3fmdtQFu6Of91mfOr7p5GGgAaCziyU40gfyK4V/mrX77kCPs5pJKy8PAtRP8lX Vj5yPxOy0uzDAD5i7DJnPC0FvgGqXY9F7octcN9l3zDbyhhC4fM/0GE8DiPD3dzYsqZ3 33mPgJl2Wqlf7lgpSkoKLgYyCel+PGRhraAdZlyB2hm9ir0d3L9er+JftEaBeDr+Rcij BUeRIaVeDsyqvugwN/4NfnMuAJ8n0lP4asG4eV76rwbEKn/2+8MVK0yiF/jwFAFRjh1s aZTZ/rXZhIeKiacjaj2wqmV4z2i5COlRse/cBbXdPKPUGRJIK78IUKzOy6sMbjTp7+N7 0RxA== X-Forwarded-Encrypted: i=1; AJvYcCXe5fXyIUz4lmTMWBMu6jer1sQeibn5LDT9TMDjfsuV7Oo1fSAcCDK4yqKFaism706YfgK023/0n0dxfxU=@vger.kernel.org X-Gm-Message-State: AOJu0Yxm3aFEluVpEcu92U/zyrfjEVMZumDLh4s7a5o/R7H4TXYcAo1L FvtFFwnXBmuJS74E1ghrtIaXwvVIme8ZmclfJrDrcCAbuxn6X97rc5ZH X-Gm-Gg: ASbGncsjHx5LYIkunSxxT21I80GKjmfKJMGTbWd3QV12CdOKmfcy5NWRhxCVxu30Nyj q87JUrutek4qiYLm+RqWFAuGJevlywOpHjlHefYx/8AMhXNhe5/Zym9RHgkC+5s7M1AZlXtZyBf tWN/mHJNBRxMs797g4gC4n4emKtCE74ea7RTsh97y23VD8KTW1STywzszgJ5NDgIYoQFk/P9SRT L4aUm7Zgo0+jnTQJGLz9ttRsOeTVNX6+EKtkyq3yr2VD0D4nwoRJfIC6y6Joxs52UWD1Y53l7Iy 1LLHRgVsd8hUeCe2Nix8kVbewx0xX2LWNZVm5BcvTYOJ73DE8P71QKegvVnwgaC0fmHp5EZC/xs bd/f+vQTi5U3PFmjdU/RrFK5INJNTr3+rHVJc2S9r/KJOr4M= X-Google-Smtp-Source: AGHT+IEM10v/FCpwSByPTcQQCQruIaaZ/dQUoecHoJXZkW1C8XFmuMoS45q6htThdb8fGXYqplzC8Q== X-Received: by 2002:a05:620a:a909:b0:80a:3092:b271 with SMTP id af79cd13be357-823fbdeaca5mr1938600585a.3.1758038509561; Tue, 16 Sep 2025 09:01:49 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-820cd703f54sm969765485a.37.2025.09.16.09.01.43 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 16 Sep 2025 09:01:49 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Kairui Song , Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 06/15] mm, swap: rename and move some swap cluster definition and helpers Date: Wed, 17 Sep 2025 00:00:51 +0800 Message-ID: <20250916160100.31545-7-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250916160100.31545-1-ryncsn@gmail.com> References: <20250916160100.31545-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song No feature change, move cluster related definitions and helpers to mm/swap.h, also tidy up and add a "swap_" prefix for cluster lock/unlock helpers, so they can be used outside of swap files. And while at it, add kerneldoc. Signed-off-by: Kairui Song Reviewed-by: Baolin Wang Reviewed-by: Barry Song Acked-by: Chris Li Acked-by: David Hildenbrand Acked-by: Nhat Pham Suggested-by: Chris Li --- include/linux/swap.h | 34 ---------------- mm/swap.h | 70 ++++++++++++++++++++++++++++++++ mm/swapfile.c | 97 +++++++++++++------------------------------- 3 files changed, 99 insertions(+), 102 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index a2bb20841616..78cc48a65512 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -235,40 +235,6 @@ enum { /* Special value in each swap_map continuation */ #define SWAP_CONT_MAX 0x7f /* Max count */ =20 -/* - * We use this to track usage of a cluster. A cluster is a block of swap d= isk - * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All - * free clusters are organized into a list. We fetch an entry from the lis= t to - * get a free cluster. - * - * The flags field determines if a cluster is free. This is - * protected by cluster lock. - */ -struct swap_cluster_info { - spinlock_t lock; /* - * Protect swap_cluster_info fields - * other than list, and swap_info_struct->swap_map - * elements corresponding to the swap cluster. - */ - u16 count; - u8 flags; - u8 order; - struct list_head list; -}; - -/* All on-list cluster must have a non-zero flag. */ -enum swap_cluster_flags { - CLUSTER_FLAG_NONE =3D 0, /* For temporary off-list cluster */ - CLUSTER_FLAG_FREE, - CLUSTER_FLAG_NONFULL, - CLUSTER_FLAG_FRAG, - /* Clusters with flags above are allocatable */ - CLUSTER_FLAG_USABLE =3D CLUSTER_FLAG_FRAG, - CLUSTER_FLAG_FULL, - CLUSTER_FLAG_DISCARD, - CLUSTER_FLAG_MAX, -}; - /* * The first page in the swap file is the swap header, which is always mar= ked * bad to prevent it from being allocated as an entry. This also prevents = the diff --git a/mm/swap.h b/mm/swap.h index 7d868f8de696..138b5197c35e 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -7,10 +7,80 @@ struct swap_iocb; =20 extern int page_cluster; =20 +#ifdef CONFIG_THP_SWAP +#define SWAPFILE_CLUSTER HPAGE_PMD_NR +#define swap_entry_order(order) (order) +#else +#define SWAPFILE_CLUSTER 256 +#define swap_entry_order(order) 0 +#endif + +/* + * We use this to track usage of a cluster. A cluster is a block of swap d= isk + * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All + * free clusters are organized into a list. We fetch an entry from the lis= t to + * get a free cluster. + * + * The flags field determines if a cluster is free. This is + * protected by cluster lock. + */ +struct swap_cluster_info { + spinlock_t lock; /* + * Protect swap_cluster_info fields + * other than list, and swap_info_struct->swap_map + * elements corresponding to the swap cluster. + */ + u16 count; + u8 flags; + u8 order; + struct list_head list; +}; + +/* All on-list cluster must have a non-zero flag. */ +enum swap_cluster_flags { + CLUSTER_FLAG_NONE =3D 0, /* For temporary off-list cluster */ + CLUSTER_FLAG_FREE, + CLUSTER_FLAG_NONFULL, + CLUSTER_FLAG_FRAG, + /* Clusters with flags above are allocatable */ + CLUSTER_FLAG_USABLE =3D CLUSTER_FLAG_FRAG, + CLUSTER_FLAG_FULL, + CLUSTER_FLAG_DISCARD, + CLUSTER_FLAG_MAX, +}; + #ifdef CONFIG_SWAP #include /* for swp_offset */ #include /* for bio_end_io_t */ =20 +static inline struct swap_cluster_info *swp_offset_cluster( + struct swap_info_struct *si, pgoff_t offset) +{ + return &si->cluster_info[offset / SWAPFILE_CLUSTER]; +} + +/** + * swap_cluster_lock - Lock and return the swap cluster of given offset. + * @si: swap device the cluster belongs to. + * @offset: the swap entry offset, pointing to a valid slot. + * + * Context: The caller must ensure the offset is in the valid range and + * protect the swap device with reference count or locks. + */ +static inline struct swap_cluster_info *swap_cluster_lock( + struct swap_info_struct *si, unsigned long offset) +{ + struct swap_cluster_info *ci =3D swp_offset_cluster(si, offset); + + spin_lock(&ci->lock); + return ci; +} + +static inline void swap_cluster_unlock(struct swap_cluster_info *ci) +{ + spin_unlock(&ci->lock); +} + /* linux/mm/page_io.c */ int sio_pool_init(void); struct swap_iocb; diff --git a/mm/swapfile.c b/mm/swapfile.c index c3c3364cb42e..700e07cb1cbd 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -58,9 +58,6 @@ static void swap_entries_free(struct swap_info_struct *si, static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); static bool folio_swapcache_freeable(struct folio *folio); -static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, - unsigned long offset); -static inline void unlock_cluster(struct swap_cluster_info *ci); =20 static DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; @@ -258,9 +255,9 @@ static int __try_to_reclaim_swap(struct swap_info_struc= t *si, * swap_map is HAS_CACHE only, which means the slots have no page table * reference or pending writeback, and can't be allocated to others. */ - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); need_reclaim =3D swap_only_has_cache(si, offset, nr_pages); - unlock_cluster(ci); + swap_cluster_unlock(ci); if (!need_reclaim) goto out_unlock; =20 @@ -385,19 +382,6 @@ static void discard_swap_cluster(struct swap_info_stru= ct *si, } } =20 -#ifdef CONFIG_THP_SWAP -#define SWAPFILE_CLUSTER HPAGE_PMD_NR - -#define swap_entry_order(order) (order) -#else -#define SWAPFILE_CLUSTER 256 - -/* - * Define swap_entry_order() as constant to let compiler to optimize - * out some code if !CONFIG_THP_SWAP - */ -#define swap_entry_order(order) 0 -#endif #define LATENCY_LIMIT 256 =20 static inline bool cluster_is_empty(struct swap_cluster_info *info) @@ -425,34 +409,12 @@ static inline unsigned int cluster_index(struct swap_= info_struct *si, return ci - si->cluster_info; } =20 -static inline struct swap_cluster_info *offset_to_cluster(struct swap_info= _struct *si, - unsigned long offset) -{ - return &si->cluster_info[offset / SWAPFILE_CLUSTER]; -} - static inline unsigned int cluster_offset(struct swap_info_struct *si, struct swap_cluster_info *ci) { return cluster_index(si, ci) * SWAPFILE_CLUSTER; } =20 -static inline struct swap_cluster_info *lock_cluster(struct swap_info_stru= ct *si, - unsigned long offset) -{ - struct swap_cluster_info *ci; - - ci =3D offset_to_cluster(si, offset); - spin_lock(&ci->lock); - - return ci; -} - -static inline void unlock_cluster(struct swap_cluster_info *ci) -{ - spin_unlock(&ci->lock); -} - static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, enum swap_cluster_flags new_flags) @@ -808,7 +770,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap= _info_struct *si, } out: relocate_cluster(si, ci); - unlock_cluster(ci); + swap_cluster_unlock(ci); if (si->flags & SWP_SOLIDSTATE) { this_cpu_write(percpu_swap_cluster.offset[order], next); this_cpu_write(percpu_swap_cluster.si[order], si); @@ -875,7 +837,7 @@ static void swap_reclaim_full_clusters(struct swap_info= _struct *si, bool force) if (ci->flags =3D=3D CLUSTER_FLAG_NONE) relocate_cluster(si, ci); =20 - unlock_cluster(ci); + swap_cluster_unlock(ci); if (to_scan <=3D 0) break; } @@ -914,7 +876,7 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o if (offset =3D=3D SWAP_ENTRY_INVALID) goto new_cluster; =20 - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); /* Cluster could have been used by another order */ if (cluster_is_usable(ci, order)) { if (cluster_is_empty(ci)) @@ -922,7 +884,7 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o found =3D alloc_swap_scan_cluster(si, ci, offset, order, usage); } else { - unlock_cluster(ci); + swap_cluster_unlock(ci); } if (found) goto done; @@ -1203,7 +1165,7 @@ static bool swap_alloc_fast(swp_entry_t *entry, if (!si || !offset || !get_swap_device_info(si)) return false; =20 - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); if (cluster_is_usable(ci, order)) { if (cluster_is_empty(ci)) offset =3D cluster_offset(si, ci); @@ -1211,7 +1173,7 @@ static bool swap_alloc_fast(swp_entry_t *entry, if (found) *entry =3D swp_entry(si->type, found); } else { - unlock_cluster(ci); + swap_cluster_unlock(ci); } =20 put_swap_device(si); @@ -1479,14 +1441,14 @@ static void swap_entries_put_cache(struct swap_info= _struct *si, unsigned long offset =3D swp_offset(entry); struct swap_cluster_info *ci; =20 - ci =3D lock_cluster(si, offset); - if (swap_only_has_cache(si, offset, nr)) + ci =3D swap_cluster_lock(si, offset); + if (swap_only_has_cache(si, offset, nr)) { swap_entries_free(si, ci, entry, nr); - else { + } else { for (int i =3D 0; i < nr; i++, entry.val++) swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); } - unlock_cluster(ci); + swap_cluster_unlock(ci); } =20 static bool swap_entries_put_map(struct swap_info_struct *si, @@ -1504,7 +1466,7 @@ static bool swap_entries_put_map(struct swap_info_str= uct *si, if (count !=3D 1 && count !=3D SWAP_MAP_SHMEM) goto fallback; =20 - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); if (!swap_is_last_map(si, offset, nr, &has_cache)) { goto locked_fallback; } @@ -1513,21 +1475,20 @@ static bool swap_entries_put_map(struct swap_info_s= truct *si, else for (i =3D 0; i < nr; i++) WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE); - unlock_cluster(ci); + swap_cluster_unlock(ci); =20 return has_cache; =20 fallback: - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); locked_fallback: for (i =3D 0; i < nr; i++, entry.val++) { count =3D swap_entry_put_locked(si, ci, entry, 1); if (count =3D=3D SWAP_HAS_CACHE) has_cache =3D true; } - unlock_cluster(ci); + swap_cluster_unlock(ci); return has_cache; - } =20 /* @@ -1577,7 +1538,7 @@ static void swap_entries_free(struct swap_info_struct= *si, unsigned char *map_end =3D map + nr_pages; =20 /* It should never free entries across different clusters */ - VM_BUG_ON(ci !=3D offset_to_cluster(si, offset + nr_pages - 1)); + VM_BUG_ON(ci !=3D swp_offset_cluster(si, offset + nr_pages - 1)); VM_BUG_ON(cluster_is_empty(ci)); VM_BUG_ON(ci->count < nr_pages); =20 @@ -1652,9 +1613,9 @@ bool swap_entry_swapped(struct swap_info_struct *si, = swp_entry_t entry) struct swap_cluster_info *ci; int count; =20 - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); count =3D swap_count(si->swap_map[offset]); - unlock_cluster(ci); + swap_cluster_unlock(ci); return !!count; } =20 @@ -1677,7 +1638,7 @@ int swp_swapcount(swp_entry_t entry) =20 offset =3D swp_offset(entry); =20 - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); =20 count =3D swap_count(si->swap_map[offset]); if (!(count & COUNT_CONTINUED)) @@ -1700,7 +1661,7 @@ int swp_swapcount(swp_entry_t entry) n *=3D (SWAP_CONT_MAX + 1); } while (tmp_count & COUNT_CONTINUED); out: - unlock_cluster(ci); + swap_cluster_unlock(ci); return count; } =20 @@ -1715,7 +1676,7 @@ static bool swap_page_trans_huge_swapped(struct swap_= info_struct *si, int i; bool ret =3D false; =20 - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); if (nr_pages =3D=3D 1) { if (swap_count(map[roffset])) ret =3D true; @@ -1728,7 +1689,7 @@ static bool swap_page_trans_huge_swapped(struct swap_= info_struct *si, } } unlock_out: - unlock_cluster(ci); + swap_cluster_unlock(ci); return ret; } =20 @@ -2662,8 +2623,8 @@ static void wait_for_allocation(struct swap_info_stru= ct *si) BUG_ON(si->flags & SWP_WRITEOK); =20 for (offset =3D 0; offset < end; offset +=3D SWAPFILE_CLUSTER) { - ci =3D lock_cluster(si, offset); - unlock_cluster(ci); + ci =3D swap_cluster_lock(si, offset); + swap_cluster_unlock(ci); } } =20 @@ -3579,7 +3540,7 @@ static int __swap_duplicate(swp_entry_t entry, unsign= ed char usage, int nr) offset =3D swp_offset(entry); VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); VM_WARN_ON(usage =3D=3D 1 && nr > 1); - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); =20 err =3D 0; for (i =3D 0; i < nr; i++) { @@ -3634,7 +3595,7 @@ static int __swap_duplicate(swp_entry_t entry, unsign= ed char usage, int nr) } =20 unlock_out: - unlock_cluster(ci); + swap_cluster_unlock(ci); return err; } =20 @@ -3733,7 +3694,7 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) =20 offset =3D swp_offset(entry); =20 - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); =20 count =3D swap_count(si->swap_map[offset]); =20 @@ -3793,7 +3754,7 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) out_unlock_cont: spin_unlock(&si->cont_lock); out: - unlock_cluster(ci); + swap_cluster_unlock(ci); put_swap_device(si); outer: if (page) --=20 2.51.0 From nobody Thu Oct 2 13:01:38 2025 Received: from mail-qk1-f174.google.com (mail-qk1-f174.google.com [209.85.222.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5512937C0F1 for ; Tue, 16 Sep 2025 16:01:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038521; cv=none; b=k/9x/cxQsTc+8n1JxVdRAKlv6WG1mjurKQTM6Nj55KtoTyGxGGDfQTTOUNlEhB9YVdU5kBkiM8pCPgoXFA95/nlzS2S730T8aI98sAmqmuIPL9SIszy8uiMVwdkzymCWg+8ckqB8JEXHCU3zt5VlcNENSHngvq+7TgID9c4IhrQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038521; c=relaxed/simple; bh=2hqukfsf+T+AVrdn5zYLXJ+JR5iE6zHx2F5iQyuNXmA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Ft5yhAzlIErsnfewZjln+dgPxYZbBYY593JBmgMYAxPxHSW90Aw9Qezd+TGc2RtdV98Po+8EeIkvl/YTacbEfC3kdgKoKnOqwhWMcXwnKfe4vCor9ncAbofOhooZiBQnDXnfMC2oGwEtn5+1ybVNlpNC6tn5TpSMIpaNZolVc2Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=O4qtXPmX; arc=none smtp.client-ip=209.85.222.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="O4qtXPmX" Received: by mail-qk1-f174.google.com with SMTP id af79cd13be357-82a334e379eso150824485a.3 for ; Tue, 16 Sep 2025 09:01:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1758038518; x=1758643318; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=ijfPPpmQ7FgeFcxxttLIRdmasmyjvNL+eokSGricjgM=; b=O4qtXPmX9iKv9WRxsHkVACUd84zht+2Vi0fQ2ZXahpuu+s0if/JXjlLjmwBuLR7I3o 1ZvfBlBXCeuesHk64NVYafxo5LZHByoEe8AhpZyiwaapga5BHsTykM9ZR3IGtnUvP6GA SSVokfAFhSQSYCdM2P0895hv4wLdMGDfKGP2u0zyYUgNjPlLtWbAlwVJbV7hpAztWpuq JXl2bJBWLH4ke8jXjpK8lAza5Ueq30cRTHasySSbE8G2EwLI7Hre9GgoEVecwmEMfZqe 3FF5kGcXuVAkXbBlt7p3XKy1yVvEtUveeLBxMU8Mn1rt/8cHWqIbq2G95fP9xudPB1fN rk8Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758038518; x=1758643318; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=ijfPPpmQ7FgeFcxxttLIRdmasmyjvNL+eokSGricjgM=; b=Ge2w2BUOZvMeg2M+C03oUlCARTbo1e9c3IgWzAElT+I77rrPMJo+nvxEIvMwP6IYCi yqQ3c5SVnkgJ8tlndBBJMOu0FRtS0DXNoJPYgWe2zfLjMivU1qE8NlZZamgR2e9P9hIn pFBs/I37vTNFN7yM3iGk8sQ7tgWSxRfOR8q1HJzPOpNwyNngWqRkD/j4m8kL6kFSORBg a2Pyx43CQeiyJRW2z9qolil8UFoBy4HMEoAjdoOVmbtUVthqy2CRGwvD7SKtKug1UrJ9 j0/PX7VGLOEyyzz+alEaZK5qJzafdNUokxxjaJaQvM+CPHjWHEnsnY6f22XrMiqTLeTm LfMg== X-Forwarded-Encrypted: i=1; AJvYcCX4OehrVB9sqjwAGBOkYhMBH7ShrgxA1m84C9ErQ/0NJtNWYY9oIWBLXXAVJXwqr27JBwv4z7tLJBbwQOk=@vger.kernel.org X-Gm-Message-State: AOJu0Yz9nlhp6qHTObM6H6PQhZHYt96VgYWQXHQwnQ+w8zx1ydu6xNRq FoZmkofrZdTUofzRlmqCn7ZXIFrZ6dvwtulFYiep7MwvVC7ouLbeKwbO X-Gm-Gg: ASbGnctyhRbNd9z2PJKt3iZokadyAUo4W9l62YDWWwa2LtfiJxMZn9Bac+pITzdpUFN rMotAiel3A/XADYRPtoIa6gXMqL4IgafKZ0cogV8YbVEDOTh3PRLSDGbwr7QEVQCrIlTMp+Rslj 5dFqpW2VvDy9zcHRacTVqaIPQQnj6QKYgyHbQNMMvkSIWBtnzzx8qVMI69dEnz4io3hL3LEAdUm TMTR3RoIYSqfapTNAQ826R9di/Fi6g29E0V/bjpqrqsA3nnvg1NeT9RbGbC2BhoxUVWNyyUAgdL 51e2V2WhoY1zePjmYLndJNzNvp6xQ+vTTV9hC3pk1qP8lpykiaMwPXJlktlia+WmMngs0RWHjRy rPwNYTUsjt1BOwDTNoQz/XGiuLSR3XdMrv6SGB+PenRrGyZo= X-Google-Smtp-Source: AGHT+IEbIIDSZwRtvoVKhPN2WtJjiGBVdo+kKPMTpeHqkRuSxQdLAwxiaZUGn/i1oUTM2vtVJTkx2w== X-Received: by 2002:a05:620a:d8c:b0:817:c961:73a3 with SMTP id af79cd13be357-823fd4190c5mr2502952885a.31.1758038516263; Tue, 16 Sep 2025 09:01:56 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-820cd703f54sm969765485a.37.2025.09.16.09.01.50 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 16 Sep 2025 09:01:55 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Kairui Song , Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 07/15] mm, swap: tidy up swap device and cluster info helpers Date: Wed, 17 Sep 2025 00:00:52 +0800 Message-ID: <20250916160100.31545-8-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250916160100.31545-1-ryncsn@gmail.com> References: <20250916160100.31545-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song swp_swap_info is the most commonly used helper for retrieving swap info. It has an internal check that may lead to a NULL return value, but almost none of its caller checks the return value, making the internal check pointless. In fact, most of these callers already ensured the entry is valid and never expect a NULL value. Tidy this up and improve the function names. If the caller can make sure the swap entry/type is valid and the device is pinned, use the new introduced __swap_entry_to_info/__swap_type_to_info instead. They have more debug sanity checks and lower overhead as they are inlined. Callers that may expect a NULL value should use swap_entry_to_info/swap_type_to_info instead. No feature change. The rearranged codes should have had no effect, or they should have been hitting NULL de-ref bugs already. Only some new sanity checks are added so potential issues may show up in debug build. The new helpers will be frequently used with swap table later when working with swap cache folios. A locked swap cache folio ensures the entries are valid and stable so these helpers are very helpful. Signed-off-by: Kairui Song Acked-by: Chris Li Reviewed-by: Barry Song Acked-by: David Hildenbrand Suggested-by: Chris Li --- include/linux/swap.h | 6 ------ mm/page_io.c | 12 ++++++------ mm/swap.h | 38 +++++++++++++++++++++++++++++++++----- mm/swap_state.c | 4 ++-- mm/swapfile.c | 37 +++++++++++++++++++------------------ 5 files changed, 60 insertions(+), 37 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 78cc48a65512..762f8db0e811 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -479,7 +479,6 @@ extern sector_t swapdev_block(int, pgoff_t); extern int __swap_count(swp_entry_t entry); extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t en= try); extern int swp_swapcount(swp_entry_t entry); -struct swap_info_struct *swp_swap_info(swp_entry_t entry); struct backing_dev_info; extern int init_swap_address_space(unsigned int type, unsigned long nr_pag= es); extern void exit_swap_address_space(unsigned int type); @@ -492,11 +491,6 @@ static inline void put_swap_device(struct swap_info_st= ruct *si) } =20 #else /* CONFIG_SWAP */ -static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry) -{ - return NULL; -} - static inline struct swap_info_struct *get_swap_device(swp_entry_t entry) { return NULL; diff --git a/mm/page_io.c b/mm/page_io.c index a2056a5ecb13..3c342db77ce3 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -204,7 +204,7 @@ static bool is_folio_zero_filled(struct folio *folio) static void swap_zeromap_folio_set(struct folio *folio) { struct obj_cgroup *objcg =3D get_obj_cgroup_from_folio(folio); - struct swap_info_struct *sis =3D swp_swap_info(folio->swap); + struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap); int nr_pages =3D folio_nr_pages(folio); swp_entry_t entry; unsigned int i; @@ -223,7 +223,7 @@ static void swap_zeromap_folio_set(struct folio *folio) =20 static void swap_zeromap_folio_clear(struct folio *folio) { - struct swap_info_struct *sis =3D swp_swap_info(folio->swap); + struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap); swp_entry_t entry; unsigned int i; =20 @@ -374,7 +374,7 @@ static void sio_write_complete(struct kiocb *iocb, long= ret) static void swap_writepage_fs(struct folio *folio, struct swap_iocb **swap= _plug) { struct swap_iocb *sio =3D swap_plug ? *swap_plug : NULL; - struct swap_info_struct *sis =3D swp_swap_info(folio->swap); + struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap); struct file *swap_file =3D sis->swap_file; loff_t pos =3D swap_dev_pos(folio->swap); =20 @@ -446,7 +446,7 @@ static void swap_writepage_bdev_async(struct folio *fol= io, =20 void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug) { - struct swap_info_struct *sis =3D swp_swap_info(folio->swap); + struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap); =20 VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio); /* @@ -537,7 +537,7 @@ static bool swap_read_folio_zeromap(struct folio *folio) =20 static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plu= g) { - struct swap_info_struct *sis =3D swp_swap_info(folio->swap); + struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap); struct swap_iocb *sio =3D NULL; loff_t pos =3D swap_dev_pos(folio->swap); =20 @@ -608,7 +608,7 @@ static void swap_read_folio_bdev_async(struct folio *fo= lio, =20 void swap_read_folio(struct folio *folio, struct swap_iocb **plug) { - struct swap_info_struct *sis =3D swp_swap_info(folio->swap); + struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap); bool synchronous =3D sis->flags & SWP_SYNCHRONOUS_IO; bool workingset =3D folio_test_workingset(folio); unsigned long pflags; diff --git a/mm/swap.h b/mm/swap.h index 138b5197c35e..30b1039c27fe 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -15,6 +15,8 @@ extern int page_cluster; #define swap_entry_order(order) 0 #endif =20 +extern struct swap_info_struct *swap_info[]; + /* * We use this to track usage of a cluster. A cluster is a block of swap d= isk * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All @@ -53,9 +55,29 @@ enum swap_cluster_flags { #include /* for swp_offset */ #include /* for bio_end_io_t */ =20 -static inline struct swap_cluster_info *swp_offset_cluster( +/* + * Callers of all helpers below must ensure the entry, type, or offset is + * valid, and protect the swap device with reference count or locks. + */ +static inline struct swap_info_struct *__swap_type_to_info(int type) +{ + struct swap_info_struct *si; + + si =3D READ_ONCE(swap_info[type]); /* rcu_dereference() */ + VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */ + return si; +} + +static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t en= try) +{ + return __swap_type_to_info(swp_type(entry)); +} + +static inline struct swap_cluster_info *__swap_offset_to_cluster( struct swap_info_struct *si, pgoff_t offset) { + VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */ + VM_WARN_ON_ONCE(offset >=3D si->max); return &si->cluster_info[offset / SWAPFILE_CLUSTER]; } =20 @@ -70,8 +92,9 @@ static inline struct swap_cluster_info *swp_offset_cluste= r( static inline struct swap_cluster_info *swap_cluster_lock( struct swap_info_struct *si, unsigned long offset) { - struct swap_cluster_info *ci =3D swp_offset_cluster(si, offset); + struct swap_cluster_info *ci =3D __swap_offset_to_cluster(si, offset); =20 + VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */ spin_lock(&ci->lock); return ci; } @@ -170,7 +193,7 @@ void swap_update_readahead(struct folio *folio, struct = vm_area_struct *vma, =20 static inline unsigned int folio_swap_flags(struct folio *folio) { - return swp_swap_info(folio->swap)->flags; + return __swap_entry_to_info(folio->swap)->flags; } =20 /* @@ -181,7 +204,7 @@ static inline unsigned int folio_swap_flags(struct foli= o *folio) static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr, bool *is_zeromap) { - struct swap_info_struct *sis =3D swp_swap_info(entry); + struct swap_info_struct *sis =3D __swap_entry_to_info(entry); unsigned long start =3D swp_offset(entry); unsigned long end =3D start + max_nr; bool first_bit; @@ -200,7 +223,7 @@ static inline int swap_zeromap_batch(swp_entry_t entry,= int max_nr, =20 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr) { - struct swap_info_struct *si =3D swp_swap_info(entry); + struct swap_info_struct *si =3D __swap_entry_to_info(entry); pgoff_t offset =3D swp_offset(entry); int i; =20 @@ -219,6 +242,11 @@ static inline int non_swapcache_batch(swp_entry_t entr= y, int max_nr) =20 #else /* CONFIG_SWAP */ struct swap_iocb; +static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t en= try) +{ + return NULL; +} + static inline void swap_read_folio(struct folio *folio, struct swap_iocb *= *plug) { } diff --git a/mm/swap_state.c b/mm/swap_state.c index 9225d6b695ad..0ad4f3b41f1b 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -336,7 +336,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry= , gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated, bool skip_if_exists) { - struct swap_info_struct *si =3D swp_swap_info(entry); + struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct folio *folio; struct folio *new_folio =3D NULL; struct folio *result =3D NULL; @@ -560,7 +560,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, unsigned long offset =3D entry_offset; unsigned long start_offset, end_offset; unsigned long mask; - struct swap_info_struct *si =3D swp_swap_info(entry); + struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct blk_plug plug; struct swap_iocb *splug =3D NULL; bool page_allocated; diff --git a/mm/swapfile.c b/mm/swapfile.c index 700e07cb1cbd..6f7a8c98d14d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -102,7 +102,7 @@ static PLIST_HEAD(swap_active_head); static struct plist_head *swap_avail_heads; static DEFINE_SPINLOCK(swap_avail_lock); =20 -static struct swap_info_struct *swap_info[MAX_SWAPFILES]; +struct swap_info_struct *swap_info[MAX_SWAPFILES]; =20 static DEFINE_MUTEX(swapon_mutex); =20 @@ -124,14 +124,20 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, per= cpu_swap_cluster) =3D { .lock =3D INIT_LOCAL_LOCK(), }; =20 -static struct swap_info_struct *swap_type_to_swap_info(int type) +/* May return NULL on invalid type, caller must check for NULL return */ +static struct swap_info_struct *swap_type_to_info(int type) { if (type >=3D MAX_SWAPFILES) return NULL; - return READ_ONCE(swap_info[type]); /* rcu_dereference() */ } =20 +/* May return NULL on invalid entry, caller must check for NULL return */ +static struct swap_info_struct *swap_entry_to_info(swp_entry_t entry) +{ + return swap_type_to_info(swp_type(entry)); +} + static inline unsigned char swap_count(unsigned char ent) { return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */ @@ -342,7 +348,7 @@ offset_to_swap_extent(struct swap_info_struct *sis, uns= igned long offset) =20 sector_t swap_folio_sector(struct folio *folio) { - struct swap_info_struct *sis =3D swp_swap_info(folio->swap); + struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap); struct swap_extent *se; sector_t sector; pgoff_t offset; @@ -1300,7 +1306,7 @@ static struct swap_info_struct *_swap_info_get(swp_en= try_t entry) =20 if (!entry.val) goto out; - si =3D swp_swap_info(entry); + si =3D swap_entry_to_info(entry); if (!si) goto bad_nofile; if (data_race(!(si->flags & SWP_USED))) @@ -1415,7 +1421,7 @@ struct swap_info_struct *get_swap_device(swp_entry_t = entry) =20 if (!entry.val) goto out; - si =3D swp_swap_info(entry); + si =3D swap_entry_to_info(entry); if (!si) goto bad_nofile; if (!get_swap_device_info(si)) @@ -1538,7 +1544,7 @@ static void swap_entries_free(struct swap_info_struct= *si, unsigned char *map_end =3D map + nr_pages; =20 /* It should never free entries across different clusters */ - VM_BUG_ON(ci !=3D swp_offset_cluster(si, offset + nr_pages - 1)); + VM_BUG_ON(ci !=3D __swap_offset_to_cluster(si, offset + nr_pages - 1)); VM_BUG_ON(cluster_is_empty(ci)); VM_BUG_ON(ci->count < nr_pages); =20 @@ -1596,7 +1602,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t = entry) =20 int __swap_count(swp_entry_t entry) { - struct swap_info_struct *si =3D swp_swap_info(entry); + struct swap_info_struct *si =3D __swap_entry_to_info(entry); pgoff_t offset =3D swp_offset(entry); =20 return swap_count(si->swap_map[offset]); @@ -1827,7 +1833,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr) =20 swp_entry_t get_swap_page_of_type(int type) { - struct swap_info_struct *si =3D swap_type_to_swap_info(type); + struct swap_info_struct *si =3D swap_type_to_info(type); unsigned long offset; swp_entry_t entry =3D {0}; =20 @@ -1908,7 +1914,7 @@ int find_first_swap(dev_t *device) */ sector_t swapdev_block(int type, pgoff_t offset) { - struct swap_info_struct *si =3D swap_type_to_swap_info(type); + struct swap_info_struct *si =3D swap_type_to_info(type); struct swap_extent *se; =20 if (!si || !(si->flags & SWP_WRITEOK)) @@ -2837,7 +2843,7 @@ static void *swap_start(struct seq_file *swap, loff_t= *pos) if (!l) return SEQ_START_TOKEN; =20 - for (type =3D 0; (si =3D swap_type_to_swap_info(type)); type++) { + for (type =3D 0; (si =3D swap_type_to_info(type)); type++) { if (!(si->flags & SWP_USED) || !si->swap_map) continue; if (!--l) @@ -2858,7 +2864,7 @@ static void *swap_next(struct seq_file *swap, void *v= , loff_t *pos) type =3D si->type + 1; =20 ++(*pos); - for (; (si =3D swap_type_to_swap_info(type)); type++) { + for (; (si =3D swap_type_to_info(type)); type++) { if (!(si->flags & SWP_USED) || !si->swap_map) continue; return si; @@ -3531,7 +3537,7 @@ static int __swap_duplicate(swp_entry_t entry, unsign= ed char usage, int nr) unsigned char has_cache; int err, i; =20 - si =3D swp_swap_info(entry); + si =3D swap_entry_to_info(entry); if (WARN_ON_ONCE(!si)) { pr_err("%s%08lx\n", Bad_file, entry.val); return -EINVAL; @@ -3646,11 +3652,6 @@ void swapcache_clear(struct swap_info_struct *si, sw= p_entry_t entry, int nr) swap_entries_put_cache(si, entry, nr); } =20 -struct swap_info_struct *swp_swap_info(swp_entry_t entry) -{ - return swap_type_to_swap_info(swp_type(entry)); -} - /* * add_swap_count_continuation - called when a swap count is duplicated * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entr= y's --=20 2.51.0 From nobody Thu Oct 2 13:01:38 2025 Received: from mail-oo1-f49.google.com (mail-oo1-f49.google.com [209.85.161.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C67D5288C26 for ; Tue, 16 Sep 2025 16:02:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.161.49 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038527; cv=none; b=bdJNQJUjZDBGSS+PMZo/yRK+buJnoOJU20nlNh6hm8uOyjQsoGYgWIH908/e33VeMaCuOpvJ3nTI1DZhTFpMLUeHcbY27lnp71WWZMj2DFXBLtBAx4E0vNr5PupBMdeBKqoGehjVK1Jf5gMh8orf3sdxr4BH9ZS6mvKgI3D7ahI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038527; c=relaxed/simple; bh=RkZBErRXaxG7+svdbz8Q7l9R8uRrzR3VYE5pQ9kOhqo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=s91C+ua57sOrx5ElhxckszcS5bR4O5mM/1GR21+2Bt3XC65NSC0s2NLn2cM0d8TG8cai3EC6Bogx539v5W5R0J2JIAr6crLeQ7XAMnWnVzp3rgE88SGCegD/INB3cF+Q2q2xCijgq+7kxcSJyqFlWgQVwjx7SDl5Pi1eG1/OtWE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=AnxL8q14; arc=none smtp.client-ip=209.85.161.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="AnxL8q14" Received: by mail-oo1-f49.google.com with SMTP id 006d021491bc7-6234b7bdf3dso759882eaf.0 for ; Tue, 16 Sep 2025 09:02:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1758038525; x=1758643325; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=qcEphP+ICQn2G6iNahIPkKAJNc9HmOiqOSMT0ntzzdE=; b=AnxL8q14oTP39kWthZ+olKP5bjcm7TvrXT6dZquB/WvAWKlsUujjk+p4OZdzovFnof p1reojFC6oWltgiXlG+aM4GV+suLb449hueDnypoVWKyYHOTKHGWLnULShghI1RK9Biz 56NQWF4nT5KiPAg5Ow8S7JBgd7YhrWThyAFR93JrE35BJ9xHsjp7RztiEgdD72rNmz6u ys8vCP000xYMjXG1/HFFZCN1MhA76WyAo6gRi2wLBzSvXFG1oYYeDfB7Zd7XYzMOZofr cYskWTQrWvaSvZxoW1vjimyUndYL8GcDtdxKawhQLME35TK81aqy56AQSY9vfIxx0ahV HwNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758038525; x=1758643325; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=qcEphP+ICQn2G6iNahIPkKAJNc9HmOiqOSMT0ntzzdE=; b=FB9PrIFFym13tkbrM4e16of9xJaCghsmHFqhJ6qQ9AwPyvq4jKAL4V2ndWIIG6z7Tt sI8oSI6C+zY4tNAebmxQ7CmabNnoWTdCFQ7ZhBLYSHtV84mwdwAXEZ13W4vt4NoKix1k W2m1flPPdFvnmOhhnNqJXLXIqEzmM9GK82xfdTtnrICuum1u28ixDrYVy60nxXE86+ly ICcPs1f6B0tkfEFoX6KntghFJDaCgS4R66PSBUrrW5JivFXKIXNra9QGTbq6UEGLyNjx QZYSlRvdRZgXw1XLy+7HvzF2mQpTbFtIzygmhSDNeM63WA4xcP8JucojCzDvZarAe4Ez g83g== X-Forwarded-Encrypted: i=1; AJvYcCXGPostIHxfRKodvQFlq7RLrbgcA8MVe1eG2gHKlIx45WggZe1h7LRIrE/KMWktUpoFm6/TyglXmwr4Vrg=@vger.kernel.org X-Gm-Message-State: AOJu0YxVSsVSLRpVoA4AyMzU7Pp7urdJ3xZF2ZRh4Xp9CfH8+w1HT6mh zXRJi7DKaUhCZ/c8/+BnmgBDehBWuMvQLCnrNunEzvO/2yqlMajDVtUm X-Gm-Gg: ASbGnct6FSGYPD0lYFfbzBDZsmEg9+BOcx2nDHPgZDdQaX+oguPa/3GfVvb4yUzTVUl r2a9wnwlyvuQFfKgqT7rpJ6fPRYjSRKtFqwKBYClDbaK1jG3Ef5EzJhF+oQgdkppTH56bQRJw5Y 78ymjt7JYamYT/YvuWEciB2XTsxHV0Hd23WlytLC1/g7ojgRviD0KVGRKsNS6dFCIgFMQ7rtQio 7RnksQ3iwrbliZw7DVC1UXg+a8zem5mfsVbNZf6DaD2iSLn+Ka/pW2Epocxc0nAb176Dq9WNWfP j0JFtnwtrH39GXTiqjh4r1Ve9Jvyc1+VpBptXu87kZy629PhrXYBn/Fn2UVYMCYu1b+jGuk2MIR Z X-Google-Smtp-Source: AGHT+IELVtFRC6Nki4vFxd93AiETiEyoU50dddSTirN6BjBYWWFVae4So0375D/Ao7WW2cQMIoMTcQ== X-Received: by 2002:a05:6808:1507:b0:438:27af:3ff8 with SMTP id 5614622812f47-43b8d86ef79mr8425509b6e.7.1758038524551; Tue, 16 Sep 2025 09:02:04 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-820cd703f54sm969765485a.37.2025.09.16.09.01.58 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 16 Sep 2025 09:02:03 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Kairui Song , Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 08/15] mm, swap: cleanup swap cache API and add kerneldoc Date: Wed, 17 Sep 2025 00:00:53 +0800 Message-ID: <20250916160100.31545-9-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250916160100.31545-1-ryncsn@gmail.com> References: <20250916160100.31545-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song In preparation for replacing the swap cache backend with the swap table, clean up and add proper kernel doc for all swap cache APIs. Now all swap cache APIs are well-defined with consistent names. No feature change, only renaming and documenting. Signed-off-by: Kairui Song Acked-by: Chris Li Reviewed-by: Barry Song Reviewed-by: Baolin Wang Acked-by: David Hildenbrand Suggested-by: Chris Li --- mm/filemap.c | 2 +- mm/memory-failure.c | 2 +- mm/memory.c | 2 +- mm/shmem.c | 10 +++--- mm/swap.h | 48 ++++++++++++++----------- mm/swap_state.c | 86 ++++++++++++++++++++++++++++++++------------- mm/swapfile.c | 8 ++--- mm/vmscan.c | 2 +- mm/zswap.c | 2 +- 9 files changed, 103 insertions(+), 59 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 8d078aa2738a..2a05b1fdd445 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -4525,7 +4525,7 @@ static void filemap_cachestat(struct address_space *m= apping, * invalidation, so there might not be * a shadow in the swapcache (yet). */ - shadow =3D get_shadow_from_swap_cache(swp); + shadow =3D swap_cache_get_shadow(swp); if (!shadow) goto resched; } diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 6d9134e3d115..3edebb0cda30 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1127,7 +1127,7 @@ static int me_swapcache_clean(struct page_state *ps, = struct page *p) struct folio *folio =3D page_folio(p); int ret; =20 - delete_from_swap_cache(folio); + swap_cache_del_folio(folio); =20 ret =3D delete_from_lru_cache(folio) ? MF_FAILED : MF_RECOVERED; folio_unlock(folio); diff --git a/mm/memory.c b/mm/memory.c index 5808c4ef21b3..41e641823558 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4699,7 +4699,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) =20 memcg1_swapin(entry, nr_pages); =20 - shadow =3D get_shadow_from_swap_cache(entry); + shadow =3D swap_cache_get_shadow(entry); if (shadow) workingset_refault(folio, shadow); =20 diff --git a/mm/shmem.c b/mm/shmem.c index 410f27bc4752..077744a9e9da 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1661,13 +1661,13 @@ int shmem_writeout(struct folio *folio, struct swap= _iocb **plug, } =20 /* - * The delete_from_swap_cache() below could be left for + * The swap_cache_del_folio() below could be left for * shrink_folio_list()'s folio_free_swap() to dispose of; * but I'm a little nervous about letting this folio out of * shmem_writeout() in a hybrid half-tmpfs-half-swap state * e.g. folio_mapping(folio) might give an unexpected answer. */ - delete_from_swap_cache(folio); + swap_cache_del_folio(folio); goto redirty; } if (nr_pages > 1) @@ -2045,7 +2045,7 @@ static struct folio *shmem_swap_alloc_folio(struct in= ode *inode, new->swap =3D entry; =20 memcg1_swapin(entry, nr_pages); - shadow =3D get_shadow_from_swap_cache(entry); + shadow =3D swap_cache_get_shadow(entry); if (shadow) workingset_refault(new, shadow); folio_add_lru(new); @@ -2183,7 +2183,7 @@ static void shmem_set_folio_swapin_error(struct inode= *inode, pgoff_t index, nr_pages =3D folio_nr_pages(folio); folio_wait_writeback(folio); if (!skip_swapcache) - delete_from_swap_cache(folio); + swap_cache_del_folio(folio); /* * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks) @@ -2422,7 +2422,7 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, folio->swap.val =3D 0; swapcache_clear(si, swap, nr_pages); } else { - delete_from_swap_cache(folio); + swap_cache_del_folio(folio); } folio_mark_dirty(folio); swap_free_nr(swap, nr_pages); diff --git a/mm/swap.h b/mm/swap.h index 30b1039c27fe..6c4acb549bec 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -167,17 +167,29 @@ static inline bool folio_matches_swap_entry(const str= uct folio *folio, return folio_entry.val =3D=3D round_down(entry.val, nr_pages); } =20 +/* + * All swap cache helpers below require the caller to ensure the swap entr= ies + * used are valid and stablize the device by any of the following ways: + * - Hold a reference by get_swap_device(): this ensures a single entry is + * valid and increases the swap device's refcount. + * - Locking a folio in the swap cache: this ensures the folio's swap entr= ies + * are valid and pinned, also implies reference to the device. + * - Locking anything referencing the swap entry: e.g. PTL that protects + * swap entries in the page table, similar to locking swap cache folio. + * - See the comment of get_swap_device() for more complex usage. + */ +struct folio *swap_cache_get_folio(swp_entry_t entry); +void *swap_cache_get_shadow(swp_entry_t entry); +int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, + gfp_t gfp, void **shadow); +void swap_cache_del_folio(struct folio *folio); +void __swap_cache_del_folio(struct folio *folio, + swp_entry_t entry, void *shadow); +void swap_cache_clear_shadow(int type, unsigned long begin, + unsigned long end); + void show_swap_cache_info(void); -void *get_shadow_from_swap_cache(swp_entry_t entry); -int add_to_swap_cache(struct folio *folio, swp_entry_t entry, - gfp_t gfp, void **shadowp); -void __delete_from_swap_cache(struct folio *folio, - swp_entry_t entry, void *shadow); -void delete_from_swap_cache(struct folio *folio); -void clear_shadow_from_swap_cache(int type, unsigned long begin, - unsigned long end); void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int n= r); -struct folio *swap_cache_get_folio(swp_entry_t entry); struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr, struct swap_iocb **plug); @@ -305,28 +317,22 @@ static inline struct folio *swap_cache_get_folio(swp_= entry_t entry) return NULL; } =20 -static inline void *get_shadow_from_swap_cache(swp_entry_t entry) +static inline void *swap_cache_get_shadow(swp_entry_t entry) { return NULL; } =20 -static inline int add_to_swap_cache(struct folio *folio, swp_entry_t entry, - gfp_t gfp_mask, void **shadowp) -{ - return -1; -} - -static inline void __delete_from_swap_cache(struct folio *folio, - swp_entry_t entry, void *shadow) +static inline int swap_cache_add_folio(swp_entry_t entry, struct folio *fo= lio, + gfp_t gfp, void **shadow) { + return -EINVAL; } =20 -static inline void delete_from_swap_cache(struct folio *folio) +static inline void swap_cache_del_folio(struct folio *folio) { } =20 -static inline void clear_shadow_from_swap_cache(int type, unsigned long be= gin, - unsigned long end) +static inline void __swap_cache_del_folio(struct folio *folio, swp_entry_t= entry, void *shadow) { } =20 diff --git a/mm/swap_state.c b/mm/swap_state.c index 0ad4f3b41f1b..f3a32a06a950 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -78,8 +78,8 @@ void show_swap_cache_info(void) * Context: Caller must ensure @entry is valid and protect the swap device * with reference count or locks. * Return: Returns the found folio on success, NULL otherwise. The caller - * must lock and check if the folio still matches the swap entry before - * use (e.g. with folio_matches_swap_entry). + * must lock nd check if the folio still matches the swap entry before + * use (e.g., folio_matches_swap_entry). */ struct folio *swap_cache_get_folio(swp_entry_t entry) { @@ -90,7 +90,15 @@ struct folio *swap_cache_get_folio(swp_entry_t entry) return folio; } =20 -void *get_shadow_from_swap_cache(swp_entry_t entry) +/** + * swap_cache_get_shadow - Looks up a shadow in the swap cache. + * @entry: swap entry used for the lookup. + * + * Context: Caller must ensure @entry is valid and protect the swap device + * with reference count or locks. + * Return: Returns either NULL or an XA_VALUE (shadow). + */ +void *swap_cache_get_shadow(swp_entry_t entry) { struct address_space *address_space =3D swap_address_space(entry); pgoff_t idx =3D swap_cache_index(entry); @@ -102,12 +110,21 @@ void *get_shadow_from_swap_cache(swp_entry_t entry) return NULL; } =20 -/* - * add_to_swap_cache resembles filemap_add_folio on swapper_space, - * but sets SwapCache flag and 'swap' instead of mapping and index. +/** + * swap_cache_add_folio - Add a folio into the swap cache. + * @folio: The folio to be added. + * @entry: The swap entry corresponding to the folio. + * @gfp: gfp_mask for XArray node allocation. + * @shadowp: If a shadow is found, return the shadow. + * + * Context: Caller must ensure @entry is valid and protect the swap device + * with reference count or locks. + * The caller also needs to mark the corresponding swap_map slots with + * SWAP_HAS_CACHE to avoid race or conflict. + * Return: Returns 0 on success, error code otherwise. */ -int add_to_swap_cache(struct folio *folio, swp_entry_t entry, - gfp_t gfp, void **shadowp) +int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, + gfp_t gfp, void **shadowp) { struct address_space *address_space =3D swap_address_space(entry); pgoff_t idx =3D swap_cache_index(entry); @@ -155,12 +172,20 @@ int add_to_swap_cache(struct folio *folio, swp_entry_= t entry, return xas_error(&xas); } =20 -/* - * This must be called only on folios that have - * been verified to be in the swap cache. +/** + * __swap_cache_del_folio - Removes a folio from the swap cache. + * @folio: The folio. + * @entry: The first swap entry that the folio corresponds to. + * @shadow: shadow value to be filled in the swap cache. + * + * Removes a folio from the swap cache and fills a shadow in place. + * This won't put the folio's refcount. The caller has to do that. + * + * Context: Caller must hold the xa_lock, ensure the folio is + * locked and in the swap cache, using the index of @entry. */ -void __delete_from_swap_cache(struct folio *folio, - swp_entry_t entry, void *shadow) +void __swap_cache_del_folio(struct folio *folio, + swp_entry_t entry, void *shadow) { struct address_space *address_space =3D swap_address_space(entry); int i; @@ -186,27 +211,40 @@ void __delete_from_swap_cache(struct folio *folio, __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr); } =20 -/* - * This must be called only on folios that have - * been verified to be in the swap cache and locked. - * It will never put the folio into the free list, - * the caller has a reference on the folio. +/** + * swap_cache_del_folio - Removes a folio from the swap cache. + * @folio: The folio. + * + * Same as __swap_cache_del_folio, but handles lock and refcount. The + * caller must ensure the folio is either clean or has a swap count + * equal to zero, or it may cause data loss. + * + * Context: Caller must ensure the folio is locked and in the swap cache. */ -void delete_from_swap_cache(struct folio *folio) +void swap_cache_del_folio(struct folio *folio) { swp_entry_t entry =3D folio->swap; struct address_space *address_space =3D swap_address_space(entry); =20 xa_lock_irq(&address_space->i_pages); - __delete_from_swap_cache(folio, entry, NULL); + __swap_cache_del_folio(folio, entry, NULL); xa_unlock_irq(&address_space->i_pages); =20 put_swap_folio(folio, entry); folio_ref_sub(folio, folio_nr_pages(folio)); } =20 -void clear_shadow_from_swap_cache(int type, unsigned long begin, - unsigned long end) +/** + * swap_cache_clear_shadow - Clears a set of shadows in the swap cache. + * @type: Indicates the swap device. + * @begin: Beginning offset of the range. + * @end: Ending offset of the range. + * + * Context: Caller must ensure the range is valid and hold a reference to + * the swap device. + */ +void swap_cache_clear_shadow(int type, unsigned long begin, + unsigned long end) { unsigned long curr =3D begin; void *old; @@ -393,7 +431,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry= , gfp_t gfp_mask, goto put_and_return; =20 /* - * We might race against __delete_from_swap_cache(), and + * We might race against __swap_cache_del_folio(), and * stumble across a swap_map entry whose SWAP_HAS_CACHE * has not yet been cleared. Or race against another * __read_swap_cache_async(), which has set SWAP_HAS_CACHE @@ -412,7 +450,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry= , gfp_t gfp_mask, goto fail_unlock; =20 /* May fail (-ENOMEM) if XArray node allocation failed. */ - if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &sha= dow)) + if (swap_cache_add_folio(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &= shadow)) goto fail_unlock; =20 memcg1_swapin(entry, 1); diff --git a/mm/swapfile.c b/mm/swapfile.c index 6f7a8c98d14d..51f781c43537 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -267,7 +267,7 @@ static int __try_to_reclaim_swap(struct swap_info_struc= t *si, if (!need_reclaim) goto out_unlock; =20 - delete_from_swap_cache(folio); + swap_cache_del_folio(folio); folio_set_dirty(folio); ret =3D nr_pages; out_unlock: @@ -1124,7 +1124,7 @@ static void swap_range_free(struct swap_info_struct *= si, unsigned long offset, swap_slot_free_notify(si->bdev, offset); offset++; } - clear_shadow_from_swap_cache(si->type, begin, end); + swap_cache_clear_shadow(si->type, begin, end); =20 /* * Make sure that try_to_unuse() observes si->inuse_pages reaching 0 @@ -1289,7 +1289,7 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp) * TODO: this could cause a theoretical memory reclaim * deadlock in the swap out path. */ - if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL)) + if (swap_cache_add_folio(folio, entry, gfp | __GFP_NOMEMALLOC, NULL)) goto out_free; =20 return 0; @@ -1759,7 +1759,7 @@ bool folio_free_swap(struct folio *folio) if (folio_swapped(folio)) return false; =20 - delete_from_swap_cache(folio); + swap_cache_del_folio(folio); folio_set_dirty(folio); return true; } diff --git a/mm/vmscan.c b/mm/vmscan.c index ca9e1cd3cd68..c79c6806560b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -776,7 +776,7 @@ static int __remove_mapping(struct address_space *mappi= ng, struct folio *folio, =20 if (reclaimed && !mapping_exiting(mapping)) shadow =3D workingset_eviction(folio, target_memcg); - __delete_from_swap_cache(folio, swap, shadow); + __swap_cache_del_folio(folio, swap, shadow); memcg1_swapout(folio, swap); xa_unlock_irq(&mapping->i_pages); put_swap_folio(folio, swap); diff --git a/mm/zswap.c b/mm/zswap.c index 63045e3fb1f5..1b1edecde6a7 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1069,7 +1069,7 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, =20 out: if (ret && ret !=3D -EEXIST) { - delete_from_swap_cache(folio); + swap_cache_del_folio(folio); folio_unlock(folio); } folio_put(folio); --=20 2.51.0 From nobody Thu Oct 2 13:01:38 2025 Received: from mail-yx1-f54.google.com (mail-yx1-f54.google.com [74.125.224.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2216337C0FE for ; Tue, 16 Sep 2025 16:02:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.224.54 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038535; cv=none; b=nLHgm8BwtvxKVI8I3+A/y3jPWg0LE+bv4Sosr7wpqZ0eCmQCESebhP/lGHtR64aB56+GuyEa+LI7upurqX2orZFHXrvTr/elBDS/LQqhQ7zXaKuHGHO6YTCgHqOg0m9uGAorZ2P1FBaBo/1zeyBjURVFf8mjkTAH2jW2wQ+rsOc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038535; c=relaxed/simple; bh=tgqzSLm3L6Kwjq+1cjXCXcPc8nBnZnJjP/JEia/8Nus=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Rz2rDB77T058EoVaOhphS5SZ1XqTLOzacLLqAkbbJymH7mdxfzsIseQRzn90hvBn8kAEHxLbaZPNcTdazJNZX/XHVLsXWMkUIzAaPCOfeIDhD4qI1RRh5KGWw2kvOX3GntJgRWYDMR05SXz5fhnfrWP/Sd7P7Jvn/JkcD1I722w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=U4bq4ATx; arc=none smtp.client-ip=74.125.224.54 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="U4bq4ATx" Received: by mail-yx1-f54.google.com with SMTP id 956f58d0204a3-62bbc95b4dfso1692363d50.2 for ; Tue, 16 Sep 2025 09:02:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1758038533; x=1758643333; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=iGZJewVWtmBg1AhubXnc5kIP+Etaz+JfrYGw6S6fAgs=; b=U4bq4ATxclevnu20C/0uOe+q2zIk6wxFxLii9L6ErJHgSqRZEZXkJDyWAUHVuncwN1 nFWBgFE6CCNFpE5aGnGED4XC0eqb1XwIUg0a6QSqBXXyEdhSRzIma7HPdoEGS/7IxxdL 9DP30Lk1Cimn7wf42WuZfqBeAqOWv5FfTrJgHfqOlWDkmS6fxQpOYjjiHPpkLJYXR7lP 1urhBZsB21h6WFmaORdp0CeuTfyELFEKCQBBcjUqREtNBSp2ztgwCBeXS0zlQUaTOWHT ZB7R4/sr5KnXTcUQGWdCsRtpbOChobGZtgvp1hS1siU8mNDJD0PFG46cVscu4sTkQd2l j9uA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758038533; x=1758643333; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=iGZJewVWtmBg1AhubXnc5kIP+Etaz+JfrYGw6S6fAgs=; b=a7eN9ZI+04DxT13+PHleYwioOAdVRqOZpWOvxPr7JZM38FAO/xJKs/5wkxhzXJJjK0 53W1WUTNIMj+RxgT9kIH8CDNJDzxjI9jbjgu8Pcgc/S2qnWlH22L4/PJtEJM1v/zvWgl XKhOWCGFUv6axkbqxBuBmCWVpcno0hketz/FIIvd3dXenD+5GO5YLA9feDrOeoe+qZB3 0wzjVHI1HiwqYRuKIefX7QhKQgrUu7nElr08ZQ32d7xtAVW4pV7MFBMrPLHBmXMA4sWy QprEhbP4SchQsy/poU5bgqV72JnchGL0fVU+t7DFad3rdklog663Fs6oXh+VLO5Z/cAj 7TQg== X-Forwarded-Encrypted: i=1; AJvYcCUna1TGOb+7k0pHyMcseqpY6KCGUDLTir8MM/WR8WhKQX2gbfq9R6PHHaQgbvHHhkJLt1hJO5s41ffB2gI=@vger.kernel.org X-Gm-Message-State: AOJu0YxoOUljCRHZEtKVOm0wMWA9YRTlW35BoLkkhuxPnd1xcKywzbK5 AI1GPAZUz1Al42uNjncKRQ288vnriktTQ593WuroRi8cXhY5Wu3bcwnh X-Gm-Gg: ASbGncu0DYPGymmAqNgqVtFh94G5dFylBPR6Ycde4Gls5MyLrH3g69bCdCPoZJ2/o+p 9aCnml+WOsNdsg9DB0Jcxad3ETmE2B6z2pnD6Oy2yizBH6rGIQ62hu4TD0JV/SyYdb2emncAaje Q+nx/Cppp30v6o+z576KUTgLMct0M2UNRlT2a8Hk5Y3Eeh9Qn0hD5aYxOetssFp8vKzGpHW5QMV HCwv7K8YpoXsgzA7E4Na0ROluwwnjazH8/rfe/xqZ2vI4BCK2NPz7ZtJYxR1mlUW6VtRwpC/2KC xWbsqm/MxHhOL9woft0D5EG8KCzNKlK/XNeIwhgcfaS0pGWVl9f5F7+ZfPKpkg5eCjjIX0zXIrB Rf/BawpCG6LDcctyGDN0RUWBM7aGV X-Google-Smtp-Source: AGHT+IEcPZ6zAaIN8e2Ph8IHvM0WXbapzodJjPQRaKLbM7twTGX8w0w/0UkhEBY/MLHBJSqkT1Ar8g== X-Received: by 2002:a05:690e:250d:20b0:5fe:ec6a:4d3 with SMTP id 956f58d0204a3-627258617c0mr12912263d50.27.1758038531125; Tue, 16 Sep 2025 09:02:11 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-820cd703f54sm969765485a.37.2025.09.16.09.02.04 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 16 Sep 2025 09:02:10 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Kairui Song , Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 09/15] mm/shmem, swap: remove redundant error handling for replacing folio Date: Wed, 17 Sep 2025 00:00:54 +0800 Message-ID: <20250916160100.31545-10-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250916160100.31545-1-ryncsn@gmail.com> References: <20250916160100.31545-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Shmem may replace a folio in the swap cache if the cached one doesn't fit the swapin's GFP zone. When doing so, shmem has already double checked that the swap cache folio is locked, still has the swap cache flag set, and contains the wanted swap entry. So it is impossible to fail due to an XArray mismatch. There is even a comment for that. Delete the defensive error handling path, and add a WARN_ON instead: if that happened, something has broken the basic principle of how the swap cache works, we should catch and fix that. Signed-off-by: Kairui Song Reviewed-by: David Hildenbrand Reviewed-by: Baolin Wang Suggested-by: Chris Li --- mm/shmem.c | 32 +++++++------------------------- 1 file changed, 7 insertions(+), 25 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 077744a9e9da..dc17717e5631 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2121,35 +2121,17 @@ static int shmem_replace_folio(struct folio **folio= p, gfp_t gfp, /* Swap cache still stores N entries instead of a high-order entry */ xa_lock_irq(&swap_mapping->i_pages); for (i =3D 0; i < nr_pages; i++) { - void *item =3D xas_load(&xas); - - if (item !=3D old) { - error =3D -ENOENT; - break; - } - - xas_store(&xas, new); + WARN_ON_ONCE(xas_store(&xas, new) !=3D old); xas_next(&xas); } - if (!error) { - mem_cgroup_replace_folio(old, new); - shmem_update_stats(new, nr_pages); - shmem_update_stats(old, -nr_pages); - } + + mem_cgroup_replace_folio(old, new); + shmem_update_stats(new, nr_pages); + shmem_update_stats(old, -nr_pages); xa_unlock_irq(&swap_mapping->i_pages); =20 - if (unlikely(error)) { - /* - * Is this possible? I think not, now that our callers - * check both the swapcache flag and folio->private - * after getting the folio lock; but be defensive. - * Reverse old to newpage for clear and free. - */ - old =3D new; - } else { - folio_add_lru(new); - *foliop =3D new; - } + folio_add_lru(new); + *foliop =3D new; =20 folio_clear_swapcache(old); old->private =3D NULL; --=20 2.51.0 From nobody Thu Oct 2 13:01:38 2025 Received: from mail-oo1-f47.google.com (mail-oo1-f47.google.com [209.85.161.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 983B928369D for ; Tue, 16 Sep 2025 16:02:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.161.47 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038542; cv=none; b=YAxvC7x7vaw37ZaLQjtV6b6D/tonZB+O+jIFn/ir7QRIc41t0Y8xJ76AvuaRHSdQ0gOROoZ8uxHTn5liU+BgskO1CU6YoO3mK6SWaUBb95igLny6vBK+qkDwJ4opq09rq+7nlvphwQcJZoFhYTvNvg7zNFL5fMRJRJR0+rHIuoY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038542; c=relaxed/simple; bh=k3YgayRnsZ7BORANgawEAl3B+3jkLemnEZKM6VYWFgo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=PGtnZSy9rfuhhkkdF0Q+r71lRlEThPLoR0dPrPYJV2lR/fh/3NC4NuNFq9+FjG/P383e5ywPpdJHqk+otly4TIuAXAJQinNSnsvpBQx9dMOLJFNBr/3RaJY0ow/uA9JxZcj3V6cZD/ZOGIFoVCtPi7amo8PQ/Btv0kxfc4muvUg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=a0muYJ5n; arc=none smtp.client-ip=209.85.161.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="a0muYJ5n" Received: by mail-oo1-f47.google.com with SMTP id 006d021491bc7-61bd4a3f39cso1419065eaf.0 for ; Tue, 16 Sep 2025 09:02:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1758038539; x=1758643339; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=xdHPoxeBxMdMqxDIR/SvXJLUUbVQxnjhjl+U1+Q8Oxs=; b=a0muYJ5nnGd1OQPM2iE05GQIs2xn8ZHNbwW+boLEW2hzrlx7qH9F5sNXfzEZDvYKGQ PBV3cob3gIQkk9nVPTBzgwJQ3GkYB3iCEhZmxiPM2K0Ml9+QlDceCkywkvdoZsnDIn7r 0nQ/9QRlqfFPS6mHO0qW2YVecnJJXwPKHNgNlP7uv/NWEwjjbn3yfcy34d+M9pIoXidP 9hMWxXXieJPq5XkGMXH155STYwONIxZ43cLlUgLc1Q0GLi2TPo3JJ2zriaScDTSaRqN6 A6lx+4HLdg+EBx7Bipe8ixKkajO8IFBaUcvXP3r6b/NIfxSETAXsJpijDPxOuY8oiLWH /xQQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758038539; x=1758643339; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=xdHPoxeBxMdMqxDIR/SvXJLUUbVQxnjhjl+U1+Q8Oxs=; b=pU/6EnLWS27nFXmTbEv2YAN9cMa9z6ZXYsrAAPdToMokqtljCgYkubssYw122fX9ad 3nIhujXmT5FzHTUtQvcr7VTejpr85vHkmrF1N1oYqvG6/akvvc9NhBGuJWGOwah6BEIR pMqC3tFfSWwYAUBJ6MtsjwwPRKGRQ1ZccRzvlcWDvtSkT6fN3q0gVHWBECKVXktsBWh6 SwEVKOr7aYfCi9zfQ3IuwhBs/lRoB1XuCXTs7VwJBdcEtcUJaQfRcslKpPJarRtwR282 VDLWpEMtSInXQJEELoeHLYgMPcVGAj4z5KY0X1fiYsJgRWHrn1fdSaTiaMT4uToF9dIN FLpQ== X-Forwarded-Encrypted: i=1; AJvYcCUaphrxQQjY9fAIWlYnOMO0hTkOXyG5CHq5pMQEnRlzUWxiGn4GtUNFVIrUO9UJuPrOfZ+zFXXi1ZGCne4=@vger.kernel.org X-Gm-Message-State: AOJu0Yy+/UYKiPSwb0IxHkM1j4QGhPhr50bAzsvqA0eviFvqBVkQyGsm vds6Mdzay2AqtuLnCTPO3PQ1r4nwDATWMFK7enPElGWbGe4RwIEND9/w X-Gm-Gg: ASbGncvheb8wFeYXpw0IYgy//DO6nxk6FCddINv9dFnTXN/UCsNX67QNuX2pLdSsFB8 E9bDJlG2firuPnlwtwZFD3gJOneTAlkgsB0xd78ZxPTKjmsyzMqCTUWs0rlmTGOoQZZsT+e+tqz Wsjo3rdHCz7nBWP32sFc5k6c1+8oFlMdrrG3D0QRPdGMN+QPNIlmWDNo8LeGOWbN9mS9pSEj5tp 41s+oIslMBle3AwUF4rjhQ3WMHd3E6aQNFRTQ9HcdnnHZ4rbU1GsCDpm2WnFgiQ4jFjF3z6W8Gw Wtfvn90bfzVKrXgrMu07Kz2B9L3DIi5csOY7wPFU0LFLkQLzb4q5qsA4hhw+iWA5xdlktp2lXl/ IoX3ccPsSpnX3yhSXxfK5WTW9X1HiusOmxt/TzR2chklfFew= X-Google-Smtp-Source: AGHT+IGHbYZsTFdHhkmVtFce8cMHp88HQXSjVc3SOFLAhgMSHQs6ZKwPtqA2adX5VFTn7SKvgD08Jg== X-Received: by 2002:a05:6808:8801:b0:43d:1f40:754a with SMTP id 5614622812f47-43d1f408e97mr4480015b6e.10.1758038537425; Tue, 16 Sep 2025 09:02:17 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-820cd703f54sm969765485a.37.2025.09.16.09.02.11 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 16 Sep 2025 09:02:16 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Kairui Song , Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 10/15] mm, swap: wrap swap cache replacement with a helper Date: Wed, 17 Sep 2025 00:00:55 +0800 Message-ID: <20250916160100.31545-11-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250916160100.31545-1-ryncsn@gmail.com> References: <20250916160100.31545-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song There are currently three swap cache users that are trying to replace an existing folio with a new one: huge memory splitting, migration, and shmem replacement. What they are doing is quite similar. Introduce a common helper for this. In later commits, this can be easily switched to use the swap table by updating this helper. The newly added helper also makes the swap cache API better defined, and make debugging easier by adding a few more debug checks. Migration and shmem replace are meant to clone the folio, including content, swap entry value, and flags. And splitting will adjust each sub folio's swap entry according to order, which could be non-uniform in the future. So document it clearly that it's the caller's responsibility to set up the new folio's swap entries and flags before calling the helper. The helper will just follow the new folio's entry value. This also prepares for replacing high-order folios in the swap cache. Currently, only splitting to order 0 is allowed for swap cache folios. Using the new helper, we can handle high-order folio splitting better. Signed-off-by: Kairui Song Reviewed-by: Baolin Wang Acked-by: David Hildenbrand Acked-by: Chris Li Suggested-by: Chris Li --- mm/huge_memory.c | 4 +--- mm/migrate.c | 11 +++-------- mm/shmem.c | 11 ++--------- mm/swap.h | 5 +++++ mm/swap_state.c | 33 +++++++++++++++++++++++++++++++++ 5 files changed, 44 insertions(+), 20 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 26cedfcd7418..4c66e358685b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3798,9 +3798,7 @@ static int __folio_split(struct folio *folio, unsigne= d int new_order, * NOTE: shmem in swap cache is not supported yet. */ if (swap_cache) { - __xa_store(&swap_cache->i_pages, - swap_cache_index(new_folio->swap), - new_folio, 0); + __swap_cache_replace_folio(folio, new_folio); continue; } =20 diff --git a/mm/migrate.c b/mm/migrate.c index 8e435a078fc3..c69cc13db692 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -566,7 +566,6 @@ static int __folio_migrate_mapping(struct address_space= *mapping, struct zone *oldzone, *newzone; int dirty; long nr =3D folio_nr_pages(folio); - long entries, i; =20 if (!mapping) { /* Take off deferred split queue while frozen and memcg set */ @@ -615,9 +614,6 @@ static int __folio_migrate_mapping(struct address_space= *mapping, if (folio_test_swapcache(folio)) { folio_set_swapcache(newfolio); newfolio->private =3D folio_get_private(folio); - entries =3D nr; - } else { - entries =3D 1; } =20 /* Move dirty while folio refs frozen and newfolio not yet exposed */ @@ -627,11 +623,10 @@ static int __folio_migrate_mapping(struct address_spa= ce *mapping, folio_set_dirty(newfolio); } =20 - /* Swap cache still stores N entries instead of a high-order entry */ - for (i =3D 0; i < entries; i++) { + if (folio_test_swapcache(folio)) + __swap_cache_replace_folio(folio, newfolio); + else xas_store(&xas, newfolio); - xas_next(&xas); - } =20 /* * Drop cache reference from old folio by unfreezing diff --git a/mm/shmem.c b/mm/shmem.c index dc17717e5631..bbfbbc1bc4d6 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2086,10 +2086,8 @@ static int shmem_replace_folio(struct folio **foliop= , gfp_t gfp, struct folio *new, *old =3D *foliop; swp_entry_t entry =3D old->swap; struct address_space *swap_mapping =3D swap_address_space(entry); - pgoff_t swap_index =3D swap_cache_index(entry); - XA_STATE(xas, &swap_mapping->i_pages, swap_index); int nr_pages =3D folio_nr_pages(old); - int error =3D 0, i; + int error =3D 0; =20 /* * We have arrived here because our zones are constrained, so don't @@ -2118,13 +2116,8 @@ static int shmem_replace_folio(struct folio **foliop= , gfp_t gfp, new->swap =3D entry; folio_set_swapcache(new); =20 - /* Swap cache still stores N entries instead of a high-order entry */ xa_lock_irq(&swap_mapping->i_pages); - for (i =3D 0; i < nr_pages; i++) { - WARN_ON_ONCE(xas_store(&xas, new) !=3D old); - xas_next(&xas); - } - + __swap_cache_replace_folio(old, new); mem_cgroup_replace_folio(old, new); shmem_update_stats(new, nr_pages); shmem_update_stats(old, -nr_pages); diff --git a/mm/swap.h b/mm/swap.h index 6c4acb549bec..fe579c81c6c4 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -185,6 +185,7 @@ int swap_cache_add_folio(struct folio *folio, swp_entry= _t entry, void swap_cache_del_folio(struct folio *folio); void __swap_cache_del_folio(struct folio *folio, swp_entry_t entry, void *shadow); +void __swap_cache_replace_folio(struct folio *old, struct folio *new); void swap_cache_clear_shadow(int type, unsigned long begin, unsigned long end); =20 @@ -336,6 +337,10 @@ static inline void __swap_cache_del_folio(struct folio= *folio, swp_entry_t entry { } =20 +static inline void __swap_cache_replace_folio(struct folio *old, struct fo= lio *new) +{ +} + static inline unsigned int folio_swap_flags(struct folio *folio) { return 0; diff --git a/mm/swap_state.c b/mm/swap_state.c index f3a32a06a950..d1f5b8fa52fc 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -234,6 +234,39 @@ void swap_cache_del_folio(struct folio *folio) folio_ref_sub(folio, folio_nr_pages(folio)); } =20 +/** + * __swap_cache_replace_folio - Replace a folio in the swap cache. + * @old: The old folio to be replaced. + * @new: The new folio. + * + * Replace an existing folio in the swap cache with a new folio. The + * caller is responsible for setting up the new folio's flag and swap + * entries. Replacement will take the new folio's swap entry value as + * the starting offset to override all slots covered by the new folio. + * + * Context: Caller must ensure both folios are locked, also lock the + * swap address_space that holds the old folio to avoid races. + */ +void __swap_cache_replace_folio(struct folio *old, struct folio *new) +{ + swp_entry_t entry =3D new->swap; + unsigned long nr_pages =3D folio_nr_pages(new); + unsigned long offset =3D swap_cache_index(entry); + unsigned long end =3D offset + nr_pages; + + XA_STATE(xas, &swap_address_space(entry)->i_pages, offset); + + VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new)); + VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new)); + VM_WARN_ON_ONCE(!entry.val); + + /* Swap cache still stores N entries instead of a high-order entry */ + do { + WARN_ON_ONCE(xas_store(&xas, new) !=3D old); + xas_next(&xas); + } while (++offset < end); +} + /** * swap_cache_clear_shadow - Clears a set of shadows in the swap cache. * @type: Indicates the swap device. --=20 2.51.0 From nobody Thu Oct 2 13:01:38 2025 Received: from mail-qt1-f176.google.com (mail-qt1-f176.google.com [209.85.160.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A1758246799 for ; Tue, 16 Sep 2025 16:02:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038553; cv=none; b=Ew/kzWY+++FHlfCSzwX33MEKwd82i5s0BEnBeZUGbzO0LyxkN3KAgZ8ZyMujveuO5Hy4PC4hyQF2HWUHQ8mPN/v7iAYy/dwihA8fn4d8FL6ptQxBf2EZbjxf0diRsi5Y/tik132Fk99gbpboY7BjjH9Fqh+jLZbTBRrfbPRjxrU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038553; c=relaxed/simple; bh=zh8A/e2vgcf4m2wqcRMfdkBjLij6hbbVdF4U7zTQMkU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=pQ26jnUXjLE76pGjrGOwqTGfR0tQL5CILkwV1aaEaWk7dOhJohN4Q6qo3xfxe78BtNmegpCx8IaZsuNbMTD/f/ZX87gCv1yZ9g1Jj02gORXdz6CTpZ14Zt3LLRZoFjcmDVmp7EDFOsgJO4iRrRvKiw7iPZPwFgsC59CTuGUshSo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=DWik9I/z; arc=none smtp.client-ip=209.85.160.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="DWik9I/z" Received: by mail-qt1-f176.google.com with SMTP id d75a77b69052e-4b5e88d9994so55416261cf.1 for ; Tue, 16 Sep 2025 09:02:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1758038546; x=1758643346; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=0cpAeeUiC7KrwXh/+D+3UbJiA467acW5a+K9Xk++Gd4=; b=DWik9I/zSgZe5clswIf65CzCyhdnHrYY+oS1iGiytHSrfUl9m5q1WpidGct6p3yx/9 /FWBBOqcQ8YIOqMIXtWI3/x1vDoFE5NtbdCksDWaMdMyhsxrAQjJ2pplvyKECIsYTqdq snQ3KyueFe0ZXzrxTRoTkKVk7fL7l6I7RE4ntEB7nFuFVFVk5N5xzwvUWX7+9BEdDcn9 bsgtJcDeQbVyT9gm4AFZCD/aNUXOzi2Btrnhl3wYhIWhbVbs3VBGD9JvSaIFGikrRm1H hmmxeSAiwEGncdyEo2HBoN27NYJydh90tqJ2Lmg2eM/Kj8AAbFhMMVuUQ+nOqzm7c1bu pWhw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758038546; x=1758643346; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=0cpAeeUiC7KrwXh/+D+3UbJiA467acW5a+K9Xk++Gd4=; b=KaGhA0Z1wAxu8Pim0gVotMZ6kdOKo+tw8eE+XgQ4+4mXApAVquQxV3m9UXVM1+AWXd auIFHtCwICeXBuNAC/KDaHunTHV/Zjl0MFZTgeu10BG93QAMOTLEcfnnfSJqbdHbVjRL 1hotSqtepQx5NLLgcnLBoOOVQZKGi7/OVut0llArPkfVdtcEXgo6rbA09/qeouVTSQtG 0DACQI4nq+FMkh5Wl4vr8tF6WQfV07PhUVLDiLpFsBi+PdbzRyxajx5aLCQupwBzr85c KgrFmDD5aTdK/8BVhEsS9ISlUtbKRg0dwo3GbUzphe/Y/GcJ4rYa79t37QqN5uuKT52q gEGQ== X-Forwarded-Encrypted: i=1; AJvYcCVIAKnNGDt4T6+qU2GW5zeVhvARbw191aKcuJTwGF1eGrq5jplWpHWhrgEhCk5WBWR0yPu3Tq2Oc+V6r4A=@vger.kernel.org X-Gm-Message-State: AOJu0YzMRaFTbKAIsS2raekVla/bgWTHIk+tJvWmWQmrMu1M3K3UqLjM ZAWvuwPnetAieVuU+25CVv9avILCU6CqDL9K9R4CzNtxuYfJkGgZdXFG X-Gm-Gg: ASbGncuuSAfulKj4JlTlm8li4urHP5aB/WWeDxGxZVSVMqbQszs7XS78S7FFOjmDl2Z sBiSZ9e22oRttgGpuefZOxAuxacKYt/BXCUAsTm9DxpujpA/IZ1Q8llarTLICvvbaKDhPPSvo99 oHI45QCgk2XpzoBcBMDP0mAFd19RWXuoPC6a4gn0H4Y4DM+P3MyEQ1pje+xI8l5Rj2K565vqaZr pDdJpVvFmFy4q6uUqS2tKeA0Qvxyg8EiHm2m+OviJdU2CtzTfwkkLJiY9Pq1QyXDk4dwOOw2hLM vN/SHeJHcuQ8w0CmPmDNdZ/KkGQVG9WB5FQV8nOqEdldz0vTa4HrsL/VKH+JdbWIxfj343PnOIS kSSExwK7p/YMQn+/vUJrBR/ALQTZS7iNReBMKSxLgYNnGvaZ/Z7j7VZOWuQ== X-Google-Smtp-Source: AGHT+IEI4TcTNHoJQ2dSrZU5SHvQ6WASUOgJPdcBDxamd43kTY7j1Rz/Uck44pCman+TyuLYNrfd4w== X-Received: by 2002:a05:622a:494:b0:4b3:b34:9395 with SMTP id d75a77b69052e-4b77d0ab5a9mr196360141cf.65.1758038544029; Tue, 16 Sep 2025 09:02:24 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-820cd703f54sm969765485a.37.2025.09.16.09.02.17 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 16 Sep 2025 09:02:23 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Kairui Song , Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 11/15] mm, swap: use the swap table for the swap cache and switch API Date: Wed, 17 Sep 2025 00:00:56 +0800 Message-ID: <20250916160100.31545-12-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250916160100.31545-1-ryncsn@gmail.com> References: <20250916160100.31545-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Introduce basic swap table infrastructures, which are now just a fixed-sized flat array inside each swap cluster, with access wrappers. Each cluster contains a swap table of 512 entries. Each table entry is an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE), a folio type (pointer), or NULL. In this first step, it only supports storing a folio or shadow, and it is a drop-in replacement for the current swap cache. Convert all swap cache users to use the new sets of APIs. Chris Li has been suggesting using a new infrastructure for swap cache for better performance, and that idea combined well with the swap table as the new backing structure. Now the lock contention range is reduced to 2M clusters, which is much smaller than the 64M address_space. And we can also drop the multiple address_space design. All the internal works are done with swap_cache_get_* helpers. Swap cache lookup is still lock-less like before, and the helper's contexts are same with original swap cache helpers. They still require a pin on the swap device to prevent the backing data from being freed. Swap cache updates are now protected by the swap cluster lock instead of the XArray lock. This is mostly handled internally, but new __swap_cache_* helpers require the caller to lock the cluster. So, a few new cluster access and locking helpers are also introduced. A fully cluster-based unified swap table can be implemented on top of this to take care of all count tracking and synchronization work, with dynamic allocation. It should reduce the memory usage while making the performance even better. Co-developed-by: Chris Li Signed-off-by: Chris Li Signed-off-by: Kairui Song Acked-by: Chris Li Suggested-by: Chris Li --- MAINTAINERS | 1 + include/linux/swap.h | 2 - mm/huge_memory.c | 13 +- mm/migrate.c | 19 ++- mm/shmem.c | 8 +- mm/swap.h | 154 +++++++++++++++++------ mm/swap_state.c | 293 +++++++++++++++++++------------------------ mm/swap_table.h | 97 ++++++++++++++ mm/swapfile.c | 100 +++++++++++---- mm/vmscan.c | 20 ++- 10 files changed, 459 insertions(+), 248 deletions(-) create mode 100644 mm/swap_table.h diff --git a/MAINTAINERS b/MAINTAINERS index 3d113bfc3c82..4c8bbf70a3c7 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16232,6 +16232,7 @@ F: include/linux/swapops.h F: mm/page_io.c F: mm/swap.c F: mm/swap.h +F: mm/swap_table.h F: mm/swap_state.c F: mm/swapfile.c =20 diff --git a/include/linux/swap.h b/include/linux/swap.h index 762f8db0e811..e818fbade1e2 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -480,8 +480,6 @@ extern int __swap_count(swp_entry_t entry); extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t en= try); extern int swp_swapcount(swp_entry_t entry); struct backing_dev_info; -extern int init_swap_address_space(unsigned int type, unsigned long nr_pag= es); -extern void exit_swap_address_space(unsigned int type); extern struct swap_info_struct *get_swap_device(swp_entry_t entry); sector_t swap_folio_sector(struct folio *folio); =20 diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 4c66e358685b..a9fc7a09167a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3720,7 +3720,7 @@ static int __folio_split(struct folio *folio, unsigne= d int new_order, /* Prevent deferred_split_scan() touching ->_refcount */ spin_lock(&ds_queue->split_queue_lock); if (folio_ref_freeze(folio, 1 + extra_pins)) { - struct address_space *swap_cache =3D NULL; + struct swap_cluster_info *ci =3D NULL; struct lruvec *lruvec; int expected_refs; =20 @@ -3764,8 +3764,7 @@ static int __folio_split(struct folio *folio, unsigne= d int new_order, goto fail; } =20 - swap_cache =3D swap_address_space(folio->swap); - xa_lock(&swap_cache->i_pages); + ci =3D swap_cluster_get_and_lock(folio); } =20 /* lock lru list/PageCompound, ref frozen by page_ref_freeze */ @@ -3797,8 +3796,8 @@ static int __folio_split(struct folio *folio, unsigne= d int new_order, * Anonymous folio with swap cache. * NOTE: shmem in swap cache is not supported yet. */ - if (swap_cache) { - __swap_cache_replace_folio(folio, new_folio); + if (ci) { + __swap_cache_replace_folio(ci, folio, new_folio); continue; } =20 @@ -3833,8 +3832,8 @@ static int __folio_split(struct folio *folio, unsigne= d int new_order, =20 unlock_page_lruvec(lruvec); =20 - if (swap_cache) - xa_unlock(&swap_cache->i_pages); + if (ci) + swap_cluster_unlock(ci); } else { spin_unlock(&ds_queue->split_queue_lock); ret =3D -EAGAIN; diff --git a/mm/migrate.c b/mm/migrate.c index c69cc13db692..aee61a980374 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -563,6 +563,7 @@ static int __folio_migrate_mapping(struct address_space= *mapping, struct folio *newfolio, struct folio *folio, int expected_count) { XA_STATE(xas, &mapping->i_pages, folio_index(folio)); + struct swap_cluster_info *ci =3D NULL; struct zone *oldzone, *newzone; int dirty; long nr =3D folio_nr_pages(folio); @@ -591,9 +592,16 @@ static int __folio_migrate_mapping(struct address_spac= e *mapping, oldzone =3D folio_zone(folio); newzone =3D folio_zone(newfolio); =20 - xas_lock_irq(&xas); + if (folio_test_swapcache(folio)) + ci =3D swap_cluster_get_and_lock_irq(folio); + else + xas_lock_irq(&xas); + if (!folio_ref_freeze(folio, expected_count)) { - xas_unlock_irq(&xas); + if (ci) + swap_cluster_unlock_irq(ci); + else + xas_unlock_irq(&xas); return -EAGAIN; } =20 @@ -624,7 +632,7 @@ static int __folio_migrate_mapping(struct address_space= *mapping, } =20 if (folio_test_swapcache(folio)) - __swap_cache_replace_folio(folio, newfolio); + __swap_cache_replace_folio(ci, folio, newfolio); else xas_store(&xas, newfolio); =20 @@ -635,8 +643,11 @@ static int __folio_migrate_mapping(struct address_spac= e *mapping, */ folio_ref_unfreeze(folio, expected_count - nr); =20 - xas_unlock(&xas); /* Leave irq disabled to prevent preemption while updating stats */ + if (ci) + swap_cluster_unlock(ci); + else + xas_unlock(&xas); =20 /* * If moved to a different zone then also account diff --git a/mm/shmem.c b/mm/shmem.c index bbfbbc1bc4d6..cf0171a72e47 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2083,9 +2083,9 @@ static int shmem_replace_folio(struct folio **foliop,= gfp_t gfp, struct shmem_inode_info *info, pgoff_t index, struct vm_area_struct *vma) { + struct swap_cluster_info *ci; struct folio *new, *old =3D *foliop; swp_entry_t entry =3D old->swap; - struct address_space *swap_mapping =3D swap_address_space(entry); int nr_pages =3D folio_nr_pages(old); int error =3D 0; =20 @@ -2116,12 +2116,12 @@ static int shmem_replace_folio(struct folio **folio= p, gfp_t gfp, new->swap =3D entry; folio_set_swapcache(new); =20 - xa_lock_irq(&swap_mapping->i_pages); - __swap_cache_replace_folio(old, new); + ci =3D swap_cluster_get_and_lock_irq(old); + __swap_cache_replace_folio(ci, old, new); mem_cgroup_replace_folio(old, new); shmem_update_stats(new, nr_pages); shmem_update_stats(old, -nr_pages); - xa_unlock_irq(&swap_mapping->i_pages); + swap_cluster_unlock_irq(ci); =20 folio_add_lru(new); *foliop =3D new; diff --git a/mm/swap.h b/mm/swap.h index fe579c81c6c4..742db4d46d23 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -2,6 +2,7 @@ #ifndef _MM_SWAP_H #define _MM_SWAP_H =20 +#include /* for atomic_long_t */ struct mempolicy; struct swap_iocb; =20 @@ -35,6 +36,7 @@ struct swap_cluster_info { u16 count; u8 flags; u8 order; + atomic_long_t *table; /* Swap table entries, see mm/swap_table.h */ struct list_head list; }; =20 @@ -55,6 +57,11 @@ enum swap_cluster_flags { #include /* for swp_offset */ #include /* for bio_end_io_t */ =20 +static inline unsigned int swp_cluster_offset(swp_entry_t entry) +{ + return swp_offset(entry) % SWAPFILE_CLUSTER; +} + /* * Callers of all helpers below must ensure the entry, type, or offset is * valid, and protect the swap device with reference count or locks. @@ -81,6 +88,25 @@ static inline struct swap_cluster_info *__swap_offset_to= _cluster( return &si->cluster_info[offset / SWAPFILE_CLUSTER]; } =20 +static inline struct swap_cluster_info *__swap_entry_to_cluster(swp_entry_= t entry) +{ + return __swap_offset_to_cluster(__swap_entry_to_info(entry), + swp_offset(entry)); +} + +static __always_inline struct swap_cluster_info *__swap_cluster_lock( + struct swap_info_struct *si, unsigned long offset, bool irq) +{ + struct swap_cluster_info *ci =3D __swap_offset_to_cluster(si, offset); + + VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */ + if (irq) + spin_lock_irq(&ci->lock); + else + spin_lock(&ci->lock); + return ci; +} + /** * swap_cluster_lock - Lock and return the swap cluster of given offset. * @si: swap device the cluster belongs to. @@ -92,11 +118,49 @@ static inline struct swap_cluster_info *__swap_offset_= to_cluster( static inline struct swap_cluster_info *swap_cluster_lock( struct swap_info_struct *si, unsigned long offset) { - struct swap_cluster_info *ci =3D __swap_offset_to_cluster(si, offset); + return __swap_cluster_lock(si, offset, false); +} =20 - VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */ - spin_lock(&ci->lock); - return ci; +static inline struct swap_cluster_info *__swap_cluster_get_and_lock( + const struct folio *folio, bool irq) +{ + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); + return __swap_cluster_lock(__swap_entry_to_info(folio->swap), + swp_offset(folio->swap), irq); +} + +/* + * swap_cluster_get_and_lock - Locks the cluster that holds a folio's entr= ies. + * @folio: The folio. + * + * This locks and returns the swap cluster that contains a folio's swap + * entries. The swap entries of a folio are always in one single cluster. + * The folio has to be locked so its swap entries won't change and the + * cluster won't be freed. + * + * Context: Caller must ensure the folio is locked and in the swap cache. + * Return: Pointer to the swap cluster. + */ +static inline struct swap_cluster_info *swap_cluster_get_and_lock( + const struct folio *folio) +{ + return __swap_cluster_get_and_lock(folio, false); +} + +/* + * swap_cluster_get_and_lock_irq - Locks the cluster that holds a folio's = entries. + * @folio: The folio. + * + * Same as swap_cluster_get_and_lock but also disable IRQ. + * + * Context: Caller must ensure the folio is locked and in the swap cache. + * Return: Pointer to the swap cluster. + */ +static inline struct swap_cluster_info *swap_cluster_get_and_lock_irq( + const struct folio *folio) +{ + return __swap_cluster_get_and_lock(folio, true); } =20 static inline void swap_cluster_unlock(struct swap_cluster_info *ci) @@ -104,6 +168,11 @@ static inline void swap_cluster_unlock(struct swap_clu= ster_info *ci) spin_unlock(&ci->lock); } =20 +static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci) +{ + spin_unlock_irq(&ci->lock); +} + /* linux/mm/page_io.c */ int sio_pool_init(void); struct swap_iocb; @@ -123,10 +192,11 @@ void __swap_writepage(struct folio *folio, struct swa= p_iocb **swap_plug); #define SWAP_ADDRESS_SPACE_SHIFT 14 #define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT) #define SWAP_ADDRESS_SPACE_MASK (SWAP_ADDRESS_SPACE_PAGES - 1) -extern struct address_space *swapper_spaces[]; -#define swap_address_space(entry) \ - (&swapper_spaces[swp_type(entry)][swp_offset(entry) \ - >> SWAP_ADDRESS_SPACE_SHIFT]) +extern struct address_space swap_space; +static inline struct address_space *swap_address_space(swp_entry_t entry) +{ + return &swap_space; +} =20 /* * Return the swap device position of the swap entry. @@ -136,15 +206,6 @@ static inline loff_t swap_dev_pos(swp_entry_t entry) return ((loff_t)swp_offset(entry)) << PAGE_SHIFT; } =20 -/* - * Return the swap cache index of the swap entry. - */ -static inline pgoff_t swap_cache_index(swp_entry_t entry) -{ - BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) !=3D SWP_OFFSET_= MASK); - return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK; -} - /** * folio_matches_swap_entry - Check if a folio matches a given swap entry. * @folio: The folio. @@ -180,14 +241,14 @@ static inline bool folio_matches_swap_entry(const str= uct folio *folio, */ struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, - gfp_t gfp, void **shadow); +void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **s= hadow); void swap_cache_del_folio(struct folio *folio); -void __swap_cache_del_folio(struct folio *folio, - swp_entry_t entry, void *shadow); -void __swap_cache_replace_folio(struct folio *old, struct folio *new); -void swap_cache_clear_shadow(int type, unsigned long begin, - unsigned long end); +/* Below helpers require the caller to lock and pass in the swap cluster. = */ +void __swap_cache_del_folio(struct swap_cluster_info *ci, + struct folio *folio, swp_entry_t entry, void *shadow); +void __swap_cache_replace_folio(struct swap_cluster_info *ci, + struct folio *old, struct folio *new); +void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents); =20 void show_swap_cache_info(void); void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int n= r); @@ -255,6 +316,32 @@ static inline int non_swapcache_batch(swp_entry_t entr= y, int max_nr) =20 #else /* CONFIG_SWAP */ struct swap_iocb; +static inline struct swap_cluster_info *swap_cluster_lock( + struct swap_info_struct *si, pgoff_t offset, bool irq) +{ + return NULL; +} + +static inline struct swap_cluster_info *swap_cluster_get_and_lock( + struct folio *folio) +{ + return NULL; +} + +static inline struct swap_cluster_info *swap_cluster_get_and_lock_irq( + struct folio *folio) +{ + return NULL; +} + +static inline void swap_cluster_unlock(struct swap_cluster_info *ci) +{ +} + +static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci) +{ +} + static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t en= try) { return NULL; @@ -272,11 +359,6 @@ static inline struct address_space *swap_address_space= (swp_entry_t entry) return NULL; } =20 -static inline pgoff_t swap_cache_index(swp_entry_t entry) -{ - return 0; -} - static inline bool folio_matches_swap_entry(const struct folio *folio, swp= _entry_t entry) { return false; @@ -323,21 +405,21 @@ static inline void *swap_cache_get_shadow(swp_entry_t= entry) return NULL; } =20 -static inline int swap_cache_add_folio(swp_entry_t entry, struct folio *fo= lio, - gfp_t gfp, void **shadow) +static inline void swap_cache_add_folio(struct folio *folio, swp_entry_t e= ntry, void **shadow) { - return -EINVAL; } =20 static inline void swap_cache_del_folio(struct folio *folio) { } =20 -static inline void __swap_cache_del_folio(struct folio *folio, swp_entry_t= entry, void *shadow) +static inline void __swap_cache_del_folio(struct swap_cluster_info *ci, + struct folio *folio, swp_entry_t entry, void *shadow) { } =20 -static inline void __swap_cache_replace_folio(struct folio *old, struct fo= lio *new) +static inline void __swap_cache_replace_folio(struct swap_cluster_info *ci, + struct folio *old, struct folio *new) { } =20 @@ -371,8 +453,10 @@ static inline int non_swapcache_batch(swp_entry_t entr= y, int max_nr) */ static inline pgoff_t folio_index(struct folio *folio) { +#ifdef CONFIG_SWAP if (unlikely(folio_test_swapcache(folio))) - return swap_cache_index(folio->swap); + return swp_offset(folio->swap); +#endif return folio->index; } =20 diff --git a/mm/swap_state.c b/mm/swap_state.c index d1f5b8fa52fc..2558a648d671 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -23,6 +23,7 @@ #include #include #include "internal.h" +#include "swap_table.h" #include "swap.h" =20 /* @@ -36,8 +37,10 @@ static const struct address_space_operations swap_aops = =3D { #endif }; =20 -struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly; -static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly; +struct address_space swap_space __read_mostly =3D { + .a_ops =3D &swap_aops, +}; + static bool enable_vma_readahead __read_mostly =3D true; =20 #define SWAP_RA_ORDER_CEILING 5 @@ -83,11 +86,20 @@ void show_swap_cache_info(void) */ struct folio *swap_cache_get_folio(swp_entry_t entry) { - struct folio *folio =3D filemap_get_folio(swap_address_space(entry), - swap_cache_index(entry)); - if (IS_ERR(folio)) - return NULL; - return folio; + unsigned long swp_tb; + struct folio *folio; + + for (;;) { + swp_tb =3D __swap_table_get(__swap_entry_to_cluster(entry), + swp_cluster_offset(entry)); + if (!swp_tb_is_folio(swp_tb)) + return NULL; + folio =3D swp_tb_to_folio(swp_tb); + if (likely(folio_try_get(folio))) + return folio; + } + + return NULL; } =20 /** @@ -100,13 +112,13 @@ struct folio *swap_cache_get_folio(swp_entry_t entry) */ void *swap_cache_get_shadow(swp_entry_t entry) { - struct address_space *address_space =3D swap_address_space(entry); - pgoff_t idx =3D swap_cache_index(entry); - void *shadow; + unsigned long swp_tb; + + swp_tb =3D __swap_table_get(__swap_entry_to_cluster(entry), + swp_cluster_offset(entry)); + if (swp_tb_is_shadow(swp_tb)) + return swp_tb_to_shadow(swp_tb); =20 - shadow =3D xa_load(&address_space->i_pages, idx); - if (xa_is_value(shadow)) - return shadow; return NULL; } =20 @@ -119,61 +131,48 @@ void *swap_cache_get_shadow(swp_entry_t entry) * * Context: Caller must ensure @entry is valid and protect the swap device * with reference count or locks. - * The caller also needs to mark the corresponding swap_map slots with - * SWAP_HAS_CACHE to avoid race or conflict. - * Return: Returns 0 on success, error code otherwise. + * The caller also needs to update the corresponding swap_map slots with + * SWAP_HAS_CACHE bit to avoid race or conflict. */ -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, - gfp_t gfp, void **shadowp) +void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **s= hadowp) { - struct address_space *address_space =3D swap_address_space(entry); - pgoff_t idx =3D swap_cache_index(entry); - XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio)); - unsigned long i, nr =3D folio_nr_pages(folio); - void *old; - - xas_set_update(&xas, workingset_update_node); - - VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); - VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio); - VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio); + void *shadow =3D NULL; + unsigned long old_tb, new_tb; + struct swap_cluster_info *ci; + unsigned int ci_start, ci_off, ci_end; + unsigned long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); + + new_tb =3D folio_to_swp_tb(folio); + ci_start =3D swp_cluster_offset(entry); + ci_end =3D ci_start + nr_pages; + ci_off =3D ci_start; + ci =3D swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry)); + do { + old_tb =3D __swap_table_xchg(ci, ci_off, new_tb); + WARN_ON_ONCE(swp_tb_is_folio(old_tb)); + if (swp_tb_is_shadow(old_tb)) + shadow =3D swp_tb_to_shadow(old_tb); + } while (++ci_off < ci_end); =20 - folio_ref_add(folio, nr); + folio_ref_add(folio, nr_pages); folio_set_swapcache(folio); folio->swap =3D entry; + swap_cluster_unlock(ci); =20 - do { - xas_lock_irq(&xas); - xas_create_range(&xas); - if (xas_error(&xas)) - goto unlock; - for (i =3D 0; i < nr; i++) { - VM_BUG_ON_FOLIO(xas.xa_index !=3D idx + i, folio); - if (shadowp) { - old =3D xas_load(&xas); - if (xa_is_value(old)) - *shadowp =3D old; - } - xas_store(&xas, folio); - xas_next(&xas); - } - address_space->nrpages +=3D nr; - __node_stat_mod_folio(folio, NR_FILE_PAGES, nr); - __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr); -unlock: - xas_unlock_irq(&xas); - } while (xas_nomem(&xas, gfp)); - - if (!xas_error(&xas)) - return 0; + node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); + lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); =20 - folio_clear_swapcache(folio); - folio_ref_sub(folio, nr); - return xas_error(&xas); + if (shadowp) + *shadowp =3D shadow; } =20 /** * __swap_cache_del_folio - Removes a folio from the swap cache. + * @ci: The locked swap cluster. * @folio: The folio. * @entry: The first swap entry that the folio corresponds to. * @shadow: shadow value to be filled in the swap cache. @@ -181,34 +180,36 @@ int swap_cache_add_folio(struct folio *folio, swp_ent= ry_t entry, * Removes a folio from the swap cache and fills a shadow in place. * This won't put the folio's refcount. The caller has to do that. * - * Context: Caller must hold the xa_lock, ensure the folio is - * locked and in the swap cache, using the index of @entry. + * Context: Caller must ensure the folio is locked and in the swap cache + * using the index of @entry, and lock the cluster that holds the entries. */ -void __swap_cache_del_folio(struct folio *folio, +void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *fo= lio, swp_entry_t entry, void *shadow) { - struct address_space *address_space =3D swap_address_space(entry); - int i; - long nr =3D folio_nr_pages(folio); - pgoff_t idx =3D swap_cache_index(entry); - XA_STATE(xas, &address_space->i_pages, idx); - - xas_set_update(&xas, workingset_update_node); - - VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); - VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio); - VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio); - - for (i =3D 0; i < nr; i++) { - void *entry =3D xas_store(&xas, shadow); - VM_BUG_ON_PAGE(entry !=3D folio, entry); - xas_next(&xas); - } + unsigned long old_tb, new_tb; + unsigned int ci_start, ci_off, ci_end; + unsigned long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_ONCE(__swap_entry_to_cluster(entry) !=3D ci); + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio); + + new_tb =3D shadow_swp_to_tb(shadow); + ci_start =3D swp_cluster_offset(entry); + ci_end =3D ci_start + nr_pages; + ci_off =3D ci_start; + do { + /* If shadow is NULL, we sets an empty shadow */ + old_tb =3D __swap_table_xchg(ci, ci_off, new_tb); + WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || + swp_tb_to_folio(old_tb) !=3D folio); + } while (++ci_off < ci_end); + folio->swap.val =3D 0; folio_clear_swapcache(folio); - address_space->nrpages -=3D nr; - __node_stat_mod_folio(folio, NR_FILE_PAGES, -nr); - __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr); + node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages); + lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages); } =20 /** @@ -223,12 +224,12 @@ void __swap_cache_del_folio(struct folio *folio, */ void swap_cache_del_folio(struct folio *folio) { + struct swap_cluster_info *ci; swp_entry_t entry =3D folio->swap; - struct address_space *address_space =3D swap_address_space(entry); =20 - xa_lock_irq(&address_space->i_pages); - __swap_cache_del_folio(folio, entry, NULL); - xa_unlock_irq(&address_space->i_pages); + ci =3D swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry)); + __swap_cache_del_folio(ci, folio, entry, NULL); + swap_cluster_unlock(ci); =20 put_swap_folio(folio, entry); folio_ref_sub(folio, folio_nr_pages(folio)); @@ -236,6 +237,7 @@ void swap_cache_del_folio(struct folio *folio) =20 /** * __swap_cache_replace_folio - Replace a folio in the swap cache. + * @ci: The locked swap cluster. * @old: The old folio to be replaced. * @new: The new folio. * @@ -244,65 +246,62 @@ void swap_cache_del_folio(struct folio *folio) * entries. Replacement will take the new folio's swap entry value as * the starting offset to override all slots covered by the new folio. * - * Context: Caller must ensure both folios are locked, also lock the - * swap address_space that holds the old folio to avoid races. + * Context: Caller must ensure both folios are locked, and lock the + * cluster that holds the old folio to be replaced. */ -void __swap_cache_replace_folio(struct folio *old, struct folio *new) +void __swap_cache_replace_folio(struct swap_cluster_info *ci, + struct folio *old, struct folio *new) { swp_entry_t entry =3D new->swap; unsigned long nr_pages =3D folio_nr_pages(new); - unsigned long offset =3D swap_cache_index(entry); - unsigned long end =3D offset + nr_pages; - - XA_STATE(xas, &swap_address_space(entry)->i_pages, offset); + unsigned int ci_off =3D swp_cluster_offset(entry); + unsigned int ci_end =3D ci_off + nr_pages; + unsigned long old_tb, new_tb; =20 VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new)); VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new)); VM_WARN_ON_ONCE(!entry.val); =20 /* Swap cache still stores N entries instead of a high-order entry */ + new_tb =3D folio_to_swp_tb(new); do { - WARN_ON_ONCE(xas_store(&xas, new) !=3D old); - xas_next(&xas); - } while (++offset < end); + old_tb =3D __swap_table_xchg(ci, ci_off, new_tb); + WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) !=3D ol= d); + } while (++ci_off < ci_end); + + /* + * If the old folio is partially replaced (e.g., splitting a large + * folio, the old folio is shrunk, and new split sub folios replace + * the shrunk part), ensure the new folio doesn't overlap it. + */ + if (IS_ENABLED(CONFIG_DEBUG_VM) && + folio_order(old) !=3D folio_order(new)) { + ci_off =3D swp_cluster_offset(old->swap); + ci_end =3D ci_off + folio_nr_pages(old); + while (ci_off++ < ci_end) + WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) !=3D old); + } } =20 /** * swap_cache_clear_shadow - Clears a set of shadows in the swap cache. - * @type: Indicates the swap device. - * @begin: Beginning offset of the range. - * @end: Ending offset of the range. + * @entry: The starting index entry. + * @nr_ents: How many slots need to be cleared. * - * Context: Caller must ensure the range is valid and hold a reference to - * the swap device. + * Context: Caller must ensure the range is valid, all in one single clust= er, + * not occupied by any folio, and lock the cluster. */ -void swap_cache_clear_shadow(int type, unsigned long begin, - unsigned long end) +void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents) { - unsigned long curr =3D begin; - void *old; - - for (;;) { - swp_entry_t entry =3D swp_entry(type, curr); - unsigned long index =3D curr & SWAP_ADDRESS_SPACE_MASK; - struct address_space *address_space =3D swap_address_space(entry); - XA_STATE(xas, &address_space->i_pages, index); - - xas_set_update(&xas, workingset_update_node); - - xa_lock_irq(&address_space->i_pages); - xas_for_each(&xas, old, min(index + (end - curr), SWAP_ADDRESS_SPACE_PAG= ES)) { - if (!xa_is_value(old)) - continue; - xas_store(&xas, NULL); - } - xa_unlock_irq(&address_space->i_pages); + struct swap_cluster_info *ci =3D __swap_entry_to_cluster(entry); + unsigned int ci_off =3D swp_cluster_offset(entry), ci_end; + unsigned long old; =20 - /* search the next swapcache until we meet end */ - curr =3D ALIGN((curr + 1), SWAP_ADDRESS_SPACE_PAGES); - if (curr > end) - break; - } + ci_end =3D ci_off + nr_ents; + do { + old =3D __swap_table_xchg(ci, ci_off, null_to_swp_tb()); + WARN_ON_ONCE(swp_tb_is_folio(old)); + } while (++ci_off < ci_end); } =20 /* @@ -482,10 +481,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entr= y, gfp_t gfp_mask, if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry)) goto fail_unlock; =20 - /* May fail (-ENOMEM) if XArray node allocation failed. */ - if (swap_cache_add_folio(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &= shadow)) - goto fail_unlock; - + swap_cache_add_folio(new_folio, entry, &shadow); memcg1_swapin(entry, 1); =20 if (shadow) @@ -677,41 +673,6 @@ struct folio *swap_cluster_readahead(swp_entry_t entry= , gfp_t gfp_mask, return folio; } =20 -int init_swap_address_space(unsigned int type, unsigned long nr_pages) -{ - struct address_space *spaces, *space; - unsigned int i, nr; - - nr =3D DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES); - spaces =3D kvcalloc(nr, sizeof(struct address_space), GFP_KERNEL); - if (!spaces) - return -ENOMEM; - for (i =3D 0; i < nr; i++) { - space =3D spaces + i; - xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ); - atomic_set(&space->i_mmap_writable, 0); - space->a_ops =3D &swap_aops; - /* swap cache doesn't use writeback related tags */ - mapping_set_no_writeback_tags(space); - } - nr_swapper_spaces[type] =3D nr; - swapper_spaces[type] =3D spaces; - - return 0; -} - -void exit_swap_address_space(unsigned int type) -{ - int i; - struct address_space *spaces =3D swapper_spaces[type]; - - for (i =3D 0; i < nr_swapper_spaces[type]; i++) - VM_WARN_ON_ONCE(!mapping_empty(&spaces[i])); - kvfree(spaces); - nr_swapper_spaces[type] =3D 0; - swapper_spaces[type] =3D NULL; -} - static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start, unsigned long *end) { @@ -884,7 +845,7 @@ static const struct attribute_group swap_attr_group =3D= { .attrs =3D swap_attrs, }; =20 -static int __init swap_init_sysfs(void) +static int __init swap_init(void) { int err; struct kobject *swap_kobj; @@ -899,11 +860,13 @@ static int __init swap_init_sysfs(void) pr_err("failed to register swap group\n"); goto delete_obj; } + /* Swap cache writeback is LRU based, no tags for it */ + mapping_set_no_writeback_tags(&swap_space); return 0; =20 delete_obj: kobject_put(swap_kobj); return err; } -subsys_initcall(swap_init_sysfs); +subsys_initcall(swap_init); #endif diff --git a/mm/swap_table.h b/mm/swap_table.h new file mode 100644 index 000000000000..e1f7cc009701 --- /dev/null +++ b/mm/swap_table.h @@ -0,0 +1,97 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _MM_SWAP_TABLE_H +#define _MM_SWAP_TABLE_H + +#include "swap.h" + +/* + * A swap table entry represents the status of a swap slot on a swap + * (physical or virtual) device. The swap table in each cluster is a + * 1:1 map of the swap slots in this cluster. + * + * Each swap table entry could be a pointer (folio), a XA_VALUE + * (shadow), or NULL. + */ + +/* + * Helpers for casting one type of info into a swap table entry. + */ +static inline unsigned long null_to_swp_tb(void) +{ + BUILD_BUG_ON(sizeof(unsigned long) !=3D sizeof(atomic_long_t)); + return 0; +} + +static inline unsigned long folio_to_swp_tb(struct folio *folio) +{ + BUILD_BUG_ON(sizeof(unsigned long) !=3D sizeof(void *)); + return (unsigned long)folio; +} + +static inline unsigned long shadow_swp_to_tb(void *shadow) +{ + BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=3D + BITS_PER_BYTE * sizeof(unsigned long)); + VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow)); + return (unsigned long)shadow; +} + +/* + * Helpers for swap table entry type checking. + */ +static inline bool swp_tb_is_null(unsigned long swp_tb) +{ + return !swp_tb; +} + +static inline bool swp_tb_is_folio(unsigned long swp_tb) +{ + return !xa_is_value((void *)swp_tb) && !swp_tb_is_null(swp_tb); +} + +static inline bool swp_tb_is_shadow(unsigned long swp_tb) +{ + return xa_is_value((void *)swp_tb); +} + +/* + * Helpers for retrieving info from swap table. + */ +static inline struct folio *swp_tb_to_folio(unsigned long swp_tb) +{ + VM_WARN_ON(!swp_tb_is_folio(swp_tb)); + return (void *)swp_tb; +} + +static inline void *swp_tb_to_shadow(unsigned long swp_tb) +{ + VM_WARN_ON(!swp_tb_is_shadow(swp_tb)); + return (void *)swp_tb; +} + +/* + * Helpers for accessing or modifying the swap table of a cluster, + * the swap cluster must be locked. + */ +static inline void __swap_table_set(struct swap_cluster_info *ci, + unsigned int off, unsigned long swp_tb) +{ + VM_WARN_ON_ONCE(off >=3D SWAPFILE_CLUSTER); + atomic_long_set(&ci->table[off], swp_tb); +} + +static inline unsigned long __swap_table_xchg(struct swap_cluster_info *ci, + unsigned int off, unsigned long swp_tb) +{ + VM_WARN_ON_ONCE(off >=3D SWAPFILE_CLUSTER); + /* Ordering is guaranteed by cluster lock, relax */ + return atomic_long_xchg_relaxed(&ci->table[off], swp_tb); +} + +static inline unsigned long __swap_table_get(struct swap_cluster_info *ci, + unsigned int off) +{ + VM_WARN_ON_ONCE(off >=3D SWAPFILE_CLUSTER); + return atomic_long_read(&ci->table[off]); +} +#endif diff --git a/mm/swapfile.c b/mm/swapfile.c index 51f781c43537..b183e96be289 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -46,6 +46,7 @@ #include #include #include +#include "swap_table.h" #include "internal.h" #include "swap.h" =20 @@ -421,6 +422,34 @@ static inline unsigned int cluster_offset(struct swap_= info_struct *si, return cluster_index(si, ci) * SWAPFILE_CLUSTER; } =20 +static int swap_cluster_alloc_table(struct swap_cluster_info *ci) +{ + WARN_ON(ci->table); + ci->table =3D kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNE= L); + if (!ci->table) + return -ENOMEM; + return 0; +} + +static void swap_cluster_free_table(struct swap_cluster_info *ci) +{ + unsigned int ci_off; + unsigned long swp_tb; + + if (!ci->table) + return; + + for (ci_off =3D 0; ci_off < SWAPFILE_CLUSTER; ci_off++) { + swp_tb =3D __swap_table_get(ci, ci_off); + if (!swp_tb_is_null(swp_tb)) + pr_err_once("swap: unclean swap space on swapoff: 0x%lx", + swp_tb); + } + + kfree(ci->table); + ci->table =3D NULL; +} + static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, enum swap_cluster_flags new_flags) @@ -703,6 +732,26 @@ static bool cluster_scan_range(struct swap_info_struct= *si, return true; } =20 +/* + * Currently, the swap table is not used for count tracking, just + * do a sanity check here to ensure nothing leaked, so the swap + * table should be empty upon freeing. + */ +static void swap_cluster_assert_table_empty(struct swap_cluster_info *ci, + unsigned int start, unsigned int nr) +{ + unsigned int ci_off =3D start % SWAPFILE_CLUSTER; + unsigned int ci_end =3D ci_off + nr; + unsigned long swp_tb; + + if (IS_ENABLED(CONFIG_DEBUG_VM)) { + do { + swp_tb =3D __swap_table_get(ci, ci_off); + VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb)); + } while (++ci_off < ci_end); + } +} + static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_c= luster_info *ci, unsigned int start, unsigned char usage, unsigned int order) @@ -722,6 +771,7 @@ static bool cluster_alloc_range(struct swap_info_struct= *si, struct swap_cluster ci->order =3D order; =20 memset(si->swap_map + start, usage, nr_pages); + swap_cluster_assert_table_empty(ci, start, nr_pages); swap_range_alloc(si, nr_pages); ci->count +=3D nr_pages; =20 @@ -1124,7 +1174,7 @@ static void swap_range_free(struct swap_info_struct *= si, unsigned long offset, swap_slot_free_notify(si->bdev, offset); offset++; } - swap_cache_clear_shadow(si->type, begin, end); + __swap_cache_clear_shadow(swp_entry(si->type, begin), nr_entries); =20 /* * Make sure that try_to_unuse() observes si->inuse_pages reaching 0 @@ -1281,16 +1331,7 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp) if (!entry.val) return -ENOMEM; =20 - /* - * XArray node allocations from PF_MEMALLOC contexts could - * completely exhaust the page allocator. __GFP_NOMEMALLOC - * stops emergency reserves from being allocated. - * - * TODO: this could cause a theoretical memory reclaim - * deadlock in the swap out path. - */ - if (swap_cache_add_folio(folio, entry, gfp | __GFP_NOMEMALLOC, NULL)) - goto out_free; + swap_cache_add_folio(folio, entry, NULL); =20 return 0; =20 @@ -1556,6 +1597,7 @@ static void swap_entries_free(struct swap_info_struct= *si, =20 mem_cgroup_uncharge_swap(entry, nr_pages); swap_range_free(si, offset, nr_pages); + swap_cluster_assert_table_empty(ci, offset, nr_pages); =20 if (!ci->count) free_cluster(si, ci); @@ -2634,6 +2676,18 @@ static void wait_for_allocation(struct swap_info_str= uct *si) } } =20 +static void free_cluster_info(struct swap_cluster_info *cluster_info, + unsigned long maxpages) +{ + int i, nr_clusters =3D DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); + + if (!cluster_info) + return; + for (i =3D 0; i < nr_clusters; i++) + swap_cluster_free_table(&cluster_info[i]); + kvfree(cluster_info); +} + /* * Called after swap device's reference count is dead, so * neither scan nor allocation will use it. @@ -2768,12 +2822,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, speci= alfile) =20 swap_file =3D p->swap_file; p->swap_file =3D NULL; - p->max =3D 0; swap_map =3D p->swap_map; p->swap_map =3D NULL; zeromap =3D p->zeromap; p->zeromap =3D NULL; cluster_info =3D p->cluster_info; + free_cluster_info(cluster_info, p->max); + p->max =3D 0; p->cluster_info =3D NULL; spin_unlock(&p->lock); spin_unlock(&swap_lock); @@ -2784,10 +2839,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specia= lfile) p->global_cluster =3D NULL; vfree(swap_map); kvfree(zeromap); - kvfree(cluster_info); /* Destroy swap account information */ swap_cgroup_swapoff(p->type); - exit_swap_address_space(p->type); =20 inode =3D mapping->host; =20 @@ -3171,8 +3224,11 @@ static struct swap_cluster_info *setup_clusters(stru= ct swap_info_struct *si, if (!cluster_info) goto err; =20 - for (i =3D 0; i < nr_clusters; i++) + for (i =3D 0; i < nr_clusters; i++) { spin_lock_init(&cluster_info[i].lock); + if (swap_cluster_alloc_table(&cluster_info[i])) + goto err_free; + } =20 if (!(si->flags & SWP_SOLIDSTATE)) { si->global_cluster =3D kmalloc(sizeof(*si->global_cluster), @@ -3233,9 +3289,8 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, } =20 return cluster_info; - err_free: - kvfree(cluster_info); + free_cluster_info(cluster_info, maxpages); err: return ERR_PTR(err); } @@ -3429,13 +3484,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, special= file, int, swap_flags) } } =20 - error =3D init_swap_address_space(si->type, maxpages); - if (error) - goto bad_swap_unlock_inode; - error =3D zswap_swapon(si->type, maxpages); if (error) - goto free_swap_address_space; + goto bad_swap_unlock_inode; =20 /* * Flush any pending IO and dirty mappings before we start using this @@ -3470,8 +3521,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf= ile, int, swap_flags) goto out; free_swap_zswap: zswap_swapoff(si->type); -free_swap_address_space: - exit_swap_address_space(si->type); bad_swap_unlock_inode: inode_unlock(inode); bad_swap: @@ -3486,7 +3535,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf= ile, int, swap_flags) spin_unlock(&swap_lock); vfree(swap_map); kvfree(zeromap); - kvfree(cluster_info); + if (cluster_info) + free_cluster_info(cluster_info, maxpages); if (inced_nr_rotate_swap) atomic_dec(&nr_rotate_swap); if (swap_file) diff --git a/mm/vmscan.c b/mm/vmscan.c index c79c6806560b..e170c12e2065 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -730,13 +730,18 @@ static int __remove_mapping(struct address_space *map= ping, struct folio *folio, { int refcount; void *shadow =3D NULL; + struct swap_cluster_info *ci; =20 BUG_ON(!folio_test_locked(folio)); BUG_ON(mapping !=3D folio_mapping(folio)); =20 - if (!folio_test_swapcache(folio)) + if (folio_test_swapcache(folio)) { + ci =3D swap_cluster_get_and_lock_irq(folio); + } else { spin_lock(&mapping->host->i_lock); - xa_lock_irq(&mapping->i_pages); + xa_lock_irq(&mapping->i_pages); + } + /* * The non racy check for a busy folio. * @@ -776,9 +781,9 @@ static int __remove_mapping(struct address_space *mappi= ng, struct folio *folio, =20 if (reclaimed && !mapping_exiting(mapping)) shadow =3D workingset_eviction(folio, target_memcg); - __swap_cache_del_folio(folio, swap, shadow); + __swap_cache_del_folio(ci, folio, swap, shadow); memcg1_swapout(folio, swap); - xa_unlock_irq(&mapping->i_pages); + swap_cluster_unlock_irq(ci); put_swap_folio(folio, swap); } else { void (*free_folio)(struct folio *); @@ -816,9 +821,12 @@ static int __remove_mapping(struct address_space *mapp= ing, struct folio *folio, return 1; =20 cannot_free: - xa_unlock_irq(&mapping->i_pages); - if (!folio_test_swapcache(folio)) + if (folio_test_swapcache(folio)) { + swap_cluster_unlock_irq(ci); + } else { + xa_unlock_irq(&mapping->i_pages); spin_unlock(&mapping->host->i_lock); + } return 0; } =20 --=20 2.51.0 From nobody Thu Oct 2 13:01:38 2025 Received: from mail-qk1-f180.google.com (mail-qk1-f180.google.com [209.85.222.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D2633265626 for ; Tue, 16 Sep 2025 16:02:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038553; cv=none; b=de1aF/8qSKHuJAtGjrSu6+AWza7yax5S6Du/8kRpvlnwFMeu/BObFGpRIZmWYRW1ERHxYCK+5Csh+eoUpjdLiU/sM1fzBxLESHS0tv/N9eAbv7HUXyyy0SCCTjc8W2V7XZTst4S7cHSfcYMRhBugx5gyUdyOaoUNsM5Km6uWgzs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038553; c=relaxed/simple; bh=WeIeULHdGyzKDZVWKWzPLNPkRpEyEoVtQFcYjcMu7QQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ODs+4SVkNt3sqycAqiA1C6Q5xTAmcS1wRLup08R5RXHJTZ+KYTwHULhnO2GEa2rAxvVEf2CkHJsg/t0jPaaFQL5h2XwkmyOjYICSztHB1p3Pm1jq1vDuRd9HumCs6bNkP5qdjOJq7ZaDJAparwu6cGKmnGw0VmoBZZaFpSkDbPk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Hkh3y9ey; arc=none smtp.client-ip=209.85.222.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Hkh3y9ey" Received: by mail-qk1-f180.google.com with SMTP id af79cd13be357-812bc4ff723so524541585a.0 for ; Tue, 16 Sep 2025 09:02:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1758038551; x=1758643351; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=4hsOmThW2RAYLAuUPY0h11vKmZVrV6aWWNRHy7EaYEQ=; b=Hkh3y9ey8noOTz2/v3ue5NWAoNjRq5RuynHMLTQITgs88CMJl5aG2sWrD3HqG7riVn UlAU5+aKgR21koXS0Y9yb/xKgIqAO8elACzZ8GKYjlpSmlpsgUNbMBXPoCMUACviD9lp ybd060zab4XjyZxdjGE2za4fkHGSolBWxNBaTOuzK9lTd6UOzGlLl7cdChoEKHLBUF95 awi9vGAVek/TGNvbPnaESES3fLuXMsJHdx5bBlUu6opBFBxhAa7vPr1I0M2Z4SIXlrdP 3oGn60aoV8xHQKE3f9oMmyohVUj3ntPGIV630gTxx9xrdSEh+ALZUmpp6eCB9PBeZL1h na+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758038551; x=1758643351; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=4hsOmThW2RAYLAuUPY0h11vKmZVrV6aWWNRHy7EaYEQ=; b=XgXXlI4oV8bXr08TtE3kDNZMVOoKg235EK6Y6oNmxxma3g8GO71ZlDRA2NorVPOZFo 0L6y4uuzEMO6rNYABx8PM7vCKljHuiXQdn04LYGSGXGHWhQwNkIqsBY5JS5LiOZJtx42 SQbdf/evZXVnIqVZSzDzoM7udZgu5yQ8/hC0Gw0U1X7Z1QNgEMzrMJwtRJyZ/lS9ZIAc uOjWamypDCxS+38o16DDdLhTPCVpsKX3hcrRICH2rQz9l9doWvVoIs2Hiehex1H4/Wl1 4OVm+86vUL5lPTQpi0BB5zyCJNm9G1Z4CZpheVCtlcchZwp/F+3bPsR3JUu6YFVwYkWa lZww== X-Forwarded-Encrypted: i=1; AJvYcCWfkCaUsnZVvWcNvkEE43jRuf2cCOhiHL34S3deC+CWVRliwujmuLD6B9d7flKfbVDWQ7pmo7QbFZWQYWQ=@vger.kernel.org X-Gm-Message-State: AOJu0Yx8cCrus3mlEps2KfcRxcJrMFzALmR1r1T4+cUtoVaZiv1/v0U3 de1vCHn0kQ+/dKlVY4JsWmCVg2rM3G9aokNHl7AhNemTKlZF4oSEMqC2 X-Gm-Gg: ASbGncsFzVT13w4oJG/SFuoJKp652ALe13Xq+96VFZNoRTB6AZsgVLZrrzkKtVIaIjJ col5HdJwaDL7HdfRW3REFxWpQTM1c3vVhwOir3c8mxn8X+gOvnhHiAM/SbjMhARHfhw6qmPSYhC U8QqvzsxXdDuLv2/ME7kZoKYoGvwtw9GhBR56Uuq627nfHxGEUTDgZfoP8mFzXMKeOOpGiJWagP QtiSVh73FbJYwIssKqXztffOSvF1sHMOMxJpOe12gtQmEgSRFmB/XjpAEjV7kJqhJXDQLv2aXDo 4FlwdwyBjNvTR3lxbPkoH7+7rLTofZrDcQELWBpccjGQT3zf1QjDXQostabx9XU1Kpg1rIDOb5U pRwITTJFRo+3h+VcS3eFmC7qzFiUvuzR12yepT60CmDN9F6c= X-Google-Smtp-Source: AGHT+IEcUJm6HBE67qNWNXwnh1bmV+sbQgFEI8gjzHotd6hHFYJuVK3vbAW+rqvSuL61i7u/HC8DNg== X-Received: by 2002:a05:620a:370a:b0:7f1:9a91:1dda with SMTP id af79cd13be357-823fb826224mr2074340985a.13.1758038550496; Tue, 16 Sep 2025 09:02:30 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-820cd703f54sm969765485a.37.2025.09.16.09.02.24 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 16 Sep 2025 09:02:29 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Kairui Song , Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 12/15] mm, swap: mark swap address space ro and add context debug check Date: Wed, 17 Sep 2025 00:00:57 +0800 Message-ID: <20250916160100.31545-13-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250916160100.31545-1-ryncsn@gmail.com> References: <20250916160100.31545-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Swap cache is now backed by swap table, and the address space is not holding any mutable data anymore. And swap cache is now protected by the swap cluster lock, instead of the XArray lock. All access to swap cache are wrapped by swap cache helpers. Locking is mostly handled internally by swap cache helpers, only a few __swap_cache_* helpers require the caller to lock the cluster by themselves. Worth noting that, unlike XArray, the cluster lock is not IRQ safe. The swap cache was very different compared to filemap, and now it's completely separated from filemap. Nothing wants to mark or change anything or do a writeback callback in IRQ. So explicitly document this and add a debug check to avoid further potential misuse. And mark the swap cache space as read-only to avoid any user wrongly mixing unexpected filemap helpers with swap cache. Signed-off-by: Kairui Song Acked-by: Chris Li Acked-by: David Hildenbrand Suggested-by: Chris Li --- mm/swap.h | 12 +++++++++++- mm/swap_state.c | 3 ++- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index 742db4d46d23..adcd85fa8538 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -99,6 +99,16 @@ static __always_inline struct swap_cluster_info *__swap_= cluster_lock( { struct swap_cluster_info *ci =3D __swap_offset_to_cluster(si, offset); =20 + /* + * Nothing modifies swap cache in an IRQ context. All access to + * swap cache is wrapped by swap_cache_* helpers, and swap cache + * writeback is handled outside of IRQs. Swapin or swapout never + * occurs in IRQ, and neither does in-place split or replace. + * + * Besides, modifying swap cache requires synchronization with + * swap_map, which was never IRQ safe. + */ + VM_WARN_ON_ONCE(!in_task()); VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */ if (irq) spin_lock_irq(&ci->lock); @@ -192,7 +202,7 @@ void __swap_writepage(struct folio *folio, struct swap_= iocb **swap_plug); #define SWAP_ADDRESS_SPACE_SHIFT 14 #define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT) #define SWAP_ADDRESS_SPACE_MASK (SWAP_ADDRESS_SPACE_PAGES - 1) -extern struct address_space swap_space; +extern struct address_space swap_space __ro_after_init; static inline struct address_space *swap_address_space(swp_entry_t entry) { return &swap_space; diff --git a/mm/swap_state.c b/mm/swap_state.c index 2558a648d671..a1478cbff384 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -37,7 +37,8 @@ static const struct address_space_operations swap_aops = =3D { #endif }; =20 -struct address_space swap_space __read_mostly =3D { +/* Set swap_space as read only as swap cache is handled by swap table */ +struct address_space swap_space __ro_after_init =3D { .a_ops =3D &swap_aops, }; =20 --=20 2.51.0 From nobody Thu Oct 2 13:01:38 2025 Received: from mail-qk1-f177.google.com (mail-qk1-f177.google.com [209.85.222.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DF7B6374281 for ; Tue, 16 Sep 2025 16:02:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038560; cv=none; b=Qcpx/h4cmiWTRWqYxwA1Z3bu9XI1sVNxCFwMUX+eKs2SyM5cJUgSB3/zGCGIXBV6oQznVuBP4K84Vpns4hpADSaVR0X1nXDn0g14bVov7+12XGZvkXjKh8qSNDeLJgCHMCcPRzGt9Yuu8//j2HnWilrHEUfgBxuOuAWB2H4fcVQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038560; c=relaxed/simple; bh=pT8bHTlnarO/NBkk+DAGsSr9yj2JTR304/MkutpXs78=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=SfnECSpc+uUr5+2LlTlzTx+hD907pH/WL2PnLZJFWtQMW0RLsGtdLE+RaVKIF3GtCbqJTzsCpfHtNzJBvzMase6ijnsnV+OdmwYHAUKP8ejpbFfMVUsozx2L4jah86Pp7CciMXHSKWGVDhsfInRpyPFZxEGgsosqdgO3os3ghug= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=jZzYvLF9; arc=none smtp.client-ip=209.85.222.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="jZzYvLF9" Received: by mail-qk1-f177.google.com with SMTP id af79cd13be357-826311c1774so401948585a.2 for ; Tue, 16 Sep 2025 09:02:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1758038558; x=1758643358; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=pkwwb9fXA+gLzNSIgeAnWPrfVwxN4Hwjf/rN/U/6eIw=; b=jZzYvLF95/uq9bDrOGKicpgRTqqQEvrcpQbh7mGWHISmh35bXX5ITXb6zAuPPkP4Ta 4SwLLzXvnK7+/q03PedeHJs4WM81SZ/2l3lSFnhz6/UDmUZ/x62vM2eh2G06DKvF84iV qgLDQHk0Rixwc+CH2UjwSgeh9zEOseOAW7Gef+E97zEWvlN7K93FZVMP5MOReSWL7A0l 6jrnZZxdJmQRJsqOCqeEW9eI2jBrHePOHwC03l83LrztKQW8t9RMrYZQFXdfAbGV3yl5 GjIS4by3haxNu5beNEGja/Qnfb2m9KMpK2qCuj9R7M5WbAdvP44JBoiKVeasRTUnJ34J IV4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758038558; x=1758643358; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=pkwwb9fXA+gLzNSIgeAnWPrfVwxN4Hwjf/rN/U/6eIw=; b=L94FafYnDEgN2GC138NFzzd6rRCmO/QbgFHx+CVaw/hCZKM94+0coDh9nyJXbtxNC9 GKxgwqslwipeub66JDdEYgX0gUKKR3gOlog1x59Pw407PKoSN2Xq0QxOAd43p9Fl4y4/ vmkpop6olrno5K+cboZ1kn/ZjChMBm9dsY4WsmbYqkEqbNsJkaxkauW24w/1kSIRLIip MtwVwjPwnCdRmgW1ZyoGANhL+rRjmjt4p5sFNi+CPenIfmweC+4JEH9XRxiR1UUBLwJC +Rc7YVs5ueKkQyZpv0Mk+TWkrdg8lhjiEs88rCY5jxLu0mJmD7RGJZ6TsCIgix7ItxOx tXCA== X-Forwarded-Encrypted: i=1; AJvYcCV1FRqZZoo6145edPBiYncdNxzT1EVGFt8QpNhZv5n50Tyb5DCCXdM3m0gKYur5eF0muJwpgMj2NWWjtTI=@vger.kernel.org X-Gm-Message-State: AOJu0Ywp3Ri452q8RBk+ZmMQjqFSrWNG/n02OXKOxH6Ca72YDRJMIsdk nHYGmExealMsOun7eoGunTXwPm68m0GCo769gWEy4qEkAWknCtY5kpFO X-Gm-Gg: ASbGnctjONxIQR3+jzfkSU5ndH9OIa7kk4EgA7SjHEIyJDaIyu4Y79bZesJrPxZf5+L L0PNkz/XpmixIYH2dzDoCl6zsCtdfa51LehwI4fLQlHpqoRsiLfuZSqS3icbJW+ckkunSxbSJ0N ugw85DQBSr0ymljB8X1sEwoJHsR7q9zyMlFd1uAsfqjIOsTExGYysKHgkbijyBkRBiFdfMchy0A ghACSyL+YtgB1rGI2YIWBrX7y4uD5oHv/UN7rM8SNXrMnAAqwnr4ydsuHN7+pk6cYuKy3q+Wd0f nYl7YPrwM0+hQhK7BChYD3n3bmqvPnSC4yoi3i+DJi8SeyDC1Yvd3NxcEAbYChD3YXs5OYA3VQo pcbHLeRlxcOgKk6wF+q6U1D+ilHHTk84Wsx+b1xXwRaa37HI= X-Google-Smtp-Source: AGHT+IF8fLd2lfO76mR7R/AGarFeTdp4W9G3W6ewW3fPGy6m/l7ACES3YArLw4+tmP3lQvR1ygglaQ== X-Received: by 2002:a05:620a:a48d:b0:811:bc35:7016 with SMTP id af79cd13be357-82400753708mr2020881185a.58.1758038557353; Tue, 16 Sep 2025 09:02:37 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-820cd703f54sm969765485a.37.2025.09.16.09.02.30 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 16 Sep 2025 09:02:36 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Kairui Song , Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song , kernel test robot Subject: [PATCH v4 13/15] mm, swap: remove contention workaround for swap cache Date: Wed, 17 Sep 2025 00:00:58 +0800 Message-ID: <20250916160100.31545-14-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250916160100.31545-1-ryncsn@gmail.com> References: <20250916160100.31545-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Swap cluster setup will try to shuffle the clusters on initialization. It was helpful to avoid contention for the swap cache space. The cluster size (2M) was much smaller than each swap cache space (64M), so shuffling the cluster means the allocator will try to allocate swap slots that are in different swap cache spaces for each CPU, reducing the chance of two CPUs using the same swap cache space, and hence reducing the contention. Now, swap cache is managed by swap clusters, this shuffle is pointless. Just remove it, and clean up related macros. This also improves the HDD swap performance as shuffling IO is a bad idea for HDD, and now the shuffling is gone. Test have shown a ~40% performance gain for HDD [1]: Doing sequential swap in of 8G data using 8 processes with usemem, average of 3 test runs: Before: 1270.91 KB/s per process After: 1849.54 KB/s per process Link: https://lore.kernel.org/linux-mm/CAMgjq7AdauQ8=3DX0zeih2r21QoV=3D-WWj= 1hyBxLWRzq74n-C=3D-Ng@mail.gmail.com/ [1] Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com Signed-off-by: Kairui Song Acked-by: Chris Li Reviewed-by: Barry Song Acked-by: David Hildenbrand Suggested-by: Chris Li --- mm/swap.h | 4 ---- mm/swapfile.c | 32 ++++++++------------------------ mm/zswap.c | 7 +++++-- 3 files changed, 13 insertions(+), 30 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index adcd85fa8538..fe5c20922082 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -198,10 +198,6 @@ int swap_writeout(struct folio *folio, struct swap_ioc= b **swap_plug); void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug); =20 /* linux/mm/swap_state.c */ -/* One swap address space for each 64M swap space */ -#define SWAP_ADDRESS_SPACE_SHIFT 14 -#define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT) -#define SWAP_ADDRESS_SPACE_MASK (SWAP_ADDRESS_SPACE_PAGES - 1) extern struct address_space swap_space __ro_after_init; static inline struct address_space *swap_address_space(swp_entry_t entry) { diff --git a/mm/swapfile.c b/mm/swapfile.c index b183e96be289..314c5c10d3bd 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3204,21 +3204,14 @@ static int setup_swap_map(struct swap_info_struct *= si, return 0; } =20 -#define SWAP_CLUSTER_INFO_COLS \ - DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info)) -#define SWAP_CLUSTER_SPACE_COLS \ - DIV_ROUND_UP(SWAP_ADDRESS_SPACE_PAGES, SWAPFILE_CLUSTER) -#define SWAP_CLUSTER_COLS \ - max_t(unsigned int, SWAP_CLUSTER_INFO_COLS, SWAP_CLUSTER_SPACE_COLS) - static struct swap_cluster_info *setup_clusters(struct swap_info_struct *s= i, union swap_header *swap_header, unsigned long maxpages) { unsigned long nr_clusters =3D DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); struct swap_cluster_info *cluster_info; - unsigned long i, j, idx; int err =3D -ENOMEM; + unsigned long i; =20 cluster_info =3D kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL); if (!cluster_info) @@ -3267,22 +3260,13 @@ static struct swap_cluster_info *setup_clusters(str= uct swap_info_struct *si, INIT_LIST_HEAD(&si->frag_clusters[i]); } =20 - /* - * Reduce false cache line sharing between cluster_info and - * sharing same address space. - */ - for (j =3D 0; j < SWAP_CLUSTER_COLS; j++) { - for (i =3D 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) { - struct swap_cluster_info *ci; - idx =3D i * SWAP_CLUSTER_COLS + j; - ci =3D cluster_info + idx; - if (idx >=3D nr_clusters) - continue; - if (ci->count) { - ci->flags =3D CLUSTER_FLAG_NONFULL; - list_add_tail(&ci->list, &si->nonfull_clusters[0]); - continue; - } + for (i =3D 0; i < nr_clusters; i++) { + struct swap_cluster_info *ci =3D &cluster_info[i]; + + if (ci->count) { + ci->flags =3D CLUSTER_FLAG_NONFULL; + list_add_tail(&ci->list, &si->nonfull_clusters[0]); + } else { ci->flags =3D CLUSTER_FLAG_FREE; list_add_tail(&ci->list, &si->free_clusters); } diff --git a/mm/zswap.c b/mm/zswap.c index 1b1edecde6a7..c1af782e54ec 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -225,10 +225,13 @@ static bool zswap_has_pool; * helpers and fwd declarations **********************************/ =20 +/* One swap address space for each 64M swap space */ +#define ZSWAP_ADDRESS_SPACE_SHIFT 14 +#define ZSWAP_ADDRESS_SPACE_PAGES (1 << ZSWAP_ADDRESS_SPACE_SHIFT) static inline struct xarray *swap_zswap_tree(swp_entry_t swp) { return &zswap_trees[swp_type(swp)][swp_offset(swp) - >> SWAP_ADDRESS_SPACE_SHIFT]; + >> ZSWAP_ADDRESS_SPACE_SHIFT]; } =20 #define zswap_pool_debug(msg, p) \ @@ -1674,7 +1677,7 @@ int zswap_swapon(int type, unsigned long nr_pages) struct xarray *trees, *tree; unsigned int nr, i; =20 - nr =3D DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES); + nr =3D DIV_ROUND_UP(nr_pages, ZSWAP_ADDRESS_SPACE_PAGES); trees =3D kvcalloc(nr, sizeof(*tree), GFP_KERNEL); if (!trees) { pr_err("alloc failed, zswap disabled for swap type %d\n", type); --=20 2.51.0 From nobody Thu Oct 2 13:01:38 2025 Received: from mail-qk1-f170.google.com (mail-qk1-f170.google.com [209.85.222.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9424931FEE7 for ; Tue, 16 Sep 2025 16:02:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038567; cv=none; b=ItghqrABmnWecq86XhjfDf0ZQlhVy7L+KI8ulrS2/3/mBkwB/e+zzG4Dia/tREcvC69EgaXW6gMttbfdiU+FCoHUHtPJ6UqHNy3XHB8hRSQQYbFpv+7uv7ea+zk8xNOjRNiWgEJIRZyOsoelUsTFxOIYluneQS8eB1JyDWVpFtg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038567; c=relaxed/simple; bh=gF3WipM5tnCkis9n2GbV/37i8mWESOx5rcBH76U/0Zg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=LciRrhMe5eqF+nn8JUFZ7WhyGrshpxbZFDYhXlit80drBczRMgxo/PM0+kLEJFBOhWbl7yE1lLR1uyAvL5tQWGiTT7ZyOVkCHdol612sqwk8OAOAc75Wa9mCPAqcbwCWu7fzAko3ReeOoUfrhinOpKifBJqFk9nxhjEIfQT0Sl8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=KoUAnWBY; arc=none smtp.client-ip=209.85.222.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="KoUAnWBY" Received: by mail-qk1-f170.google.com with SMTP id af79cd13be357-82884bb66d6so1530585a.0 for ; Tue, 16 Sep 2025 09:02:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1758038564; x=1758643364; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=gTSGPwTupsl07mG7Sko3GwC2DzpjYWJjLeYVvC9wd5M=; b=KoUAnWBYMGW6drAY7kVyU5VhJYUIW0GNk+JPy+ammylXe6y1aPOie/kGUt7Sl22/Pw DTDe/IkLjLU1JXNPDJAhQT6uy//7GV1zgzc90ttCqYV/i77HYvpB5JePzwoLk9omKwB+ Nk5RYaFspLKqs8xkkY0OCqk+LFAnThYWc99D4/IzMdDC/kRaUMQF1deWfOG66YhQeEKH 2SKPYlopjeNJVvQvCZtlIlSlOT1NLdIC/TzdL4TTGpYozPFzXoeLfDO8dYfGVrRLvIhY KLiSqTg+jLRt7NAMoJcamrG9hAfUF+UzG9nzHfnit9D6NGEzzy2j6eKR4c4tIeuuL6NA uCWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758038564; x=1758643364; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=gTSGPwTupsl07mG7Sko3GwC2DzpjYWJjLeYVvC9wd5M=; b=J4OchmjC6+mXrAx4j52O4zJZ581OoMSBGaVJhApR/pqFxmqpQSDsmw01Dz/wJ/TLtt Ykxb10/23HtP6WPMBcIuxQIIKqaTXKv5dgQUNWufZYpX3mWiUjE3/EBVpX07q6P3IMTc 6KEjR48GRJ0FG9oKBH+Cmtq3OA1avZ1kpy47pM295yUvFtMX0Ja8JbqaYmh5XghSoNDe Ia8mouMZRFf3DyiQ2vk56asByRBkgZqt8rnNyxfL+Fy3ytIJn6lh/IOtUECPGNTQMfMA tqYNXlnkxs99rC1hvf6jIIxAws/97T8D/OShV+wjQhXGcUvrkPgBAhwAQWgeSnf8/JhL zwMQ== X-Forwarded-Encrypted: i=1; AJvYcCUW3wthXB8j0d4iAnQp+cHTK5mveALR2ab9d0r33a7Rgc2ZMAHYrj3sQI7jXqbMekqooh4/Use5Q7GmleQ=@vger.kernel.org X-Gm-Message-State: AOJu0Yx8VyEgx2CUu+lnBwS+9kPvlWsqoMJSA7hZrEu0FnqR8DbNs0jz ydyXnukgzOfGNLlKc5vjMUnJN4olgn6FoUh//EN7vn42Ovt0uhGWPDnN X-Gm-Gg: ASbGnctPXJLEL11+W9P3ROuCrngZclQEvv4CvZINVLk9AJIINaF4avKqfZyV8t9pgJz ys1ddPh+EA1Rjc+Bp4ZODRACfJCTLa9QJ4Z7/TBl12OoJPPBmGiKhmNttPL0WDiLIScftIFKlUG 2CtxW9cmUfUEjDEKS4/zIERC7zDDa+aHWKGV0FWn/Rj4QsUPduxugX+HmgFxT8cVRaSDiaZXMtC K279iFfn3sUbB+So90nC+dBD8/ll6zltM2TlBLDwJxKpEUh8rUWGmVstzqiBTNK2hRZUKr2dGmm +WCHMxy+Jl8KxvqPPFvK7tMDP97rSe7fvsWZp3IVO+q9Y+gQMINNB60BtaryhrCDYaN8LgH9oYm 8mUr4fHdr1QhTLXaMc/8/obuTb023hwbNSo/6DVW5S4G0NBluVMSUV+qonA== X-Google-Smtp-Source: AGHT+IGsq3cAoMQCpRVCszGVkcOQBHFep3v4Kswe//NGVefOdyjKK7yL5pcrQVxxAUWszoihv1qb/g== X-Received: by 2002:a05:620a:170f:b0:81a:c1cc:b01b with SMTP id af79cd13be357-82b9d2fe192mr314137685a.27.1758038564020; Tue, 16 Sep 2025 09:02:44 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-820cd703f54sm969765485a.37.2025.09.16.09.02.37 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 16 Sep 2025 09:02:43 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Kairui Song , Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 14/15] mm, swap: implement dynamic allocation of swap table Date: Wed, 17 Sep 2025 00:00:59 +0800 Message-ID: <20250916160100.31545-15-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250916160100.31545-1-ryncsn@gmail.com> References: <20250916160100.31545-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Now swap table is cluster based, which means free clusters can free its table since no one should modify it. There could be speculative readers, like swap cache look up, protect them by making them RCU protected. All swap table should be filled with null entries before free, so such readers will either see a NULL pointer or a null filled table being lazy freed. On allocation, allocate the table when a cluster is used by any order. This way, we can reduce the memory usage of large swap device significantly. This idea to dynamically release unused swap cluster data was initially suggested by Chris Li while proposing the cluster swap allocator and it suits the swap table idea very well. Co-developed-by: Chris Li Signed-off-by: Chris Li Signed-off-by: Kairui Song Reviewed-by: Barry Song Suggested-by: Chris Li --- mm/swap.h | 2 +- mm/swap_state.c | 9 +-- mm/swap_table.h | 37 ++++++++- mm/swapfile.c | 197 +++++++++++++++++++++++++++++++++++++----------- 4 files changed, 194 insertions(+), 51 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index fe5c20922082..8d8efdf1297a 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -36,7 +36,7 @@ struct swap_cluster_info { u16 count; u8 flags; u8 order; - atomic_long_t *table; /* Swap table entries, see mm/swap_table.h */ + atomic_long_t __rcu *table; /* Swap table entries, see mm/swap_table.h */ struct list_head list; }; =20 diff --git a/mm/swap_state.c b/mm/swap_state.c index a1478cbff384..b13e9c4baa90 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -91,8 +91,8 @@ struct folio *swap_cache_get_folio(swp_entry_t entry) struct folio *folio; =20 for (;;) { - swp_tb =3D __swap_table_get(__swap_entry_to_cluster(entry), - swp_cluster_offset(entry)); + swp_tb =3D swap_table_get(__swap_entry_to_cluster(entry), + swp_cluster_offset(entry)); if (!swp_tb_is_folio(swp_tb)) return NULL; folio =3D swp_tb_to_folio(swp_tb); @@ -115,11 +115,10 @@ void *swap_cache_get_shadow(swp_entry_t entry) { unsigned long swp_tb; =20 - swp_tb =3D __swap_table_get(__swap_entry_to_cluster(entry), - swp_cluster_offset(entry)); + swp_tb =3D swap_table_get(__swap_entry_to_cluster(entry), + swp_cluster_offset(entry)); if (swp_tb_is_shadow(swp_tb)) return swp_tb_to_shadow(swp_tb); - return NULL; } =20 diff --git a/mm/swap_table.h b/mm/swap_table.h index e1f7cc009701..52254e455304 100644 --- a/mm/swap_table.h +++ b/mm/swap_table.h @@ -2,8 +2,15 @@ #ifndef _MM_SWAP_TABLE_H #define _MM_SWAP_TABLE_H =20 +#include +#include #include "swap.h" =20 +/* A typical flat array in each cluster as swap table */ +struct swap_table { + atomic_long_t entries[SWAPFILE_CLUSTER]; +}; + /* * A swap table entry represents the status of a swap slot on a swap * (physical or virtual) device. The swap table in each cluster is a @@ -76,22 +83,46 @@ static inline void *swp_tb_to_shadow(unsigned long swp_= tb) static inline void __swap_table_set(struct swap_cluster_info *ci, unsigned int off, unsigned long swp_tb) { + atomic_long_t *table =3D rcu_dereference_protected(ci->table, true); + + lockdep_assert_held(&ci->lock); VM_WARN_ON_ONCE(off >=3D SWAPFILE_CLUSTER); - atomic_long_set(&ci->table[off], swp_tb); + atomic_long_set(&table[off], swp_tb); } =20 static inline unsigned long __swap_table_xchg(struct swap_cluster_info *ci, unsigned int off, unsigned long swp_tb) { + atomic_long_t *table =3D rcu_dereference_protected(ci->table, true); + + lockdep_assert_held(&ci->lock); VM_WARN_ON_ONCE(off >=3D SWAPFILE_CLUSTER); /* Ordering is guaranteed by cluster lock, relax */ - return atomic_long_xchg_relaxed(&ci->table[off], swp_tb); + return atomic_long_xchg_relaxed(&table[off], swp_tb); } =20 static inline unsigned long __swap_table_get(struct swap_cluster_info *ci, unsigned int off) { + atomic_long_t *table; + VM_WARN_ON_ONCE(off >=3D SWAPFILE_CLUSTER); - return atomic_long_read(&ci->table[off]); + table =3D rcu_dereference_check(ci->table, lockdep_is_held(&ci->lock)); + + return atomic_long_read(&table[off]); +} + +static inline unsigned long swap_table_get(struct swap_cluster_info *ci, + unsigned int off) +{ + atomic_long_t *table; + unsigned long swp_tb; + + rcu_read_lock(); + table =3D rcu_dereference(ci->table); + swp_tb =3D table ? atomic_long_read(&table[off]) : null_to_swp_tb(); + rcu_read_unlock(); + + return swp_tb; } #endif diff --git a/mm/swapfile.c b/mm/swapfile.c index 314c5c10d3bd..094e3e75849f 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -59,6 +59,9 @@ static void swap_entries_free(struct swap_info_struct *si, static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); static bool folio_swapcache_freeable(struct folio *folio); +static void move_cluster(struct swap_info_struct *si, + struct swap_cluster_info *ci, struct list_head *list, + enum swap_cluster_flags new_flags); =20 static DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; @@ -105,6 +108,8 @@ static DEFINE_SPINLOCK(swap_avail_lock); =20 struct swap_info_struct *swap_info[MAX_SWAPFILES]; =20 +static struct kmem_cache *swap_table_cachep; + static DEFINE_MUTEX(swapon_mutex); =20 static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait); @@ -401,10 +406,17 @@ static inline bool cluster_is_discard(struct swap_clu= ster_info *info) return info->flags =3D=3D CLUSTER_FLAG_DISCARD; } =20 +static inline bool cluster_table_is_alloced(struct swap_cluster_info *ci) +{ + return rcu_dereference_protected(ci->table, lockdep_is_held(&ci->lock)); +} + static inline bool cluster_is_usable(struct swap_cluster_info *ci, int ord= er) { if (unlikely(ci->flags > CLUSTER_FLAG_USABLE)) return false; + if (!cluster_table_is_alloced(ci)) + return false; if (!order) return true; return cluster_is_empty(ci) || order =3D=3D ci->order; @@ -422,32 +434,90 @@ static inline unsigned int cluster_offset(struct swap= _info_struct *si, return cluster_index(si, ci) * SWAPFILE_CLUSTER; } =20 -static int swap_cluster_alloc_table(struct swap_cluster_info *ci) +static void swap_cluster_free_table(struct swap_cluster_info *ci) { - WARN_ON(ci->table); - ci->table =3D kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNE= L); - if (!ci->table) - return -ENOMEM; - return 0; + unsigned int ci_off; + struct swap_table *table; + + /* Only empty cluster's table is allow to be freed */ + lockdep_assert_held(&ci->lock); + VM_WARN_ON_ONCE(!cluster_is_empty(ci)); + for (ci_off =3D 0; ci_off < SWAPFILE_CLUSTER; ci_off++) + VM_WARN_ON_ONCE(!swp_tb_is_null(__swap_table_get(ci, ci_off))); + table =3D (void *)rcu_dereference_protected(ci->table, true); + rcu_assign_pointer(ci->table, NULL); + + kmem_cache_free(swap_table_cachep, table); } =20 -static void swap_cluster_free_table(struct swap_cluster_info *ci) +/* + * Allocate swap table for one cluster. Attempt an atomic allocation first, + * then fallback to sleeping allocation. + */ +static struct swap_cluster_info * +swap_cluster_alloc_table(struct swap_info_struct *si, + struct swap_cluster_info *ci) { - unsigned int ci_off; - unsigned long swp_tb; + struct swap_table *table; =20 - if (!ci->table) - return; + /* + * Only cluster isolation from the allocator does table allocation. + * Swap allocator uses percpu clusters and holds the local lock. + */ + lockdep_assert_held(&ci->lock); + lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock); + + /* The cluster must be free and was just isolated from the free list. */ + VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci)); + + table =3D kmem_cache_zalloc(swap_table_cachep, + __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN); + if (table) { + rcu_assign_pointer(ci->table, table); + return ci; + } + + /* + * Try a sleep allocation. Each isolated free cluster may cause + * a sleep allocation, but there is a limited number of them, so + * the potential recursive allocation is limited. + */ + spin_unlock(&ci->lock); + if (!(si->flags & SWP_SOLIDSTATE)) + spin_unlock(&si->global_cluster_lock); + local_unlock(&percpu_swap_cluster.lock); + + table =3D kmem_cache_zalloc(swap_table_cachep, + __GFP_HIGH | __GFP_NOMEMALLOC | GFP_KERNEL); + + /* + * Back to atomic context. We might have migrated to a new CPU with a + * usable percpu cluster. But just keep using the isolated cluster to + * make things easier. Migration indicates a slight change of workload + * so using a new free cluster might not be a bad idea, and the worst + * could happen with ignoring the percpu cluster is fragmentation, + * which is acceptable since this fallback and race is rare. + */ + local_lock(&percpu_swap_cluster.lock); + if (!(si->flags & SWP_SOLIDSTATE)) + spin_lock(&si->global_cluster_lock); + spin_lock(&ci->lock); =20 - for (ci_off =3D 0; ci_off < SWAPFILE_CLUSTER; ci_off++) { - swp_tb =3D __swap_table_get(ci, ci_off); - if (!swp_tb_is_null(swp_tb)) - pr_err_once("swap: unclean swap space on swapoff: 0x%lx", - swp_tb); + /* Nothing except this helper should touch a dangling empty cluster. */ + if (WARN_ON_ONCE(cluster_table_is_alloced(ci))) { + if (table) + kmem_cache_free(swap_table_cachep, table); + return ci; } =20 - kfree(ci->table); - ci->table =3D NULL; + if (!table) { + move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE); + spin_unlock(&ci->lock); + return NULL; + } + + rcu_assign_pointer(ci->table, table); + return ci; } =20 static void move_cluster(struct swap_info_struct *si, @@ -479,7 +549,7 @@ static void swap_cluster_schedule_discard(struct swap_i= nfo_struct *si, =20 static void __free_cluster(struct swap_info_struct *si, struct swap_cluste= r_info *ci) { - lockdep_assert_held(&ci->lock); + swap_cluster_free_table(ci); move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE); ci->order =3D 0; } @@ -494,15 +564,11 @@ static void __free_cluster(struct swap_info_struct *s= i, struct swap_cluster_info * this returns NULL for an non-empty list. */ static struct swap_cluster_info *isolate_lock_cluster( - struct swap_info_struct *si, struct list_head *list) + struct swap_info_struct *si, struct list_head *list, int order) { - struct swap_cluster_info *ci, *ret =3D NULL; + struct swap_cluster_info *ci, *found =3D NULL; =20 spin_lock(&si->lock); - - if (unlikely(!(si->flags & SWP_WRITEOK))) - goto out; - list_for_each_entry(ci, list, list) { if (!spin_trylock(&ci->lock)) continue; @@ -514,13 +580,19 @@ static struct swap_cluster_info *isolate_lock_cluster( =20 list_del(&ci->list); ci->flags =3D CLUSTER_FLAG_NONE; - ret =3D ci; + found =3D ci; break; } -out: spin_unlock(&si->lock); =20 - return ret; + if (found && !cluster_table_is_alloced(found)) { + /* Only an empty free cluster's swap table can be freed. */ + VM_WARN_ON_ONCE(list !=3D &si->free_clusters); + VM_WARN_ON_ONCE(!cluster_is_empty(found)); + return swap_cluster_alloc_table(si, found); + } + + return found; } =20 /* @@ -653,17 +725,27 @@ static void relocate_cluster(struct swap_info_struct = *si, * added to free cluster list and its usage counter will be increased by 1. * Only used for initialization. */ -static void inc_cluster_info_page(struct swap_info_struct *si, +static int inc_cluster_info_page(struct swap_info_struct *si, struct swap_cluster_info *cluster_info, unsigned long page_nr) { unsigned long idx =3D page_nr / SWAPFILE_CLUSTER; + struct swap_table *table; struct swap_cluster_info *ci; =20 ci =3D cluster_info + idx; + if (!ci->table) { + table =3D kmem_cache_zalloc(swap_table_cachep, GFP_KERNEL); + if (!table) + return -ENOMEM; + rcu_assign_pointer(ci->table, table); + } + ci->count++; =20 VM_BUG_ON(ci->count > SWAPFILE_CLUSTER); VM_BUG_ON(ci->flags); + + return 0; } =20 static bool cluster_reclaim_range(struct swap_info_struct *si, @@ -845,7 +927,7 @@ static unsigned int alloc_swap_scan_list(struct swap_in= fo_struct *si, unsigned int found =3D SWAP_ENTRY_INVALID; =20 do { - struct swap_cluster_info *ci =3D isolate_lock_cluster(si, list); + struct swap_cluster_info *ci =3D isolate_lock_cluster(si, list, order); unsigned long offset; =20 if (!ci) @@ -870,7 +952,7 @@ static void swap_reclaim_full_clusters(struct swap_info= _struct *si, bool force) if (force) to_scan =3D swap_usage_in_pages(si) / SWAPFILE_CLUSTER; =20 - while ((ci =3D isolate_lock_cluster(si, &si->full_clusters))) { + while ((ci =3D isolate_lock_cluster(si, &si->full_clusters, 0))) { offset =3D cluster_offset(si, ci); end =3D min(si->max, offset + SWAPFILE_CLUSTER); to_scan--; @@ -1018,6 +1100,7 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o done: if (!(si->flags & SWP_SOLIDSTATE)) spin_unlock(&si->global_cluster_lock); + return found; } =20 @@ -1885,7 +1968,13 @@ swp_entry_t get_swap_page_of_type(int type) /* This is called for allocating swap entry, not cache */ if (get_swap_device_info(si)) { if (si->flags & SWP_WRITEOK) { + /* + * Grab the local lock to be complaint + * with swap table allocation. + */ + local_lock(&percpu_swap_cluster.lock); offset =3D cluster_alloc_swap_entry(si, 0, 1); + local_unlock(&percpu_swap_cluster.lock); if (offset) { entry =3D swp_entry(si->type, offset); atomic_long_dec(&nr_swap_pages); @@ -2679,12 +2768,21 @@ static void wait_for_allocation(struct swap_info_st= ruct *si) static void free_cluster_info(struct swap_cluster_info *cluster_info, unsigned long maxpages) { + struct swap_cluster_info *ci; int i, nr_clusters =3D DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); =20 if (!cluster_info) return; - for (i =3D 0; i < nr_clusters; i++) - swap_cluster_free_table(&cluster_info[i]); + for (i =3D 0; i < nr_clusters; i++) { + ci =3D cluster_info + i; + /* Cluster with bad marks count will have a remaining table */ + spin_lock(&ci->lock); + if (rcu_dereference_protected(ci->table, true)) { + ci->count =3D 0; + swap_cluster_free_table(ci); + } + spin_unlock(&ci->lock); + } kvfree(cluster_info); } =20 @@ -2720,6 +2818,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) struct address_space *mapping; struct inode *inode; struct filename *pathname; + unsigned int maxpages; int err, found =3D 0; =20 if (!capable(CAP_SYS_ADMIN)) @@ -2826,8 +2925,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) p->swap_map =3D NULL; zeromap =3D p->zeromap; p->zeromap =3D NULL; + maxpages =3D p->max; cluster_info =3D p->cluster_info; - free_cluster_info(cluster_info, p->max); p->max =3D 0; p->cluster_info =3D NULL; spin_unlock(&p->lock); @@ -2839,6 +2938,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) p->global_cluster =3D NULL; vfree(swap_map); kvfree(zeromap); + free_cluster_info(cluster_info, maxpages); /* Destroy swap account information */ swap_cgroup_swapoff(p->type); =20 @@ -3217,11 +3317,8 @@ static struct swap_cluster_info *setup_clusters(stru= ct swap_info_struct *si, if (!cluster_info) goto err; =20 - for (i =3D 0; i < nr_clusters; i++) { + for (i =3D 0; i < nr_clusters; i++) spin_lock_init(&cluster_info[i].lock); - if (swap_cluster_alloc_table(&cluster_info[i])) - goto err_free; - } =20 if (!(si->flags & SWP_SOLIDSTATE)) { si->global_cluster =3D kmalloc(sizeof(*si->global_cluster), @@ -3240,16 +3337,23 @@ static struct swap_cluster_info *setup_clusters(str= uct swap_info_struct *si, * See setup_swap_map(): header page, bad pages, * and the EOF part of the last cluster. */ - inc_cluster_info_page(si, cluster_info, 0); + err =3D inc_cluster_info_page(si, cluster_info, 0); + if (err) + goto err; for (i =3D 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr =3D swap_header->info.badpages[i]; =20 if (page_nr >=3D maxpages) continue; - inc_cluster_info_page(si, cluster_info, page_nr); + err =3D inc_cluster_info_page(si, cluster_info, page_nr); + if (err) + goto err; + } + for (i =3D maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) { + err =3D inc_cluster_info_page(si, cluster_info, i); + if (err) + goto err; } - for (i =3D maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) - inc_cluster_info_page(si, cluster_info, i); =20 INIT_LIST_HEAD(&si->free_clusters); INIT_LIST_HEAD(&si->full_clusters); @@ -3963,6 +4067,15 @@ static int __init swapfile_init(void) =20 swapfile_maximum_size =3D arch_max_swapfile_size(); =20 + /* + * Once a cluster is freed, it's swap table content is read + * only, and all swap cache readers (swap_cache_*) verifies + * the content before use. So it's safe to use RCU slab here. + */ + swap_table_cachep =3D kmem_cache_create("swap_table", + sizeof(struct swap_table), + 0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL); + #ifdef CONFIG_MIGRATION if (swapfile_maximum_size >=3D (1UL << SWP_MIG_TOTAL_BITS)) swap_migration_ad_supported =3D true; --=20 2.51.0 From nobody Thu Oct 2 13:01:38 2025 Received: from mail-qk1-f180.google.com (mail-qk1-f180.google.com [209.85.222.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F180D31FEFE for ; Tue, 16 Sep 2025 16:02:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038573; cv=none; b=tAeGxdax0X3TnuqMkBKlcfehT124B/IR90bk0RCpfymldtPcVQjA/sTMTz6NFI15g5MLRoIPT6ZlD31KhjzrA17cfAlY3MVuw0A0nyVOFg30TD9nW459Ckl7SThxMJ3GwnTi+dcHv66JgsTa5vhDWGsC900z7ZPiEv/BpeHskuQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758038573; c=relaxed/simple; bh=ESgZ4Pu7dz4sQrjEgwaAL/nLhTL2WCOWrRvYy9cFBe0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Gn+p4uJ+2RLavR8lj9t9gZr8ja7CDPAZketh2m1AzLOaV+lxcimlvYO3A82eaSbDlx3cKu51axpsSLZ7vPuXQJEb6hRh7nRMw0O28cJmc9T7ZQuHorZRD7Jt6Clh8+PravKT2GCxz+msB9PHyuCGnGzm470JcAObUqKoF0/+Nus= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Ugsrz9iG; arc=none smtp.client-ip=209.85.222.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Ugsrz9iG" Received: by mail-qk1-f180.google.com with SMTP id af79cd13be357-80a6937c8c6so674021085a.2 for ; Tue, 16 Sep 2025 09:02:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1758038571; x=1758643371; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=Y9A/r73fYL/BjndHTsKAtIDgbtMgmgaZIN32sPZEUv0=; b=Ugsrz9iGfbnRe5g2nxE+ZRFLbSePFf+omaLxB6LHtHOxL7CfKhAlZW0LgzoCDfeO6D +WodREt0eNPOt83WZQG5FJmXrvOg88mJvzj0lJylKhgG9kvRs8IQLOgzh/EghxNrIWxU clXe7gQdvTDkbNIaXSrDpR/669cT+Th+O6VhOBt4O7cJxgEaoIjvRg1DJy8MLL61eYiR OMHLdXsjiS2CFec6P4KZXj+fsAMDUfLfD3hCiLiDJ4nK6DJ0GIwqaegdvoZXI5kkuAnC uXAQGzlLXLhea9uXGY1l3VgQK6/SloMTaNs2wRqHU4ErddHOoMc7RNSUZYMFBJGkTNJB 4J3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758038571; x=1758643371; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Y9A/r73fYL/BjndHTsKAtIDgbtMgmgaZIN32sPZEUv0=; b=nW/YZ9ZKwdyS/ATVSMeybwbPw8kcwNH22XDSKnA/MxwTkvq2DePVODYH3MOvicS4Bh QsOWYKmGXLopHKZbDMiS0UNJSZt/IdwY9U+3P4E7kXj2cfhNdY7K8KhcF4exSpiU46OJ v4/9cIwrPbzNiA6+G3g2i8Jvxz4ITq8130QE+BTCF0d061nHoNGYOTDf3TA8DKx/60/L nj0B/bKGBaIOtBUd8clB78BAfbDxtl5uUZQ6HA9KmX4sKSSUwzoTw75Nqv63W5sLRAcF UjYALU+8qshZBd7yraI8GsT94WHrZlcsOH1e+QVnWe6pLWTRhiYDdBb7qInQPj0s73jV F2RQ== X-Forwarded-Encrypted: i=1; AJvYcCUmjdvNTUyE2Wt9NWh146Nf32fNj6S5yV0jGfz1zkJT65/5/sjsV7oT3DKSgKO6GUxUJWh4NUZ6qcc75Oo=@vger.kernel.org X-Gm-Message-State: AOJu0YwF858HrbszZwr9uPYQLom1kGoGbzpzjMFyiTqqMjFQjCw7jqlP CIB3dRLD3nF7Ps9xOAhQ6eRrgd0QABu5XKaIWc64KTCKCviCKe0JfROF X-Gm-Gg: ASbGncsSoU5cWcoQDi1kNe7PrP/p9I19xf37ofKrNvtG+zLlw/26vFOriL1yRktDi0a NUauRxE4H8JBObz229QZHR8OXT7kNslIPd9lst7tJ8rw+XUbtN+Uc1cS5uJQet56f1Zx4iGlWTX Gj770ANPR8M84sme32l2AO7QRLOFGMA2cETd0fw6WhfGz6k9zCS5Svv4Eh0zVqcL0d2OpW72Z6/ sX95KLmidEOx+lQTqfOxwPiTlAsPHtOTtEC806vygdjj1gaM3k4+zKxYMphNuvRXfrDAq4y8tLr eNCR9WNoZlNLNn1EwpRYsi/pi036MMz7j7D9TIWUTlBw47UmdKz1U//0Hpj5iJzcZVvu61moijp o2IDTy77RLNc20QV+1pd1DpA8LwFWW8Wh/LBh4/wtMtm2N5Q= X-Google-Smtp-Source: AGHT+IF5wX/REvlgTwvjKzwh4z5xhh7WT1WZw3CdRiSULPQtDINM9vMmX/dyP5zoX4P09Uasw0kMqw== X-Received: by 2002:a05:620a:2951:b0:82b:3775:666e with SMTP id af79cd13be357-82b37756728mr503323685a.36.1758038570391; Tue, 16 Sep 2025 09:02:50 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-820cd703f54sm969765485a.37.2025.09.16.09.02.44 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 16 Sep 2025 09:02:49 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Kairui Song , Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 15/15] mm, swap: use a single page for swap table when the size fits Date: Wed, 17 Sep 2025 00:01:00 +0800 Message-ID: <20250916160100.31545-16-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250916160100.31545-1-ryncsn@gmail.com> References: <20250916160100.31545-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song We have a cluster size of 512 slots. Each slot consumes 8 bytes in swap table so the swap table size of each cluster is exactly one page (4K). If that condition is true, allocate one page direct and disable the slab cache to reduce the memory usage of swap table and avoid fragmentation. Co-developed-by: Chris Li Signed-off-by: Chris Li Signed-off-by: Kairui Song Acked-by: Chris Li Reviewed-by: Barry Song Suggested-by: Chris Li --- mm/swap_table.h | 2 ++ mm/swapfile.c | 51 +++++++++++++++++++++++++++++++++++++++---------- 2 files changed, 43 insertions(+), 10 deletions(-) diff --git a/mm/swap_table.h b/mm/swap_table.h index 52254e455304..ea244a57a5b7 100644 --- a/mm/swap_table.h +++ b/mm/swap_table.h @@ -11,6 +11,8 @@ struct swap_table { atomic_long_t entries[SWAPFILE_CLUSTER]; }; =20 +#define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) =3D=3D PAGE_SIZE) + /* * A swap table entry represents the status of a swap slot on a swap * (physical or virtual) device. The swap table in each cluster is a diff --git a/mm/swapfile.c b/mm/swapfile.c index 094e3e75849f..890b410d77b6 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -434,6 +434,38 @@ static inline unsigned int cluster_offset(struct swap_= info_struct *si, return cluster_index(si, ci) * SWAPFILE_CLUSTER; } =20 +static struct swap_table *swap_table_alloc(gfp_t gfp) +{ + struct folio *folio; + + if (!SWP_TABLE_USE_PAGE) + return kmem_cache_zalloc(swap_table_cachep, gfp); + + folio =3D folio_alloc(gfp | __GFP_ZERO, 0); + if (folio) + return folio_address(folio); + return NULL; +} + +static void swap_table_free_folio_rcu_cb(struct rcu_head *head) +{ + struct folio *folio; + + folio =3D page_folio(container_of(head, struct page, rcu_head)); + folio_put(folio); +} + +static void swap_table_free(struct swap_table *table) +{ + if (!SWP_TABLE_USE_PAGE) { + kmem_cache_free(swap_table_cachep, table); + return; + } + + call_rcu(&(folio_page(virt_to_folio(table), 0)->rcu_head), + swap_table_free_folio_rcu_cb); +} + static void swap_cluster_free_table(struct swap_cluster_info *ci) { unsigned int ci_off; @@ -447,7 +479,7 @@ static void swap_cluster_free_table(struct swap_cluster= _info *ci) table =3D (void *)rcu_dereference_protected(ci->table, true); rcu_assign_pointer(ci->table, NULL); =20 - kmem_cache_free(swap_table_cachep, table); + swap_table_free(table); } =20 /* @@ -470,8 +502,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si, /* The cluster must be free and was just isolated from the free list. */ VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci)); =20 - table =3D kmem_cache_zalloc(swap_table_cachep, - __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN); + table =3D swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN); if (table) { rcu_assign_pointer(ci->table, table); return ci; @@ -487,8 +518,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si, spin_unlock(&si->global_cluster_lock); local_unlock(&percpu_swap_cluster.lock); =20 - table =3D kmem_cache_zalloc(swap_table_cachep, - __GFP_HIGH | __GFP_NOMEMALLOC | GFP_KERNEL); + table =3D swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | GFP_KERNEL); =20 /* * Back to atomic context. We might have migrated to a new CPU with a @@ -506,7 +536,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si, /* Nothing except this helper should touch a dangling empty cluster. */ if (WARN_ON_ONCE(cluster_table_is_alloced(ci))) { if (table) - kmem_cache_free(swap_table_cachep, table); + swap_table_free(table); return ci; } =20 @@ -734,7 +764,7 @@ static int inc_cluster_info_page(struct swap_info_struc= t *si, =20 ci =3D cluster_info + idx; if (!ci->table) { - table =3D kmem_cache_zalloc(swap_table_cachep, GFP_KERNEL); + table =3D swap_table_alloc(GFP_KERNEL); if (!table) return -ENOMEM; rcu_assign_pointer(ci->table, table); @@ -4072,9 +4102,10 @@ static int __init swapfile_init(void) * only, and all swap cache readers (swap_cache_*) verifies * the content before use. So it's safe to use RCU slab here. */ - swap_table_cachep =3D kmem_cache_create("swap_table", - sizeof(struct swap_table), - 0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL); + if (!SWP_TABLE_USE_PAGE) + swap_table_cachep =3D kmem_cache_create("swap_table", + sizeof(struct swap_table), + 0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL); =20 #ifdef CONFIG_MIGRATION if (swapfile_maximum_size >=3D (1UL << SWP_MIG_TOTAL_BITS)) --=20 2.51.0