From nobody Wed Feb 11 08:11:48 2026 Received: from mail-oi1-f180.google.com (mail-oi1-f180.google.com [209.85.167.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4784329D291 for ; Sun, 8 Feb 2026 21:58:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.167.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770587926; cv=none; b=G6Ts7gIYRr64fbZ7I3Q9m60QutQeDzmTOy8NJ9cBhTSldK9RVmvpqC2CGKeGrMQ8CvLG7GfVdKaCMHbAjJpayq3YDRYcKQ8pzyWoy+i9uLnhz4sjP7sZouPiov3badFddX3anybmVLIV4JQeuIDai145d7zOLHq/Q2IYg3hbHdI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770587926; c=relaxed/simple; bh=XKla6ndzrkm5T0nbvnQM0Lq6V58h2slMy3+aldpr8t8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=cWyP0p9AYeTZe6ztrNfMaN8j4AU30It6UnauyYVnCIn7wagtOJds3EfrAYumUCfY73GxAW+QMLzbcNJ5EeSERzwc7If9YsiWR5VXSFUE0ALjxbZIdhd16yWh+xKZH4h3kY1SYHQwhx8ZQ1pG/45NSrWHOPDO5YWNCBR/UM2WWiQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=IO/ilBHH; arc=none smtp.client-ip=209.85.167.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="IO/ilBHH" Received: by mail-oi1-f180.google.com with SMTP id 5614622812f47-45c7c841904so1649479b6e.3 for ; Sun, 08 Feb 2026 13:58:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1770587925; x=1771192725; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=abeNKnw9HZhmXwtM4WYbHzmXFL6kS93ajK1z2pegh8E=; b=IO/ilBHH7f6VxGTwxvUPtE0Camghg38rBFxjFucaF//pYVKUUfhEtiOhVIudh+Voa7 EZV7led2OHLGiAmeH427Ft7R2ZGCbdBxdUnvFxHxlsrCJ0BOpK9Cz+DkUy3XYrHtpTrU +4ppsEnkr8imveWCFe/1Bm69QM4CPArGs9ZP54/2jLNWmMebfEdJ3sZcl7+gSuFk9orn QmpRjk8fCcCEKNpZuF1jsXkw9S2Vm7/EGEDF0RKRq7CJBYYSDvh3c2iFlIEZ89jg5rM6 TNVlDLXYxWwq+dBCu0eqwyt8P8maXK0NUS3F6jv6hHBNmU64kfA5KKZzxCCCblhLjgxV pC2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770587925; x=1771192725; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=abeNKnw9HZhmXwtM4WYbHzmXFL6kS93ajK1z2pegh8E=; b=SDkPqUKFY4IwXfLSYhofVv1vyqykUdVA7Qm7K13vCvPHDkH0Pwp+lxwwKpYgwQm7n0 tuMeRAMSg9jhThSW4RJImaYb7L7Mz3xsuTRE1PTMxNj/QHNut2U+gKTSzJlbciVLSDqz 4iPZSHg6wwgkGk6fwMXaqCjUOWzkufqYITBHMdujWMToEWTyJB3VKJcdt/b0yofHqqAU t6iwxb0MM/PkS4WqMaByk/6ZqWAm0B/9wBIuTKXqOOjvnqAAPhA43l/kqAA4a85lzbZv k2jHda84iC5SgqOqXtp+2D/Tcj/WvVFywHrNBvVar8aNpQQdeA8NrSHSUQQ1nMeNpDxh YR1A== X-Forwarded-Encrypted: i=1; AJvYcCUnA9d8dD9g/fwu79fUlc4JLmbPUjyNaZhQU3S6a3szXMU7QGaxisbzpb2ftb0MZ9DaxAZMOi7ohX/i3KE=@vger.kernel.org X-Gm-Message-State: AOJu0Yx6TvIow4OEV8uoI4/4Q6/AafJwc7s66S1pQEADEZQ4PPM7QQDG vWz6IS3GGnthUqDTG+Wbw8R8IV8EuiZN6dBxlniWKCjhFnWDKtRhVJyI X-Gm-Gg: AZuq6aIRjK+TMLdCao5GT+B9Cam37/NljwjioF5E7+FHMhuQNYkgMsjttbV4FpqvaZE Lj7YcCB8nVhSs2uZxvUSKXalbqbcrF8wvX1j5RkSfPzut0wgq/yNXiZk3QnW237ZzRpNi4WJFYF 8MZLCbU13d8yyWvgLO+TpN2JaLdldtbYj7iCt3kWWVg8dGC0rS89mwuhna3feIjA8Oo6w93nefQ MiwJa4zcBAu1eBcY4cmcrXyMngeNCYD6yl09MeoylUFsvspD2zgCgvGXyMe2HyfFAVP2IvrOGeM 1Y9tyYrz4PsVhDNIZmhg6tN2rpK22XhdgbKW7oT+QhbV42r5Lb6kSYo9QtYddAJZAo9lDiPuxQI vk+XJ7ScIR/YLWsj86MZdrzeJquanPJMQByIHRyG7GeDtkh4YUTIz6LHTrOzZ1gJD39xZGwLe37 T6ZWnxt+5ouwyod5pW3iNIw/bbCZuvxaYWObcHTxSNGny+ X-Received: by 2002:a05:6808:2218:b0:45f:66f:a2be with SMTP id 5614622812f47-462fcb8ac2amr5107625b6e.47.1770587924951; Sun, 08 Feb 2026 13:58:44 -0800 (PST) Received: from localhost ([2a03:2880:10ff:70::]) by smtp.gmail.com with ESMTPSA id 586e51a60fabf-40a9978a3absm6452458fac.18.2026.02.08.13.58.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 08 Feb 2026 13:58:43 -0800 (PST) From: Nhat Pham To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com, yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com, chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org, huang.ying.caritas@gmail.com, ryan.roberts@arm.com, shikemeng@huaweicloud.com, viro@zeniv.linux.org.uk, baohua@kernel.org, bhe@redhat.com, osalvador@suse.de, lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com, riel@surriel.com, joshua.hahnjy@gmail.com, npache@redhat.com, gourry@gourry.net, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, rafael@kernel.org, jannh@google.com, pfalcato@suse.de, zhengqi.arch@bytedance.com Subject: [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure Date: Sun, 8 Feb 2026 13:58:14 -0800 Message-ID: <20260208215839.87595-2-nphamcs@gmail.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260208215839.87595-1-nphamcs@gmail.com> References: <20260208215839.87595-1-nphamcs@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When we virtualize the swap space, we will manage swap cache at the virtual swap layer. To prepare for this, decouple swap cache from physical swap infrastructure. We will also remove all the swap cache related helpers of swap table. We will keep the rest of the swap table infrastructure, which will be repurposed to serve as the rmap (physical -> virtual swap mapping) later. Note that with this patch, we will move to a single global lock to synchronize swap cache accesses. This is temporarily, as the swap cache will be re-partitioned in to (virtual) swap clusters once we move the swap cache to the soon-to-be-introduced virtual swap layer. Signed-off-by: Nhat Pham --- Documentation/mm/swap-table.rst | 69 ----------- mm/huge_memory.c | 11 +- mm/migrate.c | 13 +- mm/shmem.c | 7 +- mm/swap.h | 26 ++-- mm/swap_state.c | 205 +++++++++++++++++--------------- mm/swap_table.h | 78 +----------- mm/swapfile.c | 43 ++----- mm/vmscan.c | 9 +- 9 files changed, 158 insertions(+), 303 deletions(-) delete mode 100644 Documentation/mm/swap-table.rst diff --git a/Documentation/mm/swap-table.rst b/Documentation/mm/swap-table.= rst deleted file mode 100644 index da10bb7a0dc37..0000000000000 --- a/Documentation/mm/swap-table.rst +++ /dev/null @@ -1,69 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -:Author: Chris Li , Kairui Song - -=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D -Swap Table -=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D - -Swap table implements swap cache as a per-cluster swap cache value array. - -Swap Entry ----------- - -A swap entry contains the information required to serve the anonymous page -fault. - -Swap entry is encoded as two parts: swap type and swap offset. - -The swap type indicates which swap device to use. -The swap offset is the offset of the swap file to read the page data from. - -Swap Cache ----------- - -Swap cache is a map to look up folios using swap entry as the key. The res= ult -value can have three possible types depending on which stage of this swap = entry -was in. - -1. NULL: This swap entry is not used. - -2. folio: A folio has been allocated and bound to this swap entry. This is - the transient state of swap out or swap in. The folio data can be in - the folio or swap file, or both. - -3. shadow: The shadow contains the working set information of the swapped - out folio. This is the normal state for a swapped out page. - -Swap Table Internals --------------------- - -The previous swap cache is implemented by XArray. The XArray is a tree -structure. Each lookup will go through multiple nodes. Can we do better? - -Notice that most of the time when we look up the swap cache, we are either -in a swap in or swap out path. We should already have the swap cluster, -which contains the swap entry. - -If we have a per-cluster array to store swap cache value in the cluster. -Swap cache lookup within the cluster can be a very simple array lookup. - -We give such a per-cluster swap cache value array a name: the swap table. - -A swap table is an array of pointers. Each pointer is the same size as a -PTE. The size of a swap table for one swap cluster typically matches a PTE -page table, which is one page on modern 64-bit systems. - -With swap table, swap cache lookup can achieve great locality, simpler, -and faster. - -Locking -------- - -Swap table modification requires taking the cluster lock. If a folio -is being added to or removed from the swap table, the folio must be -locked prior to the cluster lock. After adding or removing is done, the -folio shall be unlocked. - -Swap table lookup is protected by RCU and atomic read. If the lookup -returns a folio, the user must lock the folio before use. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 40cf59301c21a..21215ac870144 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3783,7 +3783,6 @@ static int __folio_freeze_and_split_unmapped(struct f= olio *folio, unsigned int n /* Prevent deferred_split_scan() touching ->_refcount */ ds_queue =3D folio_split_queue_lock(folio); if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) { - struct swap_cluster_info *ci =3D NULL; struct lruvec *lruvec; =20 if (old_order > 1) { @@ -3826,7 +3825,7 @@ static int __folio_freeze_and_split_unmapped(struct f= olio *folio, unsigned int n return -EINVAL; } =20 - ci =3D swap_cluster_get_and_lock(folio); + swap_cache_lock(); } =20 /* lock lru list/PageCompound, ref frozen by page_ref_freeze */ @@ -3862,8 +3861,8 @@ static int __folio_freeze_and_split_unmapped(struct f= olio *folio, unsigned int n * Anonymous folio with swap cache. * NOTE: shmem in swap cache is not supported yet. */ - if (ci) { - __swap_cache_replace_folio(ci, folio, new_folio); + if (folio_test_swapcache(folio)) { + __swap_cache_replace_folio(folio, new_folio); continue; } =20 @@ -3901,8 +3900,8 @@ static int __folio_freeze_and_split_unmapped(struct f= olio *folio, unsigned int n if (do_lru) unlock_page_lruvec(lruvec); =20 - if (ci) - swap_cluster_unlock(ci); + if (folio_test_swapcache(folio)) + swap_cache_unlock(); } else { split_queue_unlock(ds_queue); return -EAGAIN; diff --git a/mm/migrate.c b/mm/migrate.c index 4688b9e38cd2f..11d9b43dff5d8 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -571,7 +571,6 @@ static int __folio_migrate_mapping(struct address_space= *mapping, struct folio *newfolio, struct folio *folio, int expected_count) { XA_STATE(xas, &mapping->i_pages, folio->index); - struct swap_cluster_info *ci =3D NULL; struct zone *oldzone, *newzone; int dirty; long nr =3D folio_nr_pages(folio); @@ -601,13 +600,13 @@ static int __folio_migrate_mapping(struct address_spa= ce *mapping, newzone =3D folio_zone(newfolio); =20 if (folio_test_swapcache(folio)) - ci =3D swap_cluster_get_and_lock_irq(folio); + swap_cache_lock_irq(); else xas_lock_irq(&xas); =20 if (!folio_ref_freeze(folio, expected_count)) { - if (ci) - swap_cluster_unlock_irq(ci); + if (folio_test_swapcache(folio)) + swap_cache_unlock_irq(); else xas_unlock_irq(&xas); return -EAGAIN; @@ -640,7 +639,7 @@ static int __folio_migrate_mapping(struct address_space= *mapping, } =20 if (folio_test_swapcache(folio)) - __swap_cache_replace_folio(ci, folio, newfolio); + __swap_cache_replace_folio(folio, newfolio); else xas_store(&xas, newfolio); =20 @@ -652,8 +651,8 @@ static int __folio_migrate_mapping(struct address_space= *mapping, folio_ref_unfreeze(folio, expected_count - nr); =20 /* Leave irq disabled to prevent preemption while updating stats */ - if (ci) - swap_cluster_unlock(ci); + if (folio_test_swapcache(folio)) + swap_cache_unlock(); else xas_unlock(&xas); =20 diff --git a/mm/shmem.c b/mm/shmem.c index 79af5f9f8b908..1db97ef2d14eb 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2133,7 +2133,6 @@ static int shmem_replace_folio(struct folio **foliop,= gfp_t gfp, struct shmem_inode_info *info, pgoff_t index, struct vm_area_struct *vma) { - struct swap_cluster_info *ci; struct folio *new, *old =3D *foliop; swp_entry_t entry =3D old->swap; int nr_pages =3D folio_nr_pages(old); @@ -2166,12 +2165,12 @@ static int shmem_replace_folio(struct folio **folio= p, gfp_t gfp, new->swap =3D entry; folio_set_swapcache(new); =20 - ci =3D swap_cluster_get_and_lock_irq(old); - __swap_cache_replace_folio(ci, old, new); + swap_cache_lock_irq(); + __swap_cache_replace_folio(old, new); mem_cgroup_replace_folio(old, new); shmem_update_stats(new, nr_pages); shmem_update_stats(old, -nr_pages); - swap_cluster_unlock_irq(ci); + swap_cache_unlock_irq(); =20 folio_add_lru(new); *foliop =3D new; diff --git a/mm/swap.h b/mm/swap.h index 1bd466da30393..8726b587a5b5d 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -199,6 +199,11 @@ void __swap_writepage(struct folio *folio, struct swap= _iocb **swap_plug); =20 /* linux/mm/swap_state.c */ extern struct address_space swap_space __read_mostly; +void swap_cache_lock_irq(void); +void swap_cache_unlock_irq(void); +void swap_cache_lock(void); +void swap_cache_unlock(void); + static inline struct address_space *swap_address_space(swp_entry_t entry) { return &swap_space; @@ -247,14 +252,12 @@ static inline bool folio_matches_swap_entry(const str= uct folio *folio, */ struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); -void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **s= hadow); +int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, gfp_t gfp= , void **shadow); void swap_cache_del_folio(struct folio *folio); -/* Below helpers require the caller to lock and pass in the swap cluster. = */ -void __swap_cache_del_folio(struct swap_cluster_info *ci, - struct folio *folio, swp_entry_t entry, void *shadow); -void __swap_cache_replace_folio(struct swap_cluster_info *ci, - struct folio *old, struct folio *new); -void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents); +/* Below helpers require the caller to lock the swap cache. */ +void __swap_cache_del_folio(struct folio *folio, swp_entry_t entry, void *= shadow); +void __swap_cache_replace_folio(struct folio *old, struct folio *new); +void swap_cache_clear_shadow(swp_entry_t entry, int nr_ents); =20 void show_swap_cache_info(void); void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int n= r); @@ -411,21 +414,20 @@ static inline void *swap_cache_get_shadow(swp_entry_t= entry) return NULL; } =20 -static inline void swap_cache_add_folio(struct folio *folio, swp_entry_t e= ntry, void **shadow) +static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t en= try, gfp_t gfp, void **shadow) { + return 0; } =20 static inline void swap_cache_del_folio(struct folio *folio) { } =20 -static inline void __swap_cache_del_folio(struct swap_cluster_info *ci, - struct folio *folio, swp_entry_t entry, void *shadow) +static inline void __swap_cache_del_folio(struct folio *folio, swp_entry_t= entry, void *shadow) { } =20 -static inline void __swap_cache_replace_folio(struct swap_cluster_info *ci, - struct folio *old, struct folio *new) +static inline void __swap_cache_replace_folio(struct folio *old, struct fo= lio *new) { } =20 diff --git a/mm/swap_state.c b/mm/swap_state.c index 44d228982521e..34c9d9b243a74 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -22,8 +22,8 @@ #include #include #include +#include #include "internal.h" -#include "swap_table.h" #include "swap.h" =20 /* @@ -41,6 +41,28 @@ struct address_space swap_space __read_mostly =3D { .a_ops =3D &swap_aops, }; =20 +static DEFINE_XARRAY(swap_cache); + +void swap_cache_lock_irq(void) +{ + xa_lock_irq(&swap_cache); +} + +void swap_cache_unlock_irq(void) +{ + xa_unlock_irq(&swap_cache); +} + +void swap_cache_lock(void) +{ + xa_lock(&swap_cache); +} + +void swap_cache_unlock(void) +{ + xa_unlock(&swap_cache); +} + static bool enable_vma_readahead __read_mostly =3D true; =20 #define SWAP_RA_ORDER_CEILING 5 @@ -86,17 +108,22 @@ void show_swap_cache_info(void) */ struct folio *swap_cache_get_folio(swp_entry_t entry) { - unsigned long swp_tb; + void *entry_val; struct folio *folio; =20 for (;;) { - swp_tb =3D swap_table_get(__swap_entry_to_cluster(entry), - swp_cluster_offset(entry)); - if (!swp_tb_is_folio(swp_tb)) + rcu_read_lock(); + entry_val =3D xa_load(&swap_cache, entry.val); + if (!entry_val || xa_is_value(entry_val)) { + rcu_read_unlock(); return NULL; - folio =3D swp_tb_to_folio(swp_tb); - if (likely(folio_try_get(folio))) + } + folio =3D entry_val; + if (likely(folio_try_get(folio))) { + rcu_read_unlock(); return folio; + } + rcu_read_unlock(); } =20 return NULL; @@ -112,12 +139,14 @@ struct folio *swap_cache_get_folio(swp_entry_t entry) */ void *swap_cache_get_shadow(swp_entry_t entry) { - unsigned long swp_tb; + void *entry_val; + + rcu_read_lock(); + entry_val =3D xa_load(&swap_cache, entry.val); + rcu_read_unlock(); =20 - swp_tb =3D swap_table_get(__swap_entry_to_cluster(entry), - swp_cluster_offset(entry)); - if (swp_tb_is_shadow(swp_tb)) - return swp_tb_to_shadow(swp_tb); + if (xa_is_value(entry_val)) + return entry_val; return NULL; } =20 @@ -132,46 +161,58 @@ void *swap_cache_get_shadow(swp_entry_t entry) * with reference count or locks. * The caller also needs to update the corresponding swap_map slots with * SWAP_HAS_CACHE bit to avoid race or conflict. + * + * Return: 0 on success, negative error code on failure. */ -void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **s= hadowp) +int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, gfp_t gfp= , void **shadowp) { - void *shadow =3D NULL; - unsigned long old_tb, new_tb; - struct swap_cluster_info *ci; - unsigned int ci_start, ci_off, ci_end; + XA_STATE_ORDER(xas, &swap_cache, entry.val, folio_order(folio)); unsigned long nr_pages =3D folio_nr_pages(folio); + unsigned long i; + void *old; =20 VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); =20 - new_tb =3D folio_to_swp_tb(folio); - ci_start =3D swp_cluster_offset(entry); - ci_end =3D ci_start + nr_pages; - ci_off =3D ci_start; - ci =3D swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry)); - do { - old_tb =3D __swap_table_xchg(ci, ci_off, new_tb); - WARN_ON_ONCE(swp_tb_is_folio(old_tb)); - if (swp_tb_is_shadow(old_tb)) - shadow =3D swp_tb_to_shadow(old_tb); - } while (++ci_off < ci_end); - folio_ref_add(folio, nr_pages); folio_set_swapcache(folio); folio->swap =3D entry; - swap_cluster_unlock(ci); =20 - node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); - lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); + do { + xas_lock_irq(&xas); + xas_create_range(&xas); + if (xas_error(&xas)) + goto unlock; + for (i =3D 0; i < nr_pages; i++) { + VM_BUG_ON_FOLIO(xas.xa_index !=3D entry.val + i, folio); + old =3D xas_load(&xas); + if (old && !xa_is_value(old)) { + VM_WARN_ON_ONCE_FOLIO(1, folio); + xas_set_err(&xas, -EEXIST); + goto unlock; + } + if (shadowp && xa_is_value(old) && !*shadowp) + *shadowp =3D old; + xas_store(&xas, folio); + xas_next(&xas); + } + node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); + lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); +unlock: + xas_unlock_irq(&xas); + } while (xas_nomem(&xas, gfp)); =20 - if (shadowp) - *shadowp =3D shadow; + if (!xas_error(&xas)) + return 0; + + folio_clear_swapcache(folio); + folio_ref_sub(folio, nr_pages); + return xas_error(&xas); } =20 /** * __swap_cache_del_folio - Removes a folio from the swap cache. - * @ci: The locked swap cluster. * @folio: The folio. * @entry: The first swap entry that the folio corresponds to. * @shadow: shadow value to be filled in the swap cache. @@ -180,30 +221,23 @@ void swap_cache_add_folio(struct folio *folio, swp_en= try_t entry, void **shadowp * This won't put the folio's refcount. The caller has to do that. * * Context: Caller must ensure the folio is locked and in the swap cache - * using the index of @entry, and lock the cluster that holds the entries. + * using the index of @entry, and lock the swap cache xarray. */ -void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *fo= lio, - swp_entry_t entry, void *shadow) +void __swap_cache_del_folio(struct folio *folio, swp_entry_t entry, void *= shadow) { - unsigned long old_tb, new_tb; - unsigned int ci_start, ci_off, ci_end; - unsigned long nr_pages =3D folio_nr_pages(folio); + long nr_pages =3D folio_nr_pages(folio); + XA_STATE(xas, &swap_cache, entry.val); + int i; =20 - VM_WARN_ON_ONCE(__swap_entry_to_cluster(entry) !=3D ci); VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio); =20 - new_tb =3D shadow_swp_to_tb(shadow); - ci_start =3D swp_cluster_offset(entry); - ci_end =3D ci_start + nr_pages; - ci_off =3D ci_start; - do { - /* If shadow is NULL, we sets an empty shadow */ - old_tb =3D __swap_table_xchg(ci, ci_off, new_tb); - WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || - swp_tb_to_folio(old_tb) !=3D folio); - } while (++ci_off < ci_end); + for (i =3D 0; i < nr_pages; i++) { + void *old =3D xas_store(&xas, shadow); + VM_WARN_ON_FOLIO(old !=3D folio, folio); + xas_next(&xas); + } =20 folio->swap.val =3D 0; folio_clear_swapcache(folio); @@ -223,12 +257,11 @@ void __swap_cache_del_folio(struct swap_cluster_info = *ci, struct folio *folio, */ void swap_cache_del_folio(struct folio *folio) { - struct swap_cluster_info *ci; swp_entry_t entry =3D folio->swap; =20 - ci =3D swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry)); - __swap_cache_del_folio(ci, folio, entry, NULL); - swap_cluster_unlock(ci); + xa_lock_irq(&swap_cache); + __swap_cache_del_folio(folio, entry, NULL); + xa_unlock_irq(&swap_cache); =20 put_swap_folio(folio, entry); folio_ref_sub(folio, folio_nr_pages(folio)); @@ -236,7 +269,6 @@ void swap_cache_del_folio(struct folio *folio) =20 /** * __swap_cache_replace_folio - Replace a folio in the swap cache. - * @ci: The locked swap cluster. * @old: The old folio to be replaced. * @new: The new folio. * @@ -246,39 +278,23 @@ void swap_cache_del_folio(struct folio *folio) * the starting offset to override all slots covered by the new folio. * * Context: Caller must ensure both folios are locked, and lock the - * cluster that holds the old folio to be replaced. + * swap cache xarray. */ -void __swap_cache_replace_folio(struct swap_cluster_info *ci, - struct folio *old, struct folio *new) +void __swap_cache_replace_folio(struct folio *old, struct folio *new) { swp_entry_t entry =3D new->swap; unsigned long nr_pages =3D folio_nr_pages(new); - unsigned int ci_off =3D swp_cluster_offset(entry); - unsigned int ci_end =3D ci_off + nr_pages; - unsigned long old_tb, new_tb; + XA_STATE(xas, &swap_cache, entry.val); + int i; =20 VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new)); VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new)); VM_WARN_ON_ONCE(!entry.val); =20 - /* Swap cache still stores N entries instead of a high-order entry */ - new_tb =3D folio_to_swp_tb(new); - do { - old_tb =3D __swap_table_xchg(ci, ci_off, new_tb); - WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) !=3D ol= d); - } while (++ci_off < ci_end); - - /* - * If the old folio is partially replaced (e.g., splitting a large - * folio, the old folio is shrunk, and new split sub folios replace - * the shrunk part), ensure the new folio doesn't overlap it. - */ - if (IS_ENABLED(CONFIG_DEBUG_VM) && - folio_order(old) !=3D folio_order(new)) { - ci_off =3D swp_cluster_offset(old->swap); - ci_end =3D ci_off + folio_nr_pages(old); - while (ci_off++ < ci_end) - WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) !=3D old); + for (i =3D 0; i < nr_pages; i++) { + void *old_entry =3D xas_store(&xas, new); + WARN_ON_ONCE(!old_entry || xa_is_value(old_entry) || old_entry !=3D old); + xas_next(&xas); } } =20 @@ -287,20 +303,20 @@ void __swap_cache_replace_folio(struct swap_cluster_i= nfo *ci, * @entry: The starting index entry. * @nr_ents: How many slots need to be cleared. * - * Context: Caller must ensure the range is valid, all in one single clust= er, - * not occupied by any folio, and lock the cluster. + * Context: Caller must ensure the range is valid and all in one single cl= uster, + * not occupied by any folio. */ -void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents) +void swap_cache_clear_shadow(swp_entry_t entry, int nr_ents) { - struct swap_cluster_info *ci =3D __swap_entry_to_cluster(entry); - unsigned int ci_off =3D swp_cluster_offset(entry), ci_end; - unsigned long old; + XA_STATE(xas, &swap_cache, entry.val); + int i; =20 - ci_end =3D ci_off + nr_ents; - do { - old =3D __swap_table_xchg(ci, ci_off, null_to_swp_tb()); - WARN_ON_ONCE(swp_tb_is_folio(old)); - } while (++ci_off < ci_end); + xas_lock(&xas); + for (i =3D 0; i < nr_ents; i++) { + xas_store(&xas, NULL); + xas_next(&xas); + } + xas_unlock(&xas); } =20 /* @@ -480,7 +496,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entr= y, gfp_t gfp_mask, if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry)) goto fail_unlock; =20 - swap_cache_add_folio(new_folio, entry, &shadow); + /* May fail (-ENOMEM) if XArray node allocation failed. */ + if (swap_cache_add_folio(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &= shadow)) + goto fail_unlock; + memcg1_swapin(entry, 1); =20 if (shadow) diff --git a/mm/swap_table.h b/mm/swap_table.h index ea244a57a5b7a..ad2cb2ef46903 100644 --- a/mm/swap_table.h +++ b/mm/swap_table.h @@ -13,71 +13,6 @@ struct swap_table { =20 #define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) =3D=3D PAGE_SIZE) =20 -/* - * A swap table entry represents the status of a swap slot on a swap - * (physical or virtual) device. The swap table in each cluster is a - * 1:1 map of the swap slots in this cluster. - * - * Each swap table entry could be a pointer (folio), a XA_VALUE - * (shadow), or NULL. - */ - -/* - * Helpers for casting one type of info into a swap table entry. - */ -static inline unsigned long null_to_swp_tb(void) -{ - BUILD_BUG_ON(sizeof(unsigned long) !=3D sizeof(atomic_long_t)); - return 0; -} - -static inline unsigned long folio_to_swp_tb(struct folio *folio) -{ - BUILD_BUG_ON(sizeof(unsigned long) !=3D sizeof(void *)); - return (unsigned long)folio; -} - -static inline unsigned long shadow_swp_to_tb(void *shadow) -{ - BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=3D - BITS_PER_BYTE * sizeof(unsigned long)); - VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow)); - return (unsigned long)shadow; -} - -/* - * Helpers for swap table entry type checking. - */ -static inline bool swp_tb_is_null(unsigned long swp_tb) -{ - return !swp_tb; -} - -static inline bool swp_tb_is_folio(unsigned long swp_tb) -{ - return !xa_is_value((void *)swp_tb) && !swp_tb_is_null(swp_tb); -} - -static inline bool swp_tb_is_shadow(unsigned long swp_tb) -{ - return xa_is_value((void *)swp_tb); -} - -/* - * Helpers for retrieving info from swap table. - */ -static inline struct folio *swp_tb_to_folio(unsigned long swp_tb) -{ - VM_WARN_ON(!swp_tb_is_folio(swp_tb)); - return (void *)swp_tb; -} - -static inline void *swp_tb_to_shadow(unsigned long swp_tb) -{ - VM_WARN_ON(!swp_tb_is_shadow(swp_tb)); - return (void *)swp_tb; -} - /* * Helpers for accessing or modifying the swap table of a cluster, * the swap cluster must be locked. @@ -92,17 +27,6 @@ static inline void __swap_table_set(struct swap_cluster_= info *ci, atomic_long_set(&table[off], swp_tb); } =20 -static inline unsigned long __swap_table_xchg(struct swap_cluster_info *ci, - unsigned int off, unsigned long swp_tb) -{ - atomic_long_t *table =3D rcu_dereference_protected(ci->table, true); - - lockdep_assert_held(&ci->lock); - VM_WARN_ON_ONCE(off >=3D SWAPFILE_CLUSTER); - /* Ordering is guaranteed by cluster lock, relax */ - return atomic_long_xchg_relaxed(&table[off], swp_tb); -} - static inline unsigned long __swap_table_get(struct swap_cluster_info *ci, unsigned int off) { @@ -122,7 +46,7 @@ static inline unsigned long swap_table_get(struct swap_c= luster_info *ci, =20 rcu_read_lock(); table =3D rcu_dereference(ci->table); - swp_tb =3D table ? atomic_long_read(&table[off]) : null_to_swp_tb(); + swp_tb =3D table ? atomic_long_read(&table[off]) : 0; rcu_read_unlock(); =20 return swp_tb; diff --git a/mm/swapfile.c b/mm/swapfile.c index 46d2008e4b996..cacfafa9a540d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -474,7 +474,7 @@ static void swap_cluster_free_table(struct swap_cluster= _info *ci) lockdep_assert_held(&ci->lock); VM_WARN_ON_ONCE(!cluster_is_empty(ci)); for (ci_off =3D 0; ci_off < SWAPFILE_CLUSTER; ci_off++) - VM_WARN_ON_ONCE(!swp_tb_is_null(__swap_table_get(ci, ci_off))); + VM_WARN_ON_ONCE(__swap_table_get(ci, ci_off)); table =3D (void *)rcu_dereference_protected(ci->table, true); rcu_assign_pointer(ci->table, NULL); =20 @@ -843,26 +843,6 @@ static bool cluster_scan_range(struct swap_info_struct= *si, return true; } =20 -/* - * Currently, the swap table is not used for count tracking, just - * do a sanity check here to ensure nothing leaked, so the swap - * table should be empty upon freeing. - */ -static void swap_cluster_assert_table_empty(struct swap_cluster_info *ci, - unsigned int start, unsigned int nr) -{ - unsigned int ci_off =3D start % SWAPFILE_CLUSTER; - unsigned int ci_end =3D ci_off + nr; - unsigned long swp_tb; - - if (IS_ENABLED(CONFIG_DEBUG_VM)) { - do { - swp_tb =3D __swap_table_get(ci, ci_off); - VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb)); - } while (++ci_off < ci_end); - } -} - static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_c= luster_info *ci, unsigned int start, unsigned char usage, unsigned int order) @@ -882,7 +862,6 @@ static bool cluster_alloc_range(struct swap_info_struct= *si, struct swap_cluster ci->order =3D order; =20 memset(si->swap_map + start, usage, nr_pages); - swap_cluster_assert_table_empty(ci, start, nr_pages); swap_range_alloc(si, nr_pages); ci->count +=3D nr_pages; =20 @@ -1275,7 +1254,7 @@ static void swap_range_free(struct swap_info_struct *= si, unsigned long offset, swap_slot_free_notify(si->bdev, offset); offset++; } - __swap_cache_clear_shadow(swp_entry(si->type, begin), nr_entries); + swap_cache_clear_shadow(swp_entry(si->type, begin), nr_entries); =20 /* * Make sure that try_to_unuse() observes si->inuse_pages reaching 0 @@ -1423,6 +1402,7 @@ int folio_alloc_swap(struct folio *folio) unsigned int order =3D folio_order(folio); unsigned int size =3D 1 << order; swp_entry_t entry =3D {}; + int err; =20 VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio); @@ -1457,19 +1437,23 @@ int folio_alloc_swap(struct folio *folio) } =20 /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */ - if (mem_cgroup_try_charge_swap(folio, entry)) + if (mem_cgroup_try_charge_swap(folio, entry)) { + err =3D -ENOMEM; goto out_free; + } =20 if (!entry.val) return -ENOMEM; =20 - swap_cache_add_folio(folio, entry, NULL); + err =3D swap_cache_add_folio(folio, entry, __GFP_HIGH | __GFP_NOMEMALLOC = | __GFP_NOWARN, NULL); + if (err) + goto out_free; =20 return 0; =20 out_free: put_swap_folio(folio, entry); - return -ENOMEM; + return err; } =20 static struct swap_info_struct *_swap_info_get(swp_entry_t entry) @@ -1729,7 +1713,6 @@ static void swap_entries_free(struct swap_info_struct= *si, =20 mem_cgroup_uncharge_swap(entry, nr_pages); swap_range_free(si, offset, nr_pages); - swap_cluster_assert_table_empty(ci, offset, nr_pages); =20 if (!ci->count) free_cluster(si, ci); @@ -4057,9 +4040,9 @@ static int __init swapfile_init(void) swapfile_maximum_size =3D arch_max_swapfile_size(); =20 /* - * Once a cluster is freed, it's swap table content is read - * only, and all swap cache readers (swap_cache_*) verifies - * the content before use. So it's safe to use RCU slab here. + * Once a cluster is freed, it's swap table content is read only, and + * all swap table readers verify the content before use. So it's safe to + * use RCU slab here. */ if (!SWP_TABLE_USE_PAGE) swap_table_cachep =3D kmem_cache_create("swap_table", diff --git a/mm/vmscan.c b/mm/vmscan.c index 614ccf39fe3fa..558ff7f413786 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -707,13 +707,12 @@ static int __remove_mapping(struct address_space *map= ping, struct folio *folio, { int refcount; void *shadow =3D NULL; - struct swap_cluster_info *ci; =20 BUG_ON(!folio_test_locked(folio)); BUG_ON(mapping !=3D folio_mapping(folio)); =20 if (folio_test_swapcache(folio)) { - ci =3D swap_cluster_get_and_lock_irq(folio); + swap_cache_lock_irq(); } else { spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); @@ -758,9 +757,9 @@ static int __remove_mapping(struct address_space *mappi= ng, struct folio *folio, =20 if (reclaimed && !mapping_exiting(mapping)) shadow =3D workingset_eviction(folio, target_memcg); - __swap_cache_del_folio(ci, folio, swap, shadow); + __swap_cache_del_folio(folio, swap, shadow); memcg1_swapout(folio, swap); - swap_cluster_unlock_irq(ci); + swap_cache_unlock_irq(); put_swap_folio(folio, swap); } else { void (*free_folio)(struct folio *); @@ -799,7 +798,7 @@ static int __remove_mapping(struct address_space *mappi= ng, struct folio *folio, =20 cannot_free: if (folio_test_swapcache(folio)) { - swap_cluster_unlock_irq(ci); + swap_cache_unlock_irq(); } else { xa_unlock_irq(&mapping->i_pages); spin_unlock(&mapping->host->i_lock); --=20 2.47.3