From nobody Mon Jun 8 14:35:25 2026 Received: from mail-ot1-f47.google.com (mail-ot1-f47.google.com [209.85.210.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 884EF359A91 for ; Thu, 28 May 2026 21:29:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.47 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780003802; cv=none; b=Evo0vq6+kfNmjdE9Jj9Ka/VxkwW9BrRPwSHv15QV6G3kCxfiGecqC0sGtade8rNE6vnpfraZJNZZhsOeKGArWAI/5pYHxaQvHDS7RpZNco+B/+2yX3+/dKZaJIieSqPI+Pp1OsicbJXUnYwc9LYZfMoPQVq6Xzef1ABrbtT782c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780003802; c=relaxed/simple; bh=oR+X2L95z8X0bBKmzLp8qo99YnCqcx4kYIXRWFm6pak=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=dNNSownIjkuWNnZlY+2XyVA07F0jEIdz+lcdEHqCfj9AHGuwjH+gLIorCF5s8cXYs2DoDmvEgveTJwvJdixLCiMxV4q40NHKdjbciPVb0B8lrTSqke4CDK06V4suMYPkRh/1WC2l2IL0+9IX4JyS0KDENSp9nSdqQATEyVlW5S8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ACRUUcIG; arc=none smtp.client-ip=209.85.210.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ACRUUcIG" Received: by mail-ot1-f47.google.com with SMTP id 46e09a7af769-7e62b6163c8so1858429a34.2 for ; Thu, 28 May 2026 14:29:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1780003798; x=1780608598; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=oLUHfFpAw3k8f0JC/etKd/zNUmP2/ysVJxTZfrLrO5U=; b=ACRUUcIGHRL9/4Hkkbc01+V7q82I3RzdGB2xTv4vRWgRFs4aYuVEP0Lyzjj+dc4yOn 8Tn+ksqJRy/JmlwzM7hXVkNSsUkeB1XTRTSHub/F2xcWXS7SORv+oNb5I+Iw7ChTr7PV R9EA5Zy5WWiaz3NCyBzSrHypMrB8QPGQ6TVnNW5za5Os42pznCQ/k7ru63sPxuoR0peg 3RO7stps6ncxfsbwEzK5IeYhukWcuKY2LK3TWY64e4Sehk2KCd16+nJk13glcXVfGsBc OxuOA4VtZ3/8csYDUm2j7vavYT5ICKZFnBbfLAPJf4t25b2VGE4/r9bG8NrbwoE8Zk6L ZiqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780003798; x=1780608598; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=oLUHfFpAw3k8f0JC/etKd/zNUmP2/ysVJxTZfrLrO5U=; b=QWH+QzenSoaEo8TWo97LS4/LmrPmj9CdwQLOj+ANJv2qiV7IY2uY0UJT/BTy0ZM5oD gYZ9CwTTaZUxRgo5ifBUYFFnH7UNJ+wVdOmPBT6EvHTDXPZo/hqnKItcwR6HXa7sUR/Y f4csP0uofL/vebNSHHRKnTz4kQ8PXeRv9Hh25J6syd2pXDl66mwfJuBo3poq+0amqyTP eMxRtj7dNKKGhu3XtBJ9E3PmQA6lIcRqOAz5NdG97GrQTADsMuqcBoZT0Dl47y64dTJv r94hN0S0O6amIcSG2u5LS/b5n5NXiEXjO8gXNC4pZIkz/e04G/XKK2t9wOKhOqM9uzOs mTuQ== X-Forwarded-Encrypted: i=1; AFNElJ/tNzkK24KBUqj8IwEKhLXxLpEQv/tYWO18JAQFwc9Hc8nhkGHCLQvOF5xJr0CR/rjw5HzQpmJGL/PR1Kc=@vger.kernel.org X-Gm-Message-State: AOJu0YwHGGODYuZNZelZotnXLC1lxXabSMYCa3j0BJDdeKWWxGuur2h5 y6OyLO2syvSF5zMxwCYpPKuG4UAsrHWx9af24G+y/60Su9ssq129ZAEO X-Gm-Gg: Acq92OF58eLLdSHVdgj3hTy3BM+uMfNzZ7rkXegUN49HT3qoWUXygm5TUfuRp+HmQgi oyvg0Q0th2WFMxa9C8B3V1C1GpiC9zBBLwdqkKCdA9pGG1AcjSwmFjqcNXLBn9mNNQFF4kL1FXw 2A2qsnTMm1MVHz4ReUyEGjeVLt8fKM0jIsxPduBQi7g9QJR4Wv/3EzpAg37sZ6nH58REUF4vnTE BAb6rTQPDznmt3M3lXHJNwIG/JDiHE7qyzoe9rz55Qi6ITAm9ZTNWuz1wvedmDA+5Zqz3e4+dGc ZbeXFK8Cajai91fFo45uIFQpV4T41BOhZF5c3CCwEex5Fik74LyxIp9W123xagjbKjTm5fTGgWY f9yVMT2cA7sI67WdRRyvlwZqEchZjjxcMSSNuomy44lBQ1hI6kkJVlczLovtmhMoTVWFszqJUDZ 2BMvrVOoOrXt/3hk1aapWuOiZnI7s2qs9FZvnH5/kG0/JNva6OblgEebjH X-Received: by 2002:a05:6830:34a3:b0:7dc:dd19:7f69 with SMTP id 46e09a7af769-7e694dec7cbmr261221a34.17.1780003798240; Thu, 28 May 2026 14:29:58 -0700 (PDT) Received: from localhost ([2a03:2880:10ff:50::]) by smtp.gmail.com with ESMTPSA id 46e09a7af769-7e695212b05sm127382a34.16.2026.05.28.14.29.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 28 May 2026 14:29:57 -0700 (PDT) From: Nhat Pham To: kasong@tencent.com Cc: Liam.Howlett@oracle.com, akpm@linux-foundation.org, apopple@nvidia.com, axelrasmussen@google.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, bhe@redhat.com, byungchul@sk.com, cgroups@vger.kernel.org, chengming.zhou@linux.dev, chrisl@kernel.org, corbet@lwn.net, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jannh@google.com, joshua.hahnjy@gmail.com, lance.yang@linux.dev, lenb@kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-pm@vger.kernel.org, lorenzo.stoakes@oracle.com, matthew.brost@intel.com, mhocko@suse.com, muchun.song@linux.dev, npache@redhat.com, nphamcs@gmail.com, pavel@kernel.org, peterx@redhat.com, peterz@infradead.org, pfalcato@suse.de, rafael@kernel.org, rakie.kim@sk.com, roman.gushchin@linux.dev, rppt@kernel.org, ryan.roberts@arm.com, shakeel.butt@linux.dev, shikemeng@huaweicloud.com, surenb@google.com, tglx@kernel.org, vbabka@suse.cz, weixugc@google.com, ying.huang@linux.alibaba.com, yosry.ahmed@linux.dev, yuanchu@google.com, zhengqi.arch@bytedance.com, ziy@nvidia.com, kernel-team@meta.com, riel@surriel.com, haowenchao22@gmail.com Subject: [RFC PATCH 1/5] mm, swap: add virtual swap device infrastructure Date: Thu, 28 May 2026 14:29:25 -0700 Message-ID: <20260528212955.1912856-2-nphamcs@gmail.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260528212955.1912856-1-nphamcs@gmail.com> References: <20260528212955.1912856-1-nphamcs@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Create a massive virtual swap device at boot, along with the dynamic cluster infrastructure that the rest of the vswap layer is built on: - swap_cluster_info_dynamic: per-cluster dynamic info kept in an xarray, allowing arbitrary-size devices without the static cluster_info[] array. - virtual_table: a per-slot side table for vswap backend metadata (tag-encoded in low bits). The field itself is added in the next patch; this commit only introduces the dynamic cluster container that will hold it. - The size of the vswap device is ALIGN_DOWN(UINT_MAX, SWAPFILE_CLUSTER) pages. Gated by a new CONFIG_VSWAP (depends on SWAP && 64BIT). For now, the vswap device cannot be swapon'd or swapoff'd =E2=80=94 it is created unconditionally at boot when CONFIG_VSWAP=3Dy and lives for the lifetime of the kernel. The SWP_VSWAP flag and swap_is_vswap() helper let hot paths skip per-device bookkeeping that doesn't apply (avail-list management, percpu_ref get/put, hibernation target lookup, etc.). This patch is pure scaffolding: it introduces the device, the dynamic-cluster machinery, and the general shape of a vswap allocator (with sanity checks), but does not hook the vswap device into any allocation path. folio_alloc_swap will not produce vswap entries until a subsequent patch wires it in. Backends (zswap, zero, physical disk) and the vswap-aware swap-out / swap-in / writeback paths arrive in subsequent patches. Suggested-by: Kairui Song Co-developed-by: Kairui Song Signed-off-by: Kairui Song Signed-off-by: Nhat Pham --- MAINTAINERS | 1 + include/linux/swap.h | 4 + mm/Kconfig | 10 ++ mm/page_io.c | 18 ++- mm/swap.h | 46 ++++++-- mm/swap_state.c | 43 ++++--- mm/swap_table.h | 2 + mm/swapfile.c | 264 +++++++++++++++++++++++++++++++++++++++---- mm/vswap.h | 29 +++++ mm/zswap.c | 10 +- 10 files changed, 375 insertions(+), 52 deletions(-) create mode 100644 mm/vswap.h diff --git a/MAINTAINERS b/MAINTAINERS index 9be179722d42..e96bd0bf6307 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17041,6 +17041,7 @@ F: mm/swap.h F: mm/swap_table.h F: mm/swap_state.c F: mm/swapfile.c +F: mm/vswap.h =20 MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE) M: Andrew Morton diff --git a/include/linux/swap.h b/include/linux/swap.h index 6d72778e6cc3..ee9b1e76b058 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -214,6 +214,7 @@ enum { SWP_STABLE_WRITES =3D (1 << 11), /* no overwrite PG_writeback pages */ SWP_SYNCHRONOUS_IO =3D (1 << 12), /* synchronous IO is efficient */ SWP_HIBERNATION =3D (1 << 13), /* pinned for hibernation */ + SWP_VSWAP =3D (1 << 14), /* virtual swap device */ /* add others here before... */ }; =20 @@ -282,6 +283,7 @@ struct swap_info_struct { struct work_struct reclaim_work; /* reclaim worker */ struct list_head discard_clusters; /* discard clusters list */ struct plist_node avail_list; /* entry in swap_avail_head */ + struct xarray cluster_info_pool; /* Xarray for vswap dynamic cluster info= */ }; =20 static inline swp_entry_t page_swap_entry(struct page *page) @@ -473,6 +475,8 @@ void swap_free_hibernation_slot(swp_entry_t entry); =20 static inline void put_swap_device(struct swap_info_struct *si) { + if (si->flags & SWP_VSWAP) + return; percpu_ref_put(&si->users); } =20 diff --git a/mm/Kconfig b/mm/Kconfig index 776b67c66e82..fc395ae3dde8 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -19,6 +19,16 @@ menuconfig SWAP used to provide more virtual memory than the actual RAM present in your computer. If unsure say Y. =20 +config VSWAP + bool "Virtual swap device" + depends on SWAP && 64BIT + help + Adds a virtual swap layer that decouples swap entries in page + tables from physical backing storage. Swap entries are allocated + from a virtual swap device and can be backed by zswap, a physical + swapfile, or kept in memory =E2=80=94 with the backing changeable at + runtime without invalidating page table entries. + config ZSWAP bool "Compressed cache for swap pages" depends on SWAP diff --git a/mm/page_io.c b/mm/page_io.c index f2d8fe7fd057..8126be6e4cfb 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -295,8 +295,7 @@ int swap_writeout(struct folio *folio, struct swap_iocb= **swap_plug) } rcu_read_unlock(); =20 - __swap_writepage(folio, swap_plug); - return 0; + return __swap_writepage(folio, swap_plug); out_unlock: folio_unlock(folio); return ret; @@ -458,11 +457,18 @@ static void swap_writepage_bdev_async(struct folio *f= olio, submit_bio(bio); } =20 -void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug) +int __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug) { struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap); =20 VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio); + + if (sis->flags & SWP_VSWAP) { + /* Prevent the page from getting reclaimed. */ + folio_set_dirty(folio); + return AOP_WRITEPAGE_ACTIVATE; + } + /* * ->flags can be updated non-atomically, * but that will never affect SWP_FS_OPS, so the data_race @@ -479,6 +485,7 @@ void __swap_writepage(struct folio *folio, struct swap_= iocb **swap_plug) swap_writepage_bdev_sync(folio, sis); else swap_writepage_bdev_async(folio, sis); + return 0; } =20 void swap_write_unplug(struct swap_iocb *sio) @@ -684,6 +691,11 @@ void swap_read_folio(struct folio *folio, struct swap_= iocb **plug) if (zswap_load(folio) !=3D -ENOENT) goto finish; =20 + if (unlikely(sis->flags & SWP_VSWAP)) { + folio_unlock(folio); + goto finish; + } + /* We have to read from slower devices. Increase zswap protection. */ zswap_folio_swapin(folio); =20 diff --git a/mm/swap.h b/mm/swap.h index 81c06aae7ccd..479ee5871cb9 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -65,6 +65,13 @@ struct swap_cluster_info { struct list_head list; }; =20 +struct swap_cluster_info_dynamic { + struct swap_cluster_info ci; /* Underlying cluster info */ + unsigned int index; /* for cluster_index() */ + struct rcu_head rcu; /* For kfree_rcu deferred free */ + /* Backend pointers (virtual_table) added in a later patch. */ +}; + /* All on-list cluster must have a non-zero flag. */ enum swap_cluster_flags { CLUSTER_FLAG_NONE =3D 0, /* For temporary off-list cluster */ @@ -75,6 +82,7 @@ enum swap_cluster_flags { CLUSTER_FLAG_USABLE =3D CLUSTER_FLAG_FRAG, CLUSTER_FLAG_FULL, CLUSTER_FLAG_DISCARD, + CLUSTER_FLAG_DEAD, /* Vswap dynamic cluster pending kfree_rcu */ CLUSTER_FLAG_MAX, }; =20 @@ -108,9 +116,19 @@ static inline struct swap_info_struct *__swap_entry_to= _info(swp_entry_t entry) static inline struct swap_cluster_info *__swap_offset_to_cluster( struct swap_info_struct *si, pgoff_t offset) { + unsigned int cluster_idx =3D offset / SWAPFILE_CLUSTER; + VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */ VM_WARN_ON_ONCE(offset >=3D roundup(si->max, SWAPFILE_CLUSTER)); - return &si->cluster_info[offset / SWAPFILE_CLUSTER]; + + if (si->flags & SWP_VSWAP) { + struct swap_cluster_info_dynamic *ci_dyn; + + ci_dyn =3D xa_load(&si->cluster_info_pool, cluster_idx); + return ci_dyn ? &ci_dyn->ci : NULL; + } + + return &si->cluster_info[cluster_idx]; } =20 static inline struct swap_cluster_info *__swap_entry_to_cluster(swp_entry_= t entry) @@ -122,7 +140,7 @@ static inline struct swap_cluster_info *__swap_entry_to= _cluster(swp_entry_t entr static __always_inline struct swap_cluster_info *__swap_cluster_lock( struct swap_info_struct *si, unsigned long offset, bool irq) { - struct swap_cluster_info *ci =3D __swap_offset_to_cluster(si, offset); + struct swap_cluster_info *ci; =20 /* * Nothing modifies swap cache in an IRQ context. All access to @@ -135,10 +153,24 @@ static __always_inline struct swap_cluster_info *__sw= ap_cluster_lock( */ VM_WARN_ON_ONCE(!in_task()); VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */ - if (irq) - spin_lock_irq(&ci->lock); - else - spin_lock(&ci->lock); + + rcu_read_lock(); + ci =3D __swap_offset_to_cluster(si, offset); + if (ci) { + if (irq) + spin_lock_irq(&ci->lock); + else + spin_lock(&ci->lock); + + if (ci->flags =3D=3D CLUSTER_FLAG_DEAD) { + if (irq) + spin_unlock_irq(&ci->lock); + else + spin_unlock(&ci->lock); + ci =3D NULL; + } + } + rcu_read_unlock(); return ci; } =20 @@ -250,7 +282,7 @@ static inline void swap_read_unplug(struct swap_iocb *p= lug) } void swap_write_unplug(struct swap_iocb *sio); int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug); -void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug); +int __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug); =20 /* linux/mm/swap_state.c */ extern struct address_space swap_space __read_mostly; diff --git a/mm/swap_state.c b/mm/swap_state.c index 04f5ce992401..b063c47138c5 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -90,8 +90,10 @@ struct folio *swap_cache_get_folio(swp_entry_t entry) struct folio *folio; =20 for (;;) { + rcu_read_lock(); swp_tb =3D swap_table_get(__swap_entry_to_cluster(entry), swp_cluster_offset(entry)); + rcu_read_unlock(); if (!swp_tb_is_folio(swp_tb)) return NULL; folio =3D swp_tb_to_folio(swp_tb); @@ -113,8 +115,10 @@ bool swap_cache_has_folio(swp_entry_t entry) { unsigned long swp_tb; =20 + rcu_read_lock(); swp_tb =3D swap_table_get(__swap_entry_to_cluster(entry), swp_cluster_offset(entry)); + rcu_read_unlock(); return swp_tb_is_folio(swp_tb); } =20 @@ -130,8 +134,10 @@ void *swap_cache_get_shadow(swp_entry_t entry) { unsigned long swp_tb; =20 + rcu_read_lock(); swp_tb =3D swap_table_get(__swap_entry_to_cluster(entry), swp_cluster_offset(entry)); + rcu_read_unlock(); if (swp_tb_is_shadow(swp_tb)) return swp_tb_to_shadow(swp_tb); return NULL; @@ -400,14 +406,16 @@ void __swap_cache_replace_folio(struct swap_cluster_i= nfo *ci, * -ENOENT / -EEXIST: Target swap entry is unavailable or cached, the call= er * should abort or try to use the cached folio instead */ -static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci, - swp_entry_t targ_entry, gfp_t gfp, +static struct folio *__swap_cache_alloc(swp_entry_t targ_entry, gfp_t gfp, unsigned int order, struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx) { int err; swp_entry_t entry; struct folio *folio; + struct swap_cluster_info *ci; + struct swap_info_struct *si =3D __swap_entry_to_info(targ_entry); + unsigned long offset =3D swp_offset(targ_entry); void *shadow =3D NULL; unsigned short memcg_id; unsigned long address, nr_pages =3D 1UL << order; @@ -417,9 +425,12 @@ static struct folio *__swap_cache_alloc(struct swap_cl= uster_info *ci, entry.val =3D round_down(targ_entry.val, nr_pages); =20 /* Check if the slot and range are available, skip allocation if not */ - spin_lock(&ci->lock); - err =3D __swap_cache_add_check(ci, targ_entry, nr_pages, NULL, NULL); - spin_unlock(&ci->lock); + err =3D -ENOENT; + ci =3D swap_cluster_lock(si, offset); + if (ci) { + err =3D __swap_cache_add_check(ci, targ_entry, nr_pages, NULL, NULL); + swap_cluster_unlock(ci); + } if (unlikely(err)) return ERR_PTR(err); =20 @@ -440,10 +451,13 @@ static struct folio *__swap_cache_alloc(struct swap_c= luster_info *ci, return ERR_PTR(-ENOMEM); =20 /* Double check the range is still not in conflict */ - spin_lock(&ci->lock); - err =3D __swap_cache_add_check(ci, targ_entry, nr_pages, &shadow, &memcg_= id); + err =3D -ENOENT; + ci =3D swap_cluster_lock(si, offset); + if (ci) + err =3D __swap_cache_add_check(ci, targ_entry, nr_pages, &shadow, &memcg= _id); if (unlikely(err)) { - spin_unlock(&ci->lock); + if (ci) + swap_cluster_unlock(ci); folio_put(folio); return ERR_PTR(err); } @@ -451,13 +465,14 @@ static struct folio *__swap_cache_alloc(struct swap_c= luster_info *ci, __folio_set_locked(folio); __folio_set_swapbacked(folio); __swap_cache_do_add_folio(ci, folio, entry); - spin_unlock(&ci->lock); + swap_cluster_unlock(ci); =20 if (mem_cgroup_swapin_charge_folio(folio, memcg_id, vmf ? vmf->vma->vm_mm : NULL, gfp)) { - spin_lock(&ci->lock); + /* The folio pins the cluster */ + ci =3D swap_cluster_lock(si, offset); __swap_cache_do_del_folio(ci, folio, entry, shadow); - spin_unlock(&ci->lock); + swap_cluster_unlock(ci); folio_unlock(folio); /* nr_pages refs from swap cache, 1 from allocation */ folio_put_refs(folio, nr_pages + 1); @@ -501,9 +516,7 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_e= ntry, gfp_t gfp, { int order, err; struct folio *ret; - struct swap_cluster_info *ci; =20 - ci =3D __swap_entry_to_cluster(targ_entry); order =3D highest_order(orders); =20 /* orders must be non-zero, and must not exceed cluster size. */ @@ -511,12 +524,12 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ= _entry, gfp_t gfp, return ERR_PTR(-EINVAL); =20 do { - ret =3D __swap_cache_alloc(ci, targ_entry, gfp, order, + ret =3D __swap_cache_alloc(targ_entry, gfp, order, vmf, mpol, ilx); if (!IS_ERR(ret)) break; err =3D PTR_ERR(ret); - if (!order || (err && err !=3D -EBUSY && err !=3D -ENOMEM)) + if (err && err !=3D -EBUSY && err !=3D -ENOMEM) break; count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK); order =3D next_order(&orders, order); diff --git a/mm/swap_table.h b/mm/swap_table.h index e6613e62f8d0..fd7f0fb9836a 100644 --- a/mm/swap_table.h +++ b/mm/swap_table.h @@ -255,6 +255,8 @@ static inline unsigned long swap_table_get(struct swap_= cluster_info *ci, unsigned long swp_tb; =20 VM_WARN_ON_ONCE(off >=3D SWAPFILE_CLUSTER); + if (!ci) + return SWP_TB_NULL; =20 rcu_read_lock(); table =3D rcu_dereference(ci->table); diff --git a/mm/swapfile.c b/mm/swapfile.c index a9a1e477fec9..f6d2529159ff 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -42,10 +42,12 @@ #include #include #include +#include =20 #include #include #include "swap_table.h" +#include "vswap.h" #include "internal.h" #include "swap.h" =20 @@ -401,6 +403,8 @@ static inline bool cluster_is_usable(struct swap_cluste= r_info *ci, int order) static inline unsigned int cluster_index(struct swap_info_struct *si, struct swap_cluster_info *ci) { + if (si->flags & SWP_VSWAP) + return container_of(ci, struct swap_cluster_info_dynamic, ci)->index; return ci - si->cluster_info; } =20 @@ -734,6 +738,22 @@ static void free_cluster(struct swap_info_struct *si, = struct swap_cluster_info * return; } =20 + if (si->flags & SWP_VSWAP) { + struct swap_cluster_info_dynamic *ci_dyn; + + ci_dyn =3D container_of(ci, struct swap_cluster_info_dynamic, ci); + if (ci->flags !=3D CLUSTER_FLAG_NONE) { + spin_lock(&si->lock); + list_del(&ci->list); + spin_unlock(&si->lock); + } + swap_cluster_free_table(ci); + xa_erase(&si->cluster_info_pool, ci_dyn->index); + ci->flags =3D CLUSTER_FLAG_DEAD; + kfree_rcu(ci_dyn, rcu); + return; + } + __free_cluster(si, ci); } =20 @@ -836,14 +856,21 @@ static int swap_cluster_setup_bad_slot(struct swap_in= fo_struct *si, * stolen by a lower order). @usable will be set to false if that happens. */ static bool cluster_reclaim_range(struct swap_info_struct *si, - struct swap_cluster_info *ci, + struct swap_cluster_info **pcip, unsigned long start, unsigned int order, bool *usable) { + struct swap_cluster_info *ci =3D *pcip; unsigned int nr_pages =3D 1 << order; unsigned long offset =3D start, end =3D start + nr_pages; unsigned long swp_tb; =20 + /* + * Take RCU read lock before releasing the cluster lock to keep ci + * alive =E2=80=94 for vswap dynamic clusters, ci is freed via kfree_rcu + * and the grace period could otherwise elapse in the window. + */ + rcu_read_lock(); spin_unlock(&ci->lock); do { swp_tb =3D swap_table_get(ci, offset % SWAPFILE_CLUSTER); @@ -853,7 +880,15 @@ static bool cluster_reclaim_range(struct swap_info_str= uct *si, if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY) < 0) break; } while (++offset < end); - spin_lock(&ci->lock); + rcu_read_unlock(); + + /* Re-lookup: dynamic cluster may have been freed while lock was dropped = */ + ci =3D swap_cluster_lock(si, start); + *pcip =3D ci; + if (!ci) { + *usable =3D false; + return false; + } =20 /* * We just dropped ci->lock so cluster could be used by another @@ -984,7 +1019,8 @@ static unsigned int alloc_swap_scan_cluster(struct swa= p_info_struct *si, if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim)) continue; if (need_reclaim) { - ret =3D cluster_reclaim_range(si, ci, offset, order, &usable); + ret =3D cluster_reclaim_range(si, &ci, offset, order, + &usable); if (!usable) goto out; if (cluster_is_empty(ci)) @@ -1002,8 +1038,10 @@ static unsigned int alloc_swap_scan_cluster(struct s= wap_info_struct *si, break; } out: - relocate_cluster(si, ci); - swap_cluster_unlock(ci); + if (ci) { + relocate_cluster(si, ci); + swap_cluster_unlock(ci); + } if (si->flags & SWP_SOLIDSTATE) { this_cpu_write(percpu_swap_cluster.offset[order], next); this_cpu_write(percpu_swap_cluster.si[order], si); @@ -1035,6 +1073,41 @@ static unsigned int alloc_swap_scan_list(struct swap= _info_struct *si, return found; } =20 +static unsigned int alloc_swap_scan_dynamic(struct swap_info_struct *si, + struct folio *folio) +{ + struct swap_cluster_info_dynamic *ci_dyn; + struct swap_cluster_info *ci; + unsigned long offset; + + WARN_ON(!(si->flags & SWP_VSWAP)); + + ci_dyn =3D kzalloc(sizeof(*ci_dyn), GFP_ATOMIC); + if (!ci_dyn) + return SWAP_ENTRY_INVALID; + + spin_lock_init(&ci_dyn->ci.lock); + INIT_LIST_HEAD(&ci_dyn->ci.list); + + if (swap_cluster_alloc_table(&ci_dyn->ci, GFP_ATOMIC)) { + kfree(ci_dyn); + return SWAP_ENTRY_INVALID; + } + + if (xa_alloc(&si->cluster_info_pool, &ci_dyn->index, ci_dyn, + XA_LIMIT(1, DIV_ROUND_UP(si->max, SWAPFILE_CLUSTER) - 1), + GFP_ATOMIC)) { + swap_cluster_free_table(&ci_dyn->ci); + kfree(ci_dyn); + return SWAP_ENTRY_INVALID; + } + + ci =3D &ci_dyn->ci; + spin_lock(&ci->lock); + offset =3D cluster_offset(si, ci); + return alloc_swap_scan_cluster(si, ci, folio, offset); +} + static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool f= orce) { long to_scan =3D 1; @@ -1057,7 +1130,9 @@ static void swap_reclaim_full_clusters(struct swap_in= fo_struct *si, bool force) spin_unlock(&ci->lock); nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); - spin_lock(&ci->lock); + ci =3D swap_cluster_lock(si, offset); + if (!ci) + goto next; if (nr_reclaim) { offset +=3D abs(nr_reclaim); continue; @@ -1071,6 +1146,7 @@ static void swap_reclaim_full_clusters(struct swap_in= fo_struct *si, bool force) relocate_cluster(si, ci); =20 swap_cluster_unlock(ci); +next: if (to_scan <=3D 0) break; cond_resched(); @@ -1141,6 +1217,12 @@ static unsigned long cluster_alloc_swap_entry(struct= swap_info_struct *si, goto done; } =20 + if (si->flags & SWP_VSWAP) { + found =3D alloc_swap_scan_dynamic(si, folio); + if (found) + goto done; + } + if (!(si->flags & SWP_PAGE_DISCARD)) { found =3D alloc_swap_scan_list(si, &si->free_clusters, folio, false); if (found) @@ -1259,6 +1341,13 @@ static void add_to_avail_list(struct swap_info_struc= t *si, bool swapon) goto skip; } =20 + /* + * Keep vswap off the avail list =E2=80=94 it is not allocated from by + * the physical swap allocator (swap_alloc_fast/slow). + */ + if (swap_is_vswap(si)) + goto skip; + plist_add(&si->avail_list, &swap_avail_head); =20 skip: @@ -1341,6 +1430,10 @@ static void swap_range_free(struct swap_info_struct = *si, unsigned long offset, =20 static bool get_swap_device_info(struct swap_info_struct *si) { + /* vswap device is always alive =E2=80=94 no ref counting needed */ + if (swap_is_vswap(si)) + return true; + if (!percpu_ref_tryget_live(&si->users)) return false; /* @@ -1376,11 +1469,11 @@ static bool swap_alloc_fast(struct folio *folio) return false; =20 ci =3D swap_cluster_lock(si, offset); - if (cluster_is_usable(ci, order)) { + if (ci && cluster_is_usable(ci, order)) { if (cluster_is_empty(ci)) offset =3D cluster_offset(si, ci); alloc_swap_scan_cluster(si, ci, folio, offset); - } else { + } else if (ci) { swap_cluster_unlock(ci); } =20 @@ -1484,6 +1577,7 @@ int swap_retry_table_alloc(swp_entry_t entry, gfp_t g= fp) if (!si) return 0; =20 + /* Entry is in use (being faulted in), so its cluster is alive. */ ci =3D __swap_offset_to_cluster(si, offset); ret =3D swap_extend_table_alloc(si, ci, gfp); =20 @@ -1711,6 +1805,7 @@ int folio_alloc_swap(struct folio *folio) unsigned int order =3D folio_order(folio); unsigned int size =3D 1 << order; =20 + VM_WARN_ON_FOLIO(folio_test_swapcache(folio), folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio); =20 @@ -1873,7 +1968,8 @@ struct swap_info_struct *get_swap_device(swp_entry_t = entry) return NULL; put_out: pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); - percpu_ref_put(&si->users); + if (!swap_is_vswap(si)) + percpu_ref_put(&si->users); return NULL; } =20 @@ -2005,6 +2101,7 @@ static bool folio_maybe_swapped(struct folio *folio) VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); =20 + /* Folio is locked and in swap cache, so ci->count > 0: cluster is alive.= */ ci =3D __swap_entry_to_cluster(entry); ci_off =3D swp_cluster_offset(entry); ci_end =3D ci_off + folio_nr_pages(folio); @@ -2142,9 +2239,9 @@ swp_entry_t swap_alloc_hibernation_slot(int type) pcp_offset =3D this_cpu_read(percpu_swap_cluster.offset[0]); if (pcp_si =3D=3D si && pcp_offset) { ci =3D swap_cluster_lock(si, pcp_offset); - if (cluster_is_usable(ci, 0)) + if (ci && cluster_is_usable(ci, 0)) offset =3D alloc_swap_scan_cluster(si, ci, NULL, pcp_offset); - else + else if (ci) swap_cluster_unlock(ci); } if (!offset) @@ -2192,6 +2289,9 @@ static int __find_hibernation_swap_type(dev_t device,= sector_t offset) =20 if (!(sis->flags & SWP_WRITEOK)) continue; + /* vswap has no bdev =E2=80=94 never a hibernation target */ + if (swap_is_vswap(sis)) + continue; =20 if (device =3D=3D sis->bdev->bd_dev) { struct swap_extent *se =3D first_se(sis); @@ -2379,6 +2479,9 @@ int find_first_swap(dev_t *device) =20 if (!(sis->flags & SWP_WRITEOK)) continue; + /* vswap has no bdev =E2=80=94 never a hibernation target */ + if (swap_is_vswap(sis)) + continue; *device =3D sis->bdev->bd_dev; spin_unlock(&swap_lock); return type; @@ -2590,8 +2693,10 @@ static int unuse_pte_range(struct vm_area_struct *vm= a, pmd_t *pmd, &vmf); } if (!folio) { + rcu_read_lock(); swp_tb =3D swap_table_get(__swap_entry_to_cluster(entry), swp_cluster_offset(entry)); + rcu_read_unlock(); if (swp_tb_get_count(swp_tb) <=3D 0) continue; return -ENOMEM; @@ -2737,8 +2842,10 @@ static unsigned int find_next_to_unuse(struct swap_i= nfo_struct *si, * allocations from this area (while holding swap_lock). */ for (i =3D prev + 1; i < si->max; i++) { + rcu_read_lock(); swp_tb =3D swap_table_get(__swap_offset_to_cluster(si, i), i % SWAPFILE_CLUSTER); + rcu_read_unlock(); if (!swp_tb_is_null(swp_tb) && !swp_tb_is_bad(swp_tb)) break; if ((i % LATENCY_LIMIT) =3D=3D 0) @@ -2977,6 +3084,11 @@ static int setup_swap_extents(struct swap_info_struc= t *sis, struct inode *inode =3D mapping->host; int ret; =20 + if (sis->flags & SWP_VSWAP) { + *span =3D 0; + return 0; + } + if (S_ISBLK(inode->i_mode)) { ret =3D add_swap_extent(sis, 0, sis->max, 0); *span =3D sis->pages; @@ -3001,15 +3113,22 @@ static int setup_swap_extents(struct swap_info_stru= ct *sis, =20 static void _enable_swap_info(struct swap_info_struct *si) { - atomic_long_add(si->pages, &nr_swap_pages); - total_swap_pages +=3D si->pages; + if (!swap_is_vswap(si)) { + atomic_long_add(si->pages, &nr_swap_pages); + total_swap_pages +=3D si->pages; + } =20 assert_spin_locked(&swap_lock); =20 - plist_add(&si->list, &swap_active_head); - - /* Add back to available list */ - add_to_avail_list(si, true); + /* + * Vswap has no backing file and no swapoff support =E2=80=94 keep it + * off swap_active_head (used by swapoff filename lookup and + * swap_sync_discard) and swap_avail_head (physical allocator). + */ + if (!swap_is_vswap(si)) { + plist_add(&si->list, &swap_active_head); + add_to_avail_list(si, true); + } } =20 /* @@ -3046,6 +3165,8 @@ static void wait_for_allocation(struct swap_info_stru= ct *si) struct swap_cluster_info *ci; =20 BUG_ON(si->flags & SWP_WRITEOK); + if (si->flags & SWP_VSWAP) + return; =20 for (offset =3D 0; offset < end; offset +=3D SWAPFILE_CLUSTER) { ci =3D swap_cluster_lock(si, offset); @@ -3184,7 +3305,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) =20 destroy_swap_extents(p, p->swap_file); =20 - if (!(p->flags & SWP_SOLIDSTATE)) + if (!(p->flags & SWP_VSWAP) && + !(p->flags & SWP_SOLIDSTATE)) atomic_dec(&nr_rotate_swap); =20 mutex_lock(&swapon_mutex); @@ -3294,6 +3416,19 @@ static void swap_stop(struct seq_file *swap, void *v) mutex_unlock(&swapon_mutex); } =20 +static const char *swap_type_str(struct swap_info_struct *si) +{ + struct file *file =3D si->swap_file; + + if (si->flags & SWP_VSWAP) + return "vswap\t"; + + if (S_ISBLK(file_inode(file)->i_mode)) + return "partition"; + + return "file\t"; +} + static int swap_show(struct seq_file *swap, void *v) { struct swap_info_struct *si =3D v; @@ -3313,8 +3448,7 @@ static int swap_show(struct seq_file *swap, void *v) len =3D seq_file_path(swap, file, " \t\n\\"); seq_printf(swap, "%*s%s\t%lu\t%s%lu\t%s%d\n", len < 40 ? 40 - len : 1, " ", - S_ISBLK(file_inode(file)->i_mode) ? - "partition" : "file\t", + swap_type_str(si), bytes, bytes < 10000000 ? "\t" : "", inuse, inuse < 10000000 ? "\t" : "", si->prio); @@ -3446,7 +3580,6 @@ static int claim_swapfile(struct swap_info_struct *si= , struct inode *inode) return 0; } =20 - /* * Find out how many pages are allowed for a single swap device. There * are two limiting factors: @@ -3552,10 +3685,43 @@ static int setup_swap_clusters_info(struct swap_inf= o_struct *si, unsigned long maxpages) { unsigned long nr_clusters =3D DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); - struct swap_cluster_info *cluster_info; + struct swap_cluster_info *cluster_info =3D NULL; + struct swap_cluster_info_dynamic *ci_dyn; int err =3D -ENOMEM; unsigned long i; =20 + /* For SWP_VSWAP files, initialize Xarray pool instead of static array */ + if (si->flags & SWP_VSWAP) { + /* + * Pre-allocate cluster 0 and mark slot 0 (header page) + * as bad so the allocator never hands out page offset 0. + */ + ci_dyn =3D kzalloc(sizeof(*ci_dyn), GFP_KERNEL); + if (!ci_dyn) + goto err; + spin_lock_init(&ci_dyn->ci.lock); + INIT_LIST_HEAD(&ci_dyn->ci.list); + + nr_clusters =3D 0; + xa_init_flags(&si->cluster_info_pool, XA_FLAGS_ALLOC); + err =3D xa_insert(&si->cluster_info_pool, 0, ci_dyn, GFP_KERNEL); + if (err) { + kfree(ci_dyn); + goto err; + } + + err =3D swap_cluster_setup_bad_slot(si, &ci_dyn->ci, 0, false); + if (err) { + xa_erase(&si->cluster_info_pool, 0); + swap_cluster_free_table(&ci_dyn->ci); + kfree(ci_dyn); + xa_destroy(&si->cluster_info_pool); + goto err; + } + + goto setup_cluster_info; + } + cluster_info =3D kvzalloc_objs(*cluster_info, nr_clusters); if (!cluster_info) goto err; @@ -3580,6 +3746,10 @@ static int setup_swap_clusters_info(struct swap_info= _struct *si, err =3D swap_cluster_setup_bad_slot(si, cluster_info, 0, false); if (err) goto err; + + if (!swap_header) + goto setup_cluster_info; + for (i =3D 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr =3D swap_header->info.badpages[i]; =20 @@ -3599,6 +3769,7 @@ static int setup_swap_clusters_info(struct swap_info_= struct *si, goto err; } =20 +setup_cluster_info: INIT_LIST_HEAD(&si->free_clusters); INIT_LIST_HEAD(&si->full_clusters); INIT_LIST_HEAD(&si->discard_clusters); @@ -3635,7 +3806,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf= ile, int, swap_flags) struct dentry *dentry; int prio; int error; - union swap_header *swap_header; + union swap_header *swap_header =3D NULL; int nr_extents; sector_t span; unsigned long maxpages; @@ -3709,7 +3880,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf= ile, int, swap_flags) goto bad_swap_unlock_inode; } swap_header =3D kmap_local_folio(folio, 0); - maxpages =3D read_swap_header(si, swap_header, inode); if (unlikely(!maxpages)) { error =3D -EINVAL; @@ -3744,7 +3914,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf= ile, int, swap_flags) =20 if (si->bdev && !bdev_rot(si->bdev)) { si->flags |=3D SWP_SOLIDSTATE; - } else { + } else if (!(si->flags & SWP_SOLIDSTATE)) { atomic_inc(&nr_rotate_swap); inced_nr_rotate_swap =3D true; } @@ -3966,3 +4136,47 @@ static int __init swapfile_init(void) return 0; } subsys_initcall(swapfile_init); + +#ifdef CONFIG_VSWAP +struct swap_info_struct *vswap_si; + +static int __init vswap_init(void) +{ + struct swap_info_struct *si; + unsigned long maxpages; + int err; + + si =3D alloc_swap_info(); + if (IS_ERR(si)) + return PTR_ERR(si); + + maxpages =3D min(swapfile_maximum_size, + ALIGN_DOWN((unsigned long)UINT_MAX, SWAPFILE_CLUSTER)); + si->flags |=3D SWP_VSWAP | SWP_SOLIDSTATE | SWP_WRITEOK; + si->bdev =3D NULL; + si->max =3D maxpages; + si->pages =3D maxpages - 1; + si->prio =3D SHRT_MAX; + si->list.prio =3D -si->prio; + si->avail_list.prio =3D -si->prio; + + err =3D setup_swap_clusters_info(si, NULL, maxpages); + if (err) + goto fail; + + mutex_lock(&swapon_mutex); + enable_swap_info(si); + mutex_unlock(&swapon_mutex); + + vswap_si =3D si; + pr_info("vswap: created virtual swap device (%lu pages)\n", maxpages); + return 0; + +fail: + spin_lock(&swap_lock); + si->flags =3D 0; + spin_unlock(&swap_lock); + return err; +} +late_initcall(vswap_init); +#endif diff --git a/mm/vswap.h b/mm/vswap.h new file mode 100644 index 000000000000..094ff16cb5a4 --- /dev/null +++ b/mm/vswap.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Virtual swap space + * + * Copyright (C) 2026 Nhat Pham + */ +#ifndef _MM_VSWAP_H +#define _MM_VSWAP_H + +#include + +#ifdef CONFIG_VSWAP + +extern struct swap_info_struct *vswap_si; + +static inline bool swap_is_vswap(struct swap_info_struct *si) +{ + return si->flags & SWP_VSWAP; +} + +#else + +static inline bool swap_is_vswap(struct swap_info_struct *si) +{ + return false; +} + +#endif /* CONFIG_VSWAP */ +#endif /* _MM_VSWAP_H */ diff --git a/mm/zswap.c b/mm/zswap.c index 761cd699e0a3..993406074d58 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -994,11 +994,16 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, struct swap_info_struct *si; int ret =3D 0; =20 - /* try to allocate swap cache folio */ si =3D get_swap_device(swpentry); if (!si) return -EEXIST; =20 + if (si->flags & SWP_VSWAP) { + put_swap_device(si); + return -EINVAL; + } + + /* try to allocate swap cache folio */ mpol =3D get_task_policy(current); folio =3D swap_cache_alloc_folio(swpentry, GFP_KERNEL, BIT(0), NULL, mpol, NO_INTERLEAVE_INDEX); @@ -1049,7 +1054,8 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, folio_set_reclaim(folio); =20 /* start writeback */ - __swap_writepage(folio, NULL); + ret =3D __swap_writepage(folio, NULL); + WARN_ON_ONCE(ret); =20 out: if (ret) { --=20 2.53.0-Meta From nobody Mon Jun 8 14:35:25 2026 Received: from mail-oi1-f180.google.com (mail-oi1-f180.google.com [209.85.167.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 72ECB13FEE for ; Thu, 28 May 2026 21:30:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.167.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780003809; cv=none; b=BWqgCr4R11lmyhA4Wmqj2bK6MKhATmaODkz7gLrjTf/GzTbPpO0fJ9SX0xOeipvtb+Kt5NksXPFwIu5qrdDIF6T4wwB1tMgOAeEVd5NBwTSspO6uyahNsSXSMB9illGUTFCQmkih5yAHfZGzI0fz0nVlJzdq7RFc2JuUofExEFk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780003809; c=relaxed/simple; bh=ijtfpGhvoQRsJnoGRa0oEXn+l26zwg/AT4EH+qzD3I8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=EPq+n3WllySwCuPx6nucOfsmUDrzF5cAScrvHAcpTijMUDIJkaGLxknVgW5W6foA9xoRepSZyT/Xf6LOjQJOfyooeGHt2pnHtaeHiwi4BxQTxfJac42BCAeWnIc9Uhq6MZqpy8xdrBxeLmnEsw/Vq8bhLxBIRp90ed7B70gH4CQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=LpWarCxU; arc=none smtp.client-ip=209.85.167.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="LpWarCxU" Received: by mail-oi1-f180.google.com with SMTP id 5614622812f47-484d3c0855aso5163386b6e.3 for ; Thu, 28 May 2026 14:30:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1780003801; x=1780608601; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=q+w7YHMnEfkdAU/3dh9LHpeWm/MSowGtlw156d5IW3s=; b=LpWarCxUmkviAqjcPJDI0LnJz0NdiJm+NYTEFPJYPzxNSDWQYQDRLBz2ohaDsX4yhu jVjMxGZLbGqPrdNr86if+fidKGLkORuv+VkVfNDNVOD5SG4b0j36nxznpm7Ie47qHacK wwsYvq14NoipoDpV9Vbm5zq8aYG8akli46PbWNHUa+JTCkz8pjUYhMF2M79ngwbsgj6v jM0hunMyplQQZcWl+bPlTksDe3OgLZg88oE0p0s7Eo39lR+/yTC4WL5DbKv49NuFPlNz B3snQeREFIYYEhbf44TJAlOgHmrFPBSoohQ7R4ZdcKdbw1793vP/u0CXJTbnJFpOTA16 idUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780003801; x=1780608601; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=q+w7YHMnEfkdAU/3dh9LHpeWm/MSowGtlw156d5IW3s=; b=FWcqjQ3PKx1ilFX55U/+zkCezr2vZXHjkaShvu8lR9PO+AIkNsfHf/YRg5iG2R6ruX RsYBWsVX5aF4BHRP0dSiqLJLSHc5WgTHUw5f1Q/3Sd5G/4W7OnVzygHdXisKaP+IRyuC C4rEppPuF1VfFHw+ZmbolWc/Qx5QXDKAcPR6Zp6tKIp98IbtzuPHuxP/bXA0X1Gd3l83 t2YBCFccMX0bfjp1TbTKQgf/JfOCM+UDj3zScTHJm8xKaNisCrBjZXpLuPL9bIH4Z7f4 GjfNcCKyt1mUJvB/jneNVzNzlAhJbdl1VNnDDRAdfXgvmDZM538OaJT1oEDC68vqFZ4k /psQ== X-Forwarded-Encrypted: i=1; AFNElJ8nqKsDoP45Ex7hgUffTPe/3kTRCYUbF+XLbMhDSrbbkSMnaiXKcgv1f0s0OXiOycNSREv8RSdS+3Llfvg=@vger.kernel.org X-Gm-Message-State: AOJu0YzD84eQBCnSZIbygpmGdjqhGW27VVpNeJG+6vG5OdX05ds37Zh2 GnqJjr2ROK7hAhP7AjygvYD0rp4W/4x3pQxEe4qrwO2Fb0ERk7Ak/Inc X-Gm-Gg: Acq92OFEmhVs8hTVuC4+GseXalIz44e5jEX8+NVpysouRwXl5+xGQ70WHWPaRJOK+ul z70Qc7LU0FUT8Sk8HDmyQqF1y7GJdsUd/Bu/Pm+UskWfdQ9HcHlGm9NRPQMetB2BEHPHInyuLAJ 3tSy6cOErtjlqJkrME7QixnRfbJ3Ml0sDVnOBB/nKiSq4LDkLn9J+/eES4WC/Wp4UyvH1Vzn26+ tEmVmUbazkjaTSfWuBaYEdqDf2vgNCfDu8gqqF2bb8tQ5hri/xWBdgmJ/6LlHyg9UeE60rap6gR doF53Axm/tlOW25j9Ovy4pmUYrjbLEGvUZMZTkJig+Kd1hkCHL+X8jV/+e+qJ94VKOiv2pidG5E 6EMTuek/mgm75w1mTimOX06Pb6WbmL3b2hlCtE/0FZFImVjsizYV3SO4AU4qK6q3W1ndPGFDE9Y EQiVKlfmgio9AkTVS+VCuhFt64WHvDnEkPDJW++Ihb//qD03fEAEEiJR0= X-Received: by 2002:a05:6808:c40b:b0:479:ed26:fbc9 with SMTP id 5614622812f47-485e6c32c4bmr187637b6e.32.1780003800998; Thu, 28 May 2026 14:30:00 -0700 (PDT) Received: from localhost ([2a03:2880:10ff:3::]) by smtp.gmail.com with ESMTPSA id 5614622812f47-48554538d1bsm9781024b6e.8.2026.05.28.14.29.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 28 May 2026 14:29:59 -0700 (PDT) From: Nhat Pham To: kasong@tencent.com Cc: Liam.Howlett@oracle.com, akpm@linux-foundation.org, apopple@nvidia.com, axelrasmussen@google.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, bhe@redhat.com, byungchul@sk.com, cgroups@vger.kernel.org, chengming.zhou@linux.dev, chrisl@kernel.org, corbet@lwn.net, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jannh@google.com, joshua.hahnjy@gmail.com, lance.yang@linux.dev, lenb@kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-pm@vger.kernel.org, lorenzo.stoakes@oracle.com, matthew.brost@intel.com, mhocko@suse.com, muchun.song@linux.dev, npache@redhat.com, nphamcs@gmail.com, pavel@kernel.org, peterx@redhat.com, peterz@infradead.org, pfalcato@suse.de, rafael@kernel.org, rakie.kim@sk.com, roman.gushchin@linux.dev, rppt@kernel.org, ryan.roberts@arm.com, shakeel.butt@linux.dev, shikemeng@huaweicloud.com, surenb@google.com, tglx@kernel.org, vbabka@suse.cz, weixugc@google.com, ying.huang@linux.alibaba.com, yosry.ahmed@linux.dev, yuanchu@google.com, zhengqi.arch@bytedance.com, ziy@nvidia.com, kernel-team@meta.com, riel@surriel.com, haowenchao22@gmail.com Subject: [RFC PATCH 2/5] mm, swap: support zswap and zeroswap as vswap backends Date: Thu, 28 May 2026 14:29:26 -0700 Message-ID: <20260528212955.1912856-3-nphamcs@gmail.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260528212955.1912856-1-nphamcs@gmail.com> References: <20260528212955.1912856-1-nphamcs@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Build the virtual swap layer on top of the swap-table infrastructure. Virtual swap entries decouple PTE swap entries from physical backing, allowing pages to be compressed by zswap (or detected as zero-filled) without pre-allocating a physical swap slot. This patch only supports zswap and zero-page backends. If zswap_store fails, the page stays dirty in the swap cache (AOP_WRITEPAGE_ACTIVATE) =E2=80=94 physical disk backing fallback comes in the next patch. Zswap writeback of vswap-backed entries is also disabled =E2=80=94 the shrinker skips when no physical swap pages are available. Suggested-by: Kairui Song Signed-off-by: Nhat Pham --- include/linux/zswap.h | 3 + mm/internal.h | 20 ++- mm/madvise.c | 2 +- mm/memcontrol.c | 8 +- mm/memory.c | 20 ++- mm/page_io.c | 61 +++++-- mm/swap.h | 4 +- mm/swap_state.c | 8 + mm/swap_table.h | 53 ++++++ mm/swapfile.c | 375 +++++++++++++++++++++++++++++++++--------- mm/vmscan.c | 5 +- mm/vswap.h | 292 +++++++++++++++++++++++++++++++- mm/zswap.c | 106 +++++++----- 13 files changed, 807 insertions(+), 150 deletions(-) diff --git a/include/linux/zswap.h b/include/linux/zswap.h index 30c193a1207e..4b4f211f3301 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -6,6 +6,7 @@ #include =20 struct lruvec; +struct zswap_entry; =20 extern atomic_long_t zswap_stored_pages; =20 @@ -28,6 +29,7 @@ unsigned long zswap_total_pages(void); bool zswap_store(struct folio *folio); int zswap_load(struct folio *folio); void zswap_invalidate(swp_entry_t swp); +void zswap_entry_free(struct zswap_entry *entry); int zswap_swapon(int type, unsigned long nr_pages); void zswap_swapoff(int type); void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg); @@ -50,6 +52,7 @@ static inline int zswap_load(struct folio *folio) } =20 static inline void zswap_invalidate(swp_entry_t swp) {} +static inline void zswap_entry_free(struct zswap_entry *entry) {} static inline int zswap_swapon(int type, unsigned long nr_pages) { return 0; diff --git a/mm/internal.h b/mm/internal.h index 7646ecb9d621..23ea4c8172df 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -16,6 +16,7 @@ #include #include #include +#include "vswap.h" #include #include =20 @@ -436,6 +437,9 @@ static inline pte_t pte_next_swp_offset(pte_t pte) * @start_ptep: Page table pointer for the first entry. * @max_nr: The maximum number of table entries to consider. * @pte: Page table entry for the first entry. + * @free_batch: True when the batch is for a free path. Skips the + * vswap uniform-backing check (which is only relevant + * for swapin batches). * * Detect a batch of contiguous swap entries: consecutive (non-present) PT= Es * containing swap entries all with consecutive offsets and targeting the = same @@ -446,11 +450,14 @@ static inline pte_t pte_next_swp_offset(pte_t pte) * * Return: the number of table entries in the batch. */ -static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte) +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte, + bool free_batch) { pte_t expected_pte =3D pte_next_swp_offset(pte); const pte_t *end_ptep =3D start_ptep + max_nr; pte_t *ptep =3D start_ptep + 1; + swp_entry_t entry __maybe_unused; + int nr; =20 VM_WARN_ON(max_nr < 1); VM_WARN_ON(!softleaf_is_swap(softleaf_from_pte(pte))); @@ -464,7 +471,16 @@ static inline int swap_pte_batch(pte_t *start_ptep, in= t max_nr, pte_t pte) ptep++; } =20 - return ptep - start_ptep; + nr =3D ptep - start_ptep; +#ifdef CONFIG_VSWAP + if (!free_batch) { + entry =3D softleaf_from_pte(ptep_get(start_ptep)); + if (nr > 1 && swap_is_vswap(__swap_entry_to_info(entry)) && + !vswap_can_swapin_thp(entry, nr)) + return 1; + } +#endif + return nr; } #endif /* CONFIG_MMU */ =20 diff --git a/mm/madvise.c b/mm/madvise.c index cd9bb077072c..75ec10fbd61a 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -692,7 +692,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned = long addr, =20 if (softleaf_is_swap(entry)) { max_nr =3D (end - addr) / PAGE_SIZE; - nr =3D swap_pte_batch(pte, max_nr, ptent); + nr =3D swap_pte_batch(pte, max_nr, ptent, true); nr_swap -=3D nr; swap_put_entries_direct(entry, nr); clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 039e9bc8971c..a3ad83c229f7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -48,6 +48,7 @@ #include #include #include +#include #include #include #include @@ -5538,8 +5539,13 @@ void __mem_cgroup_uncharge_swap(unsigned short id, u= nsigned int nr_pages) =20 long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg) { - long nr_swap_pages =3D get_nr_swap_pages(); + long nr_swap_pages; =20 + /* vswap provides unbounded virtual swap when zswap is enabled */ + if (IS_ENABLED(CONFIG_VSWAP) && zswap_is_enabled()) + return PAGE_COUNTER_MAX; + + nr_swap_pages =3D get_nr_swap_pages(); if (mem_cgroup_disabled() || do_memsw_account()) return nr_swap_pages; for (; !mem_cgroup_is_root(memcg); memcg =3D parent_mem_cgroup(memcg)) diff --git a/mm/memory.c b/mm/memory.c index 7c020995eafc..c3050e49b086 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1764,7 +1764,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gath= er *tlb, if (!should_zap_cows(details)) return 1; =20 - nr =3D swap_pte_batch(pte, max_nr, ptent); + nr =3D swap_pte_batch(pte, max_nr, ptent, true); rss[MM_SWAPENTS] -=3D nr; swap_put_entries_direct(entry, nr); } else if (softleaf_is_migration(entry)) { @@ -4630,7 +4630,7 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_= t *ptep, int nr_pages) * from different backends. And they are likely corner cases. Similar * things might be added once zswap support large folios. */ - if (swap_pte_batch(ptep, nr_pages, pte) !=3D nr_pages) + if (swap_pte_batch(ptep, nr_pages, pte, false) !=3D nr_pages) return false; return true; } @@ -4675,15 +4675,19 @@ static unsigned long thp_swapin_suitable_orders(str= uct vm_fault *vmf) if (unlikely(userfaultfd_armed(vma))) return 0; =20 + entry =3D softleaf_from_pte(vmf->orig_pte); + /* - * A large swapped out folio could be partially or fully in zswap. We - * lack handling for such cases, so fallback to swapping in order-0 - * folio. + * A large swapped out folio could be partially or fully in zswap. + * With vswap, vswap_can_swapin_thp() (via swap_pte_batch) lets + * THP swapin through only for backings that don't need per-page + * decompression. For non-vswap entries we still need the + * zswap_never_enabled() bail =E2=80=94 zswap_load rejects large folios + * with -EINVAL, which would SIGBUS the fault. */ - if (!zswap_never_enabled()) + if (!swap_is_vswap(__swap_entry_to_info(entry)) && !zswap_never_enabled()) return 0; =20 - entry =3D softleaf_from_pte(vmf->orig_pte); /* * Get a list of all the (large) orders below PMD_ORDER that are enabled * and suitable for swapping THP. @@ -4942,7 +4946,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_ptep =3D vmf->pte - idx; folio_pte =3D ptep_get(folio_ptep); if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || - swap_pte_batch(folio_ptep, nr, folio_pte) !=3D nr) + swap_pte_batch(folio_ptep, nr, folio_pte, false) !=3D nr) goto check_folio; =20 page_idx =3D idx; diff --git a/mm/page_io.c b/mm/page_io.c index 8126be6e4cfb..b3c7e56c8eed 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -27,6 +27,7 @@ #include #include "swap.h" #include "swap_table.h" +#include "vswap.h" =20 static void __end_swap_bio_write(struct bio *bio) { @@ -204,19 +205,28 @@ static bool is_folio_zero_filled(struct folio *folio) =20 static void swap_zeromap_folio_set(struct folio *folio) { + struct swap_info_struct *si =3D __swap_entry_to_info(folio->swap); struct obj_cgroup *objcg =3D get_obj_cgroup_from_folio(folio); int nr_pages =3D folio_nr_pages(folio); struct swap_cluster_info *ci; + unsigned int voff, i; swp_entry_t entry; - unsigned int i; =20 VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); =20 ci =3D swap_cluster_get_and_lock(folio); - for (i =3D 0; i < folio_nr_pages(folio); i++) { - entry =3D page_swap_entry(folio_page(folio, i)); - __swap_table_set_zero(ci, swp_cluster_offset(entry)); + if (swap_is_vswap(si)) { + voff =3D swp_cluster_offset(folio->swap); + /* Free any prior backing (e.g. ZSWAP entry from earlier swapout) */ + vswap_release_backing(ci, voff, nr_pages); + for (i =3D 0; i < nr_pages; i++) + vswap_set_zero(ci, voff + i); + } else { + for (i =3D 0; i < nr_pages; i++) { + entry =3D page_swap_entry(folio_page(folio, i)); + __swap_table_set_zero(ci, swp_cluster_offset(entry)); + } } swap_cluster_unlock(ci); =20 @@ -282,6 +292,9 @@ int swap_writeout(struct folio *folio, struct swap_iocb= **swap_plug) */ swap_zeromap_folio_clear(folio); =20 + if (swap_is_vswap(__swap_entry_to_info(folio->swap))) + vswap_prepare_writeout(folio->swap, folio); + if (zswap_store(folio)) { count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT); goto out_unlock; @@ -295,6 +308,11 @@ int swap_writeout(struct folio *folio, struct swap_ioc= b **swap_plug) } rcu_read_unlock(); =20 + if (swap_is_vswap(__swap_entry_to_info(folio->swap))) { + folio_mark_dirty(folio); + return AOP_WRITEPAGE_ACTIVATE; + } + return __swap_writepage(folio, swap_plug); out_unlock: folio_unlock(folio); @@ -537,23 +555,40 @@ static void sio_read_complete(struct kiocb *iocb, lon= g ret) static int swap_zeromap_batch(swp_entry_t entry, int max_nr, bool *is_zerop) { - int i; - bool is_zero; - unsigned int ci_start =3D swp_cluster_offset(entry); + struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct swap_cluster_info *ci =3D __swap_entry_to_cluster(entry); + unsigned int ci_start =3D swp_cluster_offset(entry), ci_off, ci_end; + bool is_zero; =20 VM_WARN_ON_ONCE(ci_start + max_nr > SWAPFILE_CLUSTER); =20 + ci_off =3D ci_start; + ci_end =3D ci_off + max_nr; + + if (swap_is_vswap(si)) { + spin_lock(&ci->lock); + is_zero =3D vswap_test_zero(ci, ci_off); + if (is_zerop) + *is_zerop =3D is_zero; + while (++ci_off < ci_end) { + if (is_zero !=3D vswap_test_zero(ci, ci_off)) + break; + } + spin_unlock(&ci->lock); + return ci_off - ci_start; + } + rcu_read_lock(); - is_zero =3D __swap_table_test_zero(ci, ci_start); - for (i =3D 1; i < max_nr; i++) - if (is_zero !=3D __swap_table_test_zero(ci, ci_start + i)) - break; - rcu_read_unlock(); + is_zero =3D __swap_table_test_zero(ci, ci_off); if (is_zerop) *is_zerop =3D is_zero; + while (++ci_off < ci_end) { + if (is_zero !=3D __swap_table_test_zero(ci, ci_off)) + break; + } + rcu_read_unlock(); =20 - return i; + return ci_off - ci_start; } =20 static bool swap_read_folio_zeromap(struct folio *folio) diff --git a/mm/swap.h b/mm/swap.h index 479ee5871cb9..640413e30880 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -69,7 +69,9 @@ struct swap_cluster_info_dynamic { struct swap_cluster_info ci; /* Underlying cluster info */ unsigned int index; /* for cluster_index() */ struct rcu_head rcu; /* For kfree_rcu deferred free */ - /* Backend pointers (virtual_table) added in a later patch. */ +#ifdef CONFIG_VSWAP + atomic_long_t *virtual_table; /* Backing pointers for vswap slots */ +#endif }; =20 /* All on-list cluster must have a non-zero flag. */ diff --git a/mm/swap_state.c b/mm/swap_state.c index b063c47138c5..6bfa185b7d0f 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -25,6 +25,7 @@ #include "internal.h" #include "swap_table.h" #include "swap.h" +#include "vswap.h" =20 /* * swapper_space is a fiction, retained to simplify the path through @@ -692,6 +693,13 @@ struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp= , unsigned long orders, if (IS_ERR(folio)) return folio; =20 + if (folio_test_large(folio) && swap_is_vswap(__swap_entry_to_info(folio->= swap)) && + !vswap_can_swapin_thp(folio->swap, folio_nr_pages(folio))) { + folio_unlock(folio); + folio_put(folio); + return NULL; + } + swap_read_folio(folio, NULL); return folio; } diff --git a/mm/swap_table.h b/mm/swap_table.h index fd7f0fb9836a..b0e7ef9c966b 100644 --- a/mm/swap_table.h +++ b/mm/swap_table.h @@ -6,6 +6,8 @@ #include #include "swap.h" =20 +struct zswap_entry; + /* A typical flat array in each cluster as swap table */ struct swap_table { atomic_long_t entries[SWAPFILE_CLUSTER]; @@ -368,4 +370,55 @@ static inline unsigned short __swap_cgroup_clear(struc= t swap_cluster_info *ci, } #endif =20 +/* + * Pointer-tagged swap table entry: rmap for vswap-backing physical slots. + * + * On physical clusters, a Pointer-tagged entry stores the vswap entry + * that owns this physical slot (the reverse map). The top bit is reserved + * as a cache-only flag, set when vswap swap_count drops to 0 but the + * folio is still in swap cache. + * + * Pointer: |C|--- vswap entry ---|100| + * C =3D SWP_RMAP_CACHE_ONLY (bit 63) + */ +#ifdef CONFIG_VSWAP +#define SWP_TB_PTR_MARK 0b100UL +#define SWP_TB_PTR_MARK_MASK 0b111UL +#define SWP_RMAP_CACHE_ONLY (1UL << (BITS_PER_LONG - 1)) +#define SWP_RMAP_ENTRY_MASK (~(SWP_RMAP_CACHE_ONLY | SWP_TB_PTR_MARK_MASK)) + +static inline bool swp_tb_is_pointer(unsigned long swp_tb) +{ + return (swp_tb & SWP_TB_PTR_MARK_MASK) =3D=3D SWP_TB_PTR_MARK; +} + +static inline unsigned long swp_entry_to_swp_tb_ptr(swp_entry_t entry) +{ + return (entry.val << 3) | SWP_TB_PTR_MARK; +} + +static inline swp_entry_t swp_tb_ptr_to_swp_entry(unsigned long swp_tb) +{ + swp_entry_t entry; + + VM_WARN_ON(!swp_tb_is_pointer(swp_tb)); + entry.val =3D (swp_tb & SWP_RMAP_ENTRY_MASK) >> 3; + return entry; +} +#else +static inline bool swp_tb_is_pointer(unsigned long swp_tb) +{ + return false; +} +static inline unsigned long swp_entry_to_swp_tb_ptr(swp_entry_t entry) +{ + return 0; +} +static inline swp_entry_t swp_tb_ptr_to_swp_entry(unsigned long swp_tb) +{ + return (swp_entry_t){}; +} + +#endif /* CONFIG_VSWAP */ + #endif diff --git a/mm/swapfile.c b/mm/swapfile.c index f6d2529159ff..c90d83fd628a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -131,6 +131,26 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, perc= pu_swap_cluster) =3D { .lock =3D INIT_LOCAL_LOCK(), }; =20 +#ifdef CONFIG_VSWAP +struct percpu_vswap_cluster { + unsigned long offset[SWAP_NR_ORDERS]; + local_lock_t lock; +}; + +static DEFINE_PER_CPU(struct percpu_vswap_cluster, percpu_vswap_cluster) = =3D { + .offset =3D { [0 ... SWAP_NR_ORDERS - 1] =3D SWAP_ENTRY_INVALID }, + .lock =3D INIT_LOCAL_LOCK(), +}; + +static bool vswap_alloc(struct folio *folio); +static void vswap_free_cluster(struct swap_info_struct *si, + struct swap_cluster_info *ci); +#else +static inline bool vswap_alloc(struct folio *folio) { return false; } +static inline void vswap_free_cluster(struct swap_info_struct *si, + struct swap_cluster_info *ci) {} +#endif + /* May return NULL on invalid type, caller must check for NULL return */ static struct swap_info_struct *swap_type_to_info(int type) { @@ -538,8 +558,14 @@ swap_cluster_populate(struct swap_info_struct *si, * Only cluster isolation from the allocator does table allocation. * Swap allocator uses percpu clusters and holds the local lock. */ - lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock); - if (!(si->flags & SWP_SOLIDSTATE)) +#ifdef CONFIG_VSWAP + if (swap_is_vswap(si)) + lockdep_assert_held(&this_cpu_ptr(&percpu_vswap_cluster)->lock); + else +#endif + if (si->flags & SWP_SOLIDSTATE) + lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock); + else lockdep_assert_held(&si->global_cluster_lock); lockdep_assert_held(&ci->lock); =20 @@ -555,7 +581,12 @@ swap_cluster_populate(struct swap_info_struct *si, spin_unlock(&ci->lock); if (!(si->flags & SWP_SOLIDSTATE)) spin_unlock(&si->global_cluster_lock); - local_unlock(&percpu_swap_cluster.lock); +#ifdef CONFIG_VSWAP + if (swap_is_vswap(si)) + local_unlock(&percpu_vswap_cluster.lock); + else +#endif + local_unlock(&percpu_swap_cluster.lock); =20 ret =3D swap_cluster_alloc_table(ci, __GFP_HIGH | __GFP_NOMEMALLOC | GFP_KERNEL); @@ -568,7 +599,12 @@ swap_cluster_populate(struct swap_info_struct *si, * could happen with ignoring the percpu cluster is fragmentation, * which is acceptable since this fallback and race is rare. */ - local_lock(&percpu_swap_cluster.lock); +#ifdef CONFIG_VSWAP + if (swap_is_vswap(si)) + local_lock(&percpu_vswap_cluster.lock); + else +#endif + local_lock(&percpu_swap_cluster.lock); if (!(si->flags & SWP_SOLIDSTATE)) spin_lock(&si->global_cluster_lock); spin_lock(&ci->lock); @@ -738,19 +774,12 @@ static void free_cluster(struct swap_info_struct *si,= struct swap_cluster_info * return; } =20 + /* + * Vswap dynamic clusters need explicit cleanup (xarray erase, + * kfree_rcu, virtual_table free if allocated). + */ if (si->flags & SWP_VSWAP) { - struct swap_cluster_info_dynamic *ci_dyn; - - ci_dyn =3D container_of(ci, struct swap_cluster_info_dynamic, ci); - if (ci->flags !=3D CLUSTER_FLAG_NONE) { - spin_lock(&si->lock); - list_del(&ci->list); - spin_unlock(&si->lock); - } - swap_cluster_free_table(ci); - xa_erase(&si->cluster_info_pool, ci_dyn->index); - ci->flags =3D CLUSTER_FLAG_DEAD; - kfree_rcu(ci_dyn, rcu); + vswap_free_cluster(si, ci); return; } =20 @@ -874,6 +903,8 @@ static bool cluster_reclaim_range(struct swap_info_stru= ct *si, spin_unlock(&ci->lock); do { swp_tb =3D swap_table_get(ci, offset % SWAPFILE_CLUSTER); + if (swp_tb_is_pointer(swp_tb)) + break; if (swp_tb_get_count(swp_tb)) break; if (swp_tb_is_folio(swp_tb)) @@ -946,47 +977,29 @@ static bool cluster_scan_range(struct swap_info_struc= t *si, =20 static bool __swap_cluster_alloc_entries(struct swap_info_struct *si, struct swap_cluster_info *ci, + unsigned int ci_off, + unsigned long swp_tb, struct folio *folio, - unsigned int ci_off) + unsigned int order) { - unsigned int order; - unsigned long nr_pages; + unsigned long nr_pages =3D 1 << order; =20 lockdep_assert_held(&ci->lock); =20 if (!(si->flags & SWP_WRITEOK)) return false; =20 - /* - * All mm swap allocation starts with a folio (folio_alloc_swap), - * it's also the only allocation path for large orders allocation. - * Such swap slots starts with count =3D=3D 0 and will be increased - * upon folio unmap. - * - * Else, it's a exclusive order 0 allocation for hibernation. - * The slot starts with count =3D=3D 1 and never increases. - */ - if (likely(folio)) { - order =3D folio_order(folio); - nr_pages =3D 1 << order; - swap_cluster_assert_empty(ci, ci_off, nr_pages, false); + swap_cluster_assert_empty(ci, ci_off, nr_pages, false); + + if (swp_tb_is_folio(swp_tb)) __swap_cache_add_folio(ci, folio, swp_entry(si->type, ci_off + cluster_offset(si, ci))); - } else if (IS_ENABLED(CONFIG_HIBERNATION)) { - order =3D 0; - nr_pages =3D 1; - swap_cluster_assert_empty(ci, ci_off, 1, false); - /* Fake shadow placeholder with no flag, hibernation does not use the ze= romap */ - __swap_table_set(ci, ci_off, __swp_tb_mk_count(shadow_to_swp_tb(NULL, 0)= , 1)); - } else { - /* Allocation without folio is only possible with hibernation */ - WARN_ON_ONCE(1); - return false; - } + else + __swap_table_set(ci, ci_off, swp_tb); =20 /* * The first allocation in a cluster makes the - * cluster exclusive to this order + * cluster exclusive to this order. */ if (cluster_is_empty(ci)) ci->order =3D order; @@ -999,11 +1012,13 @@ static bool __swap_cluster_alloc_entries(struct swap= _info_struct *si, /* Try use a new cluster for current CPU and allocate from it. */ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, - struct folio *folio, unsigned long offset) + struct folio *folio, + unsigned long offset, + unsigned long swp_tb) { unsigned int next =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTRY_INVALID; unsigned long start =3D ALIGN_DOWN(offset, SWAPFILE_CLUSTER); - unsigned int order =3D likely(folio) ? folio_order(folio) : 0; + unsigned int order =3D folio ? folio_order(folio) : 0; unsigned long end =3D start + SWAPFILE_CLUSTER; unsigned int nr_pages =3D 1 << order; bool need_reclaim, ret, usable; @@ -1029,7 +1044,8 @@ static unsigned int alloc_swap_scan_cluster(struct sw= ap_info_struct *si, if (!ret) continue; } - if (!__swap_cluster_alloc_entries(si, ci, folio, offset % SWAPFILE_CLUST= ER)) + if (!__swap_cluster_alloc_entries(si, ci, offset % SWAPFILE_CLUSTER, + swp_tb, folio, order)) break; found =3D offset; offset +=3D nr_pages; @@ -1042,6 +1058,11 @@ static unsigned int alloc_swap_scan_cluster(struct s= wap_info_struct *si, relocate_cluster(si, ci); swap_cluster_unlock(ci); } +#ifdef CONFIG_VSWAP + if (swap_is_vswap(si)) { + this_cpu_write(percpu_vswap_cluster.offset[order], next); + } else +#endif if (si->flags & SWP_SOLIDSTATE) { this_cpu_write(percpu_swap_cluster.offset[order], next); this_cpu_write(percpu_swap_cluster.si[order], si); @@ -1054,7 +1075,8 @@ static unsigned int alloc_swap_scan_cluster(struct sw= ap_info_struct *si, static unsigned int alloc_swap_scan_list(struct swap_info_struct *si, struct list_head *list, struct folio *folio, - bool scan_all) + bool scan_all, + unsigned long swp_tb) { unsigned int found =3D SWAP_ENTRY_INVALID; =20 @@ -1065,7 +1087,7 @@ static unsigned int alloc_swap_scan_list(struct swap_= info_struct *si, if (!ci) break; offset =3D cluster_offset(si, ci); - found =3D alloc_swap_scan_cluster(si, ci, folio, offset); + found =3D alloc_swap_scan_cluster(si, ci, folio, offset, swp_tb); if (found) break; } while (scan_all); @@ -1074,7 +1096,8 @@ static unsigned int alloc_swap_scan_list(struct swap_= info_struct *si, } =20 static unsigned int alloc_swap_scan_dynamic(struct swap_info_struct *si, - struct folio *folio) + struct folio *folio, + unsigned long swp_tb) { struct swap_cluster_info_dynamic *ci_dyn; struct swap_cluster_info *ci; @@ -1094,10 +1117,17 @@ static unsigned int alloc_swap_scan_dynamic(struct = swap_info_struct *si, return SWAP_ENTRY_INVALID; } =20 + if (vswap_cluster_alloc_vtable(ci_dyn)) { + swap_cluster_free_table(&ci_dyn->ci); + kfree(ci_dyn); + return SWAP_ENTRY_INVALID; + } + if (xa_alloc(&si->cluster_info_pool, &ci_dyn->index, ci_dyn, XA_LIMIT(1, DIV_ROUND_UP(si->max, SWAPFILE_CLUSTER) - 1), GFP_ATOMIC)) { swap_cluster_free_table(&ci_dyn->ci); + vswap_cluster_free_vtable(&ci_dyn->ci); kfree(ci_dyn); return SWAP_ENTRY_INVALID; } @@ -1105,7 +1135,7 @@ static unsigned int alloc_swap_scan_dynamic(struct sw= ap_info_struct *si, ci =3D &ci_dyn->ci; spin_lock(&ci->lock); offset =3D cluster_offset(si, ci); - return alloc_swap_scan_cluster(si, ci, folio, offset); + return alloc_swap_scan_cluster(si, ci, folio, offset, swp_tb); } =20 static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool f= orce) @@ -1166,18 +1196,20 @@ static void swap_reclaim_work(struct work_struct *w= ork) * Try to allocate swap entries with specified order and try set a new * cluster for current CPU too. */ -static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, - struct folio *folio) +static unsigned long cluster_alloc_swap_entry_tb(struct swap_info_struct *= si, + struct folio *folio, + unsigned long swp_tb) { + unsigned int order =3D folio ? folio_order(folio) : 0; struct swap_cluster_info *ci; - unsigned int order =3D likely(folio) ? folio_order(folio) : 0; unsigned int offset =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTRY_INVALID; =20 /* - * Swapfile is not block device so unable - * to allocate large entries. + * File-based swap can't do large contiguous IO. vswap has no IO + * here (large entries are fine; THP swapin uses vswap_can_swapin_thp + * to gate based on backing). */ - if (order && !(si->flags & SWP_BLKDEV)) + if (order && !(si->flags & SWP_BLKDEV) && !swap_is_vswap(si)) return 0; =20 if (!(si->flags & SWP_SOLIDSTATE)) { @@ -1192,7 +1224,7 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, if (cluster_is_usable(ci, order)) { if (cluster_is_empty(ci)) offset =3D cluster_offset(si, ci); - found =3D alloc_swap_scan_cluster(si, ci, folio, offset); + found =3D alloc_swap_scan_cluster(si, ci, folio, offset, swp_tb); } else { swap_cluster_unlock(ci); } @@ -1206,25 +1238,25 @@ static unsigned long cluster_alloc_swap_entry(struc= t swap_info_struct *si, * to spread out the writes. */ if (si->flags & SWP_PAGE_DISCARD) { - found =3D alloc_swap_scan_list(si, &si->free_clusters, folio, false); + found =3D alloc_swap_scan_list(si, &si->free_clusters, folio, false, swp= _tb); if (found) goto done; } =20 if (order < PMD_ORDER) { - found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[order], folio, = true); + found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[order], folio, = true, swp_tb); if (found) goto done; } =20 if (si->flags & SWP_VSWAP) { - found =3D alloc_swap_scan_dynamic(si, folio); + found =3D alloc_swap_scan_dynamic(si, folio, swp_tb); if (found) goto done; } =20 if (!(si->flags & SWP_PAGE_DISCARD)) { - found =3D alloc_swap_scan_list(si, &si->free_clusters, folio, false); + found =3D alloc_swap_scan_list(si, &si->free_clusters, folio, false, swp= _tb); if (found) goto done; } @@ -1240,7 +1272,7 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, * failure is not critical. Scanning one cluster still * keeps the list rotated and reclaimed (for clean swap cache). */ - found =3D alloc_swap_scan_list(si, &si->frag_clusters[order], folio, fal= se); + found =3D alloc_swap_scan_list(si, &si->frag_clusters[order], folio, fal= se, swp_tb); if (found) goto done; } @@ -1254,11 +1286,11 @@ static unsigned long cluster_alloc_swap_entry(struc= t swap_info_struct *si, * Clusters here have at least one usable slots and can't fail order 0 * allocation, but reclaim may drop si->lock and race with another user. */ - found =3D alloc_swap_scan_list(si, &si->frag_clusters[o], folio, true); + found =3D alloc_swap_scan_list(si, &si->frag_clusters[o], folio, true, s= wp_tb); if (found) goto done; =20 - found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[o], folio, true= ); + found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[o], folio, true= , swp_tb); if (found) goto done; } @@ -1394,7 +1426,8 @@ static void swap_range_alloc(struct swap_info_struct = *si, if (vm_swap_full()) schedule_work(&si->reclaim_work); } - atomic_long_sub(nr_entries, &nr_swap_pages); + if (!swap_is_vswap(si)) + atomic_long_sub(nr_entries, &nr_swap_pages); } =20 static void swap_range_free(struct swap_info_struct *si, unsigned long off= set, @@ -1404,8 +1437,10 @@ static void swap_range_free(struct swap_info_struct = *si, unsigned long offset, void (*swap_slot_free_notify)(struct block_device *, unsigned long); unsigned int i; =20 - for (i =3D 0; i < nr_entries; i++) - zswap_invalidate(swp_entry(si->type, offset + i)); + if (!swap_is_vswap(si)) { + for (i =3D 0; i < nr_entries; i++) + zswap_invalidate(swp_entry(si->type, offset + i)); + } =20 if (si->flags & SWP_BLKDEV) swap_slot_free_notify =3D @@ -1424,7 +1459,8 @@ static void swap_range_free(struct swap_info_struct *= si, unsigned long offset, * only after the above cleanups are done. */ smp_wmb(); - atomic_long_add(nr_entries, &nr_swap_pages); + if (!swap_is_vswap(si)) + atomic_long_add(nr_entries, &nr_swap_pages); swap_usage_sub(si, nr_entries); } =20 @@ -1452,12 +1488,15 @@ static bool get_swap_device_info(struct swap_info_s= truct *si) * Fast path try to get swap entries with specified order from current * CPU's swap entry pool (a cluster). */ -static bool swap_alloc_fast(struct folio *folio) +static swp_entry_t swap_alloc_fast(struct folio *folio) { unsigned int order =3D folio_order(folio); struct swap_cluster_info *ci; struct swap_info_struct *si; - unsigned int offset; + unsigned long offset, swp_tb; + unsigned long found =3D 0; + + lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock); =20 /* * Once allocated, swap_info_struct will never be completely freed, @@ -1466,25 +1505,32 @@ static bool swap_alloc_fast(struct folio *folio) si =3D this_cpu_read(percpu_swap_cluster.si[order]); offset =3D this_cpu_read(percpu_swap_cluster.offset[order]); if (!si || !offset || !get_swap_device_info(si)) - return false; + return (swp_entry_t){}; + + swp_tb =3D folio_to_swp_tb(folio, 0); =20 ci =3D swap_cluster_lock(si, offset); if (ci && cluster_is_usable(ci, order)) { if (cluster_is_empty(ci)) offset =3D cluster_offset(si, ci); - alloc_swap_scan_cluster(si, ci, folio, offset); + found =3D alloc_swap_scan_cluster(si, ci, folio, offset, swp_tb); } else if (ci) { swap_cluster_unlock(ci); } =20 put_swap_device(si); - return folio_test_swapcache(folio); + if (found) + return swp_entry(si->type, found); + return (swp_entry_t){}; } =20 /* Rotate the device and switch to a new cluster */ -static void swap_alloc_slow(struct folio *folio) +static swp_entry_t swap_alloc_slow(struct folio *folio) { struct swap_info_struct *si, *next; + unsigned long swp_tb, found; + + swp_tb =3D folio_to_swp_tb(folio, 0); =20 spin_lock(&swap_avail_lock); start_over: @@ -1493,12 +1539,13 @@ static void swap_alloc_slow(struct folio *folio) plist_requeue(&si->avail_list, &swap_avail_head); spin_unlock(&swap_avail_lock); if (get_swap_device_info(si)) { - cluster_alloc_swap_entry(si, folio); + found =3D cluster_alloc_swap_entry_tb(si, folio, + swp_tb); put_swap_device(si); - if (folio_test_swapcache(folio)) - return; + if (found) + return swp_entry(si->type, found); if (folio_test_large(folio)) - return; + return (swp_entry_t){}; } =20 spin_lock(&swap_avail_lock); @@ -1516,6 +1563,7 @@ static void swap_alloc_slow(struct folio *folio) goto start_over; } spin_unlock(&swap_avail_lock); + return (swp_entry_t){}; } =20 /* @@ -1695,6 +1743,15 @@ static void swap_put_entries_cluster(struct swap_inf= o_struct *si, if (!need_reclaim || !reclaim_cache) return; =20 + /* + * Vswap space is dynamically allocated and effectively infinite =E2=80= =94 + * there is no benefit to reclaiming swap cache entries to free + * virtual slots. Physical slot reclaim is handled separately via + * SWP_RMAP_CACHE_ONLY on the physical cluster. + */ + if (swap_is_vswap(si)) + return; + do { nr_reclaimed =3D __try_to_reclaim_swap(si, offset, TTRS_UNMAPPED | TTRS_FULL); @@ -1800,6 +1857,44 @@ static int swap_dup_entries_cluster(struct swap_info= _struct *si, * Context: Caller needs to hold the folio lock. * Return: Whether the folio was added to the swap cache. */ +#ifdef CONFIG_VSWAP +static bool vswap_alloc(struct folio *folio) +{ + unsigned int order =3D folio_order(folio); + struct swap_cluster_info *ci; + unsigned long offset; + + local_lock(&percpu_vswap_cluster.lock); + offset =3D this_cpu_read(percpu_vswap_cluster.offset[order]); + + if (offset !=3D SWAP_ENTRY_INVALID) { + ci =3D swap_cluster_lock(vswap_si, offset); + if (ci && cluster_is_usable(ci, order)) { + if (cluster_is_empty(ci)) + offset =3D cluster_offset(vswap_si, ci); + alloc_swap_scan_cluster(vswap_si, ci, folio, + offset, folio_to_swp_tb(folio, 0)); + } else if (ci) { + swap_cluster_unlock(ci); + } + } + + if (!folio_test_swapcache(folio)) + cluster_alloc_swap_entry_tb(vswap_si, folio, + folio_to_swp_tb(folio, 0)); + + if (folio_test_swapcache(folio)) { + /* alloc_swap_scan_cluster updated percpu offset already */ + local_unlock(&percpu_vswap_cluster.lock); + return true; + } + + this_cpu_write(percpu_vswap_cluster.offset[order], SWAP_ENTRY_INVALID); + local_unlock(&percpu_vswap_cluster.lock); + return false; +} +#endif + int folio_alloc_swap(struct folio *folio) { unsigned int order =3D folio_order(folio); @@ -1827,12 +1922,21 @@ int folio_alloc_swap(struct folio *folio) } } =20 + /* + * Skip vswap when zswap is disabled =E2=80=94 without zswap, vswap entri= es + * have nowhere to go on writeout (no physical fallback yet; that + * arrives in the next patch). + */ + if (zswap_is_enabled() && vswap_alloc(folio)) + goto done; + again: local_lock(&percpu_swap_cluster.lock); - if (!swap_alloc_fast(folio)) + if (!swap_alloc_fast(folio).val) swap_alloc_slow(folio); local_unlock(&percpu_swap_cluster.lock); =20 +done: if (!order && unlikely(!folio_test_swapcache(folio))) { if (swap_sync_discard()) goto again; @@ -1848,6 +1952,106 @@ int folio_alloc_swap(struct folio *folio) return 0; } =20 +#ifdef CONFIG_VSWAP +static void vswap_free_cluster(struct swap_info_struct *si, + struct swap_cluster_info *ci) +{ + struct swap_cluster_info_dynamic *ci_dyn; + + ci_dyn =3D container_of(ci, struct swap_cluster_info_dynamic, ci); + if (ci->flags !=3D CLUSTER_FLAG_NONE) { + spin_lock(&si->lock); + list_del(&ci->list); + spin_unlock(&si->lock); + } + swap_cluster_free_table(ci); + vswap_cluster_free_vtable(ci); + xa_erase(&si->cluster_info_pool, ci_dyn->index); + ci->flags =3D CLUSTER_FLAG_DEAD; + kfree_rcu(ci_dyn, rcu); +} + +void vswap_release_backing(struct swap_cluster_info *ci, + unsigned int ci_start, unsigned int nr) +{ + struct swap_cluster_info_dynamic *ci_dyn; + unsigned int ci_off; + unsigned long vt; + + lockdep_assert_held(&ci->lock); + ci_dyn =3D container_of(ci, struct swap_cluster_info_dynamic, ci); + + for (ci_off =3D ci_start; ci_off < ci_start + nr; ci_off++) { + vt =3D __vtable_get(ci_dyn, ci_off); + + switch (vtable_type(vt)) { + case VSWAP_ZSWAP: + if (vtable_to_zswap(vt)) + zswap_entry_free(vtable_to_zswap(vt)); + break; + case VSWAP_SWAPFILE: + case VSWAP_FOLIO: + case VSWAP_ZERO: + case VSWAP_NONE: + break; + } + + __vtable_set(ci_dyn, ci_off, vtable_mk_none()); + } +} + +void vswap_store_folio(swp_entry_t entry, struct folio *folio) +{ + struct swap_cluster_info *ci; + struct swap_cluster_info_dynamic *ci_dyn; + int i, nr =3D folio_nr_pages(folio); + unsigned int voff; + + ci =3D __swap_entry_to_cluster(entry); + if (!ci) + return; + ci_dyn =3D container_of(ci, struct swap_cluster_info_dynamic, ci); + voff =3D swp_cluster_offset(entry); + + spin_lock(&ci->lock); + vswap_release_backing(ci, voff, nr); + for (i =3D 0; i < nr; i++) + __vtable_set(ci_dyn, voff + i, vtable_mk_folio(folio)); + spin_unlock(&ci->lock); +} + +void vswap_prepare_writeout(swp_entry_t entry, struct folio *folio) +{ + struct swap_cluster_info *ci; + struct swap_cluster_info_dynamic *ci_dyn; + int i, nr =3D folio_nr_pages(folio); + unsigned int voff; + unsigned long vt; + enum vswap_backing_type type; + + ci =3D __swap_entry_to_cluster(entry); + if (!ci) + return; + ci_dyn =3D container_of(ci, struct swap_cluster_info_dynamic, ci); + voff =3D swp_cluster_offset(entry); + + spin_lock(&ci->lock); + vt =3D __vtable_get(ci_dyn, voff); + type =3D vtable_type(vt); + + if (type =3D=3D VSWAP_SWAPFILE || type =3D=3D VSWAP_FOLIO || type =3D=3D = VSWAP_NONE) { + spin_unlock(&ci->lock); + return; + } + + vswap_release_backing(ci, voff, nr); + for (i =3D 0; i < nr; i++) + __vtable_set(ci_dyn, voff + i, vtable_mk_folio(folio)); + spin_unlock(&ci->lock); +} + +#endif /* CONFIG_VSWAP */ + /** * folio_dup_swap() - Increase swap count of swap entries of a folio. * @folio: folio with swap entries bounded. @@ -1989,6 +2193,9 @@ void __swap_cluster_free_entries(struct swap_info_str= uct *si, =20 VM_WARN_ON(ci->count < nr_pages); =20 + if (swap_is_vswap(si)) + vswap_release_backing(ci, ci_start, nr_pages); + ci->count -=3D nr_pages; do { old_tb =3D __swap_table_get(ci, ci_off); @@ -2240,12 +2447,15 @@ swp_entry_t swap_alloc_hibernation_slot(int type) if (pcp_si =3D=3D si && pcp_offset) { ci =3D swap_cluster_lock(si, pcp_offset); if (ci && cluster_is_usable(ci, 0)) - offset =3D alloc_swap_scan_cluster(si, ci, NULL, pcp_offset); + offset =3D alloc_swap_scan_cluster(si, ci, NULL, + pcp_offset, + __swp_tb_mk_count( + shadow_to_swp_tb(NULL, 0), 1)); else if (ci) swap_cluster_unlock(ci); } if (!offset) - offset =3D cluster_alloc_swap_entry(si, NULL); + offset =3D cluster_alloc_swap_entry_tb(si, NULL, __swp_tb_mk_count(shado= w_to_swp_tb(NULL, 0), 1)); local_unlock(&percpu_swap_cluster.lock); if (offset) entry =3D swp_entry(si->type, offset); @@ -2915,6 +3125,7 @@ static int try_to_unuse(unsigned int type) (i =3D find_next_to_unuse(si, i)) !=3D 0) { =20 entry =3D swp_entry(type, i); + folio =3D swap_cache_get_folio(entry); if (!folio) continue; diff --git a/mm/vmscan.c b/mm/vmscan.c index ca4533eba701..94b6cfcc28ac 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -350,6 +350,9 @@ static inline bool can_reclaim_anon_pages(struct mem_cg= roup *memcg, */ if (get_nr_swap_pages() > 0) return true; + /* vswap doesn't contribute to nr_swap_pages */ + if (IS_ENABLED(CONFIG_VSWAP) && zswap_is_enabled()) + return true; } else { /* Is the memcg below its swap limit? */ if (mem_cgroup_get_nr_swap_pages(memcg) > 0) @@ -2615,7 +2618,7 @@ static bool can_age_anon_pages(struct lruvec *lruvec, struct scan_control *sc) { /* Aging the anon LRU is valuable if swap is present: */ - if (total_swap_pages > 0) + if (total_swap_pages > 0 || (IS_ENABLED(CONFIG_VSWAP) && zswap_is_enabled= ())) return true; =20 /* Also valuable if anon pages can be demoted: */ diff --git a/mm/vswap.h b/mm/vswap.h index 094ff16cb5a4..5e6e5b88593c 100644 --- a/mm/vswap.h +++ b/mm/vswap.h @@ -7,23 +7,307 @@ #ifndef _MM_VSWAP_H #define _MM_VSWAP_H =20 + #include =20 +struct zswap_entry; + +static inline bool swap_is_vswap(struct swap_info_struct *si) +{ + return si->flags & SWP_VSWAP; +} + #ifdef CONFIG_VSWAP =20 +#include "swap.h" +#include "swap_table.h" + extern struct swap_info_struct *vswap_si; =20 -static inline bool swap_is_vswap(struct swap_info_struct *si) +/* + * Virtual table entry encoding for vswap clusters. + * + * Each entry in ci_dyn->virtual_table stores the backing type and + * pointer for a virtual swap slot. Tag in low 3 bits, payload in + * upper 61 bits. + * + * NONE: |----- 0000 ------|000| =E2=80=94 free / unbacked + * PHYS: |-- (type:5,off:N)|001| =E2=80=94 on a physical swapfile (sh= ifted) + * ZERO: |----- 0000 ------|010| =E2=80=94 zero-filled page + * ZSWAP: |--- zswap_entry* |011| =E2=80=94 compressed in zswap (tag i= n low bits) + * FOLIO: |--- folio* ------|100| =E2=80=94 in-memory only (tag in low= bits) + * + * PHYS payloads are shifted left by 3. Pointer payloads (ZSWAP, FOLIO) + * are stored directly with the tag OR'd into the low bits (kernel + * pointers are >=3D 8-byte aligned, same approach as xarray). + */ +enum vswap_backing_type { + VSWAP_NONE =3D 0, + VSWAP_SWAPFILE =3D 1, + VSWAP_ZERO =3D 2, + VSWAP_ZSWAP =3D 3, + VSWAP_FOLIO =3D 4, +}; + +#define VTABLE_TAG_BITS 3 +#define VTABLE_TAG_MASK ((1UL << VTABLE_TAG_BITS) - 1) + +static inline enum vswap_backing_type vtable_type(unsigned long vt) { - return si->flags & SWP_VSWAP; + return vt & VTABLE_TAG_MASK; } =20 -#else +static inline unsigned long vtable_payload(unsigned long vt) +{ + return vt >> VTABLE_TAG_BITS; +} =20 -static inline bool swap_is_vswap(struct swap_info_struct *si) +static inline unsigned long vtable_mk(enum vswap_backing_type type, + unsigned long payload) +{ + return (payload << VTABLE_TAG_BITS) | type; +} + +static inline unsigned long vtable_mk_none(void) +{ + return 0; +} + +static inline unsigned long vtable_mk_zero(void) +{ + return VSWAP_ZERO; +} + +static inline unsigned long vtable_mk_zswap(struct zswap_entry *ze) +{ + return (unsigned long)ze | VSWAP_ZSWAP; +} + +static inline struct zswap_entry *vtable_to_zswap(unsigned long vt) +{ + VM_WARN_ON(vtable_type(vt) !=3D VSWAP_ZSWAP); + return (struct zswap_entry *)(vt & ~VTABLE_TAG_MASK); +} + +static inline unsigned long vtable_mk_folio(struct folio *folio) +{ + return (unsigned long)folio | VSWAP_FOLIO; +} + +static inline struct folio *vtable_to_folio(unsigned long vt) +{ + VM_WARN_ON(vtable_type(vt) !=3D VSWAP_FOLIO); + return (struct folio *)(vt & ~VTABLE_TAG_MASK); +} + +/* Virtual table accessors */ + +static inline unsigned long __vtable_get(struct swap_cluster_info_dynamic = *ci_dyn, + unsigned int off) +{ + VM_WARN_ON_ONCE(off >=3D SWAPFILE_CLUSTER); + return atomic_long_read(&ci_dyn->virtual_table[off]); +} + +static inline void __vtable_set(struct swap_cluster_info_dynamic *ci_dyn, + unsigned int off, unsigned long vt) +{ + VM_WARN_ON_ONCE(off >=3D SWAPFILE_CLUSTER); + atomic_long_set(&ci_dyn->virtual_table[off], vt); +} + +/* + * Lock a vswap cluster and return the dynamic info + slot offset. + * Returns NULL if cluster not found. + * Caller must spin_unlock(&ci_dyn->ci.lock) when done. + */ +static inline struct swap_cluster_info_dynamic * +vswap_lock_cluster(swp_entry_t entry, unsigned int *voff) +{ + struct swap_cluster_info *ci; + struct swap_cluster_info_dynamic *ci_dyn; + + ci =3D __swap_entry_to_cluster(entry); + if (!ci) + return NULL; + ci_dyn =3D container_of(ci, struct swap_cluster_info_dynamic, ci); + *voff =3D swp_cluster_offset(entry); + spin_lock(&ci->lock); + return ci_dyn; +} + +/* Zswap entry helpers =E2=80=94 store/load/erase in virtual_table */ + +void vswap_release_backing(struct swap_cluster_info *ci, + unsigned int ci_start, unsigned int nr); + +static inline void vswap_zswap_store(swp_entry_t entry, + struct zswap_entry *ze) +{ + struct swap_cluster_info_dynamic *ci_dyn; + unsigned int voff; + + ci_dyn =3D vswap_lock_cluster(entry, &voff); + if (!ci_dyn) + return; + vswap_release_backing(&ci_dyn->ci, voff, 1); + __vtable_set(ci_dyn, voff, vtable_mk_zswap(ze)); + spin_unlock(&ci_dyn->ci.lock); +} + +static inline struct zswap_entry *vswap_zswap_load(swp_entry_t entry) +{ + struct swap_cluster_info_dynamic *ci_dyn; + unsigned int voff; + unsigned long vt; + + ci_dyn =3D vswap_lock_cluster(entry, &voff); + if (!ci_dyn) + return NULL; + vt =3D __vtable_get(ci_dyn, voff); + spin_unlock(&ci_dyn->ci.lock); + + if (vtable_type(vt) !=3D VSWAP_ZSWAP) + return NULL; + return vtable_to_zswap(vt); +} + + +void vswap_store_folio(swp_entry_t entry, struct folio *folio); +void vswap_prepare_writeout(swp_entry_t entry, struct folio *folio); + +/* + * Check that all nr vtable entries starting at entry have the same + * backing type. Returns the number of matching entries (< nr on + * mismatch). + */ +static inline int vswap_check_backing(swp_entry_t entry, int nr, + enum vswap_backing_type *typep) +{ + struct swap_cluster_info_dynamic *ci_dyn; + enum vswap_backing_type first_type; + unsigned int voff; + unsigned long vt; + int i; + + ci_dyn =3D vswap_lock_cluster(entry, &voff); + if (!ci_dyn) + return 0; + + for (i =3D 0; i < nr; i++) { + vt =3D __vtable_get(ci_dyn, voff + i); + if (!i) + first_type =3D vtable_type(vt); + else if (vtable_type(vt) !=3D first_type) + break; + } + spin_unlock(&ci_dyn->ci.lock); + + if (typep) + *typep =3D first_type; + return i; +} + +static inline bool vswap_can_swapin_thp(swp_entry_t entry, int nr) +{ + enum vswap_backing_type type; + + return vswap_check_backing(entry, nr, &type) =3D=3D nr && + type =3D=3D VSWAP_ZERO; +} + +static inline int vswap_cluster_alloc_vtable(struct swap_cluster_info_dyna= mic *ci_dyn) +{ + ci_dyn->virtual_table =3D kcalloc(SWAPFILE_CLUSTER, + sizeof(*ci_dyn->virtual_table), + GFP_ATOMIC); + return ci_dyn->virtual_table ? 0 : -ENOMEM; +} + +static inline void vswap_cluster_free_vtable(struct swap_cluster_info *ci) +{ + struct swap_cluster_info_dynamic *ci_dyn; + + ci_dyn =3D container_of(ci, struct swap_cluster_info_dynamic, ci); + kfree(ci_dyn->virtual_table); + ci_dyn->virtual_table =3D NULL; +} + +/* Low-level setter for callers already holding the cluster lock */ +static inline void vswap_set_zswap(struct swap_cluster_info *ci, + unsigned int ci_off, + struct zswap_entry *ze) +{ + struct swap_cluster_info_dynamic *ci_dyn; + + ci_dyn =3D container_of(ci, struct swap_cluster_info_dynamic, ci); + __vtable_set(ci_dyn, ci_off, vtable_mk_zswap(ze)); +} + +/* Zeromap helpers =E2=80=94 test/set ZERO backing in virtual_table */ + +static inline bool vswap_test_zero(struct swap_cluster_info *ci, + unsigned int ci_off) +{ + struct swap_cluster_info_dynamic *ci_dyn; + + ci_dyn =3D container_of(ci, struct swap_cluster_info_dynamic, ci); + return vtable_type(__vtable_get(ci_dyn, ci_off)) =3D=3D VSWAP_ZERO; +} + +static inline void vswap_set_zero(struct swap_cluster_info *ci, + unsigned int ci_off) +{ + struct swap_cluster_info_dynamic *ci_dyn; + + ci_dyn =3D container_of(ci, struct swap_cluster_info_dynamic, ci); + __vtable_set(ci_dyn, ci_off, vtable_mk_zero()); +} + +#else /* !CONFIG_VSWAP */ + +static inline void vswap_release_backing(struct swap_cluster_info *ci, + unsigned int ci_start, + unsigned int nr) {} + +static inline void vswap_zswap_store(swp_entry_t entry, + struct zswap_entry *ze) {} + +static inline struct zswap_entry *vswap_zswap_load(swp_entry_t entry) +{ + return NULL; +} + +static inline void vswap_store_folio(swp_entry_t entry, + struct folio *folio) {} +static inline void vswap_prepare_writeout(swp_entry_t entry, + struct folio *folio) {} + +static inline bool vswap_can_swapin_thp(swp_entry_t entry, int nr) +{ + return false; +} + +struct swap_cluster_info_dynamic; +static inline int vswap_cluster_alloc_vtable(struct swap_cluster_info_dyna= mic *ci_dyn) +{ + return 0; +} + +static inline void vswap_cluster_free_vtable(struct swap_cluster_info *ci)= {} + +static inline void vswap_set_zswap(struct swap_cluster_info *ci, + unsigned int ci_off, + struct zswap_entry *ze) {} + +static inline bool vswap_test_zero(struct swap_cluster_info *ci, + unsigned int ci_off) { return false; } =20 +static inline void vswap_set_zero(struct swap_cluster_info *ci, + unsigned int ci_off) {} + #endif /* CONFIG_VSWAP */ #endif /* _MM_VSWAP_H */ diff --git a/mm/zswap.c b/mm/zswap.c index 993406074d58..c57bf0246bb2 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -38,6 +38,7 @@ #include =20 #include "swap.h" +#include "vswap.h" #include "internal.h" =20 /********************************* @@ -762,7 +763,7 @@ static void zswap_entry_cache_free(struct zswap_entry *= entry) * Carries out the common pattern of freeing an entry's zsmalloc allocatio= n, * freeing the entry itself, and decrementing the number of stored pages. */ -static void zswap_entry_free(struct zswap_entry *entry) +void zswap_entry_free(struct zswap_entry *entry) { zswap_lru_del(&zswap_list_lru, entry); zs_free(entry->pool->zs_pool, entry->handle); @@ -994,16 +995,21 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, struct swap_info_struct *si; int ret =3D 0; =20 + /* try to allocate swap cache folio */ si =3D get_swap_device(swpentry); if (!si) return -EEXIST; =20 + /* + * Vswap entries have no physical backing =E2=80=94 writeback would fail + * and SIGBUS the caller. Bail before we waste a swap-cache folio + * allocation. + */ if (si->flags & SWP_VSWAP) { put_swap_device(si); return -EINVAL; } =20 - /* try to allocate swap cache folio */ mpol =3D get_task_policy(current); folio =3D swap_cache_alloc_folio(swpentry, GFP_KERNEL, BIT(0), NULL, mpol, NO_INTERLEAVE_INDEX); @@ -1206,6 +1212,18 @@ static unsigned long zswap_shrinker_count(struct shr= inker *shrinker, if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg)) return 0; =20 + /* + * With CONFIG_VSWAP and zswap enabled, every zswap entry is + * vswap-backed and needs a physical swap slot allocated on demand + * (via folio_realloc_swap) for writeback. If no physical slots are + * available, writeback will fail =E2=80=94 skip the shrinker to avoid + * spinning on entries we cannot drain. Vanilla zswap-on-swapfile is + * unaffected because every zswap entry already has a backing slot; + * gate on CONFIG_VSWAP so the check compiles out there. + */ + if (IS_ENABLED(CONFIG_VSWAP) && !get_nr_swap_pages()) + return 0; + /* * The shrinker resumes swap writeback, which will enter block * and may enter fs. XXX: Harmonize with vmscan.c __GFP_FS @@ -1416,25 +1434,25 @@ static bool zswap_store_page(struct page *page, if (!zswap_compress(page, entry, pool)) goto compress_failed; =20 - old =3D xa_store(swap_zswap_tree(page_swpentry), - swp_offset(page_swpentry), - entry, GFP_KERNEL); - if (xa_is_err(old)) { - int err =3D xa_err(old); + if (swap_is_vswap(__swap_entry_to_info(page_swpentry))) { + vswap_zswap_store(page_swpentry, entry); + } else { + old =3D xa_store(swap_zswap_tree(page_swpentry), + swp_offset(page_swpentry), + entry, GFP_KERNEL); + if (xa_is_err(old)) { + int err =3D xa_err(old); + + WARN_ONCE(err !=3D -ENOMEM, + "unexpected xarray error: %d\n", err); + zswap_reject_alloc_fail++; + goto store_failed; + } =20 - WARN_ONCE(err !=3D -ENOMEM, "unexpected xarray error: %d\n", err); - zswap_reject_alloc_fail++; - goto store_failed; + if (old) + zswap_entry_free(old); } =20 - /* - * We may have had an existing entry that became stale when - * the folio was redirtied and now the new version is being - * swapped out. Get rid of the old. - */ - if (old) - zswap_entry_free(old); - /* * The entry is successfully compressed and stored in the tree, there is * no further possibility of failure. Grab refs to the pool and objcg, @@ -1533,6 +1551,8 @@ bool zswap_store(struct folio *folio) =20 count_vm_events(ZSWPOUT, nr_pages); =20 + /* zswap_store_page stores directly in virtual_table for vswap */ + ret =3D true; =20 put_pool: @@ -1547,8 +1567,14 @@ bool zswap_store(struct folio *folio) * the possibly stale entries which were previously stored at the * offsets corresponding to each page of the folio. Otherwise, * writeback could overwrite the new data in the swapfile. + * + * vswap stores zswap entries directly in the per-slot virtual_table + * (no per-device xarray), so the stale-entry cleanup is implicit: + * a successful vswap_zswap_store overwrites the slot via + * vswap_release_backing, and a failed store leaves the old backing + * untouched. */ - if (!ret) { + if (!ret && !swap_is_vswap(__swap_entry_to_info(swp))) { unsigned type =3D swp_type(swp); pgoff_t offset =3D swp_offset(swp); struct zswap_entry *entry; @@ -1588,8 +1614,7 @@ bool zswap_store(struct folio *folio) int zswap_load(struct folio *folio) { swp_entry_t swp =3D folio->swap; - pgoff_t offset =3D swp_offset(swp); - struct xarray *tree =3D swap_zswap_tree(swp); + struct swap_info_struct *si =3D __swap_entry_to_info(swp); struct zswap_entry *entry; =20 VM_WARN_ON_ONCE(!folio_test_locked(folio)); @@ -1599,16 +1624,25 @@ int zswap_load(struct folio *folio) return -ENOENT; =20 /* - * Large folios should not be swapped in while zswap is being used, as - * they are not properly handled. Zswap does not properly load large - * folios, and a large folio may only be partially in zswap. + * zswap_load() does not support large folios. For non-vswap + * entries this is unexpected on the swapin path: WARN and + * sigbus. For vswap entries vswap_can_swapin_thp() has already + * filtered out ZSWAP-backed THPs, so the large folio here is + * zero- or phys-backed; return -ENOENT to fall through to the + * phys/zero IO path. */ - if (WARN_ON_ONCE(folio_test_large(folio))) { - folio_unlock(folio); - return -EINVAL; + if (folio_test_large(folio)) { + if (WARN_ON_ONCE(!swap_is_vswap(si))) { + folio_unlock(folio); + return -EINVAL; + } + return -ENOENT; } =20 - entry =3D xa_load(tree, offset); + if (swap_is_vswap(si)) + entry =3D vswap_zswap_load(swp); + else + entry =3D xa_load(swap_zswap_tree(swp), swp_offset(swp)); if (!entry) return -ENOENT; =20 @@ -1623,16 +1657,14 @@ int zswap_load(struct folio *folio) if (entry->objcg) count_objcg_events(entry->objcg, ZSWPIN, 1); =20 - /* - * We are reading into the swapcache, invalidate zswap entry. - * The swapcache is the authoritative owner of the page and - * its mappings, and the pressure that results from having two - * in-memory copies outweighs any benefits of caching the - * compression work. - */ folio_mark_dirty(folio); - xa_erase(tree, offset); - zswap_entry_free(entry); + + if (swap_is_vswap(si)) { + vswap_store_folio(swp, folio); + } else { + xa_erase(swap_zswap_tree(swp), swp_offset(swp)); + zswap_entry_free(entry); + } =20 folio_unlock(folio); return 0; --=20 2.53.0-Meta From nobody Mon Jun 8 14:35:25 2026 Received: from mail-oi1-f180.google.com (mail-oi1-f180.google.com [209.85.167.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D1CDB35E94F for ; Thu, 28 May 2026 21:30:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.167.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780003811; cv=none; b=Sj9qtCQBFRVVt5b/o+SS7QDCYm6erF7dFhZZS3z2m/6KgOqe2EDJPkHhGBQC5lyfj6zx2A8cFj8pJYQdMDVZWxeh+UcHMwqH/RVa/49IizlTPo3qaTgr6fMW+FcKkbfzdD0RVLX18UqTwYVGoi52wubBJWhaBKbLWw9SUicz0OU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780003811; c=relaxed/simple; bh=szH844L1SzzWNWGk6khkoWDAgAykrJipybh1aHsHRkM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=J08y/LJGvWfLM3ikd7KoBdSfe8etsXqjU5JUfja6AibHZ/U0nltYjiu0q1pTJpvURyaRWNSZrnZF+10qS3zjQd3RidZepnmNSPmueVkHvW4PWmL4O8CkGIyVJ/f/R9RoJR7NrA5+k4TY8gQhS5NgzNB/qC4dqjZn0s3zJo5kGE0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=beDmBhnw; arc=none smtp.client-ip=209.85.167.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="beDmBhnw" Received: by mail-oi1-f180.google.com with SMTP id 5614622812f47-479dd56d016so10089701b6e.3 for ; Thu, 28 May 2026 14:30:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1780003803; x=1780608603; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=1HPWXw9VKiy+aaD0c55tVeHWQbTXncnbogrmXqQQheo=; b=beDmBhnw1s6pL1nJ3yKqym3cTJbg3F+X1K4KlkKwvMn1dibleZFz9Q+hA1XkdJoJNB KMePuj5E9ljMVKReTCF54lHSJaHy8I/4VU71J27JlIU0Ed519KADPKLSlKh7c1I8+uTm 1V6Zbed0DSpNf2+iWQM4+dv/koXt9nt23N2eNcxAILGMMa6zJAJr4lJx6YXokQ/i9gJv U27sRIiP8uGK4F7uOMQJKThXbLrwTYlJT5ujQe9LizNPtI7b0p8d/BGN6B8djMxDL3K6 5DS58vH/Nl3a6LodNobZOKEisVeGvk3suxih9zP3dOIPSl5d6cLBNE6hUqJAFgyMjWBV 9cJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780003803; x=1780608603; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=1HPWXw9VKiy+aaD0c55tVeHWQbTXncnbogrmXqQQheo=; b=M2RbWiAeICD0bYQ8OdHS95WKrWZCZ1kKb1vry/foEMRu0ZF6JaLE/RsKRDW8ZGzvAJ 6N4HJzTaABrRy+/46NsGak+NEDEd6s6bFFr76eqhWQfOe2/1b32VrB3VyiY/EoOqbEcW GHL6QFq7g1YorwCluPGHn7TJzNjEJgdMCBUrAPLFDcYHQ4y8KoquNwNwHy5LoGUi/EM1 zyzjhnZTXtXMaSeLPFFtIWDvsrrjdONL6dYBsMJBDifNrPeE6SpN+K6Z9TBQcsnidxLl 1opM/wvMZ++qRg6YFFcz5tIUoJGcDc+JyL8wXHfk+lMHZ4brMHJI3/dV8Bzv/hv5TvCP qcww== X-Forwarded-Encrypted: i=1; AFNElJ+kEiTgmpkU3Lr+bqZm41LzEyJXcXnLqL+vKjfneN4AoISmOq84gubh4yoxVBnt7Ta5/UQzlfXDPzOpfCI=@vger.kernel.org X-Gm-Message-State: AOJu0Yx1GJV6yA2j+V3giwX9HEJxdmOueRy6hi1m3FVCgEQ/c3AHy3wQ Hu7TTOB9u53HHPNGiraIJ2F97nPhk6kfAue2FvZpraj6aJQgA+UTBMBh X-Gm-Gg: Acq92OEm61vqE8ltpZhxGDlmeBvZCZjeRGUzMs3GOoj3h60FkzOuRLTsiKjWmQ1MFdQ +oRanbZlfYnA3zU5vQfJpPku+P4z6sdfxLz9cbQurdp4O4ZyXd0GwhXMlGJVeKxGGcBlxzkCdmY +7uXGYDO16BR49CAnbTFh8+1qdVpe6VrcYnUJSos0N5ZA70EtF/ZZhHdhB1RQaJG2XCIpP5hyQ/ oDnVu4uya9+lv9gx4xfEhwryLWT8aY74mKPNNpzQqXKZeC45/K80NSWdv78B8rRZCWC9catA/My MfvtjDYhBrw5djUgzVav6bB6cFgtQHc/SpbHDnIq3y4HUOdso51UszaCdL+tUzv+2BZOw9oypW+ z1hlKL+bJuZcziBNEk9P6xDquaFXHf6VGiauFMASeT5YQnqDIyDQY7c3oKkQ1nyfpolHGKo2Nrl klAPRYXhJRWgdmp7SLFWYSMO0Bt++f//QLD3m2lBIg5DaHNmKnEBSOWXhzQfRPmb2Lhg== X-Received: by 2002:a05:6808:13cb:b0:47b:c8d0:514a with SMTP id 5614622812f47-485e6e6ccf3mr181931b6e.46.1780003802480; Thu, 28 May 2026 14:30:02 -0700 (PDT) Received: from localhost ([2a03:2880:10ff:5::]) by smtp.gmail.com with ESMTPSA id 586e51a60fabf-43c8961981bsm131440fac.9.2026.05.28.14.30.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 28 May 2026 14:30:01 -0700 (PDT) From: Nhat Pham To: kasong@tencent.com Cc: Liam.Howlett@oracle.com, akpm@linux-foundation.org, apopple@nvidia.com, axelrasmussen@google.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, bhe@redhat.com, byungchul@sk.com, cgroups@vger.kernel.org, chengming.zhou@linux.dev, chrisl@kernel.org, corbet@lwn.net, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jannh@google.com, joshua.hahnjy@gmail.com, lance.yang@linux.dev, lenb@kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-pm@vger.kernel.org, lorenzo.stoakes@oracle.com, matthew.brost@intel.com, mhocko@suse.com, muchun.song@linux.dev, npache@redhat.com, nphamcs@gmail.com, pavel@kernel.org, peterx@redhat.com, peterz@infradead.org, pfalcato@suse.de, rafael@kernel.org, rakie.kim@sk.com, roman.gushchin@linux.dev, rppt@kernel.org, ryan.roberts@arm.com, shakeel.butt@linux.dev, shikemeng@huaweicloud.com, surenb@google.com, tglx@kernel.org, vbabka@suse.cz, weixugc@google.com, ying.huang@linux.alibaba.com, yosry.ahmed@linux.dev, yuanchu@google.com, zhengqi.arch@bytedance.com, ziy@nvidia.com, kernel-team@meta.com, riel@surriel.com, haowenchao22@gmail.com Subject: [RFC PATCH 3/5] mm, swap: support physical swap as a vswap backend Date: Thu, 28 May 2026 14:29:27 -0700 Message-ID: <20260528212955.1912856-4-nphamcs@gmail.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260528212955.1912856-1-nphamcs@gmail.com> References: <20260528212955.1912856-1-nphamcs@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Add physical swap as a backend for the virtual swap layer. Without this, vswap can only back entries with zswap or zero pages, and a zswap_store failure has nowhere to fall back to =E2=80=94 the page stays dirty in swap cache (AOP_WRITEPAGE_ACTIVATE). With physical swap backing, vswap can allocate a physical slot on demand when needed: as a fallback for zswap_store failures, or as the destination for zswap writeback. Each vswap entry's physical slot is tracked via a Pointer-tagged swap_table entry on the physical cluster (rmap back to the vswap entry). Suggested-by: Kairui Song Signed-off-by: Nhat Pham --- include/linux/swap.h | 10 ++ mm/memcontrol.c | 8 +- mm/memory.c | 14 +- mm/page_io.c | 130 ++++++++++---- mm/swap.h | 11 ++ mm/swap_table.h | 1 + mm/swapfile.c | 398 ++++++++++++++++++++++++++++++++++++++++--- mm/vswap.h | 138 ++++++++++++++- mm/zswap.c | 79 ++++++--- 9 files changed, 698 insertions(+), 91 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index ee9b1e76b058..3fb55485fc76 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -449,6 +449,16 @@ extern int swp_swapcount(swp_entry_t entry); struct backing_dev_info; extern struct swap_info_struct *get_swap_device(swp_entry_t entry); sector_t swap_folio_sector(struct folio *folio); +sector_t swap_entry_sector(swp_entry_t entry); + +#ifdef CONFIG_VSWAP +swp_entry_t folio_realloc_swap(struct folio *folio); +#else +static inline swp_entry_t folio_realloc_swap(struct folio *folio) +{ + return (swp_entry_t){}; +} +#endif =20 /* * If there is an existing swap slot reference (swap entry) and the caller diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a3ad83c229f7..7492879b3239 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5541,7 +5541,13 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup = *memcg) { long nr_swap_pages; =20 - /* vswap provides unbounded virtual swap when zswap is enabled */ + /* + * vswap provides unbounded virtual swap when zswap is enabled. + * (No per-memcg may_zswap check =E2=80=94 mem_cgroup_may_zswap can sleep + * via __mem_cgroup_flush_stats, but this is callable from + * rcu_read_lock contexts like cachestat(2) =E2=86=92 workingset_test_rec= ent. + * The per-memcg swap.max is still enforced at charge time.) + */ if (IS_ENABLED(CONFIG_VSWAP) && zswap_is_enabled()) return PAGE_COUNTER_MAX; =20 diff --git a/mm/memory.c b/mm/memory.c index c3050e49b086..d15c748d4f90 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -89,6 +89,7 @@ #include "pgalloc-track.h" #include "internal.h" #include "swap.h" +#include "vswap.h" =20 #if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST) #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame fo= r last_cpupid. @@ -4523,7 +4524,14 @@ static inline bool should_try_to_free_swap(struct sw= ap_info_struct *si, * are fast, and meanwhile, swap cache pinning the slot deferring the * release of metadata or fragmentation is a more critical issue. */ - if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) + if (swap_entry_backend_has_flag(si, folio->swap, SWP_SYNCHRONOUS_IO)) + return true; + /* + * Non-swapfile backends cannot be reused for future swapouts. + * Free the swap slot unless backed by contiguous physical swap. + */ + if (swap_is_vswap(si) && + !vswap_swapfile_backed(folio->swap, folio_nr_pages(folio))) return true; if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) || folio_test_mlocked(folio)) @@ -4832,7 +4840,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) swap_update_readahead(folio, vma, vmf->address); if (!folio) { /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */ - if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) + if (swap_entry_backend_has_flag(si, entry, SWP_SYNCHRONOUS_IO)) folio =3D swapin_sync(entry, GFP_HIGHUSER_MOVABLE, thp_swapin_suitable_orders(vmf) | BIT(0), vmf, NULL, 0); @@ -5007,7 +5015,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ exclusive =3D true; } else if (exclusive && folio_test_writeback(folio) && - data_race(si->flags & SWP_STABLE_WRITES)) { + swap_entry_backend_has_flag(si, entry, SWP_STABLE_WRITES)) { /* * This is tricky: not all swap backends support * concurrent page modifications while under writeback. diff --git a/mm/page_io.c b/mm/page_io.c index b3c7e56c8eed..a65734564819 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -260,6 +260,7 @@ static void swap_zeromap_folio_clear(struct folio *foli= o) */ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug) { + swp_entry_t phys; int ret =3D 0; =20 if (folio_free_swap(folio)) @@ -292,6 +293,12 @@ int swap_writeout(struct folio *folio, struct swap_ioc= b **swap_plug) */ swap_zeromap_folio_clear(folio); =20 + /* + * For vswap: release stale non-swapfile backends before writeout. + * If already PHYS-backed (contiguous), keep it. Otherwise free old + * backing (e.g. ZSWAP from a previous swapout cycle) and set FOLIO + * so zswap_store or folio_realloc_swap starts clean. + */ if (swap_is_vswap(__swap_entry_to_info(folio->swap))) vswap_prepare_writeout(folio->swap, folio); =20 @@ -309,8 +316,19 @@ int swap_writeout(struct folio *folio, struct swap_ioc= b **swap_plug) rcu_read_unlock(); =20 if (swap_is_vswap(__swap_entry_to_info(folio->swap))) { - folio_mark_dirty(folio); - return AOP_WRITEPAGE_ACTIVATE; + /* + * zswap_store may have partially populated the vtable with + * ZSWAP entries before failing. Reset to FOLIO (freeing + * those partial entries) so folio_realloc_swap can install + * PHYS cleanly without leaking zswap_entry pointers. + */ + vswap_prepare_writeout(folio->swap, folio); + phys =3D folio_realloc_swap(folio); + if (!phys.val) { + folio_mark_dirty(folio); + return AOP_WRITEPAGE_ACTIVATE; + } + return __swap_writepage_phys(folio, swap_plug, phys); } =20 return __swap_writepage(folio, swap_plug); @@ -402,12 +420,12 @@ static void sio_write_complete(struct kiocb *iocb, lo= ng ret) mempool_free(sio, sio_pool); } =20 -static void swap_writepage_fs(struct folio *folio, struct swap_iocb **swap= _plug) +static void swap_writepage_fs(struct folio *folio, + struct swap_info_struct *sis, loff_t pos, + struct swap_iocb **swap_plug) { struct swap_iocb *sio =3D swap_plug ? *swap_plug : NULL; - struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap); struct file *swap_file =3D sis->swap_file; - loff_t pos =3D swap_dev_pos(folio->swap); =20 count_swpout_vm_event(folio); folio_start_writeback(folio); @@ -439,13 +457,13 @@ static void swap_writepage_fs(struct folio *folio, st= ruct swap_iocb **swap_plug) } =20 static void swap_writepage_bdev_sync(struct folio *folio, - struct swap_info_struct *sis) + struct swap_info_struct *sis, sector_t sector) { struct bio_vec bv; struct bio bio; =20 bio_init(&bio, sis->bdev, &bv, 1, REQ_OP_WRITE | REQ_SWAP); - bio.bi_iter.bi_sector =3D swap_folio_sector(folio); + bio.bi_iter.bi_sector =3D sector; bio_add_folio_nofail(&bio, folio, folio_size(folio), 0); =20 bio_associate_blkg_from_page(&bio, folio); @@ -475,6 +493,42 @@ static void swap_writepage_bdev_async(struct folio *fo= lio, submit_bio(bio); } =20 +#ifdef CONFIG_VSWAP +int __swap_writepage_phys(struct folio *folio, struct swap_iocb **swap_plu= g, + swp_entry_t phys_entry) +{ + struct swap_info_struct *sis =3D __swap_entry_to_info(phys_entry); + sector_t sector =3D swap_entry_sector(phys_entry); + struct bio *bio; + + VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio); + VM_WARN_ON(swap_is_vswap(sis)); + + if (data_race(sis->flags & SWP_FS_OPS)) { + swap_writepage_fs(folio, sis, swap_dev_pos(phys_entry), + swap_plug); + return 0; + } + + if (data_race(sis->flags & SWP_SYNCHRONOUS_IO)) { + swap_writepage_bdev_sync(folio, sis, sector); + return 0; + } + + bio =3D bio_alloc(sis->bdev, 1, REQ_OP_WRITE | REQ_SWAP, GFP_NOIO); + bio->bi_iter.bi_sector =3D sector; + bio->bi_end_io =3D end_swap_bio_write; + bio_add_folio_nofail(bio, folio, folio_size(folio), 0); + + bio_associate_blkg_from_page(bio, folio); + count_swpout_vm_event(folio); + folio_start_writeback(folio); + folio_unlock(folio); + submit_bio(bio); + return 0; +} +#endif + int __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug) { struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap); @@ -493,14 +547,10 @@ int __swap_writepage(struct folio *folio, struct swap= _iocb **swap_plug) * is safe. */ if (data_race(sis->flags & SWP_FS_OPS)) - swap_writepage_fs(folio, swap_plug); - /* - * ->flags can be updated non-atomically, - * but that will never affect SWP_SYNCHRONOUS_IO, so the data_race - * is safe. - */ + swap_writepage_fs(folio, sis, swap_dev_pos(folio->swap), + swap_plug); else if (data_race(sis->flags & SWP_SYNCHRONOUS_IO)) - swap_writepage_bdev_sync(folio, sis); + swap_writepage_bdev_sync(folio, sis, swap_folio_sector(folio)); else swap_writepage_bdev_async(folio, sis); return 0; @@ -624,11 +674,11 @@ static bool swap_read_folio_zeromap(struct folio *fol= io) return true; } =20 -static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plu= g) +static void swap_read_folio_fs(struct folio *folio, + struct swap_info_struct *sis, loff_t pos, + struct swap_iocb **plug) { - struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap); struct swap_iocb *sio =3D NULL; - loff_t pos =3D swap_dev_pos(folio->swap); =20 if (plug) sio =3D *plug; @@ -659,13 +709,13 @@ static void swap_read_folio_fs(struct folio *folio, s= truct swap_iocb **plug) } =20 static void swap_read_folio_bdev_sync(struct folio *folio, - struct swap_info_struct *sis) + struct swap_info_struct *sis, sector_t sector) { struct bio_vec bv; struct bio bio; =20 bio_init(&bio, sis->bdev, &bv, 1, REQ_OP_READ); - bio.bi_iter.bi_sector =3D swap_folio_sector(folio); + bio.bi_iter.bi_sector =3D sector; bio_add_folio_nofail(&bio, folio, folio_size(folio), 0); /* * Keep this task valid during swap readpage because the oom killer may @@ -681,12 +731,12 @@ static void swap_read_folio_bdev_sync(struct folio *f= olio, } =20 static void swap_read_folio_bdev_async(struct folio *folio, - struct swap_info_struct *sis) + struct swap_info_struct *sis, sector_t sector) { struct bio *bio; =20 bio =3D bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL); - bio->bi_iter.bi_sector =3D swap_folio_sector(folio); + bio->bi_iter.bi_sector =3D sector; bio->bi_end_io =3D end_swap_bio_read; bio_add_folio_nofail(bio, folio, folio_size(folio), 0); count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN); @@ -695,6 +745,22 @@ static void swap_read_folio_bdev_async(struct folio *f= olio, submit_bio(bio); } =20 +static void swap_read_folio_phys(struct folio *folio, swp_entry_t phys_ent= ry, + struct swap_iocb **plug) +{ + struct swap_info_struct *sis =3D __swap_entry_to_info(phys_entry); + sector_t sector =3D swap_entry_sector(phys_entry); + + zswap_folio_swapin(folio); + + if (data_race(sis->flags & SWP_FS_OPS)) + swap_read_folio_fs(folio, sis, swap_dev_pos(phys_entry), plug); + else if (data_race(sis->flags & SWP_SYNCHRONOUS_IO)) + swap_read_folio_bdev_sync(folio, sis, sector); + else + swap_read_folio_bdev_async(folio, sis, sector); +} + void swap_read_folio(struct folio *folio, struct swap_iocb **plug) { struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap); @@ -702,6 +768,7 @@ void swap_read_folio(struct folio *folio, struct swap_i= ocb **plug) bool workingset =3D folio_test_workingset(folio); unsigned long pflags; bool in_thrashing; + swp_entry_t phys; =20 VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); @@ -726,20 +793,15 @@ void swap_read_folio(struct folio *folio, struct swap= _iocb **plug) if (zswap_load(folio) !=3D -ENOENT) goto finish; =20 - if (unlikely(sis->flags & SWP_VSWAP)) { - folio_unlock(folio); - goto finish; - } - - /* We have to read from slower devices. Increase zswap protection. */ - zswap_folio_swapin(folio); - - if (data_race(sis->flags & SWP_FS_OPS)) { - swap_read_folio_fs(folio, plug); - } else if (synchronous) { - swap_read_folio_bdev_sync(folio, sis); + if (swap_is_vswap(sis)) { + phys =3D vswap_to_phys(folio->swap); + if (!phys.val) { + folio_unlock(folio); + goto finish; + } + swap_read_folio_phys(folio, phys, plug); } else { - swap_read_folio_bdev_async(folio, sis); + swap_read_folio_phys(folio, folio->swap, plug); } =20 finish: diff --git a/mm/swap.h b/mm/swap.h index 640413e30880..50c90a35382c 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -285,6 +285,17 @@ static inline void swap_read_unplug(struct swap_iocb *= plug) void swap_write_unplug(struct swap_iocb *sio); int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug); int __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug); +#ifdef CONFIG_VSWAP +int __swap_writepage_phys(struct folio *folio, struct swap_iocb **swap_plu= g, + swp_entry_t phys_entry); +#else +static inline int __swap_writepage_phys(struct folio *folio, + struct swap_iocb **swap_plug, + swp_entry_t phys_entry) +{ + return -EINVAL; +} +#endif =20 /* linux/mm/swap_state.c */ extern struct address_space swap_space __read_mostly; diff --git a/mm/swap_table.h b/mm/swap_table.h index b0e7ef9c966b..814bc75597a0 100644 --- a/mm/swap_table.h +++ b/mm/swap_table.h @@ -406,6 +406,7 @@ static inline swp_entry_t swp_tb_ptr_to_swp_entry(unsig= ned long swp_tb) return entry; } #else +#define SWP_RMAP_CACHE_ONLY 0UL static inline bool swp_tb_is_pointer(unsigned long swp_tb) { return false; diff --git a/mm/swapfile.c b/mm/swapfile.c index c90d83fd628a..a0976be6a12b 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -145,10 +145,16 @@ static DEFINE_PER_CPU(struct percpu_vswap_cluster, pe= rcpu_vswap_cluster) =3D { static bool vswap_alloc(struct folio *folio); static void vswap_free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci); +static void vswap_mark_cache_only(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned int ci_off); #else static inline bool vswap_alloc(struct folio *folio) { return false; } static inline void vswap_free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) {} +static inline void vswap_mark_cache_only(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned int ci_off) {} #endif =20 /* May return NULL on invalid type, caller must check for NULL return */ @@ -350,19 +356,24 @@ offset_to_swap_extent(struct swap_info_struct *sis, u= nsigned long offset) BUG(); } =20 -sector_t swap_folio_sector(struct folio *folio) +sector_t swap_entry_sector(swp_entry_t entry) { - struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap); + struct swap_info_struct *sis =3D __swap_entry_to_info(entry); struct swap_extent *se; sector_t sector; pgoff_t offset; =20 - offset =3D swp_offset(folio->swap); + offset =3D swp_offset(entry); se =3D offset_to_swap_extent(sis, offset); sector =3D se->start_block + (offset - se->start_page); return sector << (PAGE_SHIFT - 9); } =20 +sector_t swap_folio_sector(struct folio *folio) +{ + return swap_entry_sector(folio->swap); +} + /* * swap allocation tell device that a cluster of swap can now be discarded, * to allow the swap device to optimize its wear-levelling. @@ -880,6 +891,60 @@ static int swap_cluster_setup_bad_slot(struct swap_inf= o_struct *si, return ret; } =20 +/* + * Try to reclaim a Pointer-tagged physical slot backing a vswap entry. + * The physical cluster lock must NOT be held. Returns < 0 on failure. + */ +static int try_to_reclaim_vswap_backing(struct swap_info_struct *si, + unsigned long offset) +{ + struct swap_cluster_info *ci; + swp_entry_t vswap_entry, phys_entry; + struct folio *folio; + unsigned long swp_tb; + unsigned int ci_off; + + ci =3D swap_cluster_lock(si, offset); + if (!ci) + return -1; + ci_off =3D offset % SWAPFILE_CLUSTER; + swp_tb =3D __swap_table_get(ci, ci_off); + if (!swp_tb_is_pointer(swp_tb) || !(swp_tb & SWP_RMAP_CACHE_ONLY)) { + swap_cluster_unlock(ci); + return -1; + } + vswap_entry =3D swp_tb_ptr_to_swp_entry(swp_tb); + swap_cluster_unlock(ci); + + folio =3D swap_cache_get_folio(vswap_entry); + if (!folio) + return -1; + + if (!folio_trylock(folio)) { + folio_put(folio); + return -1; + } + + if (!folio_matches_swap_entry(folio, vswap_entry)) { + folio_unlock(folio); + folio_put(folio); + return -1; + } + + phys_entry =3D vswap_to_phys(vswap_entry); + if (!phys_entry.val || swp_offset(phys_entry) !=3D offset || + swp_type(phys_entry) !=3D si->type) { + folio_unlock(folio); + folio_put(folio); + return -1; + } + + vswap_store_folio(vswap_entry, folio); + folio_unlock(folio); + folio_put(folio); + return 0; +} + /* * Reclaim drops the ci lock, so the cluster may become unusable (freed or * stolen by a lower order). @usable will be set to false if that happens. @@ -903,8 +968,13 @@ static bool cluster_reclaim_range(struct swap_info_str= uct *si, spin_unlock(&ci->lock); do { swp_tb =3D swap_table_get(ci, offset % SWAPFILE_CLUSTER); - if (swp_tb_is_pointer(swp_tb)) - break; + if (swp_tb_is_pointer(swp_tb)) { + rcu_read_unlock(); + if (try_to_reclaim_vswap_backing(si, offset) < 0) + goto relock; + rcu_read_lock(); + continue; + } if (swp_tb_get_count(swp_tb)) break; if (swp_tb_is_folio(swp_tb)) @@ -912,6 +982,7 @@ static bool cluster_reclaim_range(struct swap_info_stru= ct *si, break; } while (++offset < end); rcu_read_unlock(); +relock: =20 /* Re-lookup: dynamic cluster may have been freed while lock was dropped = */ ci =3D swap_cluster_lock(si, start); @@ -983,6 +1054,8 @@ static bool __swap_cluster_alloc_entries(struct swap_i= nfo_struct *si, unsigned int order) { unsigned long nr_pages =3D 1 << order; + swp_entry_t vswap_entry, v; + unsigned int i; =20 lockdep_assert_held(&ci->lock); =20 @@ -991,11 +1064,24 @@ static bool __swap_cluster_alloc_entries(struct swap= _info_struct *si, =20 swap_cluster_assert_empty(ci, ci_off, nr_pages, false); =20 - if (swp_tb_is_folio(swp_tb)) + if (swp_tb_is_folio(swp_tb)) { __swap_cache_add_folio(ci, folio, swp_entry(si->type, ci_off + cluster_offset(si, ci))); - else + } else if (swp_tb_is_pointer(swp_tb) && nr_pages > 1) { + /* + * Pointer-tagged rmap for vswap-backing THP =E2=80=94 each + * physical slot points back to its own vswap entry. + */ + vswap_entry =3D folio->swap; + for (i =3D 0; i < nr_pages; i++) { + v =3D vswap_entry; + v.val +=3D i; + __swap_table_set(ci, ci_off + i, + swp_entry_to_swp_tb_ptr(v)); + } + } else { __swap_table_set(ci, ci_off, swp_tb); + } =20 /* * The first allocation in a cluster makes the @@ -1167,6 +1253,13 @@ static void swap_reclaim_full_clusters(struct swap_i= nfo_struct *si, bool force) offset +=3D abs(nr_reclaim); continue; } + } else if (swp_tb_is_pointer(swp_tb) && + swap_rmap_is_cache_only(ci, offset % SWAPFILE_CLUSTER)) { + spin_unlock(&ci->lock); + try_to_reclaim_vswap_backing(si, offset); + ci =3D swap_cluster_lock(si, offset); + if (!ci) + goto next; } offset++; } @@ -1507,7 +1600,14 @@ static swp_entry_t swap_alloc_fast(struct folio *fol= io) if (!si || !offset || !get_swap_device_info(si)) return (swp_entry_t){}; =20 - swp_tb =3D folio_to_swp_tb(folio, 0); + /* + * Folio already in swap cache: allocating physical backing for a + * vswap entry (folio_realloc_swap). + */ + if (folio_test_swapcache(folio)) + swp_tb =3D swp_entry_to_swp_tb_ptr(folio->swap); + else + swp_tb =3D folio_to_swp_tb(folio, 0); =20 ci =3D swap_cluster_lock(si, offset); if (ci && cluster_is_usable(ci, order)) { @@ -1530,7 +1630,11 @@ static swp_entry_t swap_alloc_slow(struct folio *fol= io) struct swap_info_struct *si, *next; unsigned long swp_tb, found; =20 - swp_tb =3D folio_to_swp_tb(folio, 0); + /* See comment in swap_alloc_fast() */ + if (folio_test_swapcache(folio)) + swp_tb =3D swp_entry_to_swp_tb_ptr(folio->swap); + else + swp_tb =3D folio_to_swp_tb(folio, 0); =20 spin_lock(&swap_avail_lock); start_over: @@ -1722,6 +1826,8 @@ static void swap_put_entries_cluster(struct swap_info= _struct *si, } /* count will be 0 after put, slot can be reclaimed */ need_reclaim =3D true; + if (swap_is_vswap(si)) + vswap_mark_cache_only(si, ci, ci_off); } /* * A count !=3D 1 or cached slot can't be freed. Put its swap @@ -1922,12 +2028,7 @@ int folio_alloc_swap(struct folio *folio) } } =20 - /* - * Skip vswap when zswap is disabled =E2=80=94 without zswap, vswap entri= es - * have nowhere to go on writeout (no physical fallback yet; that - * arrives in the next patch). - */ - if (zswap_is_enabled() && vswap_alloc(folio)) + if (vswap_alloc(folio)) goto done; =20 again: @@ -1953,6 +2054,25 @@ int folio_alloc_swap(struct folio *folio) } =20 #ifdef CONFIG_VSWAP +static void vswap_mark_cache_only(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned int ci_off) +{ + struct swap_cluster_info_dynamic *ci_dyn; + struct swap_cluster_info *pci; + swp_entry_t phys; + unsigned long vt; + + ci_dyn =3D container_of(ci, struct swap_cluster_info_dynamic, ci); + vt =3D __vtable_get(ci_dyn, ci_off); + + if (vtable_type(vt) =3D=3D VSWAP_SWAPFILE) { + phys =3D vtable_to_phys(vt); + pci =3D __swap_entry_to_cluster(phys); + swap_rmap_mark_cache_only(pci, swp_cluster_offset(phys)); + } +} + static void vswap_free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { @@ -1971,12 +2091,21 @@ static void vswap_free_cluster(struct swap_info_str= uct *si, kfree_rcu(ci_dyn, rcu); } =20 +static void __swap_cluster_free_phys_backing(struct swap_info_struct *psi, + struct swap_cluster_info *pci, + unsigned int ci_start, + unsigned int nr_pages); + void vswap_release_backing(struct swap_cluster_info *ci, unsigned int ci_start, unsigned int nr) { struct swap_cluster_info_dynamic *ci_dyn; + struct swap_info_struct *psi; + unsigned long phys_start =3D 0, phys_end =3D 0; + unsigned int phys_type =3D 0; unsigned int ci_off; unsigned long vt; + swp_entry_t phys; =20 lockdep_assert_held(&ci->lock); ci_dyn =3D container_of(ci, struct swap_cluster_info_dynamic, ci); @@ -1984,12 +2113,41 @@ void vswap_release_backing(struct swap_cluster_info= *ci, for (ci_off =3D ci_start; ci_off < ci_start + nr; ci_off++) { vt =3D __vtable_get(ci_dyn, ci_off); =20 + /* + * Flush batched physical slots when the next entry + * breaks contiguity, changes type/device, or would + * cross a SWAPFILE_CLUSTER boundary (the free helper + * operates on a single cluster). + */ + if (phys_start !=3D phys_end && + (vtable_type(vt) !=3D VSWAP_SWAPFILE || + swp_type(vtable_to_phys(vt)) !=3D phys_type || + swp_offset(vtable_to_phys(vt)) !=3D phys_end || + phys_end % SWAPFILE_CLUSTER =3D=3D 0)) { + psi =3D __swap_type_to_info(phys_type); + __swap_cluster_free_phys_backing(psi, + __swap_entry_to_cluster( + swp_entry(phys_type, phys_start)), + phys_start % SWAPFILE_CLUSTER, + phys_end - phys_start); + phys_start =3D phys_end =3D 0; + } + switch (vtable_type(vt)) { + case VSWAP_SWAPFILE: + if (!phys_start) { + phys =3D vtable_to_phys(vt); + phys_start =3D swp_offset(phys); + phys_end =3D phys_start + 1; + phys_type =3D swp_type(phys); + } else { + phys_end++; + } + break; case VSWAP_ZSWAP: if (vtable_to_zswap(vt)) zswap_entry_free(vtable_to_zswap(vt)); break; - case VSWAP_SWAPFILE: case VSWAP_FOLIO: case VSWAP_ZERO: case VSWAP_NONE: @@ -1998,6 +2156,15 @@ void vswap_release_backing(struct swap_cluster_info = *ci, =20 __vtable_set(ci_dyn, ci_off, vtable_mk_none()); } + + if (phys_start !=3D phys_end) { + psi =3D __swap_type_to_info(phys_type); + __swap_cluster_free_phys_backing(psi, + __swap_entry_to_cluster( + swp_entry(phys_type, phys_start)), + phys_start % SWAPFILE_CLUSTER, + phys_end - phys_start); + } } =20 void vswap_store_folio(swp_entry_t entry, struct folio *folio) @@ -2050,6 +2217,54 @@ void vswap_prepare_writeout(swp_entry_t entry, struc= t folio *folio) spin_unlock(&ci->lock); } =20 +swp_entry_t folio_realloc_swap(struct folio *folio) +{ + swp_entry_t vswap_entry =3D folio->swap; + struct swap_cluster_info *ci; + struct swap_cluster_info_dynamic *ci_dyn; + unsigned int voff; + swp_entry_t phys_entry =3D {}; + swp_entry_t pe; + int i, nr =3D folio_nr_pages(folio); + + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio); + VM_WARN_ON(!swap_is_vswap(__swap_entry_to_info(vswap_entry))); + + phys_entry =3D vswap_to_phys(vswap_entry); + if (phys_entry.val) + return phys_entry; + + local_lock(&percpu_swap_cluster.lock); + phys_entry =3D swap_alloc_fast(folio); + if (!phys_entry.val) + phys_entry =3D swap_alloc_slow(folio); + local_unlock(&percpu_swap_cluster.lock); + + if (!phys_entry.val) + return (swp_entry_t){}; + + voff =3D swp_cluster_offset(vswap_entry); + + ci =3D __swap_entry_to_cluster(vswap_entry); + ci_dyn =3D container_of(ci, struct swap_cluster_info_dynamic, ci); + spin_lock(&ci->lock); + /* + * Install PHYS backing without freeing any prior contents of the + * vtable. The caller is responsible for any cleanup of the prior + * backing =E2=80=94 for example, zswap_writeback_entry calls in with the + * slot still pointing at the loaded zswap_entry (which it uses + * for decompress before zswap_entry_free), and swap_writeout + * calls vswap_prepare_writeout first to drop partial ZSWAP state. + */ + for (i =3D 0; i < nr; i++) { + pe.val =3D phys_entry.val + i; + __vtable_set(ci_dyn, voff + i, vtable_mk_phys(pe)); + } + spin_unlock(&ci->lock); + + return phys_entry; +} #endif /* CONFIG_VSWAP */ =20 /** @@ -2181,6 +2396,70 @@ struct swap_info_struct *get_swap_device(swp_entry_t= entry) * Free a set of swap slots after their swap count dropped to zero, or wil= l be * zero after putting the last ref (saves one __swap_cluster_put_entry cal= l). */ +#ifdef CONFIG_VSWAP +/* + * Clear swap table entries to NULL and reset zero flags. + * Does not touch memcg or count =E2=80=94 caller handles those. + */ +static void __swap_cluster_clear_table(struct swap_cluster_info *ci, + unsigned int ci_start, + unsigned int nr_pages) +{ + unsigned int ci_off; + + lockdep_assert_held(&ci->lock); + for (ci_off =3D ci_start; ci_off < ci_start + nr_pages; ci_off++) { + __swap_table_set(ci, ci_off, null_to_swp_tb()); + if (!SWAP_TABLE_HAS_ZEROFLAG) + __swap_table_clear_zero(ci, ci_off); + } +} +#endif + +/* + * Common tail for freeing swap slots: device-level accounting + * and cluster list management. + */ +static void __swap_cluster_finish_free(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned int ci_start, + unsigned int nr_pages) +{ + lockdep_assert_held(&ci->lock); + swap_range_free(si, cluster_offset(si, ci) + ci_start, nr_pages); + swap_cluster_assert_empty(ci, ci_start, nr_pages, false); + + if (!ci->count) + free_cluster(si, ci); + else + partial_free_cluster(si, ci); +} + +#ifdef CONFIG_VSWAP +/* + * Free physical swap slots that were backing vswap entries (Pointer-tagge= d). + * Clears the physical swap table, decrements cluster count, and does + * device-level accounting. Called from vswap_release_backing. + */ +static void __swap_cluster_free_phys_backing(struct swap_info_struct *psi, + struct swap_cluster_info *pci, + unsigned int ci_start, + unsigned int nr_pages) +{ + /* + * Caller holds the vswap cluster lock (asserted in + * vswap_release_backing). Nest the physical cluster lock under it + * =E2=80=94 same lockdep class, so use SINGLE_DEPTH_NESTING to silence + * PROVE_LOCKING. + */ + spin_lock_nested(&pci->lock, SINGLE_DEPTH_NESTING); + VM_WARN_ON(pci->count < nr_pages); + pci->count -=3D nr_pages; + __swap_cluster_clear_table(pci, ci_start, nr_pages); + __swap_cluster_finish_free(psi, pci, ci_start, nr_pages); + swap_cluster_unlock(pci); +} +#endif void __swap_cluster_free_entries(struct swap_info_struct *si, struct swap_cluster_info *ci, unsigned int ci_start, unsigned int nr_pages) @@ -2188,7 +2467,6 @@ void __swap_cluster_free_entries(struct swap_info_str= uct *si, unsigned long old_tb; unsigned short batch_id =3D 0, id_cur; unsigned int ci_off =3D ci_start, ci_end =3D ci_start + nr_pages; - unsigned long ci_head =3D cluster_offset(si, ci); unsigned int batch_off =3D ci_off; =20 VM_WARN_ON(ci->count < nr_pages); @@ -2226,13 +2504,7 @@ void __swap_cluster_free_entries(struct swap_info_st= ruct *si, if (batch_id) mem_cgroup_uncharge_swap(batch_id, ci_off - batch_off); =20 - swap_range_free(si, ci_head + ci_start, nr_pages); - swap_cluster_assert_empty(ci, ci_start, nr_pages, false); - - if (!ci->count) - free_cluster(si, ci); - else - partial_free_cluster(si, ci); + __swap_cluster_finish_free(si, ci, ci_start, nr_pages); } =20 int __swap_count(swp_entry_t entry) @@ -3070,19 +3342,85 @@ static unsigned int find_next_to_unuse(struct swap_= info_struct *si, =20 static int try_to_unuse(unsigned int type) { + struct swap_cluster_info *vci; + struct mempolicy mpol =3D { .mode =3D MPOL_DEFAULT }; struct mm_struct *prev_mm; struct mm_struct *mm; struct list_head *p; int retval =3D 0; struct swap_info_struct *si =3D swap_info[type]; struct folio *folio; - swp_entry_t entry; - unsigned int i; + swp_entry_t entry, vswap_entry; + unsigned long swp_tb; + unsigned int i, ci_off; =20 if (!swap_usage_in_pages(si)) goto success; =20 retry: + /* + * Free vswap-backing slots (Pointer-tagged) first. Walk physical + * clusters, read the vswap entry from the rmap, ensure the data + * is in the swap cache, and transition PHYS=E2=86=92FOLIO. No page table + * walk needed =E2=80=94 just free the physical backing. + */ + i =3D 0; + while (IS_ENABLED(CONFIG_VSWAP) && + swap_usage_in_pages(si) && + !signal_pending(current) && + (i =3D find_next_to_unuse(si, i)) !=3D 0) { + swp_entry_t phys; + + vci =3D __swap_offset_to_cluster(si, i); + if (!vci) + continue; + ci_off =3D i % SWAPFILE_CLUSTER; + + spin_lock(&vci->lock); + swp_tb =3D __swap_table_get(vci, ci_off); + spin_unlock(&vci->lock); + + if (!swp_tb_is_pointer(swp_tb)) + continue; + + vswap_entry =3D swp_tb_ptr_to_swp_entry(swp_tb); + + folio =3D swap_cache_get_folio(vswap_entry); + if (!folio) { + folio =3D swap_cache_alloc_folio(vswap_entry, + GFP_KERNEL, BIT(0), NULL, + &mpol, NO_INTERLEAVE_INDEX); + if (IS_ERR_OR_NULL(folio)) + continue; + swap_read_folio(folio, NULL); + folio_lock(folio); + } else { + folio_lock(folio); + } + + if (!folio_matches_swap_entry(folio, vswap_entry)) { + folio_unlock(folio); + folio_put(folio); + continue; + } + + phys =3D vswap_to_phys(vswap_entry); + if (!phys.val || swp_type(phys) !=3D type) { + folio_unlock(folio); + folio_put(folio); + continue; + } + + folio_wait_writeback(folio); + vswap_store_folio(vswap_entry, folio); + folio_mark_dirty(folio); + folio_unlock(folio); + folio_put(folio); + } + + if (!swap_usage_in_pages(si)) + goto success; + retval =3D shmem_unuse(type); if (retval) return retval; @@ -3126,6 +3464,14 @@ static int try_to_unuse(unsigned int type) =20 entry =3D swp_entry(type, i); =20 + if (IS_ENABLED(CONFIG_VSWAP)) { + swp_tb =3D swap_table_get( + __swap_offset_to_cluster(si, i), + i % SWAPFILE_CLUSTER); + if (swp_tb_is_pointer(swp_tb)) + continue; + } + folio =3D swap_cache_get_folio(entry); if (!folio) continue; diff --git a/mm/vswap.h b/mm/vswap.h index 5e6e5b88593c..a3a84e27f819 100644 --- a/mm/vswap.h +++ b/mm/vswap.h @@ -24,6 +24,40 @@ static inline bool swap_is_vswap(struct swap_info_struct= *si) =20 extern struct swap_info_struct *vswap_si; =20 +/* Rmap cache-only helpers for physical cluster Pointer-tagged entries */ + +static inline void swap_rmap_mark_cache_only(struct swap_cluster_info *ci, + unsigned int off) +{ + atomic_long_t *table; + + table =3D rcu_dereference_check(ci->table, true); + atomic_long_or(SWP_RMAP_CACHE_ONLY, &table[off]); +} + +static inline void swap_rmap_clear_cache_only(struct swap_cluster_info *ci, + unsigned int off) +{ + atomic_long_t *table; + + table =3D rcu_dereference_check(ci->table, true); + atomic_long_and(~SWP_RMAP_CACHE_ONLY, &table[off]); +} + +static inline bool swap_rmap_is_cache_only(struct swap_cluster_info *ci, + unsigned int off) +{ + atomic_long_t *table; + bool ret; + + VM_WARN_ON_ONCE(off >=3D SWAPFILE_CLUSTER); + rcu_read_lock(); + table =3D rcu_dereference(ci->table); + ret =3D table && (atomic_long_read(&table[off]) & SWP_RMAP_CACHE_ONLY); + rcu_read_unlock(); + return ret; +} + /* * Virtual table entry encoding for vswap clusters. * @@ -73,6 +107,20 @@ static inline unsigned long vtable_mk_none(void) return 0; } =20 +static inline unsigned long vtable_mk_phys(swp_entry_t entry) +{ + return vtable_mk(VSWAP_SWAPFILE, entry.val); +} + +static inline swp_entry_t vtable_to_phys(unsigned long vt) +{ + swp_entry_t entry; + + VM_WARN_ON(vtable_type(vt) !=3D VSWAP_SWAPFILE); + entry.val =3D vtable_payload(vt); + return entry; +} + static inline unsigned long vtable_mk_zero(void) { return VSWAP_ZERO; @@ -136,6 +184,27 @@ vswap_lock_cluster(swp_entry_t entry, unsigned int *vo= ff) return ci_dyn; } =20 +/* High-level vswap lookup */ + +static inline swp_entry_t vswap_to_phys(swp_entry_t entry) +{ + struct swap_cluster_info_dynamic *ci_dyn; + unsigned int voff; + unsigned long vt; + + ci_dyn =3D vswap_lock_cluster(entry, &voff); + if (!ci_dyn) + return (swp_entry_t){}; + + vt =3D __vtable_get(ci_dyn, voff); + spin_unlock(&ci_dyn->ci.lock); + + if (vtable_type(vt) !=3D VSWAP_SWAPFILE) + return (swp_entry_t){}; + + return vtable_to_phys(vt); +} + /* Zswap entry helpers =E2=80=94 store/load/erase in virtual_table */ =20 void vswap_release_backing(struct swap_cluster_info *ci, @@ -188,6 +257,7 @@ static inline int vswap_check_backing(swp_entry_t entry= , int nr, enum vswap_backing_type first_type; unsigned int voff; unsigned long vt; + swp_entry_t first_phys; int i; =20 ci_dyn =3D vswap_lock_cluster(entry, &voff); @@ -196,10 +266,16 @@ static inline int vswap_check_backing(swp_entry_t ent= ry, int nr, =20 for (i =3D 0; i < nr; i++) { vt =3D __vtable_get(ci_dyn, voff + i); - if (!i) + if (!i) { first_type =3D vtable_type(vt); - else if (vtable_type(vt) !=3D first_type) + if (first_type =3D=3D VSWAP_SWAPFILE) + first_phys =3D vtable_to_phys(vt); + } else if (vtable_type(vt) !=3D first_type) { break; + } else if (first_type =3D=3D VSWAP_SWAPFILE && + vtable_to_phys(vt).val !=3D first_phys.val + i) { + break; + } } spin_unlock(&ci_dyn->ci.lock); =20 @@ -208,12 +284,20 @@ static inline int vswap_check_backing(swp_entry_t ent= ry, int nr, return i; } =20 +static inline bool vswap_swapfile_backed(swp_entry_t entry, int nr) +{ + enum vswap_backing_type type; + + return vswap_check_backing(entry, nr, &type) =3D=3D nr && + type =3D=3D VSWAP_SWAPFILE; +} + static inline bool vswap_can_swapin_thp(swp_entry_t entry, int nr) { enum vswap_backing_type type; =20 return vswap_check_backing(entry, nr, &type) =3D=3D nr && - type =3D=3D VSWAP_ZERO; + (type =3D=3D VSWAP_ZERO || type =3D=3D VSWAP_SWAPFILE); } =20 static inline int vswap_cluster_alloc_vtable(struct swap_cluster_info_dyna= mic *ci_dyn) @@ -266,6 +350,22 @@ static inline void vswap_set_zero(struct swap_cluster_= info *ci, =20 #else /* !CONFIG_VSWAP */ =20 +static inline swp_entry_t vswap_to_phys(swp_entry_t entry) +{ + return (swp_entry_t){}; +} + +static inline bool vswap_swapfile_backed(swp_entry_t entry, int nr) +{ + return false; +} + +static inline bool swap_rmap_is_cache_only(struct swap_cluster_info *ci, + unsigned int off) +{ + return false; +} + static inline void vswap_release_backing(struct swap_cluster_info *ci, unsigned int ci_start, unsigned int nr) {} @@ -310,4 +410,36 @@ static inline void vswap_set_zero(struct swap_cluster_= info *ci, unsigned int ci_off) {} =20 #endif /* CONFIG_VSWAP */ + +/* + * Test a per-backend swap flag (SWP_SYNCHRONOUS_IO, SWP_STABLE_WRITES, ..= .) + * for @entry. For a vswap entry the property belongs to the current + * physical backing, not vswap_si =E2=80=94 resolve and test that. Returns= false + * for zswap/zero/unbacked vswap entries: they don't go through bdev IO, + * so per-bdev flags don't apply. + */ +static inline bool swap_entry_backend_has_flag(struct swap_info_struct *si, + swp_entry_t entry, + unsigned long flag) +{ + struct swap_info_struct *phys_si; + swp_entry_t phys; + bool has_flag; + + if (!swap_is_vswap(si)) + return data_race(si->flags & flag); + + phys =3D vswap_to_phys(entry); + if (!phys.val) + return false; + + phys_si =3D get_swap_device(phys); + if (!phys_si) + return false; + + has_flag =3D data_race(phys_si->flags & flag); + put_swap_device(phys_si); + return has_flag; +} + #endif /* _MM_VSWAP_H */ diff --git a/mm/zswap.c b/mm/zswap.c index c57bf0246bb2..85622af0df5c 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -993,6 +993,7 @@ static int zswap_writeback_entry(struct zswap_entry *en= try, struct folio *folio; struct mempolicy *mpol; struct swap_info_struct *si; + swp_entry_t phys =3D {}; int ret =3D 0; =20 /* try to allocate swap cache folio */ @@ -1000,16 +1001,6 @@ static int zswap_writeback_entry(struct zswap_entry = *entry, if (!si) return -EEXIST; =20 - /* - * Vswap entries have no physical backing =E2=80=94 writeback would fail - * and SIGBUS the caller. Bail before we waste a swap-cache folio - * allocation. - */ - if (si->flags & SWP_VSWAP) { - put_swap_device(si); - return -EINVAL; - } - mpol =3D get_task_policy(current); folio =3D swap_cache_alloc_folio(swpentry, GFP_KERNEL, BIT(0), NULL, mpol, NO_INTERLEAVE_INDEX); @@ -1028,31 +1019,57 @@ static int zswap_writeback_entry(struct zswap_entry= *entry, /* * folio is locked, and the swapcache is now secured against * concurrent swapping to and from the slot, and concurrent - * swapoff so we can safely dereference the zswap tree here. - * Verify that the swap entry hasn't been invalidated and recycled - * behind our backs, to avoid overwriting a new swap folio with - * old compressed data. Only when this is successful can the entry - * be dereferenced. + * swapoff so we can safely dereference the zswap tree (or vswap + * vtable) here. Verify that the swap entry hasn't been + * invalidated and recycled behind our backs, to avoid overwriting + * a new swap folio with old compressed data. Only when this is + * successful can the entry be dereferenced. */ - tree =3D swap_zswap_tree(swpentry); - if (entry !=3D xa_load(tree, offset)) { - ret =3D -ENOMEM; - goto out; + if (swap_is_vswap(si)) { + if (entry !=3D vswap_zswap_load(swpentry)) { + ret =3D -ENOMEM; + goto out; + } + /* + * Allocate physical backing BEFORE decompress =E2=80=94 if it fails, + * no wasted work. folio_realloc_swap sets vtable to PHYS, + * overwriting ZSWAP =E2=80=94 the old entry pointer is only held + * by the caller now. + */ + phys =3D folio_realloc_swap(folio); + if (!phys.val) { + ret =3D -ENOMEM; + goto out; + } + } else { + tree =3D swap_zswap_tree(swpentry); + if (entry !=3D xa_load(tree, offset)) { + ret =3D -ENOMEM; + goto out; + } } =20 if (!zswap_decompress(entry, folio)) { ret =3D -EIO; + /* + * For vswap: folio_realloc_swap already moved the entry + * out of the vtable. Restore it via vswap_zswap_store so + * the entry stays tracked (and the just-allocated PHYS + * slot is freed). For non-vswap: entry is still in the + * zswap tree. + */ + if (swap_is_vswap(si) && phys.val) + vswap_zswap_store(swpentry, entry); goto out; } =20 - xa_erase(tree, offset); + if (!swap_is_vswap(si)) + xa_erase(tree, offset); =20 count_vm_event(ZSWPWB); if (entry->objcg) count_objcg_events(entry->objcg, ZSWPWB, 1); =20 - zswap_entry_free(entry); - /* folio is up to date */ folio_mark_uptodate(folio); =20 @@ -1060,8 +1077,22 @@ static int zswap_writeback_entry(struct zswap_entry = *entry, folio_set_reclaim(folio); =20 /* start writeback */ - ret =3D __swap_writepage(folio, NULL); - WARN_ON_ONCE(ret); + if (swap_is_vswap(si)) { + ret =3D __swap_writepage_phys(folio, NULL, phys); + WARN_ON_ONCE(ret); + } else { + ret =3D __swap_writepage(folio, NULL); + WARN_ON_ONCE(ret); + } + + /* + * __swap_writepage{,_phys} always returns 0 today =E2=80=94 async IO + * errors surface in the bio end_io callback, not synchronously + * here. Either way, the entry has been moved out of its prior + * location (vtable PHYS for vswap, removed from tree for not), + * so we own the free. + */ + zswap_entry_free(entry); =20 out: if (ret) { --=20 2.53.0-Meta From nobody Mon Jun 8 14:35:25 2026 Received: from mail-oi1-f170.google.com (mail-oi1-f170.google.com [209.85.167.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 420A635E1D9 for ; Thu, 28 May 2026 21:30:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.167.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780003810; cv=none; b=gEi6AWPSqwWRztruQCpENT5RU340Ez3LDx+MA1LxcxjcXwT4knqApSRYvTcRul21Fc/KfVbHwxDH6Pl13zPBqbCtINXvhPJd8ZjQgCjSI2FEEcz6cjX2tY/sd2QUifisvFq5RHHvDgahGCvWpFGS1xuI5zVX8dLj2aWKHipivwM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780003810; c=relaxed/simple; bh=+rMpHJ1FReD9AMZPCsmDYAX43qNMG7fXfRg2A6zFtVA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=nq4R6INyhd+J+7eLPp/ZcX5VRy9T8uxuMRh+5+8F3Ope6vN7Od3gUUTh7VYnxfowsXn+Xplp25HVBMj5eaXeLCMru1qTOG/CieZYFYskG25NZZ2eQE07uvS99KfWOlau9kguDzbQmOSZGdmA9vKWdn2pJGV1DdaVmIUBf34utJ8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=F3XWw+Ri; arc=none smtp.client-ip=209.85.167.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="F3XWw+Ri" Received: by mail-oi1-f170.google.com with SMTP id 5614622812f47-4824176bbbeso4591305b6e.0 for ; Thu, 28 May 2026 14:30:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1780003805; x=1780608605; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=kp4nYgBwYqbTZevet3/JvCrzNrQ24lukygwTOrJB7hw=; b=F3XWw+RiybZE8ufmSReJI0SSRI9kqEb1m+0p7T601DbhYywukYoqHQgnI4tsugdpN1 cVx/QNZWlzg4mYW6N1o7hZCUGr+Ua80RVftdJ4pIrpK7WSolQH3hs/yj5kyVNvyFwyQ8 qzKIWfFXLK/aT2dy/rRV4vkEd53yB46YvSRbla3vqlBIqAdybQ0CRAEMeKIOm6fqt5BE ozEMcryWwHnnJoBRalhRWxQuKDHvRfUTFV6E40/7CVIAu0g6sK4dzRMQxba1ga/KayQq EIaTsj79MrQt2kENVf5iQ0ydgRtx+0kfx5gToeKA0Bd8cnU0pyLET3TTObLvH1MEOlOE CYZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780003805; x=1780608605; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=kp4nYgBwYqbTZevet3/JvCrzNrQ24lukygwTOrJB7hw=; b=lc/wzSQQtyV8TdRK2xYLyxQpt0XAcdU4ZpnRgv6rW5FGalokx86GGG/lmhsUB2ATrh I2Y89rT6sW66+0TILtZIXGrikFoakYyaS6V0Gbi9BZqXjeNT6ftyFyOysWcdH0s7WBvR 7DLZcBQ0zgk0Rnplzx7GZwYx93qddYI58Bo+zVwUVKn0CIOMSFV2dIE3VmkuBmuc6lzh sIQSxbSVc/PtjMaufVjbJ3D525xnKEPCyeKFGAXhZMVErsE9lWgkFgLOlnljPSBJ5zBY o2u++3UdXxWhw5pVtK/CyYHzwuJa77vJRUNn6DnIkeGdKY7VDWbAlxspYv0RHjN/LYXq tDSQ== X-Forwarded-Encrypted: i=1; AFNElJ8lWpk/Dl0rDs2JlfrxDdeyHsWZ3qPLedSLIE6F13VFuaC8riZ883PYyOCYX4UHbJgQ5N0xxFlU98HIadA=@vger.kernel.org X-Gm-Message-State: AOJu0Yx7TaSdJtljeCKwtAuaXoa4znDMtsMWfiEWecfExn2g0Ua+HMGi FZ4M2pmwokno2T0QUV4EDT4lyUF+CfrA2b6g7DmJ+chle+1jJ40rKNJ2 X-Gm-Gg: Acq92OHtyZUG4MF0BfoDkpgmlwV2MPG2jQsp2mlhwvbNUDQHq9+tJcE0bDg9YAtQhNE xB7H3w09u5K7rHjoGk7GBHJxG7+g8qELEVORVspMF9nn1Tv/6aU7i5X0ev9BUIZsW7pYD7y++1t IG3JRYaV33Smy4syHa6Gjr75LMBFiCF6KI3CFPkG4oEYQbsNaGtziI3x9EYkxtT1iVIHcH2LPr0 oXEPNLeaG4tCTJl7gjO20aQsBgS2roeL6FsKmnq5fdhhqpV5SwRVnLv3ndPhphkn04G7ya5JBnH UkFR4e8Lu9TwmCp/eZTVlfdkG9bli1ntDMvwd9a5pGCcJsidkb0MYheSFyxIC7fNKNrxgg3Wwvh /KaqUGCAAiCfRTLhpZmJuVT5LP7IhvF9CfKuuAkOGiXF3a67lnUP+h9bQ0mV+qQebH1mRaIhn6/ eDTYGdydN0VyPBNIBaqOXKCSLAYqbJTScqZWveJS1GpTRS4XF47/FU5D3B X-Received: by 2002:a05:6808:250b:b0:485:542:f905 with SMTP id 5614622812f47-485e6acd9dbmr189559b6e.16.1780003804882; Thu, 28 May 2026 14:30:04 -0700 (PDT) Received: from localhost ([2a03:2880:10ff:58::]) by smtp.gmail.com with ESMTPSA id 5614622812f47-4855476cbccsm10188560b6e.18.2026.05.28.14.30.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 28 May 2026 14:30:04 -0700 (PDT) From: Nhat Pham To: kasong@tencent.com Cc: Liam.Howlett@oracle.com, akpm@linux-foundation.org, apopple@nvidia.com, axelrasmussen@google.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, bhe@redhat.com, byungchul@sk.com, cgroups@vger.kernel.org, chengming.zhou@linux.dev, chrisl@kernel.org, corbet@lwn.net, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jannh@google.com, joshua.hahnjy@gmail.com, lance.yang@linux.dev, lenb@kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-pm@vger.kernel.org, lorenzo.stoakes@oracle.com, matthew.brost@intel.com, mhocko@suse.com, muchun.song@linux.dev, npache@redhat.com, nphamcs@gmail.com, pavel@kernel.org, peterx@redhat.com, peterz@infradead.org, pfalcato@suse.de, rafael@kernel.org, rakie.kim@sk.com, roman.gushchin@linux.dev, rppt@kernel.org, ryan.roberts@arm.com, shakeel.butt@linux.dev, shikemeng@huaweicloud.com, surenb@google.com, tglx@kernel.org, vbabka@suse.cz, weixugc@google.com, ying.huang@linux.alibaba.com, yosry.ahmed@linux.dev, yuanchu@google.com, zhengqi.arch@bytedance.com, ziy@nvidia.com, kernel-team@meta.com, riel@surriel.com, haowenchao22@gmail.com Subject: [RFC PATCH 4/5] mm, swap: only charge physical swap entries Date: Thu, 28 May 2026 14:29:28 -0700 Message-ID: <20260528212955.1912856-5-nphamcs@gmail.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260528212955.1912856-1-nphamcs@gmail.com> References: <20260528212955.1912856-1-nphamcs@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Stop double-charging vswap entries against memcg->swap. Previously, the entry was charged once at vswap allocation (via mem_cgroup_try_charge_swap) and implicitly again when physical backing was allocated. Split the lifecycle into four operations: record the memcg private ID at vswap alloc without charging; charge memcg->swap only when physical backing is allocated via folio_realloc_swap; uncharge in vswap_release_backing (only nr_swapfile entries on v2, all nr on v1 memsw); and drop the ID ref at __swap_cluster_free_entries without uncharging. Direct-mapped physical swap charging is unchanged. Signed-off-by: Nhat Pham --- include/linux/swap.h | 57 +++++++++++++++++++++ mm/memcontrol.c | 118 +++++++++++++++++++++++++++++++++++++++++++ mm/swapfile.c | 109 ++++++++++++++++++++++++++++++++++++--- 3 files changed, 276 insertions(+), 8 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 3fb55485fc76..6f18ecdf0bb8 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -597,6 +597,43 @@ static inline int mem_cgroup_try_charge_swap(struct fo= lio *folio) return __mem_cgroup_try_charge_swap(folio); } =20 +extern void __mem_cgroup_record_swap(struct folio *folio); +static inline void mem_cgroup_record_swap(struct folio *folio) +{ + if (mem_cgroup_disabled()) + return; + __mem_cgroup_record_swap(folio); +} + +extern int __mem_cgroup_charge_backing_phys_swap(struct mem_cgroup *memcg, + unsigned int nr_pages); +static inline int mem_cgroup_charge_backing_phys_swap(struct mem_cgroup *m= emcg, + unsigned int nr_pages) +{ + if (mem_cgroup_disabled()) + return 0; + return __mem_cgroup_charge_backing_phys_swap(memcg, nr_pages); +} + +extern void __mem_cgroup_uncharge_backing_phys_swap(struct mem_cgroup *mem= cg, + unsigned int nr_pages); +static inline void mem_cgroup_uncharge_backing_phys_swap(struct mem_cgroup= *memcg, + unsigned int nr_pages) +{ + if (mem_cgroup_disabled()) + return; + __mem_cgroup_uncharge_backing_phys_swap(memcg, nr_pages); +} + +extern void __mem_cgroup_id_put_swap(unsigned short id, unsigned int nr_pa= ges); +static inline void mem_cgroup_id_put_swap(unsigned short id, + unsigned int nr_pages) +{ + if (mem_cgroup_disabled()) + return; + __mem_cgroup_id_put_swap(id, nr_pages); +} + extern void __mem_cgroup_uncharge_swap(unsigned short id, unsigned int nr_= pages); static inline void mem_cgroup_uncharge_swap(unsigned short id, unsigned in= t nr_pages) { @@ -613,6 +650,26 @@ static inline int mem_cgroup_try_charge_swap(struct fo= lio *folio) return 0; } =20 +static inline void mem_cgroup_record_swap(struct folio *folio) +{ +} + +static inline int mem_cgroup_charge_backing_phys_swap(struct mem_cgroup *m= emcg, + unsigned int nr_pages) +{ + return 0; +} + +static inline void mem_cgroup_uncharge_backing_phys_swap(struct mem_cgroup= *memcg, + unsigned int nr_pages) +{ +} + +static inline void mem_cgroup_id_put_swap(unsigned short id, + unsigned int nr_pages) +{ +} + static inline void mem_cgroup_uncharge_swap(unsigned short id, unsigned int nr_pages) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7492879b3239..91618da7ec20 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5513,6 +5513,124 @@ int __mem_cgroup_try_charge_swap(struct folio *foli= o) return 0; } =20 +/** + * __mem_cgroup_record_swap - record memcg for swap without charging + * @folio: folio being added to swap + * + * Pin the memcg private ID ref and record it in the swap cgroup table, + * but do not charge memcg->swap. Used for vswap entries where the charge + * is deferred until physical backing is allocated. + */ +void __mem_cgroup_record_swap(struct folio *folio) +{ + unsigned int nr_pages =3D folio_nr_pages(folio); + struct swap_cluster_info *ci; + struct mem_cgroup *memcg; + struct obj_cgroup *objcg; + + if (do_memsw_account()) + return; + + objcg =3D folio_objcg(folio); + if (!objcg) + return; + + rcu_read_lock(); + memcg =3D obj_cgroup_memcg(objcg); + if (!folio_test_swapcache(folio)) { + rcu_read_unlock(); + return; + } + + memcg =3D mem_cgroup_private_id_get_online(memcg, nr_pages); + rcu_read_unlock(); + + ci =3D swap_cluster_get_and_lock(folio); + __swap_cgroup_set(ci, swp_cluster_offset(folio->swap), nr_pages, + mem_cgroup_private_id(memcg)); + swap_cluster_unlock(ci); +} + +/** + * __mem_cgroup_charge_backing_phys_swap - charge memcg->swap counter only + * @memcg: the mem_cgroup to charge (may be NULL) + * @nr_pages: number of physical swap pages to charge + * + * Unlike __mem_cgroup_try_charge_swap(), this does NOT touch the memcg + * private ID refcount =E2=80=94 the ID ref was pinned earlier by + * __mem_cgroup_record_swap() at vswap allocation time and lives for the + * lifetime of the vswap entry. This helper only updates the swap counter + * when a vswap entry transitions to physical backing (folio_realloc_swap), + * so the counter and the ID ref can be managed independently. + * + * The caller resolves the memcg (typically via folio_memcg + ID + * comparison to avoid IDR lookups on the hot path). + * + * Returns 0 on success, -ENOMEM on failure. + */ +int __mem_cgroup_charge_backing_phys_swap(struct mem_cgroup *memcg, + unsigned int nr_pages) +{ + struct page_counter *counter; + + if (do_memsw_account()) + return 0; + if (!memcg) + return 0; + + if (!mem_cgroup_is_root(memcg) && + !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) { + memcg_memory_event(memcg, MEMCG_SWAP_MAX); + memcg_memory_event(memcg, MEMCG_SWAP_FAIL); + return -ENOMEM; + } + mod_memcg_state(memcg, MEMCG_SWAP, nr_pages); + return 0; +} + +/** + * __mem_cgroup_uncharge_backing_phys_swap - uncharge memcg->swap counter = only + * @memcg: the mem_cgroup to uncharge (may be NULL) + * @nr_pages: number of physical swap pages to uncharge + * + * Unlike __mem_cgroup_uncharge_swap(), this does NOT drop the memcg + * private ID refcount =E2=80=94 that ref is dropped separately via + * __mem_cgroup_id_put_swap() when the vswap entry itself is freed. + * This helper only updates the swap counter when physical backing is + * released (vswap_release_backing), so the counter and ID ref can be + * managed independently. + */ +void __mem_cgroup_uncharge_backing_phys_swap(struct mem_cgroup *memcg, + unsigned int nr_pages) +{ + if (!memcg) + return; + + if (!mem_cgroup_is_root(memcg)) { + if (do_memsw_account()) + page_counter_uncharge(&memcg->memsw, nr_pages); + else + page_counter_uncharge(&memcg->swap, nr_pages); + } + mod_memcg_state(memcg, MEMCG_SWAP, -nr_pages); +} + +/** + * __mem_cgroup_id_put_swap - drop memcg private ID ref without uncharging + * @id: cgroup private id + * @nr_pages: number of refs to drop + */ +void __mem_cgroup_id_put_swap(unsigned short id, unsigned int nr_pages) +{ + struct mem_cgroup *memcg; + + rcu_read_lock(); + memcg =3D mem_cgroup_from_private_id(id); + if (memcg) + mem_cgroup_private_id_put(memcg, nr_pages); + rcu_read_unlock(); +} + /** * __mem_cgroup_uncharge_swap - uncharge swap space * @id: cgroup id to uncharge diff --git a/mm/swapfile.c b/mm/swapfile.c index a0976be6a12b..be901fb741e5 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -33,6 +33,7 @@ #include #include #include +#include "memcontrol-v1.h" #include #include #include @@ -2043,8 +2044,15 @@ int folio_alloc_swap(struct folio *folio) goto again; } =20 - /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */ - if (unlikely(mem_cgroup_try_charge_swap(folio))) + /* + * Vswap entries: record memcg ID without charging =E2=80=94 the charge is + * deferred to folio_realloc_swap when physical backing is allocated. + * Direct-mapped physical swap entries: charge immediately as today. + */ + if (folio_test_swapcache(folio) && + swap_is_vswap(__swap_entry_to_info(folio->swap))) + mem_cgroup_record_swap(folio); + else if (unlikely(mem_cgroup_try_charge_swap(folio))) swap_cache_del_folio(folio); =20 if (unlikely(!folio_test_swapcache(folio))) @@ -2096,6 +2104,26 @@ static void __swap_cluster_free_phys_backing(struct = swap_info_struct *psi, unsigned int ci_start, unsigned int nr_pages); =20 +static void vswap_uncharge_cgroup_batch(unsigned short memcg_id, + unsigned int batch_nr, + unsigned int batch_nr_swapfile) +{ + struct mem_cgroup *memcg; + unsigned int n; + + if (do_memsw_account()) + n =3D batch_nr; + else + n =3D batch_nr_swapfile; + if (!n) + return; + + rcu_read_lock(); + memcg =3D memcg_id ? mem_cgroup_from_private_id(memcg_id) : NULL; + rcu_read_unlock(); + mem_cgroup_uncharge_backing_phys_swap(memcg, n); +} + void vswap_release_backing(struct swap_cluster_info *ci, unsigned int ci_start, unsigned int nr) { @@ -2106,12 +2134,36 @@ void vswap_release_backing(struct swap_cluster_info= *ci, unsigned int ci_off; unsigned long vt; swp_entry_t phys; + /* + * Per-cgroup uncharge batching: a single vswap_release_backing + * call can span multiple cgroups (e.g. batched free across + * folios), so we cannot uncharge with the first slot's memcg + * for the whole range. + */ + unsigned short batch_id; + unsigned int batch_nr =3D 0, batch_nr_swapfile =3D 0; =20 lockdep_assert_held(&ci->lock); ci_dyn =3D container_of(ci, struct swap_cluster_info_dynamic, ci); + batch_id =3D __swap_cgroup_get(ci, ci_start); =20 for (ci_off =3D ci_start; ci_off < ci_start + nr; ci_off++) { + unsigned short cur_id; + vt =3D __vtable_get(ci_dyn, ci_off); + cur_id =3D __swap_cgroup_get(ci, ci_off); + + /* + * Flush per-cgroup uncharge when crossing a cgroup boundary. + */ + if (cur_id !=3D batch_id) { + vswap_uncharge_cgroup_batch(batch_id, batch_nr, + batch_nr_swapfile); + batch_id =3D cur_id; + batch_nr =3D 0; + batch_nr_swapfile =3D 0; + } + batch_nr++; =20 /* * Flush batched physical slots when the next entry @@ -2135,6 +2187,7 @@ void vswap_release_backing(struct swap_cluster_info *= ci, =20 switch (vtable_type(vt)) { case VSWAP_SWAPFILE: + batch_nr_swapfile++; if (!phys_start) { phys =3D vtable_to_phys(vt); phys_start =3D swp_offset(phys); @@ -2165,6 +2218,9 @@ void vswap_release_backing(struct swap_cluster_info *= ci, phys_start % SWAPFILE_CLUSTER, phys_end - phys_start); } + + /* Final cgroup-batch flush. */ + vswap_uncharge_cgroup_batch(batch_id, batch_nr, batch_nr_swapfile); } =20 void vswap_store_folio(swp_entry_t entry, struct folio *folio) @@ -2222,7 +2278,9 @@ swp_entry_t folio_realloc_swap(struct folio *folio) swp_entry_t vswap_entry =3D folio->swap; struct swap_cluster_info *ci; struct swap_cluster_info_dynamic *ci_dyn; + struct mem_cgroup *memcg; unsigned int voff; + unsigned short memcg_id; swp_entry_t phys_entry =3D {}; swp_entry_t pe; int i, nr =3D folio_nr_pages(folio); @@ -2245,9 +2303,33 @@ swp_entry_t folio_realloc_swap(struct folio *folio) return (swp_entry_t){}; =20 voff =3D swp_cluster_offset(vswap_entry); - ci =3D __swap_entry_to_cluster(vswap_entry); ci_dyn =3D container_of(ci, struct swap_cluster_info_dynamic, ci); + + /* + * Resolve the memcg for physical swap charging. Compare + * folio_memcg against the recorded swap memcg ID =E2=80=94 on match + * (common case), zero IDR lookups. Only fall back to IDR + * lookup on mismatch (task migrated cgroups). + */ + spin_lock(&ci->lock); + memcg_id =3D __swap_cgroup_get(ci, voff); + spin_unlock(&ci->lock); + + rcu_read_lock(); + memcg =3D folio_memcg(folio); + if (!memcg || mem_cgroup_private_id(memcg) !=3D memcg_id) + memcg =3D memcg_id ? mem_cgroup_from_private_id(memcg_id) : NULL; + rcu_read_unlock(); + + if (mem_cgroup_charge_backing_phys_swap(memcg, nr)) { + __swap_cluster_free_phys_backing( + __swap_entry_to_info(phys_entry), + __swap_entry_to_cluster(phys_entry), + swp_cluster_offset(phys_entry), nr); + return (swp_entry_t){}; + } + spin_lock(&ci->lock); /* * Install PHYS backing without freeing any prior contents of the @@ -2468,10 +2550,11 @@ void __swap_cluster_free_entries(struct swap_info_s= truct *si, unsigned short batch_id =3D 0, id_cur; unsigned int ci_off =3D ci_start, ci_end =3D ci_start + nr_pages; unsigned int batch_off =3D ci_off; + bool is_vswap =3D swap_is_vswap(si); =20 VM_WARN_ON(ci->count < nr_pages); =20 - if (swap_is_vswap(si)) + if (is_vswap) vswap_release_backing(ci, ci_start, nr_pages); =20 ci->count -=3D nr_pages; @@ -2491,18 +2574,28 @@ void __swap_cluster_free_entries(struct swap_info_s= truct *si, /* * Uncharge swap slots by memcg in batches. Consecutive * slots with the same cgroup id are uncharged together. + * For vswap, only drop the ID ref =E2=80=94 physical swap was + * already uncharged in vswap_release_backing above. */ id_cur =3D __swap_cgroup_clear(ci, ci_off, 1); if (batch_id !=3D id_cur) { - if (batch_id) - mem_cgroup_uncharge_swap(batch_id, ci_off - batch_off); + if (batch_id) { + if (is_vswap) + mem_cgroup_id_put_swap(batch_id, ci_off - batch_off); + else + mem_cgroup_uncharge_swap(batch_id, ci_off - batch_off); + } batch_id =3D id_cur; batch_off =3D ci_off; } } while (++ci_off < ci_end); =20 - if (batch_id) - mem_cgroup_uncharge_swap(batch_id, ci_off - batch_off); + if (batch_id) { + if (is_vswap) + mem_cgroup_id_put_swap(batch_id, ci_off - batch_off); + else + mem_cgroup_uncharge_swap(batch_id, ci_off - batch_off); + } =20 __swap_cluster_finish_free(si, ci, ci_start, nr_pages); } --=20 2.53.0-Meta From nobody Mon Jun 8 14:35:25 2026 Received: from mail-ot1-f41.google.com (mail-ot1-f41.google.com [209.85.210.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6EB2C13DDA4 for ; Thu, 28 May 2026 21:30:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.41 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780003810; cv=none; b=QM8ypnB4QBYY3Odkbju6+6s14YL3nd36A5rMbz0rLE38XBxZj+eSlb8lfq3QmYHJOyFUrUuAXOG7gXIt+nPLg41GtNg13LytxZp3Af/jrqPvfccu4BNZk3F4tu+MoJIqmJoqIGVwZqHSzcC/jlLqvkP1y49nm7DDgOTrVcI4seU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780003810; c=relaxed/simple; bh=jiqdTPG4am84a711SHTa1ilvl5OD893KyijxO1MvI0o=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=SFTFUV+6a8Wfd/5pxWRCqhbUtE6n5tgrG1Hsy8hvn0SgoHwaIZwZ97HJzy0XaWsYGccKY6kIPD1UZ/v/1CeOZeyj8QpeeXOmB7JWRdS3fvEGV5J9nfHJzQ6EnSkaS/dnKCuY5OvNLF3dMQPPIolE0tjLhz57Vl8/QNV/pZDewDg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=LMXGI6YE; arc=none smtp.client-ip=209.85.210.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="LMXGI6YE" Received: by mail-ot1-f41.google.com with SMTP id 46e09a7af769-7e582b3bcaaso11226169a34.3 for ; Thu, 28 May 2026 14:30:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1780003806; x=1780608606; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=L4CJlSs8AXl7ZfYRtAD893brHtGG85Zp/rm/ajAW6HQ=; b=LMXGI6YEeF0agqyES1/hpONwDszZp689VQZspTvH3yimMD9MIvK2saNzK4shirMYrW Q9iwtiKkXTG9cctkw2AE0Kt0/1e8lhO6x2g0csSBBVzCxXuA8vhcy6VTQ635WOm33PzS HLd09AuzuypPVkkUbv9ri+KKmgsZXLHJsdNakqfew/PZZBgVlsuX/GAr+LUvHFPuvii3 RQh4Fs8lzEYHHsa/F0qS7KFpgBaW6VmRKp4GNndic4Z5qauuNvMfPlslFN7Glv3FHej8 /Y9ev4XI0R6EGVX263hL7QGEHgEkg8UPYLkrZTqzvt1mdHFU87P2wk9zJtfB/C/8mobq f1dw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780003806; x=1780608606; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=L4CJlSs8AXl7ZfYRtAD893brHtGG85Zp/rm/ajAW6HQ=; b=RbFyfEu57WIovPc2D9QULg88+r8uS5A3D5Ju2zk0VkkIp47BYH+xi/fsBnhHPohE+W uIAO0294So6tK8g+dXLaWREWwispm9hMkM9p8IkBnfkZUFCS9E+yQub0n9/YCR2Kw3Fa /k+2dR8h9esVmIToomneirGMXCQMxWbIROXqPXZHSl5ESx3aahZqyqyIfeVTRgd3flGT 5spL/J/Os3oPAHaVPsshNMNYmdIoj/0HSSXnC+YEXKFU9KzXAh2XFB73H0dzXzodiEdb pPz0ucKamsag+LQbvx8FypPxsk5CzvYhQPUojtHZAyQOO1635+eOtkLF1kmn2F04EI42 LzVQ== X-Forwarded-Encrypted: i=1; AFNElJ/RoH/MzaWheelhwSUF/njxJGtMSFTiiWFkacEqNmKqYDKyv3C52y30y0X7dq2LOZ986cbbjKPQjbkASuM=@vger.kernel.org X-Gm-Message-State: AOJu0YwT7obxMVzaVlqhUYfWnhYLWnWPEQ3Eb27gBQ9dl6yNpYUDWfoQ 83JUrMg0SM878PZoShsCxP93pH1CpYrFRW/sEVAGdInnexlx76YnxCkV X-Gm-Gg: Acq92OEK13uepR/8sTjyCCLW498ck4oyG4VEtr+8xO4UEVvjD1SvJk4rhyMPYqxD6Ra 1TcOMu/GWdR2FZcXEE9vuFMIORAzVW+n8z3qX4bonTMRa9W/8nvK5Gi+rvCxHDoP9afPS4b6sYA Nze85B/LC0jqKo3ihkbULHE3axCBOf0AIx0SARhfRjlBvLdyhgG9xiuHaFbXTT03zE2Txzczi3/ YjIZn1Xta6bgTvYZKfx4Uil3iw7Dl5T0lInEuXPp9VX/JZEjzPnI3Vk/XLh57KSR0PLkXe8z831 NEckj+6suYgZXgrK6ZRRPsHDL8U8GLo81TyiypDMtPjSOT3tTc1Dg6cNZ1Wx22ihxcUlt6yOsls 6BrMRBFJLABjVB0y+P1bqskM4HKMLW7IdDld2EYPdGkl3AxbSFm534Mh+SzJ7juE2ObOCIBslb7 t0U0Cf4Z9x8no2rjH1Qz+9I3Kf0vet8HxRGk1N52gZL8qQfl+pVzq8yAkD X-Received: by 2002:a05:6830:6314:b0:7dc:2f4f:17b4 with SMTP id 46e09a7af769-7e694fbdd5bmr187572a34.21.1780003806180; Thu, 28 May 2026 14:30:06 -0700 (PDT) Received: from localhost ([2a03:2880:10ff:53::]) by smtp.gmail.com with ESMTPSA id 46e09a7af769-7e6952991c2sm135613a34.26.2026.05.28.14.30.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 28 May 2026 14:30:05 -0700 (PDT) From: Nhat Pham To: kasong@tencent.com Cc: Liam.Howlett@oracle.com, akpm@linux-foundation.org, apopple@nvidia.com, axelrasmussen@google.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, bhe@redhat.com, byungchul@sk.com, cgroups@vger.kernel.org, chengming.zhou@linux.dev, chrisl@kernel.org, corbet@lwn.net, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jannh@google.com, joshua.hahnjy@gmail.com, lance.yang@linux.dev, lenb@kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-pm@vger.kernel.org, lorenzo.stoakes@oracle.com, matthew.brost@intel.com, mhocko@suse.com, muchun.song@linux.dev, npache@redhat.com, nphamcs@gmail.com, pavel@kernel.org, peterx@redhat.com, peterz@infradead.org, pfalcato@suse.de, rafael@kernel.org, rakie.kim@sk.com, roman.gushchin@linux.dev, rppt@kernel.org, ryan.roberts@arm.com, shakeel.butt@linux.dev, shikemeng@huaweicloud.com, surenb@google.com, tglx@kernel.org, vbabka@suse.cz, weixugc@google.com, ying.huang@linux.alibaba.com, yosry.ahmed@linux.dev, yuanchu@google.com, zhengqi.arch@bytedance.com, ziy@nvidia.com, kernel-team@meta.com, riel@surriel.com, haowenchao22@gmail.com Subject: [RFC PATCH 5/5] mm, swap: add debugfs counters for vswap Date: Thu, 28 May 2026 14:29:29 -0700 Message-ID: <20260528212955.1912856-6-nphamcs@gmail.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260528212955.1912856-1-nphamcs@gmail.com> References: <20260528212955.1912856-1-nphamcs@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add /sys/kernel/debug/vswap/ with two counters: - used: number of virtual swap slots currently allocated - alloc_reject: cumulative count of failed vswap allocations Signed-off-by: Nhat Pham --- mm/swapfile.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index be901fb741e5..3740ab764405 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -7,6 +7,7 @@ */ =20 #include +#include #include #include #include @@ -132,6 +133,9 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percp= u_swap_cluster) =3D { .lock =3D INIT_LOCAL_LOCK(), }; =20 +static atomic_t __maybe_unused vswap_used =3D ATOMIC_INIT(0); +static atomic_t __maybe_unused vswap_alloc_reject =3D ATOMIC_INIT(0); + #ifdef CONFIG_VSWAP struct percpu_vswap_cluster { unsigned long offset[SWAP_NR_ORDERS]; @@ -1993,11 +1997,13 @@ static bool vswap_alloc(struct folio *folio) if (folio_test_swapcache(folio)) { /* alloc_swap_scan_cluster updated percpu offset already */ local_unlock(&percpu_vswap_cluster.lock); + atomic_add(folio_nr_pages(folio), &vswap_used); return true; } =20 this_cpu_write(percpu_vswap_cluster.offset[order], SWAP_ENTRY_INVALID); local_unlock(&percpu_vswap_cluster.lock); + atomic_add(folio_nr_pages(folio), &vswap_alloc_reject); return false; } #endif @@ -2554,8 +2560,10 @@ void __swap_cluster_free_entries(struct swap_info_st= ruct *si, =20 VM_WARN_ON(ci->count < nr_pages); =20 - if (is_vswap) + if (is_vswap) { vswap_release_backing(ci, ci_start, nr_pages); + atomic_sub(nr_pages, &vswap_used); + } =20 ci->count -=3D nr_pages; do { @@ -4793,6 +4801,7 @@ struct swap_info_struct *vswap_si; static int __init vswap_init(void) { struct swap_info_struct *si; + struct dentry *root; unsigned long maxpages; int err; =20 @@ -4819,6 +4828,11 @@ static int __init vswap_init(void) mutex_unlock(&swapon_mutex); =20 vswap_si =3D si; + + root =3D debugfs_create_dir("vswap", NULL); + debugfs_create_atomic_t("used", 0444, root, &vswap_used); + debugfs_create_atomic_t("alloc_reject", 0444, root, &vswap_alloc_reject); + pr_info("vswap: created virtual swap device (%lu pages)\n", maxpages); return 0; =20 --=20 2.53.0-Meta