From nobody Mon May 25 06:42:36 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E4EF33BADA5; Sun, 17 May 2026 15:39:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; cv=none; b=ov2RvWY4gTH/NgIiGPHKZMR+n3+Cuxx6EbV4sgjhvVYyuDK0BQTG6yBMcQTPnNOc4RrYOivj04inpYBkEizi6m+ZXTqF3E/yCnxFgvXjEnCiLGPEDfwCWpMLhru3+1XatWrUitzCGw2A0vFkI/wy0SUgiYRh07QirQgsYm2JGBY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; c=relaxed/simple; bh=WrYDbfBFzQbKFwmL/LhRddlWFpC0MEg/9HOStVj3dpM=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=H3IYIudPDSkC+tkTaoLHVTKDs/oLAhy5GMyC2yi+0rhyJ0wAh8bTE15fkDiYVfBhk2azKYVphwvLJ23xnWEa022FmORgXM5zPiLdiw4tDE/A7fYdAMbh3wynDe9JLwJVoTPX1P6dLFRAMMetdA0/CSBMRYJ0QyOxFdR2/ZXyTdI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Xdqvmb7D; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Xdqvmb7D" Received: by smtp.kernel.org (Postfix) with ESMTPS id A8AAAC2BCF5; Sun, 17 May 2026 15:39:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1779032388; bh=WrYDbfBFzQbKFwmL/LhRddlWFpC0MEg/9HOStVj3dpM=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=Xdqvmb7DgeZ7DQ9l3l5MaA+zgAeGeBk64orzcD+UxMTfu6sJNzLc7sHg4sCocZ7zL ppXy3pk7gZjB/GnnTy3RmwSr58to/JsYVv0eQrM9Jkr1Yi64bBD8ZoGIw3LkQ7MF/W 0YRkq4mN1IO+lWA0F7kktuW1eriwFQ0Z06358PloMlfC5C41Zcw+7tUWWS3WKL1l2d l2Cn/ljedB8KAgf6XQ7/TZZnToXpHVLhMf0xPapHDEDvvEBCAFKZiY5u8CtPps8gFp mbSV39DRpDEynHMhxU6nsGI/0FmYSkNoOwkXksO2bMiYHzVJRMbtPgte5imDbscZBH dClAKw+HU+qZQ== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id 97C4FCD4F21; Sun, 17 May 2026 15:39:48 +0000 (UTC) From: Kairui Song via B4 Relay Date: Sun, 17 May 2026 23:39:40 +0800 Subject: [PATCH v5 01/12] mm, swap: simplify swap cache allocation helper Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260517-swap-table-p4-v5-1-88ae43e064c7@tencent.com> References: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> In-Reply-To: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Zi Yan , Baolin Wang , Barry Song , Hugh Dickins , Chris Li , Kemeng Shi , Nhat Pham , Baoquan He , Johannes Weiner , Youngjun Park , Chengming Zhou , Roman Gushchin , Shakeel Butt , Muchun Song , Usama Arif , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Kairui Song , Lorenzo Stoakes , Yosry Ahmed , Qi Zheng X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1779032385; l=14115; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=fan8Q7+GCcE8wHejlqVfY3DrWN1Psk2KK8MfUrluepA=; b=wDvENzWV3OyBcQTTlhAw3+Unh2izor1nL1OcsUuH1bKG2v29mHPfIy4dFBuOxHJUdSjtxgilg 9w+UOgDOo4sD+LuxjUP7Z5Wp4Z9LzDhQr5y2u6aUQEKSbYYoFTRhwTN X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= X-Endpoint-Received: by B4 Relay for kasong@tencent.com/kasong-sign-tencent with auth_id=562 X-Original-From: Kairui Song Reply-To: kasong@tencent.com From: Kairui Song Instead of trying to return the existing folio if the entry is already cached in swap_cache_alloc_folio, simply return an error pointer if the allocation failed, and drop the output argument that indicates what kind of folio is actually returned. And a proper wrapper swap_cache_read_folio that decouples and handles the actual requirement - read in the folio, or return the already read folio in cache. This is what async swapin and readahead actually required. As for zswap swap out, the caller just needs to abort if the allocation fails because the entry is gone or already cached, so removing simplifies the return argument, making it cleaner. No feature change. Acked-by: Chris Li Signed-off-by: Kairui Song --- mm/swap.h | 3 +- mm/swap_state.c | 180 +++++++++++++++++++++++++++++-----------------------= ---- mm/zswap.c | 23 +++----- 3 files changed, 103 insertions(+), 103 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index a77016f2423b..ad8b17a93758 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -281,8 +281,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); void swap_cache_del_folio(struct folio *folio); struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, - struct mempolicy *mpol, pgoff_t ilx, - bool *alloced); + struct mempolicy *mpol, pgoff_t ilx); /* Below helpers require the caller to lock and pass in the swap cluster. = */ void __swap_cache_add_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry); diff --git a/mm/swap_state.c b/mm/swap_state.c index 1415a5c54a43..3bba82f6dc79 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -459,54 +459,38 @@ void swap_update_readahead(struct folio *folio, struc= t vm_area_struct *vma, * All swap slots covered by the folio must have a non-zero swap count. * * Context: Caller must protect the swap device with reference count or lo= cks. - * Return: Returns the folio being added on success. Returns the existing = folio - * if @entry is already cached. Returns NULL if raced with swapin or swapo= ff. + * Return: 0 if success, error code if failed. */ -static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry, - struct folio *folio, - gfp_t gfp, bool charged) +static int __swap_cache_prepare_and_add(swp_entry_t entry, + struct folio *folio, + gfp_t gfp, bool charged) { - struct folio *swapcache =3D NULL; void *shadow; int ret; =20 __folio_set_locked(folio); __folio_set_swapbacked(folio); =20 - if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) + if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) { + ret =3D -ENOMEM; goto failed; - - for (;;) { - ret =3D swap_cache_add_folio(folio, entry, &shadow); - if (!ret) - break; - - /* - * Large order allocation needs special handling on - * race: if a smaller folio exists in cache, swapin needs - * to fallback to order 0, and doing a swap cache lookup - * might return a folio that is irrelevant to the faulting - * entry because @entry is aligned down. Just return NULL. - */ - if (ret !=3D -EEXIST || folio_test_large(folio)) - goto failed; - - swapcache =3D swap_cache_get_folio(entry); - if (swapcache) - goto failed; } =20 + ret =3D swap_cache_add_folio(folio, entry, &shadow); + if (ret) + goto failed; + memcg1_swapin(entry, folio_nr_pages(folio)); if (shadow) workingset_refault(folio, shadow); =20 /* Caller will initiate read into locked folio */ folio_add_lru(folio); - return folio; + return 0; =20 failed: folio_unlock(folio); - return swapcache; + return ret; } =20 /** @@ -515,7 +499,6 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, * @gfp_mask: memory allocation flags * @mpol: NUMA memory allocation policy to be applied * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE - * @new_page_allocated: sets true if allocation happened, false otherwise * * Allocate a folio in the swap cache for one swap slot, typically before * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by @@ -523,18 +506,40 @@ static struct folio *__swap_cache_prepare_and_add(swp= _entry_t entry, * Currently only supports order 0. * * Context: Caller must protect the swap device with reference count or lo= cks. - * Return: Returns the existing folio if @entry is cached already. Returns - * NULL if failed due to -ENOMEM or @entry have a swap count < 1. + * Return: Returns the folio if allocation succeeded and folio is added to + * swap cache. Returns error code if allocation failed due to race or OOM. */ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask, - struct mempolicy *mpol, pgoff_t ilx, - bool *new_page_allocated) + struct mempolicy *mpol, pgoff_t ilx) +{ + int err; + struct folio *folio; + + /* Allocate a new folio to be added into the swap cache. */ + folio =3D folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); + if (!folio) + return ERR_PTR(-ENOMEM); + + /* + * Try to add the new folio to the swap cache. It returns + * -EEXIST if the entry is already cached. + */ + err =3D __swap_cache_prepare_and_add(entry, folio, gfp_mask, false); + if (err) { + folio_put(folio); + return ERR_PTR(err); + } + + return folio; +} + +static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp, + struct mempolicy *mpol, pgoff_t ilx, + struct swap_iocb **plug, bool readahead) { struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct folio *folio; - struct folio *result =3D NULL; =20 - *new_page_allocated =3D false; /* Check the swap cache again for readahead path. */ folio =3D swap_cache_get_folio(entry); if (folio) @@ -544,17 +549,24 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entr= y, gfp_t gfp_mask, if (!swap_entry_swapped(si, entry)) return NULL; =20 - /* Allocate a new folio to be added into the swap cache. */ - folio =3D folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); - if (!folio) + do { + folio =3D swap_cache_get_folio(entry); + if (folio) + return folio; + + folio =3D swap_cache_alloc_folio(entry, gfp, mpol, ilx); + } while (PTR_ERR(folio) =3D=3D -EEXIST); + + if (IS_ERR_OR_NULL(folio)) return NULL; - /* Try add the new folio, returns existing folio or NULL on failure. */ - result =3D __swap_cache_prepare_and_add(entry, folio, gfp_mask, false); - if (result =3D=3D folio) - *new_page_allocated =3D true; - else - folio_put(folio); - return result; + + swap_read_folio(folio, plug); + if (readahead) { + folio_set_readahead(folio); + count_vm_event(SWAP_RA); + } + + return folio; } =20 /** @@ -573,15 +585,35 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entr= y, gfp_t gfp_mask, */ struct folio *swapin_folio(swp_entry_t entry, struct folio *folio) { + int ret; struct folio *swapcache; pgoff_t offset =3D swp_offset(entry); unsigned long nr_pages =3D folio_nr_pages(folio); =20 entry =3D swp_entry(swp_type(entry), round_down(offset, nr_pages)); - swapcache =3D __swap_cache_prepare_and_add(entry, folio, 0, true); - if (swapcache =3D=3D folio) - swap_read_folio(folio, NULL); - return swapcache; + for (;;) { + ret =3D __swap_cache_prepare_and_add(entry, folio, 0, true); + if (!ret) { + swap_read_folio(folio, NULL); + break; + } + + /* + * Large order allocation needs special handling on + * race: if a smaller folio exists in cache, swapin needs + * to fall back to order 0, and doing a swap cache lookup + * might return a folio that is irrelevant to the faulting + * entry because @entry is aligned down. Just return NULL. + */ + if (ret !=3D -EEXIST || nr_pages > 1) + return NULL; + + swapcache =3D swap_cache_get_folio(entry); + if (swapcache) + return swapcache; + } + + return folio; } =20 /* @@ -595,7 +627,6 @@ struct folio *read_swap_cache_async(swp_entry_t entry, = gfp_t gfp_mask, struct swap_iocb **plug) { struct swap_info_struct *si; - bool page_allocated; struct mempolicy *mpol; pgoff_t ilx; struct folio *folio; @@ -605,13 +636,9 @@ struct folio *read_swap_cache_async(swp_entry_t entry,= gfp_t gfp_mask, return NULL; =20 mpol =3D get_vma_policy(vma, addr, 0, &ilx); - folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, - &page_allocated); + folio =3D swap_cache_read_folio(entry, gfp_mask, mpol, ilx, plug, false); mpol_cond_put(mpol); =20 - if (page_allocated) - swap_read_folio(folio, plug); - put_swap_device(si); return folio; } @@ -696,7 +723,7 @@ static unsigned long swapin_nr_pages(unsigned long offs= et) * are fairly likely to have been swapped out from the same node. */ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, - struct mempolicy *mpol, pgoff_t ilx) + struct mempolicy *mpol, pgoff_t ilx) { struct folio *folio; unsigned long entry_offset =3D swp_offset(entry); @@ -706,7 +733,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct blk_plug plug; struct swap_iocb *splug =3D NULL; - bool page_allocated; + swp_entry_t ra_entry; =20 mask =3D swapin_nr_pages(offset) - 1; if (!mask) @@ -723,18 +750,11 @@ struct folio *swap_cluster_readahead(swp_entry_t entr= y, gfp_t gfp_mask, blk_start_plug(&plug); for (offset =3D start_offset; offset <=3D end_offset ; offset++) { /* Ok, do the async read-ahead now */ - folio =3D swap_cache_alloc_folio( - swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx, - &page_allocated); + ra_entry =3D swp_entry(swp_type(entry), offset); + folio =3D swap_cache_read_folio(ra_entry, gfp_mask, mpol, ilx, + &splug, offset !=3D entry_offset); if (!folio) continue; - if (page_allocated) { - swap_read_folio(folio, &splug); - if (offset !=3D entry_offset) { - folio_set_readahead(folio); - count_vm_event(SWAP_RA); - } - } folio_put(folio); } blk_finish_plug(&plug); @@ -742,11 +762,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry= , gfp_t gfp_mask, lru_add_drain(); /* Push any new pages onto the LRU now */ skip: /* The page was likely read above, so no need for plugging here */ - folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, - &page_allocated); - if (unlikely(page_allocated)) - swap_read_folio(folio, NULL); - return folio; + return swap_cache_read_folio(entry, gfp_mask, mpol, ilx, NULL, false); } =20 static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start, @@ -812,8 +828,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, pte_t *pte =3D NULL, pentry; int win; unsigned long start, end, addr; - pgoff_t ilx; - bool page_allocated; + pgoff_t ilx =3D targ_ilx; =20 win =3D swap_vma_ra_win(vmf, &start, &end); if (win =3D=3D 1) @@ -847,19 +862,12 @@ static struct folio *swap_vma_readahead(swp_entry_t t= arg_entry, gfp_t gfp_mask, if (!si) continue; } - folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, - &page_allocated); + folio =3D swap_cache_read_folio(entry, gfp_mask, mpol, ilx, + &splug, addr !=3D vmf->address); if (si) put_swap_device(si); if (!folio) continue; - if (page_allocated) { - swap_read_folio(folio, &splug); - if (addr !=3D vmf->address) { - folio_set_readahead(folio); - count_vm_event(SWAP_RA); - } - } folio_put(folio); } if (pte) @@ -869,10 +877,8 @@ static struct folio *swap_vma_readahead(swp_entry_t ta= rg_entry, gfp_t gfp_mask, lru_add_drain(); skip: /* The folio was likely read above, so no need for plugging here */ - folio =3D swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx, - &page_allocated); - if (unlikely(page_allocated)) - swap_read_folio(folio, NULL); + folio =3D swap_cache_read_folio(targ_entry, gfp_mask, mpol, targ_ilx, + NULL, false); return folio; } =20 diff --git a/mm/zswap.c b/mm/zswap.c index 4b5149173b0e..e27f6e96f003 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -991,7 +991,6 @@ static int zswap_writeback_entry(struct zswap_entry *en= try, pgoff_t offset =3D swp_offset(swpentry); struct folio *folio; struct mempolicy *mpol; - bool folio_was_allocated; struct swap_info_struct *si; int ret =3D 0; =20 @@ -1002,22 +1001,18 @@ static int zswap_writeback_entry(struct zswap_entry= *entry, =20 mpol =3D get_task_policy(current); folio =3D swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol, - NO_INTERLEAVE_INDEX, &folio_was_allocated); + NO_INTERLEAVE_INDEX); put_swap_device(si); - if (!folio) - return -ENOMEM; =20 /* - * Found an existing folio, we raced with swapin or concurrent - * shrinker. We generally writeback cold folios from zswap, and - * swapin means the folio just became hot, so skip this folio. - * For unlikely concurrent shrinker case, it will be unlinked - * and freed when invalidated by the concurrent shrinker anyway. + * Swap cache allocation might fail due to OOM, or the entry + * may already be cached due to concurrent swapin or have been + * freed. If already cached, a concurrent swapin made the folio + * hot, so skip it. For the unlikely concurrent shrinker case, + * it will be unlinked and freed when invalidated anyway. */ - if (!folio_was_allocated) { - ret =3D -EEXIST; - goto out; - } + if (IS_ERR(folio)) + return PTR_ERR(folio); =20 /* * folio is locked, and the swapcache is now secured against @@ -1057,7 +1052,7 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, __swap_writepage(folio, NULL); =20 out: - if (ret && ret !=3D -EEXIST) { + if (ret) { swap_cache_del_folio(folio); folio_unlock(folio); } --=20 2.54.0 From nobody Mon May 25 06:42:36 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F2A313BB108; Sun, 17 May 2026 15:39:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; cv=none; b=Gq/jrOAmKyiwLzbmkldH3Sh01Ndo+G7ZX50Il2I7Gny9vlX2hzIO29z7p+eypo3uHy+175C4fetGJQ7+7R7lNIxLO467nndYwUiiR7eupFFGEo7fdTzar7N11WLm8sI8a3eh77bp6HaJ17Mr72DLcJEE2ZXK81Fr43KW8MpchmI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; c=relaxed/simple; bh=zcCsIEOwHCa/qUJIv1+S/n2+L3ydr1bUKi5zSmcBhlg=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=Z7D75n24WDRNcVHE7I43OQqCS/dByD49acno+GDHqmHh/IZ585jHnEtAN7ECpqiBUftNn54iPungiSdZhqCdJ1mp0ZcL8eWMe7iwfl7jLfqE4wyOBnhUIQ/m3zee8gTCFkt43nv1tCmffb8pq57kp/z5rJW2jMrc44c4QX1UAQk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=b0csTYMp; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="b0csTYMp" Received: by smtp.kernel.org (Postfix) with ESMTPS id B7C71C2BCFB; Sun, 17 May 2026 15:39:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1779032388; bh=zcCsIEOwHCa/qUJIv1+S/n2+L3ydr1bUKi5zSmcBhlg=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=b0csTYMp/vx0BBpaqdbHL6WkUZFV3Lr0hcq5zpucQkoCjh+IA3xVokMyK0uTb6h6M RkqNKeJF1jtPMUVh2wTIDuAYUSu51PhWjZhQQcwWmlDF1J/Fhdg0lc11PwsEjHxAZv JbGkxKWBtiT10K1CUQRBfnqLpx1SWGfYVX529luLVjVlSL4tNOr34V19OmGJ/y/CRm kVNDbphAd+ozhmqbtYdlgkz/AooRuEVVE3yjAQOsiWdrrc034/BFNTfdklEgGdyvPy OMHsWHpSqGRqPrm6Qr2Z464SK0k3tvs2EvYxHvUD/ku03GhhyDC6pjzhuKmmC6pVWD ey3Q0P5LOaM6g== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id AADB5CD4F4A; Sun, 17 May 2026 15:39:48 +0000 (UTC) From: Kairui Song via B4 Relay Date: Sun, 17 May 2026 23:39:41 +0800 Subject: [PATCH v5 02/12] mm, swap: move common swap cache operations into standalone helpers Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260517-swap-table-p4-v5-2-88ae43e064c7@tencent.com> References: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> In-Reply-To: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Zi Yan , Baolin Wang , Barry Song , Hugh Dickins , Chris Li , Kemeng Shi , Nhat Pham , Baoquan He , Johannes Weiner , Youngjun Park , Chengming Zhou , Roman Gushchin , Shakeel Butt , Muchun Song , Usama Arif , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Kairui Song , Lorenzo Stoakes , Yosry Ahmed , Qi Zheng X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1779032385; l=7803; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=6zAFqCFLxPNS+wqMUa/zuMJhekX/MwcjKTbI53cDZ/0=; b=5d+qPQoSPI1l0sACYZ/x+MKf6EzLYr8ukWQ8JrEd9L4XW7+ArJZMK/3suF+Rvw8yNwZC2TZAF stBQm70i5rODpdXka4GRLE34gPTWm+f6J0Vrt0wsMyEjMAl1P83/ydA X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= X-Endpoint-Received: by B4 Relay for kasong@tencent.com/kasong-sign-tencent with auth_id=562 X-Original-From: Kairui Song Reply-To: kasong@tencent.com From: Kairui Song Move a few swap cache checking, adding, and deletion operations into standalone helpers to be used later. And while at it, add proper kernel doc. No feature or behavior change. Acked-by: Chris Li Signed-off-by: Kairui Song --- mm/swap_state.c | 146 ++++++++++++++++++++++++++++++++++++++--------------= ---- 1 file changed, 100 insertions(+), 46 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 3bba82f6dc79..89fa19ec13f6 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -137,8 +137,47 @@ void *swap_cache_get_shadow(swp_entry_t entry) return NULL; } =20 -void __swap_cache_add_folio(struct swap_cluster_info *ci, - struct folio *folio, swp_entry_t entry) +/** + * __swap_cache_add_check - Check if a range is suitable for adding a foli= o. + * @ci: The locked swap cluster. + * @ci_off: Range start offset. + * @nr: Number of slots to check. + * @shadow: Returns the shadow value if one exists in the range. + * + * Check if all slots covered by given range have a swap count >=3D 1. + * Retrieves the shadow if there is one. + * + * Context: Caller must lock the cluster. + * Return: 0 if success, error code if failed. + */ +static int __swap_cache_add_check(struct swap_cluster_info *ci, + unsigned int ci_off, unsigned int nr, + void **shadow) +{ + unsigned int ci_end =3D ci_off + nr; + unsigned long old_tb; + + lockdep_assert_held(&ci->lock); + if (WARN_ON_ONCE(ci_off >=3D SWAPFILE_CLUSTER)) + return -EINVAL; + + if (unlikely(!ci->table)) + return -ENOENT; + do { + old_tb =3D __swap_table_get(ci, ci_off); + if (unlikely(swp_tb_is_folio(old_tb))) + return -EEXIST; + if (unlikely(!__swp_tb_get_count(old_tb))) + return -ENOENT; + if (swp_tb_is_shadow(old_tb)) + *shadow =3D swp_tb_to_shadow(old_tb); + } while (++ci_off < ci_end); + + return 0; +} + +static void __swap_cache_do_add_folio(struct swap_cluster_info *ci, + struct folio *folio, swp_entry_t entry) { unsigned int ci_off =3D swp_cluster_offset(entry), ci_end; unsigned long nr_pages =3D folio_nr_pages(folio); @@ -159,7 +198,28 @@ void __swap_cache_add_folio(struct swap_cluster_info *= ci, folio_ref_add(folio, nr_pages); folio_set_swapcache(folio); folio->swap =3D entry; +} + +/** + * __swap_cache_add_folio - Add a folio to the swap cache and update stats. + * @ci: The locked swap cluster. + * @folio: The folio to be added. + * @entry: The swap entry corresponding to the folio. + * + * Unconditionally add a folio to the swap cache. The caller must ensure + * all slots are usable and have no conflicts. This assigns entry to + * @folio->swap, increases folio refcount by the number of pages, and + * updates swap cache stats. + * + * Context: Caller must ensure the folio is locked and lock the cluster + * that holds the entries. + */ +void __swap_cache_add_folio(struct swap_cluster_info *ci, + struct folio *folio, swp_entry_t entry) +{ + unsigned long nr_pages =3D folio_nr_pages(folio); =20 + __swap_cache_do_add_folio(ci, folio, entry); node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); } @@ -168,9 +228,11 @@ void __swap_cache_add_folio(struct swap_cluster_info *= ci, * swap_cache_add_folio - Add a folio into the swap cache. * @folio: The folio to be added. * @entry: The swap entry corresponding to the folio. - * @gfp: gfp_mask for XArray node allocation. * @shadowp: If a shadow is found, return the shadow. * + * Add a folio into the swap cache. Will return error if any slot is no + * longer a valid swapped out slot or already occupied by another folio. + * * Context: Caller must ensure @entry is valid and protect the swap device * with reference count or locks. */ @@ -179,60 +241,31 @@ static int swap_cache_add_folio(struct folio *folio, = swp_entry_t entry, { int err; void *shadow =3D NULL; - unsigned long old_tb; + unsigned int ci_off; struct swap_info_struct *si; struct swap_cluster_info *ci; - unsigned int ci_start, ci_off, ci_end; unsigned long nr_pages =3D folio_nr_pages(folio); =20 si =3D __swap_entry_to_info(entry); - ci_start =3D swp_cluster_offset(entry); - ci_end =3D ci_start + nr_pages; - ci_off =3D ci_start; ci =3D swap_cluster_lock(si, swp_offset(entry)); - if (unlikely(!ci->table)) { - err =3D -ENOENT; - goto failed; + ci_off =3D swp_cluster_offset(entry); + err =3D __swap_cache_add_check(ci, ci_off, nr_pages, &shadow); + if (err) { + swap_cluster_unlock(ci); + return err; } - do { - old_tb =3D __swap_table_get(ci, ci_off); - if (unlikely(swp_tb_is_folio(old_tb))) { - err =3D -EEXIST; - goto failed; - } - if (unlikely(!__swp_tb_get_count(old_tb))) { - err =3D -ENOENT; - goto failed; - } - if (swp_tb_is_shadow(old_tb)) - shadow =3D swp_tb_to_shadow(old_tb); - } while (++ci_off < ci_end); + __swap_cache_add_folio(ci, folio, entry); swap_cluster_unlock(ci); if (shadowp) *shadowp =3D shadow; - return 0; =20 -failed: - swap_cluster_unlock(ci); - return err; + return 0; } =20 -/** - * __swap_cache_del_folio - Removes a folio from the swap cache. - * @ci: The locked swap cluster. - * @folio: The folio. - * @entry: The first swap entry that the folio corresponds to. - * @shadow: shadow value to be filled in the swap cache. - * - * Removes a folio from the swap cache and fills a shadow in place. - * This won't put the folio's refcount. The caller has to do that. - * - * Context: Caller must ensure the folio is locked and in the swap cache - * using the index of @entry, and lock the cluster that holds the entries. - */ -void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *fo= lio, - swp_entry_t entry, void *shadow) +static void __swap_cache_do_del_folio(struct swap_cluster_info *ci, + struct folio *folio, + swp_entry_t entry, void *shadow) { int count; unsigned long old_tb; @@ -259,14 +292,12 @@ void __swap_cache_del_folio(struct swap_cluster_info = *ci, struct folio *folio, folio_swapped =3D true; else need_free =3D true; - /* If shadow is NULL, we sets an empty shadow. */ + /* If shadow is NULL, we set an empty shadow. */ __swap_table_set(ci, ci_off, shadow_to_swp_tb(shadow, count)); } while (++ci_off < ci_end); =20 folio->swap.val =3D 0; folio_clear_swapcache(folio); - node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages); - lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages); =20 if (!folio_swapped) { __swap_cluster_free_entries(si, ci, ci_start, nr_pages); @@ -279,6 +310,29 @@ void __swap_cache_del_folio(struct swap_cluster_info *= ci, struct folio *folio, } } =20 +/** + * __swap_cache_del_folio - Removes a folio from the swap cache. + * @ci: The locked swap cluster. + * @folio: The folio. + * @entry: The first swap entry that the folio corresponds to. + * @shadow: shadow value to be filled in the swap cache. + * + * Removes a folio from the swap cache and fills a shadow in place. + * This won't put the folio's refcount. The caller has to do that. + * + * Context: Caller must ensure the folio is locked and in the swap cache + * using the index of @entry, and lock the cluster that holds the entries. + */ +void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *fo= lio, + swp_entry_t entry, void *shadow) +{ + unsigned long nr_pages =3D folio_nr_pages(folio); + + __swap_cache_do_del_folio(ci, folio, entry, shadow); + node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages); + lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages); +} + /** * swap_cache_del_folio - Removes a folio from the swap cache. * @folio: The folio. --=20 2.54.0 From nobody Mon May 25 06:42:36 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0D5343BB109; Sun, 17 May 2026 15:39:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; cv=none; b=iBxVfBGMmRhnLGSHde9RfvCQClj5iMjkjpxBIYwol6cxBnalJuYkvrpwGNLWfP+cDdGG9NtjwwbotUnNYdf1AIb957vtna1+VA1afF60ACGlBXcC8BozWW2lFN4xF795JQjmeL/ivBhPYa+zWOxnlLo1N5Bot5Yi07/N8uHkBnA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; c=relaxed/simple; bh=fxPLjjzwJE4NXFx7tLD197c2NXECqvzlGTxAHCYE6GU=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=oMaRgxU8pbU9Rgn2dsZxm6pGMtiFqg7+nVjhIbZNX6ej9s0guL6qRpUH1fRlo9J6QubJ/u/q54Qy00SIF15OznTf0z3D7fju/Br3JAQFOXNfP/ExlwX28cEcFiY4sbuyI+bMN6EVMsLJ1ZARwPlCh/A6taJgxPaNU4hgvtF6O/U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=EOPlDJuG; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="EOPlDJuG" Received: by smtp.kernel.org (Postfix) with ESMTPS id CAA8AC2BCB8; Sun, 17 May 2026 15:39:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1779032388; bh=fxPLjjzwJE4NXFx7tLD197c2NXECqvzlGTxAHCYE6GU=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=EOPlDJuGeeEyDz0OW/9Q91adKgWYZWC1OXt7zQvfyDsT36rt/bnGAXo7D4LSpQbFu fu96b/dC7zAP12ChyZwJtok7lZgv0ZymZubi58KZy05gnOQ1KNemihF8Ce/cEmD4fd 7LXdoETeG4qwygyDGhtQg1RK511yTU/8+on0fnDpgClxTM8jAh4TqbroqEV6AP1fzh b4T+mA2XkL+qZdTcblYfGIUTdrow4GaLmMgObnGHJTy1DauQR4yEi/FIiKumMimOHQ Gxrr18S/9EOb2eat152D7y+nCInNP5roBnfmUiFw66Lrgq6S9WeiJFL3ckPBxt4ydG 4iy21RmpdG57w== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id BCFDDCD4F4B; Sun, 17 May 2026 15:39:48 +0000 (UTC) From: Kairui Song via B4 Relay Date: Sun, 17 May 2026 23:39:42 +0800 Subject: [PATCH v5 03/12] mm/huge_memory: move THP gfp limit helper into header Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260517-swap-table-p4-v5-3-88ae43e064c7@tencent.com> References: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> In-Reply-To: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Zi Yan , Baolin Wang , Barry Song , Hugh Dickins , Chris Li , Kemeng Shi , Nhat Pham , Baoquan He , Johannes Weiner , Youngjun Park , Chengming Zhou , Roman Gushchin , Shakeel Butt , Muchun Song , Usama Arif , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Kairui Song , Lorenzo Stoakes , Yosry Ahmed , Qi Zheng X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1779032385; l=4383; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=MR1heGhcrCQDpIR+H9UupQ+NXKwJ5c9dggOiT/gkYv0=; b=Emcc+GvxCXEovYTFHNrqxOtteQvH/LOh+8ha3l+YBgxzoEMWjAynaykgAVWZOwERL2/UrSAlo +rDZv5p2h8pA8VBjuNTk20FbhJOsgd75ajpE0c9jJmykcpK6nrgQ5E0 X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= X-Endpoint-Received: by B4 Relay for kasong@tencent.com/kasong-sign-tencent with auth_id=562 X-Original-From: Kairui Song Reply-To: kasong@tencent.com From: Kairui Song Shmem has some special requirements for THP GFP and has to limit it in certain zones or provide a more lenient fallback. We'll use this helper for generic swap THP allocation, which needs to support shmem. For a typical GFP_HIGHUSER_MOVABLE swap-in, this helper is basically a no-op. But it's necessary for certain shmem users, mostly drivers. No feature change. Acked-by: Chris Li Reviewed-by: Zi Yan Signed-off-by: Kairui Song --- include/linux/huge_mm.h | 30 ++++++++++++++++++++++++++++++ mm/shmem.c | 30 +++--------------------------- 2 files changed, 33 insertions(+), 27 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 127f9e1e7604..edece3e26985 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -242,6 +242,31 @@ static inline bool thp_vma_suitable_order(struct vm_ar= ea_struct *vma, return true; } =20 +/* + * Make sure huge_gfp is always more limited than limit_gfp. + * Some shmem users want THP allocation to be done less aggressively + * and only in certain zone. + */ +static inline gfp_t thp_shmem_limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_g= fp) +{ + gfp_t allowflags =3D __GFP_IO | __GFP_FS | __GFP_RECLAIM; + gfp_t denyflags =3D __GFP_NOWARN | __GFP_NORETRY; + gfp_t zoneflags =3D limit_gfp & GFP_ZONEMASK; + gfp_t result =3D huge_gfp & ~(allowflags | GFP_ZONEMASK); + + /* Allow allocations only from the originally specified zones. */ + result |=3D zoneflags; + + /* + * Minimize the result gfp by taking the union with the deny flags, + * and the intersection of the allow flags. + */ + result |=3D (limit_gfp & denyflags); + result |=3D (huge_gfp & limit_gfp) & allowflags; + + return result; +} + /* * Filter the bitfield of input orders to the ones suitable for use in the= vma. * See thp_vma_suitable_order(). @@ -565,6 +590,11 @@ static inline bool thp_vma_suitable_order(struct vm_ar= ea_struct *vma, return false; } =20 +static inline gfp_t thp_shmem_limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_g= fp) +{ + return huge_gfp; +} + static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct = *vma, unsigned long addr, unsigned long orders) { diff --git a/mm/shmem.c b/mm/shmem.c index bab3529af23c..6edb23b41bac 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1791,30 +1791,6 @@ static struct folio *shmem_swapin_cluster(swp_entry_= t swap, gfp_t gfp, return folio; } =20 -/* - * Make sure huge_gfp is always more limited than limit_gfp. - * Some of the flags set permissions, while others set limitations. - */ -static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp) -{ - gfp_t allowflags =3D __GFP_IO | __GFP_FS | __GFP_RECLAIM; - gfp_t denyflags =3D __GFP_NOWARN | __GFP_NORETRY; - gfp_t zoneflags =3D limit_gfp & GFP_ZONEMASK; - gfp_t result =3D huge_gfp & ~(allowflags | GFP_ZONEMASK); - - /* Allow allocations only from the originally specified zones. */ - result |=3D zoneflags; - - /* - * Minimize the result gfp by taking the union with the deny flags, - * and the intersection of the allow flags. - */ - result |=3D (limit_gfp & denyflags); - result |=3D (huge_gfp & limit_gfp) & allowflags; - - return result; -} - #ifdef CONFIG_TRANSPARENT_HUGEPAGE bool shmem_hpage_pmd_enabled(void) { @@ -2065,7 +2041,7 @@ static struct folio *shmem_swap_alloc_folio(struct in= ode *inode, non_swapcache_batch(entry, nr_pages) !=3D nr_pages) goto fallback; =20 - alloc_gfp =3D limit_gfp_mask(vma_thp_gfp_mask(vma), gfp); + alloc_gfp =3D thp_shmem_limit_gfp_mask(vma_thp_gfp_mask(vma), gfp); } retry: new =3D shmem_alloc_folio(alloc_gfp, order, info, index); @@ -2141,7 +2117,7 @@ static int shmem_replace_folio(struct folio **foliop,= gfp_t gfp, if (nr_pages > 1) { gfp_t huge_gfp =3D vma_thp_gfp_mask(vma); =20 - gfp =3D limit_gfp_mask(huge_gfp, gfp); + gfp =3D thp_shmem_limit_gfp_mask(huge_gfp, gfp); } #endif =20 @@ -2548,7 +2524,7 @@ static int shmem_get_folio_gfp(struct inode *inode, p= goff_t index, gfp_t huge_gfp; =20 huge_gfp =3D vma_thp_gfp_mask(vma); - huge_gfp =3D limit_gfp_mask(huge_gfp, gfp); + huge_gfp =3D thp_shmem_limit_gfp_mask(huge_gfp, gfp); folio =3D shmem_alloc_and_add_folio(vmf, huge_gfp, inode, index, fault_mm, orders); if (!IS_ERR(folio)) { --=20 2.54.0 From nobody Mon May 25 06:42:36 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2F5C73BBA1A; Sun, 17 May 2026 15:39:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; cv=none; b=jSCHAjwsm3ifhsYjTP3M7BIxuktgIwuf9BuibXo4sbziFGlbiR3kVCq5DMptvmFus20AFkOLXmSo+cuYZNG64S6YnXcS8fbI399jcsV/67vkYQiXPsCxf3p5zpM7SxEV8uSOWw+V1LHidOtw8hiHR0ZrnzjAKZn31xfusL3fPYk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; c=relaxed/simple; bh=qHup8/H3gLFR/K0BuY9xLabpcbkq6dZx7UHyYFi0q7w=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=qZAufEI2bC3P08+etEqkyDrhjv5mxjkSqXIsv9QWlmQCDfKCosMyuTjzaN0U3va6givZ1khuJeBeQBBV6pI+ybfu5Dt6esp/u/vZny9pwuOchhZtRLp7Dk4vi0UXeXcembT0LKr5eLdESO5JGg56DUO7/C6nXSE/JDTbrF28ksY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=ZAdJA+AF; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="ZAdJA+AF" Received: by smtp.kernel.org (Postfix) with ESMTPS id D98EBC2BCFA; Sun, 17 May 2026 15:39:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1779032388; bh=qHup8/H3gLFR/K0BuY9xLabpcbkq6dZx7UHyYFi0q7w=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=ZAdJA+AFqZMkjLjg8+D5Zy3PBxXBpjSB84NesVFGV5lTD9zpeWV8f67av1NnWTppB 7jMHBeblHMxcqw57kp0Vu6ZbtekE/PAl8flW6m6SlqTy5sFvLkUqneLJYEvWijGAAa OQ4cs1ZMSWVgSpDr7gGN+gohzVwGvdUdN+Qfg/yi+6lz2gI+R/yiv7e32C8e6Z5bSf 0nFIfp8wVEg2/zY1LjTi5NmkqIWfhWAYctakavOaFpGewEY1bKcFMVPu4QJx0USls5 Bji8EiqCR7Eu0Fj7axoTvrCFhj93fcyRTiBcO2jD79yPiJuy0cw2Kwh0QknSAoh269 Dr25NPLpcErXg== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id CEF37CD4F3D; Sun, 17 May 2026 15:39:48 +0000 (UTC) From: Kairui Song via B4 Relay Date: Sun, 17 May 2026 23:39:43 +0800 Subject: [PATCH v5 04/12] mm, swap: add support for stable large allocation in swap cache directly Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260517-swap-table-p4-v5-4-88ae43e064c7@tencent.com> References: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> In-Reply-To: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Zi Yan , Baolin Wang , Barry Song , Hugh Dickins , Chris Li , Kemeng Shi , Nhat Pham , Baoquan He , Johannes Weiner , Youngjun Park , Chengming Zhou , Roman Gushchin , Shakeel Butt , Muchun Song , Usama Arif , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Kairui Song , Lorenzo Stoakes , Yosry Ahmed , Qi Zheng X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1779032385; l=13317; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=+ROKUM3ujlgzPiOEtu/gqJTIAINsw+BID9FIhQxsBcs=; b=2x9qwXMjT3F8ICOFKmjO1zK9qb6NkYKxBpjS55TFiSQ8TYc3KaIel9PqxuD8vpss+uO7jDbQ0 TzlBXm6E36lDpL0V4fM4fpvSAYJwzgqTr9fY2nB7YGCjvwSbCDmyTQ9 X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= X-Endpoint-Received: by B4 Relay for kasong@tencent.com/kasong-sign-tencent with auth_id=562 X-Original-From: Kairui Song Reply-To: kasong@tencent.com From: Kairui Song To make it possible to allocate large folios directly in swap cache, provide a new infrastructure helper to handle the swap cache status check, allocation, and order fallback in the swap cache layer The new helper replaces the existing swap_cache_alloc_folio. Based on this, all the separate swap folio allocation that is being done by anon / shmem before is converted to use this helper directly, unifying folio allocation for anon, shmem, and readahead. This slightly consolidates how allocation is synchronized, making it more stable and less prone to errors. The slot-count and cache-conflict check is now always performed with the cluster lock held before allocation, and repeated under the same lock right before cache insertion. This double check produces a stable result compared to the previous anon and shmem mTHP allocation implementation, avoids the false-negative conflict checks that the lockless path can return =E2=80=94 = large allocations no longer have to be unwound because the range turned out to be occupied =E2=80=94 and aborts early for already-freed slots, which helps ordinary swapin and especially readahead, with only a marginal increase in cluster-lock contention (the lock is very lightly contended and stays local in the first place). Hence, callers of swap_cache_alloc_folio() no longer need to check the swap slot count or swap cache status themselves. And now whoever first successfully allocates a folio in the swap cache will be the one who charges it and performs the swap-in. The race window of swapping is also reduced since the loop is much more compact. Signed-off-by: Kairui Song --- mm/swap.h | 3 +- mm/swap_state.c | 236 +++++++++++++++++++++++++++++++++++++++-------------= ---- mm/zswap.c | 2 +- 3 files changed, 170 insertions(+), 71 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index ad8b17a93758..6774af10a943 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -280,7 +280,8 @@ bool swap_cache_has_folio(swp_entry_t entry); struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); void swap_cache_del_folio(struct folio *folio); -struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, +struct folio *swap_cache_alloc_folio(swp_entry_t target_entry, gfp_t gfp_m= ask, + unsigned long orders, struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx); /* Below helpers require the caller to lock and pass in the swap cluster. = */ void __swap_cache_add_folio(struct swap_cluster_info *ci, diff --git a/mm/swap_state.c b/mm/swap_state.c index 89fa19ec13f6..0adb0565bbb1 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -139,10 +139,10 @@ void *swap_cache_get_shadow(swp_entry_t entry) =20 /** * __swap_cache_add_check - Check if a range is suitable for adding a foli= o. - * @ci: The locked swap cluster. - * @ci_off: Range start offset. - * @nr: Number of slots to check. - * @shadow: Returns the shadow value if one exists in the range. + * @ci: The locked swap cluster + * @targ_entry: The target swap entry to check, will be rounded down by @nr + * @nr: Number of slots to check, must be a power of 2 + * @shadowp: Returns the shadow value if one exists in the range. * * Check if all slots covered by given range have a swap count >=3D 1. * Retrieves the shadow if there is one. @@ -151,26 +151,40 @@ void *swap_cache_get_shadow(swp_entry_t entry) * Return: 0 if success, error code if failed. */ static int __swap_cache_add_check(struct swap_cluster_info *ci, - unsigned int ci_off, unsigned int nr, - void **shadow) + swp_entry_t targ_entry, + unsigned long nr, void **shadowp) { - unsigned int ci_end =3D ci_off + nr; + unsigned int ci_off, ci_end; unsigned long old_tb; =20 lockdep_assert_held(&ci->lock); - if (WARN_ON_ONCE(ci_off >=3D SWAPFILE_CLUSTER)) - return -EINVAL; =20 + /* + * If the target slot is not swapped out or already cached, return + * -ENOENT or -EEXIST. If the batch is not suitable, could be a + * race with concurrent free or cache add, return -EBUSY. + */ if (unlikely(!ci->table)) return -ENOENT; + ci_off =3D swp_cluster_offset(targ_entry); + old_tb =3D __swap_table_get(ci, ci_off); + if (swp_tb_is_folio(old_tb)) + return -EEXIST; + if (!__swp_tb_get_count(old_tb)) + return -ENOENT; + if (swp_tb_is_shadow(old_tb) && shadowp) + *shadowp =3D swp_tb_to_shadow(old_tb); + + if (nr =3D=3D 1) + return 0; + + ci_off =3D round_down(ci_off, nr); + ci_end =3D ci_off + nr; do { old_tb =3D __swap_table_get(ci, ci_off); - if (unlikely(swp_tb_is_folio(old_tb))) - return -EEXIST; - if (unlikely(!__swp_tb_get_count(old_tb))) - return -ENOENT; - if (swp_tb_is_shadow(old_tb)) - *shadow =3D swp_tb_to_shadow(old_tb); + if (unlikely(swp_tb_is_folio(old_tb) || + !__swp_tb_get_count(old_tb))) + return -EBUSY; } while (++ci_off < ci_end); =20 return 0; @@ -241,15 +255,13 @@ static int swap_cache_add_folio(struct folio *folio, = swp_entry_t entry, { int err; void *shadow =3D NULL; - unsigned int ci_off; struct swap_info_struct *si; struct swap_cluster_info *ci; unsigned long nr_pages =3D folio_nr_pages(folio); =20 si =3D __swap_entry_to_info(entry); ci =3D swap_cluster_lock(si, swp_offset(entry)); - ci_off =3D swp_cluster_offset(entry); - err =3D __swap_cache_add_check(ci, ci_off, nr_pages, &shadow); + err =3D __swap_cache_add_check(ci, entry, nr_pages, &shadow); if (err) { swap_cluster_unlock(ci); return err; @@ -404,6 +416,142 @@ void __swap_cache_replace_folio(struct swap_cluster_i= nfo *ci, } } =20 +/* + * Try to allocate a folio of given order in the swap cache. + * + * This helper resolves the potential races of swap allocation + * and prepares a folio to be used for swap IO. May return following + * value: + * + * -ENOMEM / -EBUSY: Order is too large or in conflict with sub slot, + * caller should shrink the order and retry + * -ENOENT / -EEXIST: Target swap entry is unavailable or cached, the call= er + * should abort or try to use the cached folio instead + */ +static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci, + swp_entry_t targ_entry, gfp_t gfp, + unsigned int order, struct vm_fault *vmf, + struct mempolicy *mpol, pgoff_t ilx) +{ + int err; + swp_entry_t entry; + struct folio *folio; + void *shadow =3D NULL; + unsigned long address, nr_pages =3D 1UL << order; + struct vm_area_struct *vma =3D vmf ? vmf->vma : NULL; + + VM_WARN_ON_ONCE(nr_pages > SWAPFILE_CLUSTER); + entry.val =3D round_down(targ_entry.val, nr_pages); + + /* Check if the slot and range are available, skip allocation if not */ + spin_lock(&ci->lock); + err =3D __swap_cache_add_check(ci, targ_entry, nr_pages, NULL); + spin_unlock(&ci->lock); + if (unlikely(err)) + return ERR_PTR(err); + + /* + * Limit THP gfp. The limitation is a no-op for typical + * GFP_HIGHUSER_MOVABLE but matters for shmem. + */ + if (order) + gfp =3D thp_shmem_limit_gfp_mask(vma_thp_gfp_mask(vma), gfp); + + if (mpol || !vmf) { + folio =3D folio_alloc_mpol(gfp, order, mpol, ilx, numa_node_id()); + } else { + address =3D round_down(vmf->address, PAGE_SIZE << order); + folio =3D vma_alloc_folio(gfp, order, vmf->vma, address); + } + if (unlikely(!folio)) + return ERR_PTR(-ENOMEM); + + /* Double check the range is still not in conflict */ + spin_lock(&ci->lock); + err =3D __swap_cache_add_check(ci, targ_entry, nr_pages, &shadow); + if (unlikely(err)) { + spin_unlock(&ci->lock); + folio_put(folio); + return ERR_PTR(err); + } + + __folio_set_locked(folio); + __folio_set_swapbacked(folio); + __swap_cache_do_add_folio(ci, folio, entry); + spin_unlock(&ci->lock); + + if (mem_cgroup_swapin_charge_folio(folio, vmf ? vmf->vma->vm_mm : NULL, + gfp, entry)) { + spin_lock(&ci->lock); + __swap_cache_do_del_folio(ci, folio, entry, shadow); + spin_unlock(&ci->lock); + folio_unlock(folio); + /* nr_pages refs from swap cache, 1 from allocation */ + folio_put_refs(folio, nr_pages + 1); + count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK_CHARGE); + return ERR_PTR(-ENOMEM); + } + + /* For memsw accounting, swap is uncharged when folio is added to swap ca= che */ + memcg1_swapin(entry, 1 << order); + if (shadow) + workingset_refault(folio, shadow); + + node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); + lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); + + /* Caller will initiate read into locked new_folio */ + folio_add_lru(folio); + return folio; +} + +/** + * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap ca= che. + * @targ_entry: swap entry indicating the target slot + * @gfp: memory allocation flags + * @orders: allocation orders, must be non zero + * @vmf: fault information + * @mpol: NUMA memory allocation policy to be applied + * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE + * + * Allocate a folio in the swap cache for one swap slot, typically before + * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by + * @targ_entry must have a non-zero swap count (swapped out). + * + * Context: Caller must protect the swap device with reference count or lo= cks. + * Return: Returns the folio if allocation succeeded and folio is in the s= wap + * cache. Returns error code if failed due to race, OOM or invalid argumen= ts. + */ +struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp, + unsigned long orders, struct vm_fault *vmf, + struct mempolicy *mpol, pgoff_t ilx) +{ + int order, err; + struct folio *ret; + struct swap_cluster_info *ci; + + ci =3D __swap_entry_to_cluster(targ_entry); + order =3D highest_order(orders); + + /* orders must be non-zero, and must not exceed cluster size. */ + if (WARN_ON_ONCE(!orders || (1UL << order) > SWAPFILE_CLUSTER)) + return ERR_PTR(-EINVAL); + + do { + ret =3D __swap_cache_alloc(ci, targ_entry, gfp, order, + vmf, mpol, ilx); + if (!IS_ERR(ret)) + break; + err =3D PTR_ERR(ret); + if (!order || (err && err !=3D -EBUSY && err !=3D -ENOMEM)) + break; + count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK); + order =3D next_order(&orders, order); + } while (orders); + + return ret; +} + /* * If we are the only user, then try to free up the swap cache. * @@ -547,68 +695,18 @@ static int __swap_cache_prepare_and_add(swp_entry_t e= ntry, return ret; } =20 -/** - * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap ca= che. - * @entry: the swapped out swap entry to be binded to the folio. - * @gfp_mask: memory allocation flags - * @mpol: NUMA memory allocation policy to be applied - * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE - * - * Allocate a folio in the swap cache for one swap slot, typically before - * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by - * @entry must have a non-zero swap count (swapped out). - * Currently only supports order 0. - * - * Context: Caller must protect the swap device with reference count or lo= cks. - * Return: Returns the folio if allocation succeeded and folio is added to - * swap cache. Returns error code if allocation failed due to race or OOM. - */ -struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask, - struct mempolicy *mpol, pgoff_t ilx) -{ - int err; - struct folio *folio; - - /* Allocate a new folio to be added into the swap cache. */ - folio =3D folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); - if (!folio) - return ERR_PTR(-ENOMEM); - - /* - * Try to add the new folio to the swap cache. It returns - * -EEXIST if the entry is already cached. - */ - err =3D __swap_cache_prepare_and_add(entry, folio, gfp_mask, false); - if (err) { - folio_put(folio); - return ERR_PTR(err); - } - - return folio; -} - static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp, struct mempolicy *mpol, pgoff_t ilx, struct swap_iocb **plug, bool readahead) { - struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct folio *folio; =20 - /* Check the swap cache again for readahead path. */ - folio =3D swap_cache_get_folio(entry); - if (folio) - return folio; - - /* Skip allocation for unused and bad swap slot for readahead. */ - if (!swap_entry_swapped(si, entry)) - return NULL; - do { folio =3D swap_cache_get_folio(entry); if (folio) return folio; =20 - folio =3D swap_cache_alloc_folio(entry, gfp, mpol, ilx); + folio =3D swap_cache_alloc_folio(entry, gfp, BIT(0), NULL, mpol, ilx); } while (PTR_ERR(folio) =3D=3D -EEXIST); =20 if (IS_ERR_OR_NULL(folio)) diff --git a/mm/zswap.c b/mm/zswap.c index e27f6e96f003..761cd699e0a3 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1000,7 +1000,7 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, return -EEXIST; =20 mpol =3D get_task_policy(current); - folio =3D swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol, + folio =3D swap_cache_alloc_folio(swpentry, GFP_KERNEL, BIT(0), NULL, mpol, NO_INTERLEAVE_INDEX); put_swap_device(si); =20 --=20 2.54.0 From nobody Mon May 25 06:42:36 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2F7283BC669; Sun, 17 May 2026 15:39:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; cv=none; b=rv8NtKRts+a1q7YG+AZzLCuH7N3MHKPv7G2t9Q5ykUqF7HNjwflE5ounX5jEEnAqSuB0STR1FtNDFCe+wtUXI3WyB2KNWQo2qvqdEzc51ce/9iRZFUYasXaQjUGcZNWimSmSj7KCccl9ER9tt45DgSqclZ5rFEljYvDFMPWnz/0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; c=relaxed/simple; bh=mOw4KFzfxUHWUyGfuj/p8RfUV/fSZEnDA0Q9EKIRo5o=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=jinFx7kMuTQGi8wvcy4R2Pypc7HlXiQt1xPAIn7J5SiQb8F+W4LAoaG+maiG5FWK6ijoUrJ+N/A4PfMFYTWeUuf4fEubQuUjG1R6Ct16yYxVnWQJyCcBC/gvtR06Qkl6smykUCklZ/KCMtdoyvh60bQPM115SrfCSRkwr8F+GR4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=oukHjeUp; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="oukHjeUp" Received: by smtp.kernel.org (Postfix) with ESMTPS id EE0E4C2BD01; Sun, 17 May 2026 15:39:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1779032389; bh=mOw4KFzfxUHWUyGfuj/p8RfUV/fSZEnDA0Q9EKIRo5o=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=oukHjeUpKe3pSDHFeLQDMlMpP7kPZw6AJvO3zD5P3pDNZKr159UjfFM/M7d65R0b9 FUDJT7s3015qf93qY2UBr8eijJO0uy1932eto4ejk0DiyOR6HcHBxnxXAMsDyq7h4a mSHi6PHqmgOigvwYoMtke3KNP7YUtmJl4ar4SR8QkP76x9eTq7kAlDjdqSQQDwS4dv CISzETlDznAelEA0p2XjOrQuKP9469Kvq1C6cjT37WzWgoubj2v70R4cy3+rqTw8cw +cC+WcgR3Yh0FMLlIKv5Gl4ds829j0n6NTzo8YYMuFnLX4QZ3mqbYrtQ+1bHPw/tst E9R74M5PSmNIA== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id E2231CD4F4D; Sun, 17 May 2026 15:39:48 +0000 (UTC) From: Kairui Song via B4 Relay Date: Sun, 17 May 2026 23:39:44 +0800 Subject: [PATCH v5 05/12] mm, swap: unify large folio allocation Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260517-swap-table-p4-v5-5-88ae43e064c7@tencent.com> References: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> In-Reply-To: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Zi Yan , Baolin Wang , Barry Song , Hugh Dickins , Chris Li , Kemeng Shi , Nhat Pham , Baoquan He , Johannes Weiner , Youngjun Park , Chengming Zhou , Roman Gushchin , Shakeel Butt , Muchun Song , Usama Arif , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Kairui Song , Lorenzo Stoakes , Yosry Ahmed , Qi Zheng X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1779032386; l=19387; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=Ly9lgjUrp+SrzpzuAxH3h8JdaIdG/IvtELS2basb0ss=; b=VGeN7fJjoBLwaEjbI3AggzHCcSHLv18i4S0u4Y36cmeBgdxSQ6zJLbjfWg0/iZOyn/PnHLKN7 AaP7cYhwoOzCckjeD9REntUbvFwX/NId5SRgK+3t6TrK+PK+HMino8s X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= X-Endpoint-Received: by B4 Relay for kasong@tencent.com/kasong-sign-tencent with auth_id=562 X-Original-From: Kairui Song Reply-To: kasong@tencent.com From: Kairui Song Now that direct large order allocation is supported in the swap cache, both anon and shmem can use it instead of implementing their own methods. This unifies the fallback and swap cache check, which also reduces the TOCTOU race window of swap cache state: previously, high order swapin required checking swap cache states first, then allocating and falling back separately. Now all these steps happen in the same compact loop. Order fallback and statistics are also unified, callers just need to check and pass the acceptable order bitmask. There is basically no behavior change. This only makes things more unified and prepares for later commits. Cgroup and zero map checks can also be moved into the compact loop, further reducing race windows and redundancy Signed-off-by: Kairui Song --- mm/memory.c | 80 +++++++------------------------ mm/shmem.c | 102 +++++++++++++--------------------------- mm/swap.h | 30 ++---------- mm/swap_state.c | 143 ++++++++++------------------------------------------= ---- mm/swapfile.c | 3 +- 5 files changed, 79 insertions(+), 279 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 0c9d9c2cbf0e..da891bcce59c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4609,26 +4609,6 @@ static vm_fault_t handle_pte_marker(struct vm_fault = *vmf) return VM_FAULT_SIGBUS; } =20 -static struct folio *__alloc_swap_folio(struct vm_fault *vmf) -{ - struct vm_area_struct *vma =3D vmf->vma; - struct folio *folio; - softleaf_t entry; - - folio =3D vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address); - if (!folio) - return NULL; - - entry =3D softleaf_from_pte(vmf->orig_pte); - if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, - GFP_KERNEL, entry)) { - folio_put(folio); - return NULL; - } - - return folio; -} - #ifdef CONFIG_TRANSPARENT_HUGEPAGE /* * Check if the PTEs within a range are contiguous swap entries @@ -4658,8 +4638,6 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_= t *ptep, int nr_pages) */ if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) !=3D nr_pages)) return false; - if (unlikely(non_swapcache_batch(entry, nr_pages) !=3D nr_pages)) - return false; =20 return true; } @@ -4687,16 +4665,14 @@ static inline unsigned long thp_swap_suitable_order= s(pgoff_t swp_offset, return orders; } =20 -static struct folio *alloc_swap_folio(struct vm_fault *vmf) +static unsigned long thp_swapin_suitable_orders(struct vm_fault *vmf) { struct vm_area_struct *vma =3D vmf->vma; unsigned long orders; - struct folio *folio; unsigned long addr; softleaf_t entry; spinlock_t *ptl; pte_t *pte; - gfp_t gfp; int order; =20 /* @@ -4704,7 +4680,7 @@ static struct folio *alloc_swap_folio(struct vm_fault= *vmf) * maintain the uffd semantics. */ if (unlikely(userfaultfd_armed(vma))) - goto fallback; + return 0; =20 /* * A large swapped out folio could be partially or fully in zswap. We @@ -4712,7 +4688,7 @@ static struct folio *alloc_swap_folio(struct vm_fault= *vmf) * folio. */ if (!zswap_never_enabled()) - goto fallback; + return 0; =20 entry =3D softleaf_from_pte(vmf->orig_pte); /* @@ -4726,12 +4702,12 @@ static struct folio *alloc_swap_folio(struct vm_fau= lt *vmf) vmf->address, orders); =20 if (!orders) - goto fallback; + return 0; =20 pte =3D pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl); if (unlikely(!pte)) - goto fallback; + return 0; =20 /* * For do_swap_page, find the highest order where the aligned range is @@ -4747,29 +4723,12 @@ static struct folio *alloc_swap_folio(struct vm_fau= lt *vmf) =20 pte_unmap_unlock(pte, ptl); =20 - /* Try allocating the highest of the remaining orders. */ - gfp =3D vma_thp_gfp_mask(vma); - while (orders) { - addr =3D ALIGN_DOWN(vmf->address, PAGE_SIZE << order); - folio =3D vma_alloc_folio(gfp, order, vma, addr); - if (folio) { - if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, - gfp, entry)) - return folio; - count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK_CHARGE); - folio_put(folio); - } - count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK); - order =3D next_order(&orders, order); - } - -fallback: - return __alloc_swap_folio(vmf); + return orders; } #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ -static struct folio *alloc_swap_folio(struct vm_fault *vmf) +static unsigned long thp_swapin_suitable_orders(struct vm_fault *vmf) { - return __alloc_swap_folio(vmf); + return 0; } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 @@ -4875,23 +4834,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (folio) swap_update_readahead(folio, vma, vmf->address); if (!folio) { - if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { - folio =3D alloc_swap_folio(vmf); - if (folio) { - /* - * folio is charged, so swapin can only fail due - * to raced swapin and return NULL. - */ - swapcache =3D swapin_folio(entry, folio); - if (swapcache !=3D folio) - folio_put(folio); - folio =3D swapcache; - } - } else { + /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */ + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) + folio =3D swapin_sync(entry, GFP_HIGHUSER_MOVABLE, + thp_swapin_suitable_orders(vmf) | BIT(0), + vmf, NULL, 0); + else folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf); - } =20 - if (!folio) { + if (IS_ERR_OR_NULL(folio)) { /* * Back out if somebody else faulted in this pte * while we released the pte lock. @@ -4901,6 +4852,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte))) ret =3D VM_FAULT_OOM; + folio =3D NULL; goto unlock; } =20 diff --git a/mm/shmem.c b/mm/shmem.c index 6edb23b41bac..77a3e28e5160 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -159,7 +159,7 @@ static unsigned long shmem_default_max_inodes(void) =20 static int shmem_swapin_folio(struct inode *inode, pgoff_t index, struct folio **foliop, enum sgp_type sgp, gfp_t gfp, - struct vm_area_struct *vma, vm_fault_t *fault_type); + struct vm_fault *vmf, vm_fault_t *fault_type); =20 static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb) { @@ -2017,68 +2017,32 @@ static struct folio *shmem_alloc_and_add_folio(stru= ct vm_fault *vmf, } =20 static struct folio *shmem_swap_alloc_folio(struct inode *inode, - struct vm_area_struct *vma, pgoff_t index, + struct vm_fault *vmf, pgoff_t index, swp_entry_t entry, int order, gfp_t gfp) { + pgoff_t ilx; + struct folio *folio; + struct mempolicy *mpol; struct shmem_inode_info *info =3D SHMEM_I(inode); - struct folio *new, *swapcache; - int nr_pages =3D 1 << order; - gfp_t alloc_gfp =3D gfp; - - if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { - if (WARN_ON_ONCE(order)) - return ERR_PTR(-EINVAL); - } else if (order) { - /* - * If uffd is active for the vma, we need per-page fault - * fidelity to maintain the uffd semantics, then fallback - * to swapin order-0 folio, as well as for zswap case. - * Any existing sub folio in the swap cache also blocks - * mTHP swapin. - */ - if ((vma && unlikely(userfaultfd_armed(vma))) || - !zswap_never_enabled() || - non_swapcache_batch(entry, nr_pages) !=3D nr_pages) - goto fallback; =20 - alloc_gfp =3D thp_shmem_limit_gfp_mask(vma_thp_gfp_mask(vma), gfp); - } -retry: - new =3D shmem_alloc_folio(alloc_gfp, order, info, index); - if (!new) { - new =3D ERR_PTR(-ENOMEM); - goto fallback; - } + if ((vmf && unlikely(userfaultfd_armed(vmf->vma))) || + !zswap_never_enabled()) + order =3D 0; =20 - if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL, - alloc_gfp, entry)) { - folio_put(new); - new =3D ERR_PTR(-ENOMEM); - goto fallback; - } +again: + mpol =3D shmem_get_pgoff_policy(info, index, order, &ilx); + folio =3D swapin_sync(entry, gfp, BIT(order), vmf, mpol, ilx); + mpol_cond_put(mpol); =20 - swapcache =3D swapin_folio(entry, new); - if (swapcache !=3D new) { - folio_put(new); - if (!swapcache) { - /* - * The new folio is charged already, swapin can - * only fail due to another raced swapin. - */ - new =3D ERR_PTR(-EEXIST); - goto fallback; - } + if (!IS_ERR(folio)) + return folio; + + if (order) { + order =3D 0; + goto again; } - return swapcache; -fallback: - /* Order 0 swapin failed, nothing to fallback to, abort */ - if (!order) - return new; - entry.val +=3D index - round_down(index, nr_pages); - alloc_gfp =3D gfp; - nr_pages =3D 1; - order =3D 0; - goto retry; + + return folio; } =20 /* @@ -2265,11 +2229,12 @@ static int shmem_split_large_entry(struct inode *in= ode, pgoff_t index, */ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, struct folio **foliop, enum sgp_type sgp, - gfp_t gfp, struct vm_area_struct *vma, + gfp_t gfp, struct vm_fault *vmf, vm_fault_t *fault_type) { struct address_space *mapping =3D inode->i_mapping; - struct mm_struct *fault_mm =3D vma ? vma->vm_mm : NULL; + struct vm_area_struct *vma =3D vmf ? vmf->vma : NULL; + struct mm_struct *fault_mm =3D vmf ? vmf->vma->vm_mm : NULL; struct shmem_inode_info *info =3D SHMEM_I(inode); swp_entry_t swap; softleaf_t index_entry; @@ -2310,20 +2275,19 @@ static int shmem_swapin_folio(struct inode *inode, = pgoff_t index, if (!folio) { if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { /* Direct swapin skipping swap cache & readahead */ - folio =3D shmem_swap_alloc_folio(inode, vma, index, - index_entry, order, gfp); - if (IS_ERR(folio)) { - error =3D PTR_ERR(folio); - folio =3D NULL; - goto failed; - } + folio =3D shmem_swap_alloc_folio(inode, vmf, index, + swap, order, gfp); } else { /* Cached swapin only supports order 0 folio */ folio =3D shmem_swapin_cluster(swap, gfp, info, index); - if (!folio) { + } + if (IS_ERR_OR_NULL(folio)) { + if (IS_ERR(folio)) + error =3D PTR_ERR(folio); + else error =3D -ENOMEM; - goto failed; - } + folio =3D NULL; + goto failed; } if (fault_type) { *fault_type |=3D VM_FAULT_MAJOR; @@ -2471,7 +2435,7 @@ static int shmem_get_folio_gfp(struct inode *inode, p= goff_t index, =20 if (xa_is_value(folio)) { error =3D shmem_swapin_folio(inode, index, &folio, - sgp, gfp, vma, fault_type); + sgp, gfp, vmf, fault_type); if (error =3D=3D -EEXIST) goto repeat; =20 diff --git a/mm/swap.h b/mm/swap.h index 6774af10a943..8e57e9431624 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -300,7 +300,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t flag, struct mempolicy *mpol, pgoff_t ilx); struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf); -struct folio *swapin_folio(swp_entry_t entry, struct folio *folio); +struct folio *swapin_sync(swp_entry_t entry, gfp_t flag, unsigned long ord= ers, + struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx); void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, unsigned long addr); =20 @@ -334,24 +335,6 @@ static inline int swap_zeromap_batch(swp_entry_t entry= , int max_nr, return find_next_bit(sis->zeromap, end, start) - start; } =20 -static inline int non_swapcache_batch(swp_entry_t entry, int max_nr) -{ - int i; - - /* - * While allocating a large folio and doing mTHP swapin, we need to - * ensure all entries are not cached, otherwise, the mTHP folio will - * be in conflict with the folio in swap cache. - */ - for (i =3D 0; i < max_nr; i++) { - if (swap_cache_has_folio(entry)) - return i; - entry.val++; - } - - return i; -} - #else /* CONFIG_SWAP */ struct swap_iocb; static inline struct swap_cluster_info *swap_cluster_lock( @@ -433,7 +416,9 @@ static inline struct folio *swapin_readahead(swp_entry_= t swp, gfp_t gfp_mask, return NULL; } =20 -static inline struct folio *swapin_folio(swp_entry_t entry, struct folio *= folio) +static inline struct folio *swapin_sync( + swp_entry_t entry, gfp_t flag, unsigned long orders, + struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx) { return NULL; } @@ -493,10 +478,5 @@ static inline int swap_zeromap_batch(swp_entry_t entry= , int max_nr, { return 0; } - -static inline int non_swapcache_batch(swp_entry_t entry, int max_nr) -{ - return 0; -} #endif /* CONFIG_SWAP */ #endif /* _MM_SWAP_H */ diff --git a/mm/swap_state.c b/mm/swap_state.c index 0adb0565bbb1..98c8691826fb 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -238,43 +238,6 @@ void __swap_cache_add_folio(struct swap_cluster_info *= ci, lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); } =20 -/** - * swap_cache_add_folio - Add a folio into the swap cache. - * @folio: The folio to be added. - * @entry: The swap entry corresponding to the folio. - * @shadowp: If a shadow is found, return the shadow. - * - * Add a folio into the swap cache. Will return error if any slot is no - * longer a valid swapped out slot or already occupied by another folio. - * - * Context: Caller must ensure @entry is valid and protect the swap device - * with reference count or locks. - */ -static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, - void **shadowp) -{ - int err; - void *shadow =3D NULL; - struct swap_info_struct *si; - struct swap_cluster_info *ci; - unsigned long nr_pages =3D folio_nr_pages(folio); - - si =3D __swap_entry_to_info(entry); - ci =3D swap_cluster_lock(si, swp_offset(entry)); - err =3D __swap_cache_add_check(ci, entry, nr_pages, &shadow); - if (err) { - swap_cluster_unlock(ci); - return err; - } - - __swap_cache_add_folio(ci, folio, entry); - swap_cluster_unlock(ci); - if (shadowp) - *shadowp =3D shadow; - - return 0; -} - static void __swap_cache_do_del_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry, void *shadow) @@ -650,51 +613,6 @@ void swap_update_readahead(struct folio *folio, struct= vm_area_struct *vma, } } =20 -/** - * __swap_cache_prepare_and_add - Prepare the folio and add it to swap cac= he. - * @entry: swap entry to be bound to the folio. - * @folio: folio to be added. - * @gfp: memory allocation flags for charge, can be 0 if @charged if true. - * @charged: if the folio is already charged. - * - * Update the swap_map and add folio as swap cache, typically before swapi= n. - * All swap slots covered by the folio must have a non-zero swap count. - * - * Context: Caller must protect the swap device with reference count or lo= cks. - * Return: 0 if success, error code if failed. - */ -static int __swap_cache_prepare_and_add(swp_entry_t entry, - struct folio *folio, - gfp_t gfp, bool charged) -{ - void *shadow; - int ret; - - __folio_set_locked(folio); - __folio_set_swapbacked(folio); - - if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) { - ret =3D -ENOMEM; - goto failed; - } - - ret =3D swap_cache_add_folio(folio, entry, &shadow); - if (ret) - goto failed; - - memcg1_swapin(entry, folio_nr_pages(folio)); - if (shadow) - workingset_refault(folio, shadow); - - /* Caller will initiate read into locked folio */ - folio_add_lru(folio); - return 0; - -failed: - folio_unlock(folio); - return ret; -} - static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp, struct mempolicy *mpol, pgoff_t ilx, struct swap_iocb **plug, bool readahead) @@ -705,7 +623,6 @@ static struct folio *swap_cache_read_folio(swp_entry_t = entry, gfp_t gfp, folio =3D swap_cache_get_folio(entry); if (folio) return folio; - folio =3D swap_cache_alloc_folio(entry, gfp, BIT(0), NULL, mpol, ilx); } while (PTR_ERR(folio) =3D=3D -EEXIST); =20 @@ -722,49 +639,37 @@ static struct folio *swap_cache_read_folio(swp_entry_= t entry, gfp_t gfp, } =20 /** - * swapin_folio - swap-in one or multiple entries skipping readahead. - * @entry: starting swap entry to swap in - * @folio: a new allocated and charged folio + * swapin_sync - swap-in one or multiple entries skipping readahead. + * @entry: swap entry indicating the target slot + * @gfp: memory allocation flags + * @orders: allocation orders + * @vmf: fault information + * @mpol: NUMA memory allocation policy to be applied + * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE * - * Reads @entry into @folio, @folio will be added to the swap cache. - * If @folio is a large folio, the @entry will be rounded down to align - * with the folio size. + * This allocates a folio suitable for given @orders, or returns the + * existing folio in the swap cache for @entry. This initiates the IO, too, + * if needed. @entry is rounded down if @orders allow large allocation. * - * Return: returns pointer to @folio on success. If folio is a large folio - * and this raced with another swapin, NULL will be returned to allow fall= back - * to order 0. Else, if another folio was already added to the swap cache, - * return that swap cache folio instead. + * Context: Caller must ensure @entry is valid and pin the swap device wit= h refcount. + * Return: Returns the folio on success, error code if failed. */ -struct folio *swapin_folio(swp_entry_t entry, struct folio *folio) +struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp, unsigned long orde= rs, + struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx) { - int ret; - struct folio *swapcache; - pgoff_t offset =3D swp_offset(entry); - unsigned long nr_pages =3D folio_nr_pages(folio); - - entry =3D swp_entry(swp_type(entry), round_down(offset, nr_pages)); - for (;;) { - ret =3D __swap_cache_prepare_and_add(entry, folio, 0, true); - if (!ret) { - swap_read_folio(folio, NULL); - break; - } + struct folio *folio; =20 - /* - * Large order allocation needs special handling on - * race: if a smaller folio exists in cache, swapin needs - * to fall back to order 0, and doing a swap cache lookup - * might return a folio that is irrelevant to the faulting - * entry because @entry is aligned down. Just return NULL. - */ - if (ret !=3D -EEXIST || nr_pages > 1) - return NULL; + do { + folio =3D swap_cache_get_folio(entry); + if (folio) + return folio; + folio =3D swap_cache_alloc_folio(entry, gfp, orders, vmf, mpol, ilx); + } while (PTR_ERR(folio) =3D=3D -EEXIST); =20 - swapcache =3D swap_cache_get_folio(entry); - if (swapcache) - return swapcache; - } + if (IS_ERR(folio)) + return folio; =20 + swap_read_folio(folio, NULL); return folio; } =20 diff --git a/mm/swapfile.c b/mm/swapfile.c index 08309c1dafa3..4e5a54769e81 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1827,8 +1827,7 @@ void folio_put_swap(struct folio *folio, struct page = *subpage) * do_swap_page() * ... swapoff+swapon * swap_cache_alloc_folio() - * swap_cache_add_folio() - * // check swap_map + * // check swap_map * // verify PTE not changed * * In __swap_duplicate(), the swap_map need to be checked before --=20 2.54.0 From nobody Mon May 25 06:42:36 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3BFC13BC69E; Sun, 17 May 2026 15:39:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; cv=none; b=odNFL5TUl6t6w+7vs6PdLYVJqp630CZ+FNU1KgYPZ8Aa7emLTKH7aW5Wp/SFuATeEnbS6l9hfIOSUqY8tt8JQRBN2G/Jio1B56lHv3LBsyVdm4MQaaGrFgo8ThwO4qOigl1dM9EkdepF6GYlMDURI+AJFBfZXU1jTKEjXxcV+Nw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; c=relaxed/simple; bh=ywsc30h0pTm+0AkzKZFg/ZB6sIicQ6hYTTCBSgZ6DLU=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=TBDTuhGcIloqQvlEc0YqFv+QzhdGODT7NYxJZqlblZ1Rzs2DKYINl0SpQariKZR4tAl/+iDV3zAKBCzV5xUu9E34Em14q2VgwI7y768U+hHjz1NAdVH+PuRz5o/T+cRfl4J44+SsGjEMZ20/5H2e9+UrfbDsK8DI/Z/pFbXDMq4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=OrOM3DO2; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="OrOM3DO2" Received: by smtp.kernel.org (Postfix) with ESMTPS id 0D97AC2BCFD; Sun, 17 May 2026 15:39:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1779032389; bh=ywsc30h0pTm+0AkzKZFg/ZB6sIicQ6hYTTCBSgZ6DLU=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=OrOM3DO2RP6CNRzGi0gTw5Meh8zh9zGkh0Dugc17AyavRrzo1KJBuYsVUkHaPHQrd ICNW8cdqMeCsZ4nNq0OX/CxGLpNULIqbl8Do+pYSaSevlp/AvEa6SEZZcDQMoAxZrn iOCAnu6uWZXm0D4p3thDf2MVn76GJtInzLK+B1o53xPzFFaWTcvpFxbOnvK7+YIMua 9I2+M/HG7e0sxMevLoCxaiewcc6YW3KRcATZlUjUcgcHpxTVBNJEevMotNAh9YhVhn qCClJ9DAvkukF/8PZ5QVxGbbd9SGO5yrEV4yKRtZCo0nXsE7W9Zif9PB1LzehpH2nK 7qJ11Lg9vWh3w== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id 04059CD4F47; Sun, 17 May 2026 15:39:49 +0000 (UTC) From: Kairui Song via B4 Relay Date: Sun, 17 May 2026 23:39:45 +0800 Subject: [PATCH v5 06/12] mm/memcg, swap: tidy up cgroup v1 memsw swap helpers Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260517-swap-table-p4-v5-6-88ae43e064c7@tencent.com> References: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> In-Reply-To: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Zi Yan , Baolin Wang , Barry Song , Hugh Dickins , Chris Li , Kemeng Shi , Nhat Pham , Baoquan He , Johannes Weiner , Youngjun Park , Chengming Zhou , Roman Gushchin , Shakeel Butt , Muchun Song , Usama Arif , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Kairui Song , Lorenzo Stoakes , Yosry Ahmed , Qi Zheng X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1779032386; l=9795; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=gCgUUE/Etfz8T9X0Mywmw2qi0UZBkmFa6lWPz7BcjGo=; b=zSehdC7R81v31OVC9Y5cQntcQByyo/27QgkaU/0kJ+Ss/mYlUb4eBRTiM0MtTVnh5TTV5GRbp h9jcqFma61UDq9qXqQV5/Jm2JBtOrvWK7radt5LhwjGxhF+dsYeJkqS X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= X-Endpoint-Received: by B4 Relay for kasong@tencent.com/kasong-sign-tencent with auth_id=562 X-Original-From: Kairui Song Reply-To: kasong@tencent.com From: Kairui Song The cgroup v1 swap helpers always operate on swap cache folios whose swap entry is stable: the folio is locked and in the swap cache. There is no need to pass the swap entry or page count as separate parameters when they can be derived from the folio itself. Simplify the redundant parameters and add sanity checks to document the required preconditions. Also rename memcg1_swapout to __memcg1_swapout to indicate it requires special calling context: the folio must be isolated and dying, and the call must be made with interrupts disabled. No functional change. Acked-by: Chris Li Signed-off-by: Kairui Song --- include/linux/memcontrol.h | 8 ++++---- include/linux/swap.h | 10 ++++------ mm/huge_memory.c | 2 +- mm/memcontrol-v1.c | 33 ++++++++++++++++++++------------- mm/memcontrol.c | 9 ++++----- mm/swap_state.c | 4 ++-- mm/swapfile.c | 2 +- mm/vmscan.c | 2 +- 8 files changed, 37 insertions(+), 33 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index dc3fa687759b..7d08128de1fd 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1899,8 +1899,8 @@ static inline void mem_cgroup_exit_user_fault(void) current->in_user_fault =3D 0; } =20 -void memcg1_swapout(struct folio *folio, swp_entry_t entry); -void memcg1_swapin(swp_entry_t entry, unsigned int nr_pages); +void __memcg1_swapout(struct folio *folio); +void memcg1_swapin(struct folio *folio); =20 #else /* CONFIG_MEMCG_V1 */ static inline @@ -1929,11 +1929,11 @@ static inline void mem_cgroup_exit_user_fault(void) { } =20 -static inline void memcg1_swapout(struct folio *folio, swp_entry_t entry) +static inline void __memcg1_swapout(struct folio *folio) { } =20 -static inline void memcg1_swapin(swp_entry_t entry, unsigned int nr_pages) +static inline void memcg1_swapin(struct folio *folio) { } =20 diff --git a/include/linux/swap.h b/include/linux/swap.h index aa89e1d30a77..6b3acdf9bdd4 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -576,13 +576,12 @@ static inline void folio_throttle_swaprate(struct fol= io *folio, gfp_t gfp) #endif =20 #if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP) -int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry); -static inline int mem_cgroup_try_charge_swap(struct folio *folio, - swp_entry_t entry) +int __mem_cgroup_try_charge_swap(struct folio *folio); +static inline int mem_cgroup_try_charge_swap(struct folio *folio) { if (mem_cgroup_disabled()) return 0; - return __mem_cgroup_try_charge_swap(folio, entry); + return __mem_cgroup_try_charge_swap(folio); } =20 extern void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_= pages); @@ -596,8 +595,7 @@ static inline void mem_cgroup_uncharge_swap(swp_entry_t= entry, unsigned int nr_p extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg); extern bool mem_cgroup_swap_full(struct folio *folio); #else -static inline int mem_cgroup_try_charge_swap(struct folio *folio, - swp_entry_t entry) +static inline int mem_cgroup_try_charge_swap(struct folio *folio) { return 0; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c565b2a651e0..42b86e8ab7c0 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -4430,7 +4430,7 @@ void deferred_split_folio(struct folio *folio, bool p= artially_mapped) =20 /* * Exclude swapcache: originally to avoid a corrupt deferred split - * queue. Nowadays that is fully prevented by memcg1_swapout(); + * queue. Nowadays that is fully prevented by __memcg1_swapout(); * but if page reclaim is already handling the same folio, it is * unnecessary to handle it again in the shrinker, so excluding * swapcache here may still be a useful optimization. diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 433bba9dfe71..36c507d81dc5 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -604,18 +604,23 @@ void memcg1_commit_charge(struct folio *folio, struct= mem_cgroup *memcg) } =20 /** - * memcg1_swapout - transfer a memsw charge to swap + * __memcg1_swapout - transfer a memsw charge to swap * @folio: folio whose memsw charge to transfer - * @entry: swap entry to move the charge to * - * Transfer the memsw charge of @folio to @entry. + * Transfer the memsw charge of @folio to the swap entry stored in + * folio->swap. + * + * Context: folio must be isolated, unmapped, locked and is just about + * to be freed, and caller must disable IRQs. */ -void memcg1_swapout(struct folio *folio, swp_entry_t entry) +void __memcg1_swapout(struct folio *folio) { struct mem_cgroup *memcg, *swap_memcg; struct obj_cgroup *objcg; unsigned int nr_entries; =20 + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); VM_BUG_ON_FOLIO(folio_ref_count(folio), folio); =20 @@ -641,7 +646,7 @@ void memcg1_swapout(struct folio *folio, swp_entry_t en= try) swap_memcg =3D mem_cgroup_private_id_get_online(memcg, nr_entries); mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries); =20 - swap_cgroup_record(folio, mem_cgroup_private_id(swap_memcg), entry); + swap_cgroup_record(folio, mem_cgroup_private_id(swap_memcg), folio->swap); =20 folio_unqueue_deferred_split(folio); folio->memcg_data =3D 0; @@ -671,18 +676,20 @@ void memcg1_swapout(struct folio *folio, swp_entry_t = entry) obj_cgroup_put(objcg); } =20 -/* +/** * memcg1_swapin - uncharge swap slot - * @entry: the first swap entry for which the pages are charged - * @nr_pages: number of pages which will be uncharged + * @folio: folio being swapped in * - * Call this function after successfully adding the charged page to swapca= che. + * Call this function after successfully adding the charged + * folio to swapcache. * - * Note: This function assumes the page for which swap slot is being uncha= rged - * is order 0 page. + * Context: The folio has to be in swap cache and locked. */ -void memcg1_swapin(swp_entry_t entry, unsigned int nr_pages) +void memcg1_swapin(struct folio *folio) { + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + /* * Cgroup1's unified memory+swap counter has been charged with the * new swapcache page, finish the transfer by uncharging the swap @@ -701,7 +708,7 @@ void memcg1_swapin(swp_entry_t entry, unsigned int nr_p= ages) * let's not wait for it. The page already received a * memory+swap charge, drop the swap entry duplicate. */ - mem_cgroup_uncharge_swap(entry, nr_pages); + mem_cgroup_uncharge_swap(folio->swap, folio_nr_pages(folio)); } } =20 diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d978e18b9b2d..a28a68eed7ba 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5464,13 +5464,12 @@ int __init mem_cgroup_init(void) /** * __mem_cgroup_try_charge_swap - try charging swap space for a folio * @folio: folio being added to swap - * @entry: swap entry to charge * - * Try to charge @folio's memcg for the swap space at @entry. + * Try to charge @folio's memcg for the swap space at folio->swap. * * Returns 0 on success, -ENOMEM on failure. */ -int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry) +int __mem_cgroup_try_charge_swap(struct folio *folio) { unsigned int nr_pages =3D folio_nr_pages(folio); struct page_counter *counter; @@ -5487,7 +5486,7 @@ int __mem_cgroup_try_charge_swap(struct folio *folio,= swp_entry_t entry) =20 rcu_read_lock(); memcg =3D obj_cgroup_memcg(objcg); - if (!entry.val) { + if (!folio_test_swapcache(folio)) { memcg_memory_event(memcg, MEMCG_SWAP_FAIL); rcu_read_unlock(); return 0; @@ -5506,7 +5505,7 @@ int __mem_cgroup_try_charge_swap(struct folio *folio,= swp_entry_t entry) } mod_memcg_state(memcg, MEMCG_SWAP, nr_pages); =20 - swap_cgroup_record(folio, mem_cgroup_private_id(memcg), entry); + swap_cgroup_record(folio, mem_cgroup_private_id(memcg), folio->swap); =20 return 0; } diff --git a/mm/swap_state.c b/mm/swap_state.c index 98c8691826fb..7a80494fa37f 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -455,8 +455,8 @@ static struct folio *__swap_cache_alloc(struct swap_clu= ster_info *ci, return ERR_PTR(-ENOMEM); } =20 - /* For memsw accounting, swap is uncharged when folio is added to swap ca= che */ - memcg1_swapin(entry, 1 << order); + /* memsw uncharges swap when folio is added to swap cache */ + memcg1_swapin(folio); if (shadow) workingset_refault(folio, shadow); =20 diff --git a/mm/swapfile.c b/mm/swapfile.c index 4e5a54769e81..5c8bb15719bf 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1731,7 +1731,7 @@ int folio_alloc_swap(struct folio *folio) } =20 /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */ - if (unlikely(mem_cgroup_try_charge_swap(folio, folio->swap))) + if (unlikely(mem_cgroup_try_charge_swap(folio))) swap_cache_del_folio(folio); =20 if (unlikely(!folio_test_swapcache(folio))) diff --git a/mm/vmscan.c b/mm/vmscan.c index b3e555561417..924c84326551 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -737,7 +737,7 @@ static int __remove_mapping(struct address_space *mappi= ng, struct folio *folio, =20 if (reclaimed && !mapping_exiting(mapping)) shadow =3D workingset_eviction(folio, target_memcg); - memcg1_swapout(folio, swap); + __memcg1_swapout(folio); __swap_cache_del_folio(ci, folio, swap, shadow); swap_cluster_unlock_irq(ci); } else { --=20 2.54.0 From nobody Mon May 25 06:42:36 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4AEA43BE178; Sun, 17 May 2026 15:39:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; cv=none; b=ceuQFBk5aQuoQJRPq0YjAg7Qncj0fc8o7qzFsGpnBpxV5Fx08Cpcrgk7jU4ADoGtnYIwUvAnbSUmM8yn77MxEbRTARr4RgTxztNg2/4Ov1Vdl8hyNhc+Enk59AbEXrMYVvOYQ7ePb7c2O1HVRGZ0YOv676bkPB4vy7me1Chwq+w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; c=relaxed/simple; bh=IA+n1x4TGGrmvV2VnYh711ckQlZ8TYsnmX7nmSi5qbk=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=lFRUGXHFYZD3GxdTBSmUKn4ypLQHJS2YvunEqPB52szcxV+91xXsb3V0Wphelu6f7AZRHyJoQE9IZ5Bu0X8WY8n4Pug+TDlqpAc83sNohRTv6Np65VB35seeMqFHQK5OsKyNxxLi2yeCi8JCUlQXCkR/ARcIOGOtOu94GedOB44= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Hx4xeej5; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Hx4xeej5" Received: by smtp.kernel.org (Postfix) with ESMTPS id 221E8C2BCFB; Sun, 17 May 2026 15:39:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1779032389; bh=IA+n1x4TGGrmvV2VnYh711ckQlZ8TYsnmX7nmSi5qbk=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=Hx4xeej5OriGbAD9fV3Nc/EvcAJEU39k/zjYtwDaXN2NIONHfSdw+g+czq5Zo0AyN Om0cYSHb//cEVOP7A+tXQ20YPnj7KFHotcw0hfS/9vMrMClBZQOLHi3FlxPya/Yiqp 4Ab7XGCqRV/QwhxJyRbpkFhQA6HxEbOpurKl2YWZPp8Zo546tSPigec/aHxQt+xxbR RsipPcGcmc02PYzo08dyL1LUfghXO/61tMf7KaoYdL3XFnjAiEHrPMcHzKC2zrKVEt LXjTlpkkUXqHypvJuXk2Vvea2/5xxe2PxzqdSkplqmZaPIR2qbZqEwX1qgyo8Dc7ZX CzyVTZq6ANLIQ== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id 16ADCCD4F4A; Sun, 17 May 2026 15:39:49 +0000 (UTC) From: Kairui Song via B4 Relay Date: Sun, 17 May 2026 23:39:46 +0800 Subject: [PATCH v5 07/12] mm, swap: support flexible batch freeing of slots in different memcgs Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260517-swap-table-p4-v5-7-88ae43e064c7@tencent.com> References: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> In-Reply-To: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Zi Yan , Baolin Wang , Barry Song , Hugh Dickins , Chris Li , Kemeng Shi , Nhat Pham , Baoquan He , Johannes Weiner , Youngjun Park , Chengming Zhou , Roman Gushchin , Shakeel Butt , Muchun Song , Usama Arif , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Kairui Song , Lorenzo Stoakes , Yosry Ahmed , Qi Zheng X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1779032386; l=2512; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=ZD2081+ctfRfaBeb4mp+5wkRHhM7k1bXJ2YEQVy37aA=; b=/gV2rB59+sfFgcX/Dv/Tg5we0Z1EcHk9+EDjavgbQFtp9YpTqfm8oJ7yHIq0BedOQfVQ6HnSG GNhYM36cxJHBEEoKcNtjUh75tuEkjOcoLjMNFHzWMJEEtS6uiMtjPPG X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= X-Endpoint-Received: by B4 Relay for kasong@tencent.com/kasong-sign-tencent with auth_id=562 X-Original-From: Kairui Song Reply-To: kasong@tencent.com From: Kairui Song Instead of requiring the caller to ensure all slots are in the same memcg, make the function handle different memcgs at once. This is both a micro optimization and required for removing the memcg lookup in the page table layer, so it can be unified at the swap layer. We are not removing the memcg lookup in the page table in this commit. It has to be done after the memcg lookup is deferred to the swap layer. Acked-by: Chris Li Signed-off-by: Kairui Song --- mm/swapfile.c | 33 +++++++++++++++++++++++++++++---- 1 file changed, 29 insertions(+), 4 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 5c8bb15719bf..c9c80ba9252b 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1873,21 +1873,46 @@ void __swap_cluster_free_entries(struct swap_info_s= truct *si, unsigned int ci_start, unsigned int nr_pages) { unsigned long old_tb; + unsigned int type =3D si->type; + unsigned short batch_id =3D 0, id_cur; unsigned int ci_off =3D ci_start, ci_end =3D ci_start + nr_pages; - unsigned long offset =3D cluster_offset(si, ci) + ci_start; + unsigned long ci_head =3D cluster_offset(si, ci); + unsigned int batch_off =3D ci_off; + swp_entry_t entry; =20 VM_WARN_ON(ci->count < nr_pages); =20 ci->count -=3D nr_pages; do { old_tb =3D __swap_table_get(ci, ci_off); - /* Release the last ref, or after swap cache is dropped */ + /* + * Freeing is done after release of the last swap count + * ref, or after swap cache is dropped + */ VM_WARN_ON(!swp_tb_is_shadow(old_tb) || __swp_tb_get_count(old_tb) > 1); __swap_table_set(ci, ci_off, null_to_swp_tb()); + + /* + * Uncharge swap slots by memcg in batches. Consecutive + * slots with the same cgroup id are uncharged together. + */ + entry =3D swp_entry(type, ci_head + ci_off); + id_cur =3D lookup_swap_cgroup_id(entry); + if (batch_id !=3D id_cur) { + if (batch_id) + mem_cgroup_uncharge_swap(swp_entry(type, ci_head + batch_off), + ci_off - batch_off); + batch_id =3D id_cur; + batch_off =3D ci_off; + } } while (++ci_off < ci_end); =20 - mem_cgroup_uncharge_swap(swp_entry(si->type, offset), nr_pages); - swap_range_free(si, offset, nr_pages); + if (batch_id) { + mem_cgroup_uncharge_swap(swp_entry(type, ci_head + batch_off), + ci_off - batch_off); + } + + swap_range_free(si, ci_head + ci_start, nr_pages); swap_cluster_assert_empty(ci, ci_start, nr_pages, false); =20 if (!ci->count) --=20 2.54.0 From nobody Mon May 25 06:42:36 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6EB7A3BED5C; Sun, 17 May 2026 15:39:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; cv=none; b=iCDRbu9re0DPRWAdKfnrCHPIEST7U9XcanYF517KeaamvBolB72DKJak3PDKU1XEVsS0a6guztYnSTjPsxDHh5Vps8C00atoCHGN/CMfAGp1OGHSW/jYVhMOBFX3N6VDMICAFYW+5u438dFofSd8oHbpF3TRKe/T6AKndm9SQnc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; c=relaxed/simple; bh=6JfP+tnQd6rB8o/FJvWGSHr6xIh391ALgAvc5yXD7nA=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=k+ElfJexbP1I3jaqdSLdtrhakcEzNzdH3+zoXEzZmMkLECbvfRoUBpv2hoOTzZ1431V9E1I6Jk9ZbahX6wveLAyEfi0tY/69ls9Z6INgA6/ez3nNI1r6pCWwraNDOZhhWuTRxVGL1KBSt0feTJRwj5qrW6gRmZclna1rCYNlJMg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=i2yeJtCI; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="i2yeJtCI" Received: by smtp.kernel.org (Postfix) with ESMTPS id 3532AC2BCB0; Sun, 17 May 2026 15:39:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1779032389; bh=6JfP+tnQd6rB8o/FJvWGSHr6xIh391ALgAvc5yXD7nA=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=i2yeJtCIqsbrmFdGEcIoioNvhuZRJhmrYuJjptES4ad4z2Y93E4TLQ/q/qlqcW7yw KC1rz/cvLg+Hd3DHWJtaO37Esz5IZotcF2MgHQXITi7lorHHQTh3LGsfMBjkFOfAVC ZW0hZaZvkGQi8lsWPnPrhuFujuv1HQe2hoWu/udaRrWke//dD+fvh7abjEz8psWvbD mrKsO7DC4T1+797JLlD/CU2CKQpBHE7RgLcLo/XRnpNYfTSSzlB5HlCato11HyTXdU 3hdcxR0I74j1lcvYX5JlOLG2VChzPx/tHye9cuJ/P5cpDmRdU3J3HZgCHXs5qhDrGU FQYDPE0VUgOmw== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2964ECD4F3D; Sun, 17 May 2026 15:39:49 +0000 (UTC) From: Kairui Song via B4 Relay Date: Sun, 17 May 2026 23:39:47 +0800 Subject: [PATCH v5 08/12] mm, swap: delay and unify memcg lookup and charging for swapin Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260517-swap-table-p4-v5-8-88ae43e064c7@tencent.com> References: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> In-Reply-To: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Zi Yan , Baolin Wang , Barry Song , Hugh Dickins , Chris Li , Kemeng Shi , Nhat Pham , Baoquan He , Johannes Weiner , Youngjun Park , Chengming Zhou , Roman Gushchin , Shakeel Butt , Muchun Song , Usama Arif , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Kairui Song , Lorenzo Stoakes , Yosry Ahmed , Qi Zheng X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1779032386; l=7826; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=X60Ww9iDXTNSFaztz8zRIlTgTyJ5DmPDpgWaFQWaCho=; b=FpkTnUvkHSeT5V9yi8WrIyeNzt70+djoTZHfnvcwMlUtDZk6DUZ7kQIhKqvoUC3ftTwIxD2G6 SSmCkoG1hxQAb02TKt3uSUW10vTzE/Q96wMp7q1jWFOBiAZ3yKzo9ih X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= X-Endpoint-Received: by B4 Relay for kasong@tencent.com/kasong-sign-tencent with auth_id=562 X-Original-From: Kairui Song Reply-To: kasong@tencent.com From: Kairui Song Instead of checking the cgroup private ID during page table walk in swap_pte_batch(), move the memcg lookup into __swap_cache_add_check() under the cluster lock. The first pre-alloc check is speculative and skips the memcg check since the post-alloc stable check ensures all slots covered by the folio belong to the same memcg. It is very rare for contiguous and aligned entries across a contiguous region of a page table of the same process or shmem mapping to belong to different memcgs. This also prepares for recording the memcg info in the cluster's table. Also make the order check and fallback more compact. There should be no user-observable behavior change. Acked-by: Chris Li Signed-off-by: Kairui Song --- include/linux/memcontrol.h | 6 +++--- mm/internal.h | 10 +--------- mm/memcontrol.c | 10 ++++------ mm/swap_state.c | 28 +++++++++++++++++++--------- 4 files changed, 27 insertions(+), 27 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7d08128de1fd..a013f37f24aa 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -646,8 +646,8 @@ static inline int mem_cgroup_charge(struct folio *folio= , struct mm_struct *mm, =20 int mem_cgroup_charge_hugetlb(struct folio* folio, gfp_t gfp); =20 -int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *= mm, - gfp_t gfp, swp_entry_t entry); +int mem_cgroup_swapin_charge_folio(struct folio *folio, unsigned short id, + struct mm_struct *mm, gfp_t gfp); =20 void __mem_cgroup_uncharge(struct folio *folio); =20 @@ -1137,7 +1137,7 @@ static inline int mem_cgroup_charge_hugetlb(struct fo= lio* folio, gfp_t gfp) } =20 static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, - struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) + unsigned short id, struct mm_struct *mm, gfp_t gfp) { return 0; } diff --git a/mm/internal.h b/mm/internal.h index 5a2ddcf68e0b..9d2fec696bd6 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -451,24 +451,16 @@ static inline int swap_pte_batch(pte_t *start_ptep, i= nt max_nr, pte_t pte) { pte_t expected_pte =3D pte_next_swp_offset(pte); const pte_t *end_ptep =3D start_ptep + max_nr; - const softleaf_t entry =3D softleaf_from_pte(pte); pte_t *ptep =3D start_ptep + 1; - unsigned short cgroup_id; =20 VM_WARN_ON(max_nr < 1); - VM_WARN_ON(!softleaf_is_swap(entry)); + VM_WARN_ON(!softleaf_is_swap(softleaf_from_pte(pte))); =20 - cgroup_id =3D lookup_swap_cgroup_id(entry); while (ptep < end_ptep) { - softleaf_t entry; - pte =3D ptep_get(ptep); =20 if (!pte_same(pte, expected_pte)) break; - entry =3D softleaf_from_pte(pte); - if (lookup_swap_cgroup_id(entry) !=3D cgroup_id) - break; expected_pte =3D pte_next_swp_offset(expected_pte); ptep++; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a28a68eed7ba..4f940cf22ffe 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5070,27 +5070,25 @@ int mem_cgroup_charge_hugetlb(struct folio *folio, = gfp_t gfp) =20 /** * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swa= pin. - * @folio: folio to charge. + * @folio: the folio to charge + * @id: memory cgroup id * @mm: mm context of the victim * @gfp: reclaim mode - * @entry: swap entry for which the folio is allocated * * This function charges a folio allocated for swapin. Please call this be= fore * adding the folio to the swapcache. * * Returns 0 on success. Otherwise, an error code is returned. */ -int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *= mm, - gfp_t gfp, swp_entry_t entry) +int mem_cgroup_swapin_charge_folio(struct folio *folio, unsigned short id, + struct mm_struct *mm, gfp_t gfp) { struct mem_cgroup *memcg; - unsigned short id; int ret; =20 if (mem_cgroup_disabled()) return 0; =20 - id =3D lookup_swap_cgroup_id(entry); rcu_read_lock(); memcg =3D mem_cgroup_from_private_id(id); if (!memcg || !css_tryget_online(&memcg->css)) diff --git a/mm/swap_state.c b/mm/swap_state.c index 7a80494fa37f..bdd949ae0044 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -142,17 +142,21 @@ void *swap_cache_get_shadow(swp_entry_t entry) * @ci: The locked swap cluster * @targ_entry: The target swap entry to check, will be rounded down by @nr * @nr: Number of slots to check, must be a power of 2 - * @shadowp: Returns the shadow value if one exists in the range. + * @shadowp: Returns the shadow value if one exists in the range + * @memcg_id: Returns the memory cgroup id, NULL to ignore cgroup check * * Check if all slots covered by given range have a swap count >=3D 1. - * Retrieves the shadow if there is one. + * Retrieves the shadow if there is one. If @memcg_id is not NULL, also + * checks if all slots belong to the same cgroup and return the cgroup + * private id. * * Context: Caller must lock the cluster. * Return: 0 if success, error code if failed. */ static int __swap_cache_add_check(struct swap_cluster_info *ci, swp_entry_t targ_entry, - unsigned long nr, void **shadowp) + unsigned long nr, void **shadowp, + unsigned short *memcg_id) { unsigned int ci_off, ci_end; unsigned long old_tb; @@ -172,19 +176,24 @@ static int __swap_cache_add_check(struct swap_cluster= _info *ci, return -EEXIST; if (!__swp_tb_get_count(old_tb)) return -ENOENT; - if (swp_tb_is_shadow(old_tb) && shadowp) + if (shadowp && swp_tb_is_shadow(old_tb)) *shadowp =3D swp_tb_to_shadow(old_tb); + if (memcg_id) + *memcg_id =3D lookup_swap_cgroup_id(targ_entry); =20 if (nr =3D=3D 1) return 0; =20 + targ_entry.val =3D round_down(targ_entry.val, nr); ci_off =3D round_down(ci_off, nr); ci_end =3D ci_off + nr; do { old_tb =3D __swap_table_get(ci, ci_off); if (unlikely(swp_tb_is_folio(old_tb) || - !__swp_tb_get_count(old_tb))) + !__swp_tb_get_count(old_tb) || + (memcg_id && *memcg_id !=3D lookup_swap_cgroup_id(targ_entry)))) return -EBUSY; + targ_entry.val++; } while (++ci_off < ci_end); =20 return 0; @@ -400,6 +409,7 @@ static struct folio *__swap_cache_alloc(struct swap_clu= ster_info *ci, swp_entry_t entry; struct folio *folio; void *shadow =3D NULL; + unsigned short memcg_id; unsigned long address, nr_pages =3D 1UL << order; struct vm_area_struct *vma =3D vmf ? vmf->vma : NULL; =20 @@ -408,7 +418,7 @@ static struct folio *__swap_cache_alloc(struct swap_clu= ster_info *ci, =20 /* Check if the slot and range are available, skip allocation if not */ spin_lock(&ci->lock); - err =3D __swap_cache_add_check(ci, targ_entry, nr_pages, NULL); + err =3D __swap_cache_add_check(ci, targ_entry, nr_pages, NULL, NULL); spin_unlock(&ci->lock); if (unlikely(err)) return ERR_PTR(err); @@ -431,7 +441,7 @@ static struct folio *__swap_cache_alloc(struct swap_clu= ster_info *ci, =20 /* Double check the range is still not in conflict */ spin_lock(&ci->lock); - err =3D __swap_cache_add_check(ci, targ_entry, nr_pages, &shadow); + err =3D __swap_cache_add_check(ci, targ_entry, nr_pages, &shadow, &memcg_= id); if (unlikely(err)) { spin_unlock(&ci->lock); folio_put(folio); @@ -443,8 +453,8 @@ static struct folio *__swap_cache_alloc(struct swap_clu= ster_info *ci, __swap_cache_do_add_folio(ci, folio, entry); spin_unlock(&ci->lock); =20 - if (mem_cgroup_swapin_charge_folio(folio, vmf ? vmf->vma->vm_mm : NULL, - gfp, entry)) { + if (mem_cgroup_swapin_charge_folio(folio, memcg_id, + vmf ? vmf->vma->vm_mm : NULL, gfp)) { spin_lock(&ci->lock); __swap_cache_do_del_folio(ci, folio, entry, shadow); spin_unlock(&ci->lock); --=20 2.54.0 From nobody Mon May 25 06:42:36 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7276F3BED76; Sun, 17 May 2026 15:39:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; cv=none; b=SY2FvDpTjNmSFLQknZjPs63NfjNqvD1Tt7zXsyDjQNq8GbgzGF91naGeIHUvJVbfmOTR74ziAM4jkMezTNwj5nSeS/WEAuZYOFUr5LY/TeNPh+EPRFosuO7zOreDVq5LuzcYfalbG8t1bNSdcW6/Vgq6HR1Tc13zLzdt3rHyet4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; c=relaxed/simple; bh=ufU1ySsOqiRZPnCHz7wte/G4BO/nmQmUUaeDl7sofnQ=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=XL4QwcC71yrYd1sLLq2ZJmIfoYl7e7Lc5l6T/0niAovOHq9wXCWca/ac8js4s1VB8JFGaTHW1yj8TWyWGeVv6YoL114XEjkWvPWwD7YStmXl58KHKezLmQSQfW6uHO8jtxdidUy93490CvK73mHW/BJz1xZRQRT0Ti846R3U+Qg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=i4d9ocNS; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="i4d9ocNS" Received: by smtp.kernel.org (Postfix) with ESMTPS id 4B0B2C2BCB8; Sun, 17 May 2026 15:39:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1779032389; bh=ufU1ySsOqiRZPnCHz7wte/G4BO/nmQmUUaeDl7sofnQ=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=i4d9ocNSV+J2qEOZBzsNaaBcr1rrZV78Pzms/udoUFzqNM/dHWNtoGEKRooDbhjKn 9bdwZ26RQ6qDr5Nfgaao/o4aM/0/X68HzhqcUqWgbAg2SOCotz84w/+hdHu1smTq2N LCA3xMEqd4iLsqQdMwQzsR0C5gQFei1Eu9qGGM5p+/tWUYmazvVmGV9c1OGK2mChyv 5rsohQR+L/aKNQAOeIcYoVdvu3eJRfBFe5bFdOUUmmW/TB0XKrIUeBD9/wViZ7r0W3 rLBNA88zt17t4LKVSfSrNSxwLZRwcGod906+px5oErBZbs3yl/ItfAvoE4NLdno70X NMlKjCKxj5xHQ== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3ED5CCD4F4B; Sun, 17 May 2026 15:39:49 +0000 (UTC) From: Kairui Song via B4 Relay Date: Sun, 17 May 2026 23:39:48 +0800 Subject: [PATCH v5 09/12] mm, swap: consolidate cluster allocation helpers Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260517-swap-table-p4-v5-9-88ae43e064c7@tencent.com> References: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> In-Reply-To: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Zi Yan , Baolin Wang , Barry Song , Hugh Dickins , Chris Li , Kemeng Shi , Nhat Pham , Baoquan He , Johannes Weiner , Youngjun Park , Chengming Zhou , Roman Gushchin , Shakeel Butt , Muchun Song , Usama Arif , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Kairui Song , Lorenzo Stoakes , Yosry Ahmed , Qi Zheng X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1779032386; l=7205; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=c3oYSj0Wv0VB37K97Td0A16y0BTkJDC6ZCRW5Z0N+vc=; b=zqRQIF0gJ1NS70Y3n4NafR8BRy6VcNbWVCBT3V3wQlQir61v4EF1nqvCQihtd/BI6Arz34fXN /X7F3BMdUmBCuMh2dNHAb2GzYmlMa2rgpiG/Xai0f4fA0JlMYKPDPis X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= X-Endpoint-Received: by B4 Relay for kasong@tencent.com/kasong-sign-tencent with auth_id=562 X-Original-From: Kairui Song Reply-To: kasong@tencent.com From: Kairui Song Swap cluster table management is spread across several narrow helpers. As a result, the allocation and fallback sequences are open-coded in multiple places. A few more per-cluster tables will be added soon, so avoid duplicating these sequences per table type. Fold the existing pairs into cluster-oriented helpers, and rename for consistency. No functional change, only a few sanity checks are slightly adjusted. Acked-by: Chris Li Signed-off-by: Kairui Song --- mm/swapfile.c | 110 ++++++++++++++++++++++++++----------------------------= ---- 1 file changed, 49 insertions(+), 61 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index c9c80ba9252b..7740ba99f87e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -411,20 +411,7 @@ static inline unsigned int cluster_offset(struct swap_= info_struct *si, return cluster_index(si, ci) * SWAPFILE_CLUSTER; } =20 -static struct swap_table *swap_table_alloc(gfp_t gfp) -{ - struct folio *folio; - - if (!SWP_TABLE_USE_PAGE) - return kmem_cache_zalloc(swap_table_cachep, gfp); - - folio =3D folio_alloc(gfp | __GFP_ZERO, 0); - if (folio) - return folio_address(folio); - return NULL; -} - -static void swap_table_free_folio_rcu_cb(struct rcu_head *head) +static void swap_cluster_free_table_folio_rcu_cb(struct rcu_head *head) { struct folio *folio; =20 @@ -432,15 +419,46 @@ static void swap_table_free_folio_rcu_cb(struct rcu_h= ead *head) folio_put(folio); } =20 -static void swap_table_free(struct swap_table *table) +static void swap_cluster_free_table(struct swap_cluster_info *ci) { + struct swap_table *table; + + table =3D (struct swap_table *)rcu_dereference_protected(ci->table, true); + if (!table) + return; + + rcu_assign_pointer(ci->table, NULL); if (!SWP_TABLE_USE_PAGE) { kmem_cache_free(swap_table_cachep, table); return; } =20 call_rcu(&(folio_page(virt_to_folio(table), 0)->rcu_head), - swap_table_free_folio_rcu_cb); + swap_cluster_free_table_folio_rcu_cb); +} + +static int swap_cluster_alloc_table(struct swap_cluster_info *ci, gfp_t gf= p) +{ + struct swap_table *table =3D NULL; + struct folio *folio; + + /* The cluster must be empty and not on any list during allocation. */ + VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci)); + if (rcu_access_pointer(ci->table)) + return 0; + + if (SWP_TABLE_USE_PAGE) { + folio =3D folio_alloc(gfp | __GFP_ZERO, 0); + if (folio) + table =3D folio_address(folio); + } else { + table =3D kmem_cache_zalloc(swap_table_cachep, gfp); + } + if (!table) + return -ENOMEM; + + rcu_assign_pointer(ci->table, table); + return 0; } =20 /* @@ -471,27 +489,15 @@ static void swap_cluster_assert_empty(struct swap_clu= ster_info *ci, WARN_ON_ONCE(nr =3D=3D SWAPFILE_CLUSTER && ci->extend_table); } =20 -static void swap_cluster_free_table(struct swap_cluster_info *ci) -{ - struct swap_table *table; - - /* Only empty cluster's table is allow to be freed */ - lockdep_assert_held(&ci->lock); - table =3D (void *)rcu_dereference_protected(ci->table, true); - rcu_assign_pointer(ci->table, NULL); - - swap_table_free(table); -} - /* * Allocate swap table for one cluster. Attempt an atomic allocation first, * then fallback to sleeping allocation. */ static struct swap_cluster_info * -swap_cluster_alloc_table(struct swap_info_struct *si, +swap_cluster_populate(struct swap_info_struct *si, struct swap_cluster_info *ci) { - struct swap_table *table; + int ret; =20 /* * Only cluster isolation from the allocator does table allocation. @@ -502,14 +508,9 @@ swap_cluster_alloc_table(struct swap_info_struct *si, lockdep_assert_held(&si->global_cluster_lock); lockdep_assert_held(&ci->lock); =20 - /* The cluster must be free and was just isolated from the free list. */ - VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci)); - - table =3D swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN); - if (table) { - rcu_assign_pointer(ci->table, table); + if (!swap_cluster_alloc_table(ci, __GFP_HIGH | __GFP_NOMEMALLOC | + __GFP_NOWARN)) return ci; - } =20 /* * Try a sleep allocation. Each isolated free cluster may cause @@ -521,7 +522,8 @@ swap_cluster_alloc_table(struct swap_info_struct *si, spin_unlock(&si->global_cluster_lock); local_unlock(&percpu_swap_cluster.lock); =20 - table =3D swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | GFP_KERNEL); + ret =3D swap_cluster_alloc_table(ci, __GFP_HIGH | __GFP_NOMEMALLOC | + GFP_KERNEL); =20 /* * Back to atomic context. We might have migrated to a new CPU with a @@ -536,20 +538,11 @@ swap_cluster_alloc_table(struct swap_info_struct *si, spin_lock(&si->global_cluster_lock); spin_lock(&ci->lock); =20 - /* Nothing except this helper should touch a dangling empty cluster. */ - if (WARN_ON_ONCE(cluster_table_is_alloced(ci))) { - if (table) - swap_table_free(table); - return ci; - } - - if (!table) { + if (ret) { move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE); spin_unlock(&ci->lock); return NULL; } - - rcu_assign_pointer(ci->table, table); return ci; } =20 @@ -621,12 +614,11 @@ static struct swap_cluster_info *isolate_lock_cluster( } spin_unlock(&si->lock); =20 - if (found && !cluster_table_is_alloced(found)) { - /* Only an empty free cluster's swap table can be freed. */ - VM_WARN_ON_ONCE(flags !=3D CLUSTER_FLAG_FREE); + /* Cluster's table is freed when and only when it's on the free list. */ + if (found && flags =3D=3D CLUSTER_FLAG_FREE) { VM_WARN_ON_ONCE(list !=3D &si->free_clusters); - VM_WARN_ON_ONCE(!cluster_is_empty(found)); - return swap_cluster_alloc_table(si, found); + VM_WARN_ON_ONCE(cluster_table_is_alloced(found)); + return swap_cluster_populate(si, found); } =20 return found; @@ -769,7 +761,6 @@ static int swap_cluster_setup_bad_slot(struct swap_info= _struct *si, unsigned int ci_off =3D offset % SWAPFILE_CLUSTER; unsigned long idx =3D offset / SWAPFILE_CLUSTER; struct swap_cluster_info *ci; - struct swap_table *table; int ret =3D 0; =20 /* si->max may got shrunk by swap swap_activate() */ @@ -790,12 +781,9 @@ static int swap_cluster_setup_bad_slot(struct swap_inf= o_struct *si, } =20 ci =3D cluster_info + idx; - if (!ci->table) { - table =3D swap_table_alloc(GFP_KERNEL); - if (!table) - return -ENOMEM; - rcu_assign_pointer(ci->table, table); - } + /* Need to allocate swap table first for initial bad slot marking. */ + if (!ci->count && swap_cluster_alloc_table(ci, GFP_KERNEL)) + return -ENOMEM; spin_lock(&ci->lock); /* Check for duplicated bad swap slots. */ if (__swap_table_xchg(ci, ci_off, SWP_TB_BAD) !=3D SWP_TB_NULL) { @@ -3054,7 +3042,7 @@ static void free_swap_cluster_info(struct swap_cluste= r_info *cluster_info, ci =3D cluster_info + i; /* Cluster with bad marks count will have a remaining table */ spin_lock(&ci->lock); - if (rcu_dereference_protected(ci->table, true)) { + if (cluster_table_is_alloced(ci)) { swap_cluster_assert_empty(ci, 0, SWAPFILE_CLUSTER, true); swap_cluster_free_table(ci); } --=20 2.54.0 From nobody Mon May 25 06:42:36 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7F14A3BF69C; Sun, 17 May 2026 15:39:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; cv=none; b=AnkZdEIOgZDATpb1vgIDFz3tn+zkvRfmSTB2N52zwL6kQ0m1BLyAOuMuC8IWhzCvh1XdkmT8mTemSbD6y4CHeNDUfD6MMYl13GPYHbfPunoGCtllkNQrkmAh1qC9+C055PKJ5ccCHFRZc6NXaFW1QEBjQIcPmpqV+ZzLCyKOc7g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; c=relaxed/simple; bh=lhIWzlgO0iPvpJ0krVLf6BdjLRUh5La5OGxJd6WrUXQ=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=AnzB8rJgxYbKB1IIqNXlnNYkkyQGlhFm4Qmg5mp3pkPagtYSPFfJ5zUKDgUqAO06IntHYvJ3oaqJh+2lmSiEARDTw9HhtRmaxJFST7YvYZaMKVZXwz9hiS2Up351MlxUXP1crgcF607oKdKPxtR/yqJHkDGiRv9x6aRDcJObSus= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=s+zLAs87; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="s+zLAs87" Received: by smtp.kernel.org (Postfix) with ESMTPS id 5C4B8C2BCFA; Sun, 17 May 2026 15:39:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1779032389; bh=lhIWzlgO0iPvpJ0krVLf6BdjLRUh5La5OGxJd6WrUXQ=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=s+zLAs87NcMDmuQ/5gP0z7QrmStxRQT30D3gBVE/2vt/GEW7Ych1q65eJbLPfD7vh 16YmiAnNzweAIjQcU2x4cw6tqfqdSRTTLFPoagaD4jKxF0a+vBM0YbSpS5i0CTPBEl 0ByTKsHgKCFncBm1QGSvHr8IxAGZi+J323BF5JmTDyWO8VMjrcH8df6aJ8jqbsdf/g vJD0e0JSRhEl9aZ5u6QUn6CtjerQ/vOAQgL10WJxVtksd6aGPCCf+4p2ZANc35h/fA MASwyErBMHlLwz8QzDz8Knqvh54c53prfAiCXv7PDFpyEUzKPNCGnafZBjcuImHc1T +JM4Yd9hDW76Q== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id 53E87CD4F47; Sun, 17 May 2026 15:39:49 +0000 (UTC) From: Kairui Song via B4 Relay Date: Sun, 17 May 2026 23:39:49 +0800 Subject: [PATCH v5 10/12] mm/memcg, swap: store cgroup id in cluster table directly Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260517-swap-table-p4-v5-10-88ae43e064c7@tencent.com> References: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> In-Reply-To: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Zi Yan , Baolin Wang , Barry Song , Hugh Dickins , Chris Li , Kemeng Shi , Nhat Pham , Baoquan He , Johannes Weiner , Youngjun Park , Chengming Zhou , Roman Gushchin , Shakeel Butt , Muchun Song , Usama Arif , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Kairui Song , Lorenzo Stoakes , Yosry Ahmed , Qi Zheng X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1779032386; l=15426; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=xJcSmosb42r4Bg21YfxGHIrP+RMnEGQrTQryxhKS1qI=; b=al/zHgvGHr8jn+tJsgDWfWpr+1zcaDtZs3FVJmmqHvFXBvkTVNipAFp0keDAV00JGczwEi7tW E6Y/Voa4o3wABwHgvNm8R61cSCgMrc014uCaVw+LPif+9gLiIbepvYh X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= X-Endpoint-Received: by B4 Relay for kasong@tencent.com/kasong-sign-tencent with auth_id=562 X-Original-From: Kairui Song Reply-To: kasong@tencent.com From: Kairui Song Drop the usage of the swap_cgroup_ctrl, and use the dynamic cluster table instead. The per-cluster memcg table is 1024 / 512 bytes on most archs, and does not need RCU protection: the cgroup data is only read and written under the cluster lock. That keeps things simple, lets the allocation use plain kmalloc with immediate kfree (no deferred free), and keeps fragmentation acceptable. Acked-by: Chris Li Signed-off-by: Kairui Song --- include/linux/memcontrol.h | 6 +++-- include/linux/swap.h | 8 +++--- mm/memcontrol-v1.c | 42 +++++++++++++++++++----------- mm/memcontrol.c | 13 ++++++---- mm/swap.h | 4 +++ mm/swap_state.c | 6 ++--- mm/swap_table.h | 64 ++++++++++++++++++++++++++++++++++++++++++= ++++ mm/swapfile.c | 37 ++++++++++++++++++--------- mm/vmscan.c | 2 +- 9 files changed, 139 insertions(+), 43 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index a013f37f24aa..bf1a6e131eca 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -29,6 +29,7 @@ struct obj_cgroup; struct page; struct mm_struct; struct kmem_cache; +struct swap_cluster_info; =20 /* Cgroup-specific page state, on top of universal node page state */ enum memcg_stat_item { @@ -1899,7 +1900,7 @@ static inline void mem_cgroup_exit_user_fault(void) current->in_user_fault =3D 0; } =20 -void __memcg1_swapout(struct folio *folio); +void __memcg1_swapout(struct folio *folio, struct swap_cluster_info *ci); void memcg1_swapin(struct folio *folio); =20 #else /* CONFIG_MEMCG_V1 */ @@ -1929,7 +1930,8 @@ static inline void mem_cgroup_exit_user_fault(void) { } =20 -static inline void __memcg1_swapout(struct folio *folio) +static inline void __memcg1_swapout(struct folio *folio, + struct swap_cluster_info *ci) { } =20 diff --git a/include/linux/swap.h b/include/linux/swap.h index 6b3acdf9bdd4..203bbe23ba1f 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -584,12 +584,12 @@ static inline int mem_cgroup_try_charge_swap(struct f= olio *folio) return __mem_cgroup_try_charge_swap(folio); } =20 -extern void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_= pages); -static inline void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned in= t nr_pages) +extern void __mem_cgroup_uncharge_swap(unsigned short id, unsigned int nr_= pages); +static inline void mem_cgroup_uncharge_swap(unsigned short id, unsigned in= t nr_pages) { if (mem_cgroup_disabled()) return; - __mem_cgroup_uncharge_swap(entry, nr_pages); + __mem_cgroup_uncharge_swap(id, nr_pages); } =20 extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg); @@ -600,7 +600,7 @@ static inline int mem_cgroup_try_charge_swap(struct fol= io *folio) return 0; } =20 -static inline void mem_cgroup_uncharge_swap(swp_entry_t entry, +static inline void mem_cgroup_uncharge_swap(unsigned short id, unsigned int nr_pages) { } diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 36c507d81dc5..494e7b9adc60 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -14,6 +14,7 @@ =20 #include "internal.h" #include "swap.h" +#include "swap_table.h" #include "memcontrol-v1.h" =20 /* @@ -606,14 +607,15 @@ void memcg1_commit_charge(struct folio *folio, struct= mem_cgroup *memcg) /** * __memcg1_swapout - transfer a memsw charge to swap * @folio: folio whose memsw charge to transfer + * @ci: the locked swap cluster holding the swap entries * * Transfer the memsw charge of @folio to the swap entry stored in * folio->swap. * - * Context: folio must be isolated, unmapped, locked and is just about - * to be freed, and caller must disable IRQs. + * Context: folio must be isolated, unmapped, locked and is just about to + * be freed, and caller must disable IRQs and hold the swap cluster lock. */ -void __memcg1_swapout(struct folio *folio) +void __memcg1_swapout(struct folio *folio, struct swap_cluster_info *ci) { struct mem_cgroup *memcg, *swap_memcg; struct obj_cgroup *objcg; @@ -646,7 +648,8 @@ void __memcg1_swapout(struct folio *folio) swap_memcg =3D mem_cgroup_private_id_get_online(memcg, nr_entries); mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries); =20 - swap_cgroup_record(folio, mem_cgroup_private_id(swap_memcg), folio->swap); + __swap_cgroup_set(ci, swp_cluster_offset(folio->swap), nr_entries, + mem_cgroup_private_id(swap_memcg)); =20 folio_unqueue_deferred_split(folio); folio->memcg_data =3D 0; @@ -661,8 +664,7 @@ void __memcg1_swapout(struct folio *folio) } =20 /* - * Interrupts should be disabled here because the caller holds the - * i_pages lock which is taken with interrupts-off. It is + * The caller must hold the swap cluster lock with IRQ off. It is * important here to have the interrupts disabled because it is the * only synchronisation we have for updating the per-CPU variables. */ @@ -677,7 +679,7 @@ void __memcg1_swapout(struct folio *folio) } =20 /** - * memcg1_swapin - uncharge swap slot + * memcg1_swapin - uncharge swap slot on swapin * @folio: folio being swapped in * * Call this function after successfully adding the charged @@ -687,6 +689,10 @@ void __memcg1_swapout(struct folio *folio) */ void memcg1_swapin(struct folio *folio) { + struct swap_cluster_info *ci; + unsigned long nr_pages; + unsigned short id; + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); =20 @@ -702,14 +708,20 @@ void memcg1_swapin(struct folio *folio) * correspond 1:1 to page and swap slot lifetimes: we charge the * page to memory here, and uncharge swap when the slot is freed. */ - if (do_memsw_account()) { - /* - * The swap entry might not get freed for a long time, - * let's not wait for it. The page already received a - * memory+swap charge, drop the swap entry duplicate. - */ - mem_cgroup_uncharge_swap(folio->swap, folio_nr_pages(folio)); - } + if (!do_memsw_account()) + return; + + /* + * The swap entry might not get freed for a long time, + * let's not wait for it. The page already received a + * memory+swap charge, drop the swap entry duplicate. + */ + nr_pages =3D folio_nr_pages(folio); + ci =3D swap_cluster_get_and_lock(folio); + id =3D __swap_cgroup_clear(ci, swp_cluster_offset(folio->swap), + nr_pages); + swap_cluster_unlock(ci); + mem_cgroup_uncharge_swap(id, nr_pages); } =20 void memcg1_uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4f940cf22ffe..b5c267a061a9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -64,6 +64,7 @@ #include #include #include "internal.h" +#include "swap_table.h" #include #include #include "slab.h" @@ -5470,6 +5471,7 @@ int __init mem_cgroup_init(void) int __mem_cgroup_try_charge_swap(struct folio *folio) { unsigned int nr_pages =3D folio_nr_pages(folio); + struct swap_cluster_info *ci; struct page_counter *counter; struct mem_cgroup *memcg; struct obj_cgroup *objcg; @@ -5503,22 +5505,23 @@ int __mem_cgroup_try_charge_swap(struct folio *foli= o) } mod_memcg_state(memcg, MEMCG_SWAP, nr_pages); =20 - swap_cgroup_record(folio, mem_cgroup_private_id(memcg), folio->swap); + ci =3D swap_cluster_get_and_lock(folio); + __swap_cgroup_set(ci, swp_cluster_offset(folio->swap), nr_pages, + mem_cgroup_private_id(memcg)); + swap_cluster_unlock(ci); =20 return 0; } =20 /** * __mem_cgroup_uncharge_swap - uncharge swap space - * @entry: swap entry to uncharge + * @id: cgroup id to uncharge * @nr_pages: the amount of swap space to uncharge */ -void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages) +void __mem_cgroup_uncharge_swap(unsigned short id, unsigned int nr_pages) { struct mem_cgroup *memcg; - unsigned short id; =20 - id =3D swap_cgroup_clear(entry, nr_pages); rcu_read_lock(); memcg =3D mem_cgroup_from_private_id(id); if (memcg) { diff --git a/mm/swap.h b/mm/swap.h index 8e57e9431624..5b2f095fff6e 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -5,6 +5,7 @@ #include /* for atomic_long_t */ struct mempolicy; struct swap_iocb; +struct swap_memcg_table; =20 extern int page_cluster; =20 @@ -38,6 +39,9 @@ struct swap_cluster_info { u8 order; atomic_long_t __rcu *table; /* Swap table entries, see mm/swap_table.h */ unsigned int *extend_table; /* For large swap count, protected by ci->loc= k */ +#ifdef CONFIG_MEMCG + struct swap_memcg_table *memcg_table; /* Swap table entries' cgroup recor= d */ +#endif struct list_head list; }; =20 diff --git a/mm/swap_state.c b/mm/swap_state.c index bdd949ae0044..873cb3f26337 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -179,21 +179,19 @@ static int __swap_cache_add_check(struct swap_cluster= _info *ci, if (shadowp && swp_tb_is_shadow(old_tb)) *shadowp =3D swp_tb_to_shadow(old_tb); if (memcg_id) - *memcg_id =3D lookup_swap_cgroup_id(targ_entry); + *memcg_id =3D __swap_cgroup_get(ci, ci_off); =20 if (nr =3D=3D 1) return 0; =20 - targ_entry.val =3D round_down(targ_entry.val, nr); ci_off =3D round_down(ci_off, nr); ci_end =3D ci_off + nr; do { old_tb =3D __swap_table_get(ci, ci_off); if (unlikely(swp_tb_is_folio(old_tb) || !__swp_tb_get_count(old_tb) || - (memcg_id && *memcg_id !=3D lookup_swap_cgroup_id(targ_entry)))) + (memcg_id && *memcg_id !=3D __swap_cgroup_get(ci, ci_off)))) return -EBUSY; - targ_entry.val++; } while (++ci_off < ci_end); =20 return 0; diff --git a/mm/swap_table.h b/mm/swap_table.h index 8415ffbe2b9c..b4e1100f8296 100644 --- a/mm/swap_table.h +++ b/mm/swap_table.h @@ -11,6 +11,11 @@ struct swap_table { atomic_long_t entries[SWAPFILE_CLUSTER]; }; =20 +/* For storing memcg private id */ +struct swap_memcg_table { + unsigned short id[SWAPFILE_CLUSTER]; +}; + #define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) =3D=3D PAGE_SIZE) =20 /* @@ -247,4 +252,63 @@ static inline unsigned long swap_table_get(struct swap= _cluster_info *ci, =20 return swp_tb; } + +#ifdef CONFIG_MEMCG +static inline void __swap_cgroup_set(struct swap_cluster_info *ci, + unsigned int ci_off, unsigned long nr, unsigned short id) +{ + lockdep_assert_held(&ci->lock); + VM_WARN_ON_ONCE(ci_off >=3D SWAPFILE_CLUSTER); + if (WARN_ON_ONCE(!ci->memcg_table)) + return; + do { + ci->memcg_table->id[ci_off++] =3D id; + } while (--nr); +} + +static inline unsigned short __swap_cgroup_get(struct swap_cluster_info *c= i, + unsigned int ci_off) +{ + lockdep_assert_held(&ci->lock); + VM_WARN_ON_ONCE(ci_off >=3D SWAPFILE_CLUSTER); + if (unlikely(!ci->memcg_table)) + return 0; + return ci->memcg_table->id[ci_off]; +} + +static inline unsigned short __swap_cgroup_clear(struct swap_cluster_info = *ci, + unsigned int ci_off, + unsigned long nr) +{ + unsigned short old =3D __swap_cgroup_get(ci, ci_off); + + if (!old) + return 0; + do { + VM_WARN_ON_ONCE(ci->memcg_table->id[ci_off] !=3D old); + ci->memcg_table->id[ci_off++] =3D 0; + } while (--nr); + + return old; +} +#else +static inline void __swap_cgroup_set(struct swap_cluster_info *ci, + unsigned int ci_off, unsigned long nr, unsigned short id) +{ +} + +static inline unsigned short __swap_cgroup_get(struct swap_cluster_info *c= i, + unsigned int ci_off) +{ + return 0; +} + +static inline unsigned short __swap_cgroup_clear(struct swap_cluster_info = *ci, + unsigned int ci_off, + unsigned long nr) +{ + return 0; +} +#endif + #endif diff --git a/mm/swapfile.c b/mm/swapfile.c index 7740ba99f87e..ae14d4049e4b 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -423,7 +423,12 @@ static void swap_cluster_free_table(struct swap_cluste= r_info *ci) { struct swap_table *table; =20 - table =3D (struct swap_table *)rcu_dereference_protected(ci->table, true); +#ifdef CONFIG_MEMCG + kfree(ci->memcg_table); + ci->memcg_table =3D NULL; +#endif + + table =3D (struct swap_table *)rcu_access_pointer(ci->table); if (!table) return; =20 @@ -441,6 +446,7 @@ static int swap_cluster_alloc_table(struct swap_cluster= _info *ci, gfp_t gfp) { struct swap_table *table =3D NULL; struct folio *folio; + int ret =3D 0; =20 /* The cluster must be empty and not on any list during allocation. */ VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci)); @@ -458,7 +464,19 @@ static int swap_cluster_alloc_table(struct swap_cluste= r_info *ci, gfp_t gfp) return -ENOMEM; =20 rcu_assign_pointer(ci->table, table); - return 0; + +#ifdef CONFIG_MEMCG + if (!mem_cgroup_disabled()) { + VM_WARN_ON_ONCE(ci->memcg_table); + ci->memcg_table =3D kzalloc_obj(*ci->memcg_table, gfp); + if (!ci->memcg_table) + ret =3D -ENOMEM; + } +#endif + if (ret) + swap_cluster_free_table(ci); + + return ret; } =20 /* @@ -483,6 +501,7 @@ static void swap_cluster_assert_empty(struct swap_clust= er_info *ci, bad_slots++; else WARN_ON_ONCE(!swp_tb_is_null(swp_tb)); + WARN_ON_ONCE(__swap_cgroup_get(ci, ci_off)); } while (++ci_off < ci_end); =20 WARN_ON_ONCE(bad_slots !=3D (swapoff ? ci->count : 0)); @@ -1861,12 +1880,10 @@ void __swap_cluster_free_entries(struct swap_info_s= truct *si, unsigned int ci_start, unsigned int nr_pages) { unsigned long old_tb; - unsigned int type =3D si->type; unsigned short batch_id =3D 0, id_cur; unsigned int ci_off =3D ci_start, ci_end =3D ci_start + nr_pages; unsigned long ci_head =3D cluster_offset(si, ci); unsigned int batch_off =3D ci_off; - swp_entry_t entry; =20 VM_WARN_ON(ci->count < nr_pages); =20 @@ -1884,21 +1901,17 @@ void __swap_cluster_free_entries(struct swap_info_s= truct *si, * Uncharge swap slots by memcg in batches. Consecutive * slots with the same cgroup id are uncharged together. */ - entry =3D swp_entry(type, ci_head + ci_off); - id_cur =3D lookup_swap_cgroup_id(entry); + id_cur =3D __swap_cgroup_clear(ci, ci_off, 1); if (batch_id !=3D id_cur) { if (batch_id) - mem_cgroup_uncharge_swap(swp_entry(type, ci_head + batch_off), - ci_off - batch_off); + mem_cgroup_uncharge_swap(batch_id, ci_off - batch_off); batch_id =3D id_cur; batch_off =3D ci_off; } } while (++ci_off < ci_end); =20 - if (batch_id) { - mem_cgroup_uncharge_swap(swp_entry(type, ci_head + batch_off), - ci_off - batch_off); - } + if (batch_id) + mem_cgroup_uncharge_swap(batch_id, ci_off - batch_off); =20 swap_range_free(si, ci_head + ci_start, nr_pages); swap_cluster_assert_empty(ci, ci_start, nr_pages, false); diff --git a/mm/vmscan.c b/mm/vmscan.c index 924c84326551..ca4533eba701 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -737,7 +737,7 @@ static int __remove_mapping(struct address_space *mappi= ng, struct folio *folio, =20 if (reclaimed && !mapping_exiting(mapping)) shadow =3D workingset_eviction(folio, target_memcg); - __memcg1_swapout(folio); + __memcg1_swapout(folio, ci); __swap_cache_del_folio(ci, folio, swap, shadow); swap_cluster_unlock_irq(ci); } else { --=20 2.54.0 From nobody Mon May 25 06:42:36 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A5E353B995D; Sun, 17 May 2026 15:39:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; cv=none; b=LfL9i1Z2uPPJxuzTnC4LypksNTbKigUmgnCqJR4HT9QMtnCmHUCha+1zyC+3CwF8UiwdJyvhtgUWG0OWUwR4/09YpNQGCBwI5MFncmxO6JGlephStdBbGme+SqoNTY7WGxxgsGx846WkyjFc25Ja+sSUyNO7x4RCix9BK4WtPnI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; c=relaxed/simple; bh=yGvnZGsm49ieaEy1P65cpRxDGwB0e/972Jb+fgbhqgY=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=lsMgUX7M1tNldVLyTOVmAGUj4VpIbSyMboBBmMAfnH+8FCyAVaz41nuLNXpVRU+zt0ofKvXbtYuOqz2EurDzQm3ukwgjRhTb6ZIwi5kf13mtpJ+WCHMU+ttOp6jw6cDY942+dpBtQ3XjbA5j6AiNkSGdlDRCVxoefdIF50UD+1M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=fU4e6tDK; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="fU4e6tDK" Received: by smtp.kernel.org (Postfix) with ESMTPS id 71038C2BCC6; Sun, 17 May 2026 15:39:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1779032389; bh=yGvnZGsm49ieaEy1P65cpRxDGwB0e/972Jb+fgbhqgY=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=fU4e6tDKyiigJ2weNWJHsMDM5xcbsqrWXxlqm4hz0LebMNzwRyrwCvXAxK3yOppsk 1CjtzRMM8BguXqAEgqvYUgCUvUdmENy8z2Vj7kDlL92fUij6b7tyxGIbP7qPaj1lVp CRhHa7d0MYZ9GFHuHTIMFNgXmOBiakiz/oXCrhht1+EvDoFI+VOMwQneWT3vGK+Kv5 wg82ylsdJmbJFp/5ks4Wo/YuqatG5TzI7FcHcvcW6nJom0D3ZkuT9aInM8ckeXQ/yo wa1/5a82cd8gxCbTDpsFMpkNIJQVVeZi9cfniSi3jVv2CO0RG6JF4ilDKEJREDVhgh TCinGhPp+F6Tw== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6700ECD4F4A; Sun, 17 May 2026 15:39:49 +0000 (UTC) From: Kairui Song via B4 Relay Date: Sun, 17 May 2026 23:39:50 +0800 Subject: [PATCH v5 11/12] mm/memcg: remove no longer used swap cgroup array Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260517-swap-table-p4-v5-11-88ae43e064c7@tencent.com> References: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> In-Reply-To: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Zi Yan , Baolin Wang , Barry Song , Hugh Dickins , Chris Li , Kemeng Shi , Nhat Pham , Baoquan He , Johannes Weiner , Youngjun Park , Chengming Zhou , Roman Gushchin , Shakeel Butt , Muchun Song , Usama Arif , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Kairui Song , Lorenzo Stoakes , Yosry Ahmed , Qi Zheng X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1779032386; l=10199; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=ta79Avu/thElyBUdmQaqOrzxqO0x3dOx+PK/95QYg2I=; b=QjCLrMxdbgPbyx3P6mHcPL3X71ABuhjk8mPMc0wAGaat2yXpmLdgqZE4yIvhAMZbgTuUR1KM9 ZDzo+FxGTNaCzxk2Zvf0BMk2ECZ9IIEDsFv7dqp+26cvmg/OrIPPQHx X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= X-Endpoint-Received: by B4 Relay for kasong@tencent.com/kasong-sign-tencent with auth_id=562 X-Original-From: Kairui Song Reply-To: kasong@tencent.com From: Kairui Song Now all swap cgroup records are stored in the swap cluster directly, the static array is no longer needed. Acked-by: Chris Li Signed-off-by: Kairui Song --- MAINTAINERS | 1 - include/linux/swap_cgroup.h | 47 ------------ mm/Makefile | 3 - mm/internal.h | 1 - mm/memcontrol-v1.c | 1 - mm/memcontrol.c | 1 - mm/swap_cgroup.c | 174 ----------------------------------------= ---- mm/swapfile.c | 8 -- 8 files changed, 236 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 0116eb99b708..9be179722d42 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6564,7 +6564,6 @@ F: mm/memcontrol.c F: mm/memcontrol-v1.c F: mm/memcontrol-v1.h F: mm/page_counter.c -F: mm/swap_cgroup.c F: samples/cgroup/* F: tools/testing/selftests/cgroup/memcg_protection.m F: tools/testing/selftests/cgroup/test_hugetlb_memcg.c diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h deleted file mode 100644 index 91cdf12190a0..000000000000 --- a/include/linux/swap_cgroup.h +++ /dev/null @@ -1,47 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef __LINUX_SWAP_CGROUP_H -#define __LINUX_SWAP_CGROUP_H - -#include - -#if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP) - -extern void swap_cgroup_record(struct folio *folio, unsigned short id, swp= _entry_t ent); -extern unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_e= nts); -extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent); -extern int swap_cgroup_swapon(int type, unsigned long max_pages); -extern void swap_cgroup_swapoff(int type); - -#else - -static inline -void swap_cgroup_record(struct folio *folio, unsigned short id, swp_entry_= t ent) -{ -} - -static inline -unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents) -{ - return 0; -} - -static inline -unsigned short lookup_swap_cgroup_id(swp_entry_t ent) -{ - return 0; -} - -static inline int -swap_cgroup_swapon(int type, unsigned long max_pages) -{ - return 0; -} - -static inline void swap_cgroup_swapoff(int type) -{ - return; -} - -#endif - -#endif /* __LINUX_SWAP_CGROUP_H */ diff --git a/mm/Makefile b/mm/Makefile index 8ad2ab08244e..eff9f9e7e061 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -103,9 +103,6 @@ obj-$(CONFIG_PAGE_COUNTER) +=3D page_counter.o obj-$(CONFIG_LIVEUPDATE_MEMFD) +=3D memfd_luo.o obj-$(CONFIG_MEMCG_V1) +=3D memcontrol-v1.o obj-$(CONFIG_MEMCG) +=3D memcontrol.o vmpressure.o -ifdef CONFIG_SWAP -obj-$(CONFIG_MEMCG) +=3D swap_cgroup.o -endif ifdef CONFIG_BPF_SYSCALL obj-$(CONFIG_MEMCG) +=3D bpf_memcontrol.o endif diff --git a/mm/internal.h b/mm/internal.h index 9d2fec696bd6..7646ecb9d621 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -17,7 +17,6 @@ #include #include #include -#include #include =20 /* Internal core VMA manipulation functions. */ diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 494e7b9adc60..08be1a752c2e 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -5,7 +5,6 @@ #include #include #include -#include #include #include #include diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b5c267a061a9..039e9bc8971c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -54,7 +54,6 @@ #include #include #include -#include #include #include #include diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c deleted file mode 100644 index 95c38e54dd58..000000000000 --- a/mm/swap_cgroup.c +++ /dev/null @@ -1,174 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -#include -#include -#include - -#include /* depends on mm.h include */ - -static DEFINE_MUTEX(swap_cgroup_mutex); - -/* Pack two cgroup id (short) of two entries in one swap_cgroup (atomic_t)= */ -#define ID_PER_SC (sizeof(struct swap_cgroup) / sizeof(unsigned short)) -#define ID_SHIFT (BITS_PER_TYPE(unsigned short)) -#define ID_MASK (BIT(ID_SHIFT) - 1) -struct swap_cgroup { - atomic_t ids; -}; - -struct swap_cgroup_ctrl { - struct swap_cgroup *map; -}; - -static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES]; - -static unsigned short __swap_cgroup_id_lookup(struct swap_cgroup *map, - pgoff_t offset) -{ - unsigned int shift =3D (offset % ID_PER_SC) * ID_SHIFT; - unsigned int old_ids =3D atomic_read(&map[offset / ID_PER_SC].ids); - - BUILD_BUG_ON(!is_power_of_2(ID_PER_SC)); - BUILD_BUG_ON(sizeof(struct swap_cgroup) !=3D sizeof(atomic_t)); - - return (old_ids >> shift) & ID_MASK; -} - -static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map, - pgoff_t offset, - unsigned short new_id) -{ - unsigned short old_id; - struct swap_cgroup *sc =3D &map[offset / ID_PER_SC]; - unsigned int shift =3D (offset % ID_PER_SC) * ID_SHIFT; - unsigned int new_ids, old_ids =3D atomic_read(&sc->ids); - - do { - old_id =3D (old_ids >> shift) & ID_MASK; - new_ids =3D (old_ids & ~(ID_MASK << shift)); - new_ids |=3D ((unsigned int)new_id) << shift; - } while (!atomic_try_cmpxchg(&sc->ids, &old_ids, new_ids)); - - return old_id; -} - -/** - * swap_cgroup_record - record mem_cgroup for a set of swap entries. - * These entries must belong to one single folio, and that folio - * must be being charged for swap space (swap out), and these - * entries must not have been charged - * - * @folio: the folio that the swap entry belongs to - * @id: mem_cgroup ID to be recorded - * @ent: the first swap entry to be recorded - */ -void swap_cgroup_record(struct folio *folio, unsigned short id, - swp_entry_t ent) -{ - unsigned int nr_ents =3D folio_nr_pages(folio); - struct swap_cgroup *map; - pgoff_t offset, end; - unsigned short old; - - offset =3D swp_offset(ent); - end =3D offset + nr_ents; - map =3D swap_cgroup_ctrl[swp_type(ent)].map; - - do { - old =3D __swap_cgroup_id_xchg(map, offset, id); - VM_BUG_ON(old); - } while (++offset !=3D end); -} - -/** - * swap_cgroup_clear - clear mem_cgroup for a set of swap entries. - * These entries must be being uncharged from swap. They either - * belongs to one single folio in the swap cache (swap in for - * cgroup v1), or no longer have any users (slot freeing). - * - * @ent: the first swap entry to be recorded into - * @nr_ents: number of swap entries to be recorded - * - * Returns the existing old value. - */ -unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents) -{ - pgoff_t offset, end; - struct swap_cgroup *map; - unsigned short old, iter =3D 0; - - offset =3D swp_offset(ent); - end =3D offset + nr_ents; - map =3D swap_cgroup_ctrl[swp_type(ent)].map; - - do { - old =3D __swap_cgroup_id_xchg(map, offset, 0); - if (!iter) - iter =3D old; - VM_BUG_ON(iter !=3D old); - } while (++offset !=3D end); - - return old; -} - -/** - * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry - * @ent: swap entry to be looked up. - * - * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID) - */ -unsigned short lookup_swap_cgroup_id(swp_entry_t ent) -{ - struct swap_cgroup_ctrl *ctrl; - - if (mem_cgroup_disabled()) - return 0; - - ctrl =3D &swap_cgroup_ctrl[swp_type(ent)]; - if (unlikely(!ctrl->map)) - return 0; - return __swap_cgroup_id_lookup(ctrl->map, swp_offset(ent)); -} - -int swap_cgroup_swapon(int type, unsigned long max_pages) -{ - struct swap_cgroup *map; - struct swap_cgroup_ctrl *ctrl; - - if (mem_cgroup_disabled()) - return 0; - - BUILD_BUG_ON(sizeof(unsigned short) * ID_PER_SC !=3D - sizeof(struct swap_cgroup)); - map =3D vzalloc(DIV_ROUND_UP(max_pages, ID_PER_SC) * - sizeof(struct swap_cgroup)); - if (!map) - goto nomem; - - ctrl =3D &swap_cgroup_ctrl[type]; - mutex_lock(&swap_cgroup_mutex); - ctrl->map =3D map; - mutex_unlock(&swap_cgroup_mutex); - - return 0; -nomem: - pr_info("couldn't allocate enough memory for swap_cgroup\n"); - pr_info("swap_cgroup can be disabled by swapaccount=3D0 boot option\n"); - return -ENOMEM; -} - -void swap_cgroup_swapoff(int type) -{ - struct swap_cgroup *map; - struct swap_cgroup_ctrl *ctrl; - - if (mem_cgroup_disabled()) - return; - - mutex_lock(&swap_cgroup_mutex); - ctrl =3D &swap_cgroup_ctrl[type]; - map =3D ctrl->map; - ctrl->map =3D NULL; - mutex_unlock(&swap_cgroup_mutex); - - vfree(map); -} diff --git a/mm/swapfile.c b/mm/swapfile.c index ae14d4049e4b..095e9c953e49 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -45,7 +45,6 @@ =20 #include #include -#include #include "swap_table.h" #include "internal.h" #include "swap.h" @@ -3200,8 +3199,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) p->global_cluster =3D NULL; kvfree(zeromap); free_swap_cluster_info(cluster_info, maxpages); - /* Destroy swap account information */ - swap_cgroup_swapoff(p->type); =20 inode =3D mapping->host; =20 @@ -3732,10 +3729,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, special= file, int, swap_flags) if (error) goto bad_swap_unlock_inode; =20 - error =3D swap_cgroup_swapon(si->type, maxpages); - if (error) - goto bad_swap_unlock_inode; - /* * Use kvmalloc_array instead of bitmap_zalloc as the allocation order mi= ght * be above MAX_PAGE_ORDER incase of a large swap file. @@ -3846,7 +3839,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf= ile, int, swap_flags) si->global_cluster =3D NULL; inode =3D NULL; destroy_swap_extents(si, swap_file); - swap_cgroup_swapoff(si->type); free_swap_cluster_info(si->cluster_info, si->max); si->cluster_info =3D NULL; kvfree(si->zeromap); --=20 2.54.0 From nobody Mon May 25 06:42:36 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B184A3C0601; Sun, 17 May 2026 15:39:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; cv=none; b=YKLryBBqYQmzQuq6pXmZdvSfMEaKtm9lS38G1H+4f0AyNtYyQcyp1yQn+cv4KodZQNhXdNpW9ITnMjbkUPrW6xi2yd6RvO08meNK3IsV2QpOc4ebBKlJoS/uGIDA3chRc5kBUR8nN6VHfdmOiZpOD4Mo77cEl9O72Q9/knDwdt8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779032389; c=relaxed/simple; bh=FTZF+GCZrsIHcfofs6t9OX/YECec6/F/i6KkyQtFLsg=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=hdlCTCnNzj2MlogxUtqjZUnV03KJqx3EuIdpuVKXuKrqeRjiXChyRSKJlp7lTMynJS6w+RCOTij/pU65H9bIpAlP+1TPh2Epvjvu+BUXbokB3xdltv1sVucWt5OguXE/4gRv2R4/UGa4uys2qbtFVZ6+OlxCL/ypk5I2hR92ciM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=figebiAZ; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="figebiAZ" Received: by smtp.kernel.org (Postfix) with ESMTPS id 87487C2BCF5; Sun, 17 May 2026 15:39:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1779032389; bh=FTZF+GCZrsIHcfofs6t9OX/YECec6/F/i6KkyQtFLsg=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=figebiAZjhV+uTfYnOQX4+ZSkG1TWC1bYfpjCfVmHAfUod2meQILXG7xsq0/c+KZR 75qA38OWFQwAiiT2G7JNy8wV17VHvmCMUOGAizVW55PiVCjB/j3POGETIp99Ww29rt qkZaGk0i2l1vz46TixisedpHv2U5bmpNGq+VKt2belKLDhpnhs9DCMIvPodjWK86MS BBWL14oR95FSMKTatEkFDFRrrfSewFbPyepl/Sr+twTdAtkiPS7477dVzhrtzvtK0d gS850OvVUKM4momS5mZZmIBwecLnXwH5lfK4F+NcOWdQibo6utH3SJ8ILhBqSs9/jU yPqgl9Y2uACbQ== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7DBBBCD4F4B; Sun, 17 May 2026 15:39:49 +0000 (UTC) From: Kairui Song via B4 Relay Date: Sun, 17 May 2026 23:39:51 +0800 Subject: [PATCH v5 12/12] mm, swap: merge zeromap into swap table Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260517-swap-table-p4-v5-12-88ae43e064c7@tencent.com> References: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> In-Reply-To: <20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Zi Yan , Baolin Wang , Barry Song , Hugh Dickins , Chris Li , Kemeng Shi , Nhat Pham , Baoquan He , Johannes Weiner , Youngjun Park , Chengming Zhou , Roman Gushchin , Shakeel Butt , Muchun Song , Usama Arif , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Kairui Song , Lorenzo Stoakes , Yosry Ahmed , Qi Zheng X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1779032386; l=25679; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=gttVtNi/7R4eUHbXAfn8Ru41nAabUu7c8GCjLiuwTrA=; b=7H4CNCyOHcGJpeqK0fJEfCdsiOq3F6jvaclr2oRPX6acDIdFYIXyXZori//qPMy4JFkv196ij MbAxFYc7qvnAJMV8sXk1KXcrwsXXJZEs5GgVlBNjNoEfW+TKcmjmeos X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= X-Endpoint-Received: by B4 Relay for kasong@tencent.com/kasong-sign-tencent with auth_id=562 X-Original-From: Kairui Song Reply-To: kasong@tencent.com From: Kairui Song By allocating one additional bit in the swap table entry's flags field alongside the count, we can store the zeromap inline For 64 bit systems, zeromap will store in the swap table, avoiding zeromap allocation. It reduces the allocated memory. That is the happy path. For certain 32-bit archs, there might not be enough bits in the swap table to contain both PFN and flags. Therefore, conditionally let each cluster have a zeromap field at build time, and use that instead. If the swapfile cluster is not fully used, it will still save memory for zeromap. The empty cluster does not allocate a zeromap. In the worst case, all cluster are fully populated. We will use memory similar to the previous zeromap implementation. A few macros were moved to different headers for build time struct definition. Acked-by: Chris Li Reviewed-by: Youngjun Park Signed-off-by: Kairui Song --- include/linux/swap.h | 1 - mm/memory.c | 11 +---- mm/page_io.c | 61 +++++++++++++++++++++++---- mm/swap.h | 51 +++++++++-------------- mm/swap_state.c | 14 ++++--- mm/swap_table.h | 115 +++++++++++++++++++++++++++++++++++++----------= ---- mm/swapfile.c | 54 +++++++++++------------- 7 files changed, 191 insertions(+), 116 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 203bbe23ba1f..6d72778e6cc3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -253,7 +253,6 @@ struct swap_info_struct { struct plist_node list; /* entry in swap_active_head */ signed char type; /* strange name for an index */ unsigned int max; /* size of this swap device */ - unsigned long *zeromap; /* kvmalloc'ed bitmap to track zero pages */ struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ struct list_head free_clusters; /* free clusters list */ struct list_head full_clusters; /* full clusters list */ diff --git a/mm/memory.c b/mm/memory.c index da891bcce59c..7c020995eafc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4611,13 +4611,11 @@ static vm_fault_t handle_pte_marker(struct vm_fault= *vmf) =20 #ifdef CONFIG_TRANSPARENT_HUGEPAGE /* - * Check if the PTEs within a range are contiguous swap entries - * and have consistent swapcache, zeromap. + * Check if the PTEs within a range are contiguous swap entries. */ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) { unsigned long addr; - softleaf_t entry; int idx; pte_t pte; =20 @@ -4627,18 +4625,13 @@ static bool can_swapin_thp(struct vm_fault *vmf, pt= e_t *ptep, int nr_pages) =20 if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) return false; - entry =3D softleaf_from_pte(pte); - if (swap_pte_batch(ptep, nr_pages, pte) !=3D nr_pages) - return false; - /* * swap_read_folio() can't handle the case a large folio is hybridly * from different backends. And they are likely corner cases. Similar * things might be added once zswap support large folios. */ - if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) !=3D nr_pages)) + if (swap_pte_batch(ptep, nr_pages, pte) !=3D nr_pages) return false; - return true; } =20 diff --git a/mm/page_io.c b/mm/page_io.c index 7ed76592e20d..f2d8fe7fd057 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -26,6 +26,7 @@ #include #include #include "swap.h" +#include "swap_table.h" =20 static void __end_swap_bio_write(struct bio *bio) { @@ -204,15 +205,20 @@ static bool is_folio_zero_filled(struct folio *folio) static void swap_zeromap_folio_set(struct folio *folio) { struct obj_cgroup *objcg =3D get_obj_cgroup_from_folio(folio); - struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap); int nr_pages =3D folio_nr_pages(folio); + struct swap_cluster_info *ci; swp_entry_t entry; unsigned int i; =20 + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + + ci =3D swap_cluster_get_and_lock(folio); for (i =3D 0; i < folio_nr_pages(folio); i++) { entry =3D page_swap_entry(folio_page(folio, i)); - set_bit(swp_offset(entry), sis->zeromap); + __swap_table_set_zero(ci, swp_cluster_offset(entry)); } + swap_cluster_unlock(ci); =20 count_vm_events(SWPOUT_ZERO, nr_pages); if (objcg) { @@ -223,14 +229,19 @@ static void swap_zeromap_folio_set(struct folio *foli= o) =20 static void swap_zeromap_folio_clear(struct folio *folio) { - struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap); + struct swap_cluster_info *ci; swp_entry_t entry; unsigned int i; =20 + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + + ci =3D swap_cluster_get_and_lock(folio); for (i =3D 0; i < folio_nr_pages(folio); i++) { entry =3D page_swap_entry(folio_page(folio, i)); - clear_bit(swp_offset(entry), sis->zeromap); + __swap_table_clear_zero(ci, swp_cluster_offset(entry)); } + swap_cluster_unlock(ci); } =20 /* @@ -255,10 +266,9 @@ int swap_writeout(struct folio *folio, struct swap_ioc= b **swap_plug) } =20 /* - * Use a bitmap (zeromap) to avoid doing IO for zero-filled pages. - * The bits in zeromap are protected by the locked swapcache folio - * and atomic updates are used to protect against read-modify-write - * corruption due to other zero swap entries seeing concurrent updates. + * Use the swap table zero mark to avoid doing IO for zero-filled + * pages. The zero mark is protected by the cluster lock, which is + * acquired internally by swap_zeromap_folio_set/clear. */ if (is_folio_zero_filled(folio)) { swap_zeromap_folio_set(folio); @@ -509,19 +519,52 @@ static void sio_read_complete(struct kiocb *iocb, lon= g ret) mempool_free(sio, sio_pool); } =20 +/* + * Return the count of contiguous swap entries that share the same + * zeromap status as the starting entry. If is_zerop is not NULL, + * it will return the zeromap status of the starting entry. + * + * Context: Caller must ensure the cluster containing the entries + * that will be checked won't be freed. + */ +static int swap_zeromap_batch(swp_entry_t entry, int max_nr, + bool *is_zerop) +{ + int i; + bool is_zero; + unsigned int ci_start =3D swp_cluster_offset(entry); + struct swap_cluster_info *ci =3D __swap_entry_to_cluster(entry); + + VM_WARN_ON_ONCE(ci_start + max_nr > SWAPFILE_CLUSTER); + + rcu_read_lock(); + is_zero =3D __swap_table_test_zero(ci, ci_start); + for (i =3D 1; i < max_nr; i++) + if (is_zero !=3D __swap_table_test_zero(ci, ci_start + i)) + break; + rcu_read_unlock(); + if (is_zerop) + *is_zerop =3D is_zero; + + return i; +} + static bool swap_read_folio_zeromap(struct folio *folio) { int nr_pages =3D folio_nr_pages(folio); struct obj_cgroup *objcg; bool is_zeromap; =20 + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + /* * Swapping in a large folio that is partially in the zeromap is not * currently handled. Return true without marking the folio uptodate so * that an IO error is emitted (e.g. do_swap_page() will sigbus). + * Folio lock stabilizes the cluster and map, so the check is safe. */ if (WARN_ON_ONCE(swap_zeromap_batch(folio->swap, nr_pages, - &is_zeromap) !=3D nr_pages)) + &is_zeromap) !=3D nr_pages)) return true; =20 if (!is_zeromap) diff --git a/mm/swap.h b/mm/swap.h index 5b2f095fff6e..81c06aae7ccd 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -3,12 +3,29 @@ #define _MM_SWAP_H =20 #include /* for atomic_long_t */ +#include /* for PAGE_SHIFT */ struct mempolicy; struct swap_iocb; struct swap_memcg_table; =20 extern int page_cluster; =20 +#if defined(MAX_POSSIBLE_PHYSMEM_BITS) +#define SWAP_CACHE_PFN_BITS (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT) +#elif defined(MAX_PHYSMEM_BITS) +#define SWAP_CACHE_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT) +#else +#define SWAP_CACHE_PFN_BITS (BITS_PER_LONG - PAGE_SHIFT) +#endif + +/* Swap table marker, 0x1 means shadow, 0x2 means PFN (SWP_TB_PFN_MARK) */ +#define SWAP_CACHE_PFN_MARK_BITS 2 +/* At least 2 bits are needed to distinguish SWP_TB_COUNT_MAX, 1 and 0 */ +#define SWAP_COUNT_MIN_BITS 2 +/* If there are enough bits besides PFN and marker, store zero flag inline= */ +#define SWAP_TABLE_HAS_ZEROFLAG ((BITS_PER_LONG - SWAP_CACHE_PFN_MARK_BIT= S - \ + SWAP_CACHE_PFN_BITS) > SWAP_COUNT_MIN_BITS) + #ifdef CONFIG_THP_SWAP #define SWAPFILE_CLUSTER HPAGE_PMD_NR #define swap_entry_order(order) (order) @@ -41,6 +58,9 @@ struct swap_cluster_info { unsigned int *extend_table; /* For large swap count, protected by ci->loc= k */ #ifdef CONFIG_MEMCG struct swap_memcg_table *memcg_table; /* Swap table entries' cgroup recor= d */ +#endif +#if !SWAP_TABLE_HAS_ZEROFLAG + unsigned long *zero_bitmap; #endif struct list_head list; }; @@ -314,31 +334,6 @@ static inline unsigned int folio_swap_flags(struct fol= io *folio) return __swap_entry_to_info(folio->swap)->flags; } =20 -/* - * Return the count of contiguous swap entries that share the same - * zeromap status as the starting entry. If is_zeromap is not NULL, - * it will return the zeromap status of the starting entry. - */ -static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr, - bool *is_zeromap) -{ - struct swap_info_struct *sis =3D __swap_entry_to_info(entry); - unsigned long start =3D swp_offset(entry); - unsigned long end =3D start + max_nr; - bool first_bit; - - first_bit =3D test_bit(start, sis->zeromap); - if (is_zeromap) - *is_zeromap =3D first_bit; - - if (max_nr <=3D 1) - return max_nr; - if (first_bit) - return find_next_zero_bit(sis->zeromap, end, start) - start; - else - return find_next_bit(sis->zeromap, end, start) - start; -} - #else /* CONFIG_SWAP */ struct swap_iocb; static inline struct swap_cluster_info *swap_cluster_lock( @@ -476,11 +471,5 @@ static inline unsigned int folio_swap_flags(struct fol= io *folio) { return 0; } - -static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr, - bool *has_zeromap) -{ - return 0; -} #endif /* CONFIG_SWAP */ #endif /* _MM_SWAP_H */ diff --git a/mm/swap_state.c b/mm/swap_state.c index 873cb3f26337..04f5ce992401 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -160,6 +160,7 @@ static int __swap_cache_add_check(struct swap_cluster_i= nfo *ci, { unsigned int ci_off, ci_end; unsigned long old_tb; + bool is_zero; =20 lockdep_assert_held(&ci->lock); =20 @@ -184,12 +185,14 @@ static int __swap_cache_add_check(struct swap_cluster= _info *ci, if (nr =3D=3D 1) return 0; =20 + is_zero =3D __swap_table_test_zero(ci, ci_off); ci_off =3D round_down(ci_off, nr); ci_end =3D ci_off + nr; do { old_tb =3D __swap_table_get(ci, ci_off); if (unlikely(swp_tb_is_folio(old_tb) || !__swp_tb_get_count(old_tb) || + is_zero !=3D __swap_table_test_zero(ci, ci_off) || (memcg_id && *memcg_id !=3D __swap_cgroup_get(ci, ci_off)))) return -EBUSY; } while (++ci_off < ci_end); @@ -213,7 +216,7 @@ static void __swap_cache_do_add_folio(struct swap_clust= er_info *ci, do { old_tb =3D __swap_table_get(ci, ci_off); VM_WARN_ON_ONCE(swp_tb_is_folio(old_tb)); - __swap_table_set(ci, ci_off, pfn_to_swp_tb(pfn, __swp_tb_get_count(old_t= b))); + __swap_table_set(ci, ci_off, pfn_to_swp_tb(pfn, __swp_tb_get_flags(old_t= b))); } while (++ci_off < ci_end); =20 folio_ref_add(folio, nr_pages); @@ -249,7 +252,6 @@ static void __swap_cache_do_del_folio(struct swap_clust= er_info *ci, struct folio *folio, swp_entry_t entry, void *shadow) { - int count; unsigned long old_tb; struct swap_info_struct *si; unsigned int ci_start, ci_off, ci_end; @@ -269,13 +271,13 @@ static void __swap_cache_do_del_folio(struct swap_clu= ster_info *ci, old_tb =3D __swap_table_get(ci, ci_off); WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) !=3D folio); - count =3D __swp_tb_get_count(old_tb); - if (count) + if (__swp_tb_get_count(old_tb)) folio_swapped =3D true; else need_free =3D true; /* If shadow is NULL, we set an empty shadow. */ - __swap_table_set(ci, ci_off, shadow_to_swp_tb(shadow, count)); + __swap_table_set(ci, ci_off, shadow_to_swp_tb(shadow, + __swp_tb_get_flags(old_tb))); } while (++ci_off < ci_end); =20 folio->swap.val =3D 0; @@ -369,7 +371,7 @@ void __swap_cache_replace_folio(struct swap_cluster_inf= o *ci, do { old_tb =3D __swap_table_get(ci, ci_off); WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) !=3D ol= d); - __swap_table_set(ci, ci_off, pfn_to_swp_tb(pfn, __swp_tb_get_count(old_t= b))); + __swap_table_set(ci, ci_off, pfn_to_swp_tb(pfn, __swp_tb_get_flags(old_t= b))); } while (++ci_off < ci_end); =20 /* diff --git a/mm/swap_table.h b/mm/swap_table.h index b4e1100f8296..e6613e62f8d0 100644 --- a/mm/swap_table.h +++ b/mm/swap_table.h @@ -26,12 +26,14 @@ struct swap_memcg_table { * Swap table entry type and bits layouts: * * NULL: |---------------- 0 ---------------| - Free slot - * Shadow: | SWAP_COUNT |---- SHADOW_VAL ---|1| - Swapped out slot - * PFN: | SWAP_COUNT |------ PFN -------|10| - Cached slot + * Shadow: |SWAP_COUNT|Z|---- SHADOW_VAL ---|1| - Swapped out slot + * PFN: |SWAP_COUNT|Z|------ PFN -------|10| - Cached slot * Pointer: |----------- Pointer ----------|100| - (Unused) * Bad: |------------- 1 -------------|1000| - Bad slot * - * SWAP_COUNT is `SWP_TB_COUNT_BITS` long, each entry is an atomic long. + * COUNT is `SWP_TB_COUNT_BITS` long, Z is the `SWP_TB_ZERO_FLAG` bit, + * and together they form the `SWP_TB_FLAGS_BITS` wide flags field. + * Each entry is an atomic long. * * Usages: * @@ -54,14 +56,6 @@ struct swap_memcg_table { * - Bad: Swap slot is reserved, protects swap header or holes on swap dev= ices. */ =20 -#if defined(MAX_POSSIBLE_PHYSMEM_BITS) -#define SWAP_CACHE_PFN_BITS (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT) -#elif defined(MAX_PHYSMEM_BITS) -#define SWAP_CACHE_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT) -#else -#define SWAP_CACHE_PFN_BITS (BITS_PER_LONG - PAGE_SHIFT) -#endif - /* NULL Entry, all 0 */ #define SWP_TB_NULL 0UL =20 @@ -69,22 +63,26 @@ struct swap_memcg_table { #define SWP_TB_SHADOW_MARK 0b1UL =20 /* Cached: PFN */ -#define SWP_TB_PFN_BITS (SWAP_CACHE_PFN_BITS + SWP_TB_PFN_MARK_BITS) +#define SWP_TB_PFN_BITS (SWAP_CACHE_PFN_BITS + SWAP_CACHE_PFN_MARK_BITS) #define SWP_TB_PFN_MARK 0b10UL -#define SWP_TB_PFN_MARK_BITS 2 -#define SWP_TB_PFN_MARK_MASK (BIT(SWP_TB_PFN_MARK_BITS) - 1) +#define SWP_TB_PFN_MARK_MASK (BIT(SWAP_CACHE_PFN_MARK_BITS) - 1) =20 -/* SWAP_COUNT part for PFN or shadow, the width can be shrunk or extended = */ -#define SWP_TB_COUNT_BITS min(4, BITS_PER_LONG - SWP_TB_PFN_BITS) +/* Flags: For PFN or shadow, contains SWAP_COUNT, width changes */ +#define SWP_TB_FLAGS_BITS min(5, BITS_PER_LONG - SWP_TB_PFN_BITS) +#define SWP_TB_COUNT_BITS (SWP_TB_FLAGS_BITS - SWAP_TABLE_HAS_ZEROFLAG) +#define SWP_TB_FLAGS_MASK (~((~0UL) >> SWP_TB_FLAGS_BITS)) #define SWP_TB_COUNT_MASK (~((~0UL) >> SWP_TB_COUNT_BITS)) +#define SWP_TB_FLAGS_SHIFT (BITS_PER_LONG - SWP_TB_FLAGS_BITS) #define SWP_TB_COUNT_SHIFT (BITS_PER_LONG - SWP_TB_COUNT_BITS) #define SWP_TB_COUNT_MAX ((1 << SWP_TB_COUNT_BITS) - 1) +/* The first flag is zero bit (SWAP_TABLE_HAS_ZEROFLAG) */ +#define SWP_TB_ZERO_FLAG BIT(BITS_PER_LONG - SWP_TB_FLAGS_BITS) =20 /* Bad slot: ends with 0b1000 and rests of bits are all 1 */ #define SWP_TB_BAD ((~0UL) << 3) =20 /* Macro for shadow offset calculation */ -#define SWAP_COUNT_SHIFT SWP_TB_COUNT_BITS +#define SWAP_COUNT_SHIFT SWP_TB_FLAGS_BITS =20 /* * Helpers for casting one type of info into a swap table entry. @@ -102,40 +100,47 @@ static inline unsigned long __count_to_swp_tb(unsigne= d char count) * used (count > 0 && count < SWP_TB_COUNT_MAX), and * overflow (count =3D=3D SWP_TB_COUNT_MAX). */ - BUILD_BUG_ON(SWP_TB_COUNT_MAX < 2 || SWP_TB_COUNT_BITS < 2); + BUILD_BUG_ON(SWP_TB_COUNT_BITS < SWAP_COUNT_MIN_BITS); VM_WARN_ON(count > SWP_TB_COUNT_MAX); return ((unsigned long)count) << SWP_TB_COUNT_SHIFT; } =20 -static inline unsigned long pfn_to_swp_tb(unsigned long pfn, unsigned int = count) +static inline unsigned long __flags_to_swp_tb(unsigned char flags) +{ + BUILD_BUG_ON(SWP_TB_FLAGS_BITS > BITS_PER_BYTE); + VM_WARN_ON(flags >> SWP_TB_FLAGS_BITS); + return ((unsigned long)flags) << SWP_TB_FLAGS_SHIFT; +} + +static inline unsigned long pfn_to_swp_tb(unsigned long pfn, unsigned char= flags) { unsigned long swp_tb; =20 BUILD_BUG_ON(sizeof(unsigned long) !=3D sizeof(void *)); BUILD_BUG_ON(SWAP_CACHE_PFN_BITS > - (BITS_PER_LONG - SWP_TB_PFN_MARK_BITS - SWP_TB_COUNT_BITS)); + (BITS_PER_LONG - SWAP_CACHE_PFN_MARK_BITS - SWP_TB_FLAGS_BITS)); =20 - swp_tb =3D (pfn << SWP_TB_PFN_MARK_BITS) | SWP_TB_PFN_MARK; - VM_WARN_ON_ONCE(swp_tb & SWP_TB_COUNT_MASK); + swp_tb =3D (pfn << SWAP_CACHE_PFN_MARK_BITS) | SWP_TB_PFN_MARK; + VM_WARN_ON_ONCE(swp_tb & SWP_TB_FLAGS_MASK); =20 - return swp_tb | __count_to_swp_tb(count); + return swp_tb | __flags_to_swp_tb(flags); } =20 -static inline unsigned long folio_to_swp_tb(struct folio *folio, unsigned = int count) +static inline unsigned long folio_to_swp_tb(struct folio *folio, unsigned = char flags) { - return pfn_to_swp_tb(folio_pfn(folio), count); + return pfn_to_swp_tb(folio_pfn(folio), flags); } =20 -static inline unsigned long shadow_to_swp_tb(void *shadow, unsigned int co= unt) +static inline unsigned long shadow_to_swp_tb(void *shadow, unsigned char f= lags) { BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=3D BITS_PER_BYTE * sizeof(unsigned long)); BUILD_BUG_ON((unsigned long)xa_mk_value(0) !=3D SWP_TB_SHADOW_MARK); =20 VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow)); - VM_WARN_ON_ONCE(shadow && ((unsigned long)shadow & SWP_TB_COUNT_MASK)); + VM_WARN_ON_ONCE(shadow && ((unsigned long)shadow & SWP_TB_FLAGS_MASK)); =20 - return (unsigned long)shadow | __count_to_swp_tb(count) | SWP_TB_SHADOW_M= ARK; + return (unsigned long)shadow | SWP_TB_SHADOW_MARK | __flags_to_swp_tb(fla= gs); } =20 /* @@ -173,14 +178,14 @@ static inline bool swp_tb_is_countable(unsigned long = swp_tb) static inline struct folio *swp_tb_to_folio(unsigned long swp_tb) { VM_WARN_ON(!swp_tb_is_folio(swp_tb)); - return pfn_folio((swp_tb & ~SWP_TB_COUNT_MASK) >> SWP_TB_PFN_MARK_BITS); + return pfn_folio((swp_tb & ~SWP_TB_FLAGS_MASK) >> SWAP_CACHE_PFN_MARK_BIT= S); } =20 static inline void *swp_tb_to_shadow(unsigned long swp_tb) { VM_WARN_ON(!swp_tb_is_shadow(swp_tb)); /* No shift needed, xa_value is stored as it is in the lower bits. */ - return (void *)(swp_tb & ~SWP_TB_COUNT_MASK); + return (void *)(swp_tb & ~SWP_TB_FLAGS_MASK); } =20 static inline unsigned char __swp_tb_get_count(unsigned long swp_tb) @@ -189,6 +194,12 @@ static inline unsigned char __swp_tb_get_count(unsigne= d long swp_tb) return ((swp_tb & SWP_TB_COUNT_MASK) >> SWP_TB_COUNT_SHIFT); } =20 +static inline unsigned char __swp_tb_get_flags(unsigned long swp_tb) +{ + VM_WARN_ON(!swp_tb_is_countable(swp_tb)); + return ((swp_tb & SWP_TB_FLAGS_MASK) >> SWP_TB_FLAGS_SHIFT); +} + static inline int swp_tb_get_count(unsigned long swp_tb) { if (swp_tb_is_countable(swp_tb)) @@ -253,6 +264,50 @@ static inline unsigned long swap_table_get(struct swap= _cluster_info *ci, return swp_tb; } =20 +static inline void __swap_table_set_zero(struct swap_cluster_info *ci, + unsigned int ci_off) +{ +#if SWAP_TABLE_HAS_ZEROFLAG + unsigned long swp_tb =3D __swap_table_get(ci, ci_off); + + BUILD_BUG_ON(SWP_TB_ZERO_FLAG & ~SWP_TB_FLAGS_MASK); + VM_WARN_ON(!swp_tb_is_countable(swp_tb)); + swp_tb |=3D SWP_TB_ZERO_FLAG; + __swap_table_set(ci, ci_off, swp_tb); +#else + lockdep_assert_held(&ci->lock); + __set_bit(ci_off, ci->zero_bitmap); +#endif +} + +static inline bool __swap_table_test_zero(struct swap_cluster_info *ci, + unsigned int ci_off) +{ +#if SWAP_TABLE_HAS_ZEROFLAG + unsigned long swp_tb =3D __swap_table_get(ci, ci_off); + + VM_WARN_ON(!swp_tb_is_countable(swp_tb)); + return !!(swp_tb & SWP_TB_ZERO_FLAG); +#else + return test_bit(ci_off, ci->zero_bitmap); +#endif +} + +static inline void __swap_table_clear_zero(struct swap_cluster_info *ci, + unsigned int ci_off) +{ +#if SWAP_TABLE_HAS_ZEROFLAG + unsigned long swp_tb =3D __swap_table_get(ci, ci_off); + + VM_WARN_ON(!swp_tb_is_countable(swp_tb)); + swp_tb &=3D ~SWP_TB_ZERO_FLAG; + __swap_table_set(ci, ci_off, swp_tb); +#else + lockdep_assert_held(&ci->lock); + __clear_bit(ci_off, ci->zero_bitmap); +#endif +} + #ifdef CONFIG_MEMCG static inline void __swap_cgroup_set(struct swap_cluster_info *ci, unsigned int ci_off, unsigned long nr, unsigned short id) diff --git a/mm/swapfile.c b/mm/swapfile.c index 095e9c953e49..a9a1e477fec9 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -427,6 +427,11 @@ static void swap_cluster_free_table(struct swap_cluste= r_info *ci) ci->memcg_table =3D NULL; #endif =20 +#if !SWAP_TABLE_HAS_ZEROFLAG + kfree(ci->zero_bitmap); + ci->zero_bitmap =3D NULL; +#endif + table =3D (struct swap_table *)rcu_access_pointer(ci->table); if (!table) return; @@ -469,13 +474,21 @@ static int swap_cluster_alloc_table(struct swap_clust= er_info *ci, gfp_t gfp) VM_WARN_ON_ONCE(ci->memcg_table); ci->memcg_table =3D kzalloc_obj(*ci->memcg_table, gfp); if (!ci->memcg_table) - ret =3D -ENOMEM; + goto err_free; } #endif - if (ret) - swap_cluster_free_table(ci); =20 - return ret; +#if !SWAP_TABLE_HAS_ZEROFLAG + VM_WARN_ON_ONCE(ci->zero_bitmap); + ci->zero_bitmap =3D bitmap_zalloc(SWAPFILE_CLUSTER, gfp); + if (!ci->zero_bitmap) + goto err_free; +#endif + return 0; + +err_free: + swap_cluster_free_table(ci); + return -ENOMEM; } =20 /* @@ -928,8 +941,8 @@ static bool __swap_cluster_alloc_entries(struct swap_in= fo_struct *si, order =3D 0; nr_pages =3D 1; swap_cluster_assert_empty(ci, ci_off, 1, false); - /* Sets a fake shadow as placeholder */ - __swap_table_set(ci, ci_off, shadow_to_swp_tb(NULL, 1)); + /* Fake shadow placeholder with no flag, hibernation does not use the ze= romap */ + __swap_table_set(ci, ci_off, __swp_tb_mk_count(shadow_to_swp_tb(NULL, 0)= , 1)); } else { /* Allocation without folio is only possible with hibernation */ WARN_ON_ONCE(1); @@ -1302,14 +1315,8 @@ static void swap_range_free(struct swap_info_struct = *si, unsigned long offset, void (*swap_slot_free_notify)(struct block_device *, unsigned long); unsigned int i; =20 - /* - * Use atomic clear_bit operations only on zeromap instead of non-atomic - * bitmap_clear to prevent adjacent bits corruption due to simultaneous w= rites. - */ - for (i =3D 0; i < nr_entries; i++) { - clear_bit(offset + i, si->zeromap); + for (i =3D 0; i < nr_entries; i++) zswap_invalidate(swp_entry(si->type, offset + i)); - } =20 if (si->flags & SWP_BLKDEV) swap_slot_free_notify =3D @@ -1894,7 +1901,11 @@ void __swap_cluster_free_entries(struct swap_info_st= ruct *si, * ref, or after swap cache is dropped */ VM_WARN_ON(!swp_tb_is_shadow(old_tb) || __swp_tb_get_count(old_tb) > 1); + + /* Resetting the slot to NULL also clears the inline flags. */ __swap_table_set(ci, ci_off, null_to_swp_tb()); + if (!SWAP_TABLE_HAS_ZEROFLAG) + __swap_table_clear_zero(ci, ci_off); =20 /* * Uncharge swap slots by memcg in batches. Consecutive @@ -3088,7 +3099,6 @@ static void flush_percpu_swap_cluster(struct swap_inf= o_struct *si) SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) { struct swap_info_struct *p =3D NULL; - unsigned long *zeromap; struct swap_cluster_info *cluster_info; struct file *swap_file, *victim; struct address_space *mapping; @@ -3184,8 +3194,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) =20 swap_file =3D p->swap_file; p->swap_file =3D NULL; - zeromap =3D p->zeromap; - p->zeromap =3D NULL; maxpages =3D p->max; cluster_info =3D p->cluster_info; p->max =3D 0; @@ -3197,7 +3205,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) mutex_unlock(&swapon_mutex); kfree(p->global_cluster); p->global_cluster =3D NULL; - kvfree(zeromap); free_swap_cluster_info(cluster_info, maxpages); =20 inode =3D mapping->host; @@ -3729,17 +3736,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, special= file, int, swap_flags) if (error) goto bad_swap_unlock_inode; =20 - /* - * Use kvmalloc_array instead of bitmap_zalloc as the allocation order mi= ght - * be above MAX_PAGE_ORDER incase of a large swap file. - */ - si->zeromap =3D kvmalloc_array(BITS_TO_LONGS(maxpages), sizeof(long), - GFP_KERNEL | __GFP_ZERO); - if (!si->zeromap) { - error =3D -ENOMEM; - goto bad_swap_unlock_inode; - } - if (si->bdev && bdev_stable_writes(si->bdev)) si->flags |=3D SWP_STABLE_WRITES; =20 @@ -3841,8 +3837,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf= ile, int, swap_flags) destroy_swap_extents(si, swap_file); free_swap_cluster_info(si->cluster_info, si->max); si->cluster_info =3D NULL; - kvfree(si->zeromap); - si->zeromap =3D NULL; /* * Clear the SWP_USED flag after all resources are freed so * alloc_swap_info can reuse this si safely. --=20 2.54.0