From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8A89433CE82 for ; Fri, 19 Dec 2025 19:44:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173464; cv=none; b=rwIbGiCE/ekzJv+5q9H0b7NVSxhp8RByvWn2KtKyvf41ypq9XWZx1sMX+amdNZ7txpmvPOM649agwHjVonLwdSIqMZ6XggQIx9TjKF7SnFsFUvO5j2VXBnhX7KyCuwN/1B/2CJdVK2CDgLT2103H5OZ3oqVHERq/f/SkZL9UnNk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173464; c=relaxed/simple; bh=c7XHXQO9zXKbP7qyjKJBuHI+1CJAKSIaCAqoYI1PgYU=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=BoBfE6WqJ/yMrjVBcd1aY5zj0zf2rDlzNttArL7TjlvVaR3ZEYVWwy5KRWhlC8YZWZIOCWlSDoXVJVjdXVY2sE9TtBkTfKWjnXrc6k50bWVEqRzep/oxY/FW05y3LvaM/8KqsRgmCMqoe1O6Ki1ZtUlrP6VNzIWbwylk/oouJq8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=BhKb552k; arc=none smtp.client-ip=209.85.214.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="BhKb552k" Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-2a0d0788adaso19330035ad.3 for ; Fri, 19 Dec 2025 11:44:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173462; x=1766778262; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=lAf7fTFUp87eNqWjmH0abD/uFyuT//fsXVh8ggzpMg4=; b=BhKb552kbxAmWL4ohOj6WPJXBl2+93jvt0A6tnIgWI/DQHjlYXlY5a2DJaDHneeAGj nXZ0noqe1svNvrF7R1E31D0zcgSvIgItk2CI5ARk77X4pRe3Wp1UQlJVxqo4V7jyD1SL LknOuq3riYHLwRDgd8CFH2ZvtARcFmZnj45TrOTK3r03wrKU7hOkvTEHKUyoaBxwbx/U iqRbvuQCuxyAf9hfek4nBIbThyHqwFkF3LRjD/ciG4sAGyj/XfVT4FwJb5qdniMoaCTR Hm4oHR7ERjCweqFjQVq/Nst38Fvz4aXJuSDFzpWhsEPsMmEL9TPlt1aevrxCKGyBRVB8 /IfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173462; x=1766778262; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=lAf7fTFUp87eNqWjmH0abD/uFyuT//fsXVh8ggzpMg4=; b=rbaUnAcINlLqJOQDwmUbVbA3EPrurhRReBzNfGU/kzqF7VaBmKpoUIdUt2gYaxFLKE 9YPi+OB0Fshl4qS9OSzajuyQJBgDBHupoYBJ+qxgDwWWl6fMREiGXRmVOrlp11quKZBC U+jA2ol7PCOhOVjn+lRRJrrpn8VPGkDX0BYwuu/2llW339dNPKDcggkGRCsv1r0pWBhg D1PHVbV3PbXtmutMY+VTKF0jE0raSIuM9l3cLccdozYngbTW6NsdtCvCcnSB7JwP70Jd Ai6cyv7jv5Rhs363s5HHXSyKcS4lPZJlZBGDWcRe3ImUXVWt1kU8a9rz6kOaXKRTDqGW /3Hw== X-Forwarded-Encrypted: i=1; AJvYcCX93Vu85HBqGczmPAaobOBJ7t7cofdFGWHGpu1cS7X1yDdH4WVFTEPki2Q8qmwKdrGewKyOG4XBvBJ+0Vo=@vger.kernel.org X-Gm-Message-State: AOJu0Yw2Qt8W/vukSIv93sYZr4N4Orzvviqy/WdB1P5vQ25DVTSMgibp JnnZRA+MIa9S7UTRkksMe9b7pLEEho8HvwaCs+EA2E22SXhsR4hGhxVP X-Gm-Gg: AY/fxX66Eku5PpowMuUwlW+d9bUKSZWSW1JxcpxOMywUl3UBgGnyLY7wPRiJBhrobbt YA/h5L+a0Q933x/P+WsvVWRnV5JnNmgaDxhimPyZ2skbMrL2Y9EnzFXsVBzXg2nKn0q9UmPkDXV gXLSreMSwX9RZ91GSb6IT75lPc0TKsOORh4ssN5cIIkz2tAKucLTqRIVNKNjHEc4bj0aaVrEhAd mN9zfYzdwm39bzF19523MPD7qwIaEDylHunv0F/SwXUIXcETUBfKJ2VYty6hLFP8VfoRHBDKzjp lWX1+N3nZ7PM3ibXTM9C9xbTUf/nPgT0Y9wVo2A8deK5slds9lZthSHBsfZ0xmKbi2oadhULm+F hSXItmSDXk9oe2vSYzU3/3/Wi6KgM85k5wfgfzWPJgLOHBgW6dXhlPyWSxV4edmU0NVQjM622s3 SX+nEofr5RFTC7dvXr6/tXhMCNrvr3Q6ooF68yC9DP6km+C4qthH17 X-Google-Smtp-Source: AGHT+IE19UvB+rb2MKbC0md7DDnxzaHyFF5mRTAudCyzP6UVbqIF6A/NpS4srCy1bzmhjToSvoscPg== X-Received: by 2002:a17:903:2f82:b0:2a1:10f7:9717 with SMTP id d9443c01a7336-2a2f2c4cea2mr36645865ad.58.1766173461648; Fri, 19 Dec 2025 11:44:21 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.44.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:44:21 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:30 +0800 Subject: [PATCH v5 01/19] mm, swap: rename __read_swap_cache_async to swap_cache_alloc_folio Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-1-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=8341; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=xtF0rZ/4UZR3IAaMqBPWls8Ex5Nkvt0SwdqitL97Xlk=; b=KqoERz2XSwvzkqzShTEdiYozhp7xC07oPTrykYjgN8joxIeMU9W3Sydd2xtYzNFUAWidiBMQT t3NlsfMpnkyAjVxYghoi89TdeIbTQ7DqPOmnJf2TrjjLDqjNBsqAvGE X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song __read_swap_cache_async is widely used to allocate and ensure a folio is in swapcache, or get the folio if a folio is already there. It's not async, and it's not doing any read. Rename it to better present its usage, and prepare to be reworked as part of new swap cache APIs. Also, add some comments for the function. Worth noting that the skip_if_exists argument is an long existing workaround that will be dropped soon. Reviewed-by: Yosry Ahmed Acked-by: Chris Li Reviewed-by: Barry Song Reviewed-by: Nhat Pham Reviewed-by: Baoquan He Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swap.h | 6 +++--- mm/swap_state.c | 46 +++++++++++++++++++++++++++++++++------------- mm/swapfile.c | 2 +- mm/zswap.c | 4 ++-- 4 files changed, 39 insertions(+), 19 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index d034c13d8dd2..0fff92e42cfe 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -249,6 +249,9 @@ struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **s= hadow); void swap_cache_del_folio(struct folio *folio); +struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, + struct mempolicy *mpol, pgoff_t ilx, + bool *alloced, bool skip_if_exists); /* Below helpers require the caller to lock and pass in the swap cluster. = */ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry, void *shadow); @@ -261,9 +264,6 @@ void swapcache_clear(struct swap_info_struct *si, swp_e= ntry_t entry, int nr); struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr, struct swap_iocb **plug); -struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags, - struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated, - bool skip_if_exists); struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, struct mempolicy *mpol, pgoff_t ilx); struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, diff --git a/mm/swap_state.c b/mm/swap_state.c index 5f97c6ae70a2..08252eaef32f 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -402,9 +402,29 @@ void swap_update_readahead(struct folio *folio, struct= vm_area_struct *vma, } } =20 -struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, - struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated, - bool skip_if_exists) +/** + * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap ca= che. + * @entry: the swapped out swap entry to be binded to the folio. + * @gfp_mask: memory allocation flags + * @mpol: NUMA memory allocation policy to be applied + * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE + * @new_page_allocated: sets true if allocation happened, false otherwise + * @skip_if_exists: if the slot is a partially cached state, return NULL. + * This is a workaround that would be removed shortly. + * + * Allocate a folio in the swap cache for one swap slot, typically before + * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by + * @entry must have a non-zero swap count (swapped out). + * Currently only supports order 0. + * + * Context: Caller must protect the swap device with reference count or lo= cks. + * Return: Returns the existing folio if @entry is cached already. Returns + * NULL if failed due to -ENOMEM or @entry have a swap count < 1. + */ +struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask, + struct mempolicy *mpol, pgoff_t ilx, + bool *new_page_allocated, + bool skip_if_exists) { struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct folio *folio; @@ -452,12 +472,12 @@ struct folio *__read_swap_cache_async(swp_entry_t ent= ry, gfp_t gfp_mask, goto put_and_return; =20 /* - * Protect against a recursive call to __read_swap_cache_async() + * Protect against a recursive call to swap_cache_alloc_folio() * on the same entry waiting forever here because SWAP_HAS_CACHE * is set but the folio is not the swap cache yet. This can * happen today if mem_cgroup_swapin_charge_folio() below * triggers reclaim through zswap, which may call - * __read_swap_cache_async() in the writeback path. + * swap_cache_alloc_folio() in the writeback path. */ if (skip_if_exists) goto put_and_return; @@ -466,7 +486,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry= , gfp_t gfp_mask, * We might race against __swap_cache_del_folio(), and * stumble across a swap_map entry whose SWAP_HAS_CACHE * has not yet been cleared. Or race against another - * __read_swap_cache_async(), which has set SWAP_HAS_CACHE + * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE * in swap_map, but not yet added its folio to swap cache. */ schedule_timeout_uninterruptible(1); @@ -525,7 +545,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, = gfp_t gfp_mask, return NULL; =20 mpol =3D get_vma_policy(vma, addr, 0, &ilx); - folio =3D __read_swap_cache_async(entry, gfp_mask, mpol, ilx, + folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, &page_allocated, false); mpol_cond_put(mpol); =20 @@ -643,9 +663,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, blk_start_plug(&plug); for (offset =3D start_offset; offset <=3D end_offset ; offset++) { /* Ok, do the async read-ahead now */ - folio =3D __read_swap_cache_async( - swp_entry(swp_type(entry), offset), - gfp_mask, mpol, ilx, &page_allocated, false); + folio =3D swap_cache_alloc_folio( + swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx, + &page_allocated, false); if (!folio) continue; if (page_allocated) { @@ -662,7 +682,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, lru_add_drain(); /* Push any new pages onto the LRU now */ skip: /* The page was likely read above, so no need for plugging here */ - folio =3D __read_swap_cache_async(entry, gfp_mask, mpol, ilx, + folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, &page_allocated, false); if (unlikely(page_allocated)) swap_read_folio(folio, NULL); @@ -767,7 +787,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, if (!si) continue; } - folio =3D __read_swap_cache_async(entry, gfp_mask, mpol, ilx, + folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, &page_allocated, false); if (si) put_swap_device(si); @@ -789,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, lru_add_drain(); skip: /* The folio was likely read above, so no need for plugging here */ - folio =3D __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx, + folio =3D swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx, &page_allocated, false); if (unlikely(page_allocated)) swap_read_folio(folio, NULL); diff --git a/mm/swapfile.c b/mm/swapfile.c index 46d2008e4b99..e5284067a442 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1574,7 +1574,7 @@ static unsigned char swap_entry_put_locked(struct swa= p_info_struct *si, * CPU1 CPU2 * do_swap_page() * ... swapoff+swapon - * __read_swap_cache_async() + * swap_cache_alloc_folio() * swapcache_prepare() * __swap_duplicate() * // check swap_map diff --git a/mm/zswap.c b/mm/zswap.c index 5d0f8b13a958..a7a2443912f4 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1014,8 +1014,8 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, return -EEXIST; =20 mpol =3D get_task_policy(current); - folio =3D __read_swap_cache_async(swpentry, GFP_KERNEL, mpol, - NO_INTERLEAVE_INDEX, &folio_was_allocated, true); + folio =3D swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol, + NO_INTERLEAVE_INDEX, &folio_was_allocated, true); put_swap_device(si); if (!folio) return -ENOMEM; --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 401FC33C53A for ; Fri, 19 Dec 2025 19:44:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173469; cv=none; b=I8OCZVyQ/Zm1Ac2S0cObHuX21JZT7WvYQKIZFcv/gMc53FVPB5vuxD9t8l0m4Ge7oTjfS5o97qtVIks92JVJClwFphy9zh0zK7h+cwUt52+Nfp+neA7TrmlheKrK6/1CC8aeN7qz9Q+caN27GvXycgBAJzHfyltnBHUPd+75gis= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173469; c=relaxed/simple; bh=mDR+6KpgZSns5kFDjnZ/kPqYNF5XkIY1kIQ6+gyx8GY=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=UdkmddtjwStJ8QIHzZF5ZncVsVHzAzc5y8LLcXDGT99xs4IQaTn75iHfSklQVTXvLfk7EBmHi9b9BWzKq1YPWkeqaiAScgBrJvZ+DAH69bLxI5iRoFHWFEgOz4MfafysEmS4gOaBfEEXYSGes1reu1usT5pI3UXTFOvpYVGPdhc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=aYSzW8Pv; arc=none smtp.client-ip=209.85.214.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="aYSzW8Pv" Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-2a0b4320665so31999305ad.1 for ; Fri, 19 Dec 2025 11:44:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173466; x=1766778266; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=wOM3SvUw6equ2pewheYxYcWCJBK1kBvmm5iLOdPjOeg=; b=aYSzW8PvQbZw0bJYBBiBAKCs5eYAG0896dUlWgd6FixVyyAxN6av07sICGK2HEQqdz mKnu8FF2SP641HO2iqigKNRLIzULQ0rdDSOanByBlC3C0OD4cisMyDdtUJ8vQgAO7Oxe 73m0bcZOU7/8Dj1HAK7XsZncSvN3MarP5+PsG/5CR9MqLxHdC2y+JlkRaTmMWLkZlxdS /QPapxAerR0/mC/vECntS7IHp4wkMnTlgh3itW0u3QPm4uFRUrhX8wDH9ZtvRdYNwJ+1 3l5EJgj5v0h58XQrebXYdkj8jGg4LCmNsqsgK2WnJsFZqFP9gv9tojw8LEVSRImr8KgR 7n1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173466; x=1766778266; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=wOM3SvUw6equ2pewheYxYcWCJBK1kBvmm5iLOdPjOeg=; b=qdPzWk0Mu1vVsYCeWznf5c1bchakLC7Ca304x4YycbLHqYvYvPaK17HFzV/HpXErpE e8jwaC1BFVUSdzWzV1jtzGcn1ameFPlXhktMqGNJHHwL5W0NsB0vj/T6eIkPbe2OXeAK uVmFnOuqg1pBPPzMaZaaEYvgjN/UrdemMA+D0RmFK2dJfnSA6p79HXKL4rnDL6VWEv5w WA0wgZuGM8drXo0wRn23qvjXAwxsWN0HVuv7iPaS9kit33rhyWhWqj8Uc4F5wpwPdp7O ODlfax3JBIKP7p3psh0ok3Oe6F1YfjnA+bgIgA5ICBnDaplvD2DuUAO/7wCq3MLMDawF fBkw== X-Forwarded-Encrypted: i=1; AJvYcCV2pjQX2ewFQaMnkVd50SFmC1sabNTDQV1n4Ks2YbuuDB08DI/ZIbCR9sy/6eMgN+g+7+eQ4zJZqe57rc4=@vger.kernel.org X-Gm-Message-State: AOJu0YxfGcUo26p4D9a/zje6626KooBfWF4dXkMVOexqJv0jsqeBfsQf tE528FMAj254x9WEMtVa4cdZS/mbo5myGho9ytt/omLeNj8j4JsIRkcp X-Gm-Gg: AY/fxX5JvPkfGAdjOcZ8tJ/JIgdI2XMuIS8gcWEWdzBNfMh8e1CYF3NQG/aC5jGMZXR RPddu4GY7aXGSqEemdlI8OkI++5ujc3Ad2E24Cw9X8ByF7i87KbHwgXJ3J11xrFfYLMNoRtLNAw 51MhqaOc8o8bTdjlVVv7ZHRrBw2t4WchrVMRBX7APJfLbOcpqhTZdUzmVis7FHSLstdvWhbEopi uwCPSEv3zCaf4Y4cTJryoMD1QvIkyB7R3AFXLDhOQwNO17Fl3vaqtDMJ8BEJLFsNMxPEfOEed/7 pRJk+bfmhDf/Brd3OZi7l3Ym7kpaZunBwONfXNYojlScAZArsQSKy0bDQ/O6NoKM8J5ILHI5lzT k+YUqadMbM7JFA6JQYaVP5avirZCjkDrtD5XzrltDLFK++3z+x/+ssuLV5ayjhBBN6GjtOuyhT4 jY33S72mqMOAcfINSeRQMhJupf3PfTM0MPuNh1dfAYR8N5mtfEKGOX X-Google-Smtp-Source: AGHT+IGrQ3Q/FdBKfcVai6hio6T0QoCziVZqMcESgeV47I7zHwMkeXW0Pae8tSt7dWOmGEe3IFAQeg== X-Received: by 2002:a17:903:b90:b0:29d:7b9b:515b with SMTP id d9443c01a7336-2a2f2527092mr38887355ad.20.1766173466340; Fri, 19 Dec 2025 11:44:26 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.44.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:44:25 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:31 +0800 Subject: [PATCH v5 02/19] mm, swap: split swap cache preparation loop into a standalone helper Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-2-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=7996; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=zTQ0qqEjQqVMue2feX7IfMCwec9b88xoOj9LChkN4Ao=; b=uwZ/nGWmMEjJ1Y+auph7TYnVzo/nsS072TR9lc38iy4sEmW/hjGgekZVoKCGLnY7cZO8ZNups hGIn0qnxc4rDjwVP4mT9B9Hd/tMdB+PZTz+DqL49qti049TRJB01+5Q X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song To prepare for the removal of swap cache bypass swapin, introduce a new helper that accepts an allocated and charged fresh folio, prepares the folio, the swap map, and then adds the folio to the swap cache. This doesn't change how swap cache works yet, we are still depending on the SWAP_HAS_CACHE in the swap map for synchronization. But all synchronization hacks are now all in this single helper. No feature change. Acked-by: Chris Li Reviewed-by: Barry Song Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- mm/swap_state.c | 197 +++++++++++++++++++++++++++++++---------------------= ---- 1 file changed, 109 insertions(+), 88 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 08252eaef32f..a8511ce43242 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -402,6 +402,97 @@ void swap_update_readahead(struct folio *folio, struct= vm_area_struct *vma, } } =20 +/** + * __swap_cache_prepare_and_add - Prepare the folio and add it to swap cac= he. + * @entry: swap entry to be bound to the folio. + * @folio: folio to be added. + * @gfp: memory allocation flags for charge, can be 0 if @charged if true. + * @charged: if the folio is already charged. + * @skip_if_exists: if the slot is in a cached state, return NULL. + * This is an old workaround that will be removed shortly. + * + * Update the swap_map and add folio as swap cache, typically before swapi= n. + * All swap slots covered by the folio must have a non-zero swap count. + * + * Context: Caller must protect the swap device with reference count or lo= cks. + * Return: Returns the folio being added on success. Returns the existing = folio + * if @entry is already cached. Returns NULL if raced with swapin or swapo= ff. + */ +static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry, + struct folio *folio, + gfp_t gfp, bool charged, + bool skip_if_exists) +{ + struct folio *swapcache; + void *shadow; + int ret; + + /* + * Check and pin the swap map with SWAP_HAS_CACHE, then add the folio + * into the swap cache. Loop with a schedule delay if raced with + * another process setting SWAP_HAS_CACHE. This hackish loop will + * be fixed very soon. + */ + for (;;) { + ret =3D swapcache_prepare(entry, folio_nr_pages(folio)); + if (!ret) + break; + + /* + * The skip_if_exists is for protecting against a recursive + * call to this helper on the same entry waiting forever + * here because SWAP_HAS_CACHE is set but the folio is not + * in the swap cache yet. This can happen today if + * mem_cgroup_swapin_charge_folio() below triggers reclaim + * through zswap, which may call this helper again in the + * writeback path. + * + * Large order allocation also needs special handling on + * race: if a smaller folio exists in cache, swapin needs + * to fallback to order 0, and doing a swap cache lookup + * might return a folio that is irrelevant to the faulting + * entry because @entry is aligned down. Just return NULL. + */ + if (ret !=3D -EEXIST || skip_if_exists || folio_test_large(folio)) + return NULL; + + /* + * Check the swap cache again, we can only arrive + * here because swapcache_prepare returns -EEXIST. + */ + swapcache =3D swap_cache_get_folio(entry); + if (swapcache) + return swapcache; + + /* + * We might race against __swap_cache_del_folio(), and + * stumble across a swap_map entry whose SWAP_HAS_CACHE + * has not yet been cleared. Or race against another + * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE + * in swap_map, but not yet added its folio to swap cache. + */ + schedule_timeout_uninterruptible(1); + } + + __folio_set_locked(folio); + __folio_set_swapbacked(folio); + + if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) { + put_swap_folio(folio, entry); + folio_unlock(folio); + return NULL; + } + + swap_cache_add_folio(folio, entry, &shadow); + memcg1_swapin(entry, folio_nr_pages(folio)); + if (shadow) + workingset_refault(folio, shadow); + + /* Caller will initiate read into locked folio */ + folio_add_lru(folio); + return folio; +} + /** * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap ca= che. * @entry: the swapped out swap entry to be binded to the folio. @@ -428,99 +519,29 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entr= y, gfp_t gfp_mask, { struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct folio *folio; - struct folio *new_folio =3D NULL; struct folio *result =3D NULL; - void *shadow =3D NULL; =20 *new_page_allocated =3D false; - for (;;) { - int err; - - /* - * Check the swap cache first, if a cached folio is found, - * return it unlocked. The caller will lock and check it. - */ - folio =3D swap_cache_get_folio(entry); - if (folio) - goto got_folio; - - /* - * Just skip read ahead for unused swap slot. - */ - if (!swap_entry_swapped(si, entry)) - goto put_and_return; - - /* - * Get a new folio to read into from swap. Allocate it now if - * new_folio not exist, before marking swap_map SWAP_HAS_CACHE, - * when -EEXIST will cause any racers to loop around until we - * add it to cache. - */ - if (!new_folio) { - new_folio =3D folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); - if (!new_folio) - goto put_and_return; - } - - /* - * Swap entry may have been freed since our caller observed it. - */ - err =3D swapcache_prepare(entry, 1); - if (!err) - break; - else if (err !=3D -EEXIST) - goto put_and_return; - - /* - * Protect against a recursive call to swap_cache_alloc_folio() - * on the same entry waiting forever here because SWAP_HAS_CACHE - * is set but the folio is not the swap cache yet. This can - * happen today if mem_cgroup_swapin_charge_folio() below - * triggers reclaim through zswap, which may call - * swap_cache_alloc_folio() in the writeback path. - */ - if (skip_if_exists) - goto put_and_return; + /* Check the swap cache again for readahead path. */ + folio =3D swap_cache_get_folio(entry); + if (folio) + return folio; =20 - /* - * We might race against __swap_cache_del_folio(), and - * stumble across a swap_map entry whose SWAP_HAS_CACHE - * has not yet been cleared. Or race against another - * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE - * in swap_map, but not yet added its folio to swap cache. - */ - schedule_timeout_uninterruptible(1); - } - - /* - * The swap entry is ours to swap in. Prepare the new folio. - */ - __folio_set_locked(new_folio); - __folio_set_swapbacked(new_folio); - - if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry)) - goto fail_unlock; - - swap_cache_add_folio(new_folio, entry, &shadow); - memcg1_swapin(entry, 1); + /* Skip allocation for unused swap slot for readahead path. */ + if (!swap_entry_swapped(si, entry)) + return NULL; =20 - if (shadow) - workingset_refault(new_folio, shadow); - - /* Caller will initiate read into locked new_folio */ - folio_add_lru(new_folio); - *new_page_allocated =3D true; - folio =3D new_folio; -got_folio: - result =3D folio; - goto put_and_return; - -fail_unlock: - put_swap_folio(new_folio, entry); - folio_unlock(new_folio); -put_and_return: - if (!(*new_page_allocated) && new_folio) - folio_put(new_folio); + /* Allocate a new folio to be added into the swap cache. */ + folio =3D folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); + if (!folio) + return NULL; + /* Try add the new folio, returns existing folio or NULL on failure. */ + result =3D __swap_cache_prepare_and_add(entry, folio, gfp_mask, + false, skip_if_exists); + if (result =3D=3D folio) + *new_page_allocated =3D true; + else + folio_put(folio); return result; } =20 --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f196.google.com (mail-pl1-f196.google.com [209.85.214.196]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DE81133CE82 for ; Fri, 19 Dec 2025 19:44:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.196 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173473; cv=none; b=pVAaYSsIjQGiCjyjuYtlYjJgLbjkyHTNn0cbIXNY0I6ns4dBppjYDGOTQ8jEGp1oZoFJLCsDRhIXB7iEhalxA2SKkePUGF9eLgI2hBhZCZnVlgbwdN488dITCab3T+hberialx2SSB6dpB8/SHvxAxTiCJC3r2pKQC2woQqaU3c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173473; c=relaxed/simple; bh=gFsjaq/ALU/9N8ndyBcj35EwDezI9teWe0L7EXppRIk=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=WnaJDbgd+3Jyn7efQnfBoxP6cEXC7m7eleJFfUo3NdvuBpp07/G7U4KgN7M5M6bkU2d0Q07A+z9D1qRNxFcmZiZV/8Wp6xu8vWKnk6LcgpELXA+1M9hr+W7tO/iVCxAN4Z3mHRpq6QlS0gi2qhr/Km3nrJEd6ns/5z8VUco2Fao= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=EC8rXeJg; arc=none smtp.client-ip=209.85.214.196 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="EC8rXeJg" Received: by mail-pl1-f196.google.com with SMTP id d9443c01a7336-2a0a95200e8so20392295ad.0 for ; Fri, 19 Dec 2025 11:44:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173471; x=1766778271; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=Wdy65Ou8JbWRDmKXo9PQRW4JzY5cc2LJzS9tndH63Hw=; b=EC8rXeJgJdtzlhQv5XYvdgqEPDJtcjHJYrvXdZxTh1/nzEiQcjAC1HxV4FadrFouYh YyQtHgFWnTxXVBy4SPNPv/u03oUWrXUWVPqADCd1DkpQO3CrNQmyzpOHkQHSYqFcPpFg oEXJEaIeBn4XbC6J95mHem1gv6VKzuwKDFh5lMXGq/uwRiZ8ZzAKfKigDBf9DBMLAUGj +un05lj9n8QvoEWybCDf2bwxGPU1y4KUmth8xiD4I35jgo4q/vNVxDsWjDhbeYFEpjGe LDFbUh/92zUuDmWuJKPUxRDP020l8ixBgxcclMgAO24OUvJhBkbfwTH/hwS7h7BqBVYw /FHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173471; x=1766778271; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=Wdy65Ou8JbWRDmKXo9PQRW4JzY5cc2LJzS9tndH63Hw=; b=Fm2JDDFnfPuq28tj1G2GFFDXcvaGrlypBPMgrdv4LDMJhSMYBlL0bJLRqA6oPAaUP1 0Or4SRs4Ihv7VkSfgtat7eOuA7PdF3WZy127Wsy6zbHqiUXAH+FAGmYoNEUdKAfDyOpb pfSpOGKBR47VXRPOco98qLafLqiEk+eYvJupXj4KVeOnUAIokygeZFAiiBtfmZ7G74zD zxkUvJ/CBupR9DPTQn3LfBvV974RbEKAnH7tTnaN5qQfv7e2lNBtKE9CvFPPylCiy/U7 PiICRtyFm2cdXJU1e9ArG5qasJXQtzEmIGr2/INe/5Ej8MzLH6nBMWyN7XBPefsFMWuY x28A== X-Forwarded-Encrypted: i=1; AJvYcCUYROaFP347yUHdQ/MKA1flhW0qe6Xu+D10ZEbC+MvjUYJqkS/fM8AaVpAXCjgL73tfLJEx3JFnIdeslF4=@vger.kernel.org X-Gm-Message-State: AOJu0YzxL+7y1EbgUyjoC8UOxlylQoClzdonfFJpzWe4IuolPj67IXBV kGbF4KSKdARue5eiJ5ARB/1SfIZ3cW1k3jdIzBKrMoYlqzHOGRROEms3 X-Gm-Gg: AY/fxX77V8fgLXqe39T4vGh1/z5t8QYeqHYppkrRGAiU46fSBsVv+CzO+Cx/KSDXhe3 idRf7y/Co0R3V9SzaOl+gI1J+2xutA+MS3U/zuFkKT190xn7IOxHPn9tX9LZBjS/JJRtjob/QdS G39DXLrM/Yu2YLGAMKj/W4WYEy/AjBzeoRWVBqQ3FdY6VqxIVDOvs0joJKIlJ+dym82kNEqVIwa dAtWONKvqrysWCyGysPyWZugMiZDeeQb+C1Oo2kAzLPcSt/OYLdAvtk48gSM2hv4mQkUgw+KMQu uSu4T3B6AhBtmWsgFzLvZ62jFGwW+6dz3wvkiJvABJeFPAvNbfuE3pv3y2sjX9sfjC8zDcr5/yk BDPvkxACp5Hs5Q7XMLgTbivZ54/wDsILHFXZyMZ/PnZuRvCac681hJyTofKMkoa4kiFC6oKt45L yAW2DrukIbS66jQXwMn4dAjM0fKx+1XTr8PXetjPBoHxj42P4bu7kE X-Google-Smtp-Source: AGHT+IG4CDonbV3GFvhsjxzJk22Gacbkw05BqfVp934Sw3htWv1tWnWtHgdeJa2AVGj6kHEDd3u2wg== X-Received: by 2002:a17:902:ce8b:b0:295:b46f:a6c2 with SMTP id d9443c01a7336-2a2f27325d5mr34529775ad.37.1766173471122; Fri, 19 Dec 2025 11:44:31 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.44.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:44:30 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:32 +0800 Subject: [PATCH v5 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-3-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=12641; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=+3pLR6DNAmoEtneFzdsLDCbvXyTZwRGJx+xHGBkb1j0=; b=0fD8hCCxok4pcURnS9vGRUviNahfagzA/y3AIqUeiBm+PgPFRAbUucN/XlEBrnO6zGNr7P1pT HASK/jzthWpBtDvvkWWv3RW3Hb1YqcJ8Uw/NPFVuxvTNkyjGBSP1FL7 X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Now the overhead of the swap cache is trivial. Bypassing the swap cache is no longer a valid optimization. So unify the swapin path using the swap cache. This changes the swap in behavior in two observable ways. Readahead is now always disabled for SWP_SYNCHRONOUS_IO devices, which is a huge win for some workloads: We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) =3D=3D 1` as the indicator to bypass both the swap cache and readahead, the swap count check made bypassing ineffective in many cases, and it's not a good indicator. The limitation existed because the current swap design made it hard to decouple readahead bypassing and swap cache bypassing. We do want to always bypass readahead for SWP_SYNCHRONOUS_IO devices, but bypassing swap cache at the same time will cause repeated IO and memory overhead. Now that swap cache bypassing is gone, this swap count check can be dropped. The second thing here is that this enabled large swapin for all swap entries on SWP_SYNCHRONOUS_IO devices. Previously, the large swap in is also coupled with swap cache bypassing, and so the swap count checking also makes large swapin less effective. Now this is also improved. We will always have large swapin supported for all SWP_SYNCHRONOUS_IO cases. And to catch potential issues with large swapin, especially with page exclusiveness and swap cache, more debug sanity checks and comments are added. But overall, the code is simpler. And new helper and routines will be used by other components in later commits too. And now it's possible to rely on the swap cache layer for resolving synchronization issues, which will also be done by a later commit. Worth mentioning that for a large folio workload, this may cause more serious thrashing. This isn't a problem with this commit, but a generic large folio issue. For a 4K workload, this commit increases the performance. Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- mm/memory.c | 137 +++++++++++++++++++++-------------------------------= ---- mm/swap.h | 6 +++ mm/swap_state.c | 27 +++++++++++ 3 files changed, 85 insertions(+), 85 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index ee15303c4041..3d6ab2689b5e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4608,7 +4608,16 @@ static struct folio *alloc_swap_folio(struct vm_faul= t *vmf) } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 -static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq); +/* Sanity check that a folio is fully exclusive */ +static void check_swap_exclusive(struct folio *folio, swp_entry_t entry, + unsigned int nr_pages) +{ + /* Called under PT locked and folio locked, the swap count is stable */ + do { + VM_WARN_ON_ONCE_FOLIO(__swap_count(entry) !=3D 1, folio); + entry.val++; + } while (--nr_pages); +} =20 /* * We enter with non-exclusive mmap_lock (to exclude vma changes, @@ -4621,17 +4630,14 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq); vm_fault_t do_swap_page(struct vm_fault *vmf) { struct vm_area_struct *vma =3D vmf->vma; - struct folio *swapcache, *folio =3D NULL; - DECLARE_WAITQUEUE(wait, current); + struct folio *swapcache =3D NULL, *folio; struct page *page; struct swap_info_struct *si =3D NULL; rmap_t rmap_flags =3D RMAP_NONE; - bool need_clear_cache =3D false; bool exclusive =3D false; softleaf_t entry; pte_t pte; vm_fault_t ret =3D 0; - void *shadow =3D NULL; int nr_pages; unsigned long page_idx; unsigned long address; @@ -4702,57 +4708,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio =3D swap_cache_get_folio(entry); if (folio) swap_update_readahead(folio, vma, vmf->address); - swapcache =3D folio; - if (!folio) { - if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && - __swap_count(entry) =3D=3D 1) { - /* skip swapcache */ + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { folio =3D alloc_swap_folio(vmf); if (folio) { - __folio_set_locked(folio); - __folio_set_swapbacked(folio); - - nr_pages =3D folio_nr_pages(folio); - if (folio_test_large(folio)) - entry.val =3D ALIGN_DOWN(entry.val, nr_pages); /* - * Prevent parallel swapin from proceeding with - * the cache flag. Otherwise, another thread - * may finish swapin first, free the entry, and - * swapout reusing the same entry. It's - * undetectable as pte_same() returns true due - * to entry reuse. + * folio is charged, so swapin can only fail due + * to raced swapin and return NULL. */ - if (swapcache_prepare(entry, nr_pages)) { - /* - * Relax a bit to prevent rapid - * repeated page faults. - */ - add_wait_queue(&swapcache_wq, &wait); - schedule_timeout_uninterruptible(1); - remove_wait_queue(&swapcache_wq, &wait); - goto out_page; - } - need_clear_cache =3D true; - - memcg1_swapin(entry, nr_pages); - - shadow =3D swap_cache_get_shadow(entry); - if (shadow) - workingset_refault(folio, shadow); - - folio_add_lru(folio); - - /* To provide entry to swap_read_folio() */ - folio->swap =3D entry; - swap_read_folio(folio, NULL); - folio->private =3D NULL; + swapcache =3D swapin_folio(entry, folio); + if (swapcache !=3D folio) + folio_put(folio); + folio =3D swapcache; } } else { - folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, - vmf); - swapcache =3D folio; + folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf); } =20 if (!folio) { @@ -4774,6 +4744,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); } =20 + swapcache =3D folio; ret |=3D folio_lock_or_retry(folio, vmf); if (ret & VM_FAULT_RETRY) goto out_release; @@ -4843,24 +4814,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_nomap; } =20 - /* allocated large folios for SWP_SYNCHRONOUS_IO */ - if (folio_test_large(folio) && !folio_test_swapcache(folio)) { - unsigned long nr =3D folio_nr_pages(folio); - unsigned long folio_start =3D ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); - unsigned long idx =3D (vmf->address - folio_start) / PAGE_SIZE; - pte_t *folio_ptep =3D vmf->pte - idx; - pte_t folio_pte =3D ptep_get(folio_ptep); - - if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || - swap_pte_batch(folio_ptep, nr, folio_pte) !=3D nr) - goto out_nomap; - - page_idx =3D idx; - address =3D folio_start; - ptep =3D folio_ptep; - goto check_folio; - } - nr_pages =3D 1; page_idx =3D 0; address =3D vmf->address; @@ -4904,12 +4857,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio)); BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page)); =20 + /* + * If a large folio already belongs to anon mapping, then we + * can just go on and map it partially. + * If not, with the large swapin check above failing, the page table + * have changed, so sub pages might got charged to the wrong cgroup, + * or even should be shmem. So we have to free it and fallback. + * Nothing should have touched it, both anon and shmem checks if a + * large folio is fully appliable before use. + * + * This will be removed once we unify folio allocation in the swap cache + * layer, where allocation of a folio stabilizes the swap entries. + */ + if (!folio_test_anon(folio) && folio_test_large(folio) && + nr_pages !=3D folio_nr_pages(folio)) { + if (!WARN_ON_ONCE(folio_test_dirty(folio))) + swap_cache_del_folio(folio); + goto out_nomap; + } + /* * Check under PT lock (to protect against concurrent fork() sharing * the swap entry concurrently) for certainly exclusive pages. */ if (!folio_test_ksm(folio)) { + /* + * The can_swapin_thp check above ensures all PTE have + * same exclusiveness. Checking just one PTE is fine. + */ exclusive =3D pte_swp_exclusive(vmf->orig_pte); + if (exclusive) + check_swap_exclusive(folio, entry, nr_pages); if (folio !=3D swapcache) { /* * We have a fresh page that is not exposed to the @@ -4987,18 +4965,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) vmf->orig_pte =3D pte_advance_pfn(pte, page_idx); =20 /* ksm created a completely new copy */ - if (unlikely(folio !=3D swapcache && swapcache)) { + if (unlikely(folio !=3D swapcache)) { folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); } else if (!folio_test_anon(folio)) { /* - * We currently only expect small !anon folios which are either - * fully exclusive or fully shared, or new allocated large - * folios which are fully exclusive. If we ever get large - * folios within swapcache here, we have to be careful. + * We currently only expect !anon folios that are fully + * mappable. See the comment after can_swapin_thp above. */ - VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) !=3D nr_pages, folio); + VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); } else { folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address, @@ -5038,12 +5014,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); out: - /* Clear the swap cache pin for direct swapin after PTL unlock */ - if (need_clear_cache) { - swapcache_clear(si, entry, nr_pages); - if (waitqueue_active(&swapcache_wq)) - wake_up(&swapcache_wq); - } if (si) put_swap_device(si); return ret; @@ -5051,6 +5021,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); out_page: + if (folio_test_swapcache(folio)) + folio_free_swap(folio); folio_unlock(folio); out_release: folio_put(folio); @@ -5058,11 +5030,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_unlock(swapcache); folio_put(swapcache); } - if (need_clear_cache) { - swapcache_clear(si, entry, nr_pages); - if (waitqueue_active(&swapcache_wq)) - wake_up(&swapcache_wq); - } if (si) put_swap_device(si); return ret; diff --git a/mm/swap.h b/mm/swap.h index 0fff92e42cfe..214e7d041030 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -268,6 +268,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t flag, struct mempolicy *mpol, pgoff_t ilx); struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf); +struct folio *swapin_folio(swp_entry_t entry, struct folio *folio); void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, unsigned long addr); =20 @@ -386,6 +387,11 @@ static inline struct folio *swapin_readahead(swp_entry= _t swp, gfp_t gfp_mask, return NULL; } =20 +static inline struct folio *swapin_folio(swp_entry_t entry, struct folio *= folio) +{ + return NULL; +} + static inline void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, unsigned long addr) { diff --git a/mm/swap_state.c b/mm/swap_state.c index a8511ce43242..8c429dc33ca9 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -545,6 +545,33 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry= , gfp_t gfp_mask, return result; } =20 +/** + * swapin_folio - swap-in one or multiple entries skipping readahead. + * @entry: starting swap entry to swap in + * @folio: a new allocated and charged folio + * + * Reads @entry into @folio, @folio will be added to the swap cache. + * If @folio is a large folio, the @entry will be rounded down to align + * with the folio size. + * + * Return: returns pointer to @folio on success. If folio is a large folio + * and this raced with another swapin, NULL will be returned to allow fall= back + * to order 0. Else, if another folio was already added to the swap cache, + * return that swap cache folio instead. + */ +struct folio *swapin_folio(swp_entry_t entry, struct folio *folio) +{ + struct folio *swapcache; + pgoff_t offset =3D swp_offset(entry); + unsigned long nr_pages =3D folio_nr_pages(folio); + + entry =3D swp_entry(swp_type(entry), round_down(offset, nr_pages)); + swapcache =3D __swap_cache_prepare_and_add(entry, folio, 0, true, false); + if (swapcache =3D=3D folio) + swap_read_folio(folio, NULL); + return swapcache; +} + /* * Locate a page of swap in physical memory, reserving swap cache space * and reading the disk if it is not already cached. --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 80D0533CEA5 for ; Fri, 19 Dec 2025 19:44:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173478; cv=none; b=Ox/2uyRU4fI6q8lDN/pKkRS9TaUojN7zOSLWH1jMt14gF8vSPqR/s5OMDEx//Rfsl0Lyv/BgyiPXSj5H+yFJOr3Hbc/PI+6PdpyHeOLbSKZ0bdML3/wopPUNB9fzIvJw2vPQyxwbvEHL6UQJInGrsOtjV5EysMzub0W8kuWxw9c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173478; c=relaxed/simple; bh=fBQzBWRvSNvqbL7qlNgTc4sQ7TRdPV5/mrlX/UlrKxo=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=ADH/gW4rVMhT2UyCINhZl/pxdgcnW7EvB/dDMl89fxaEN8abBchetUz+0kXjNaKyPch69xPDC7j3h82bGoOr/VovHaK4j0ILysUAqLZ3Cw4XUnBmweDUAAOIywijckTIwYe40Mabt3l5Ex2Gjez50Py9S5FeAWO+Fx5CqaTu20U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=lYm7CS1y; arc=none smtp.client-ip=209.85.214.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="lYm7CS1y" Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-2a0bae9aca3so30002625ad.3 for ; Fri, 19 Dec 2025 11:44:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173476; x=1766778276; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=6NNw4+sQJUy6FxM6FuaAbQJwTnt74msvVl4/Z/6MR8I=; b=lYm7CS1yV7dTMJlW6cLxO4ifthW+9t1eT+IesdwlYaCrXMxpTFLtNjCngfKl2hhwaP Sn6BbIJbbxs0cXH+wLyGOtrgHiaSGbsPl8yPbNFdAwOPdLDLBw2Qmh2jEbLbKRjWh0RD B4Wl4aX+Yax3/kIgIiMYBnsqagc1QwVEwLpBrST/bccf9ZHXNBJWg8ePLcTBOQtXq6Jw s9WkRQMiSpC8GwpE9ImAMdIVORXKzo03sH685XEe8aP+JaAtZDjZctoJOHyzEHOM731l A6AWlNvyzg9QbmdgUFYDSpuN2rf/0RKJuyb3EK2FR+sicCeq9mkOcugOF+biOjqWkHuv CckQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173476; x=1766778276; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=6NNw4+sQJUy6FxM6FuaAbQJwTnt74msvVl4/Z/6MR8I=; b=QZ7918luDk9w1wWiv94OH81/6p4+2k7RVMAXDoqWJnb/mbz8sKBE99JwWabWj3pFaX q3Fpcm1jt8D/1oz6rxNSxXxz/mIzp8MQk8pq9UPy/nFzlwHJyAEqpB/HvLhDzfwb8400 +Vb58wgHy2u7awNJsFcjhJt/aF3MlXHpfm1IX9QNoXSNcbWt8azgjmKvbPjw8YPe2JAy IZTwM6Tuy8e5BNZSNsauEOwmonRrXxCab3TlGVsH1jCu21Z+PziFT4gJ6IQJfzaOIfIk VIXss2xQ+zb1ifreZnHT4gvKKuw85d13YDN8MOWy3p7cXbu5yvoorYcJVxaR4Oob0pnj W6yw== X-Forwarded-Encrypted: i=1; AJvYcCXGJ2ONGJe1dxsQ6GjQKxigZLc6NzW2cdEU44k6sKQyJhrX4MDEyjSm8hQXcFZC6osW7F9RUWQIDG4G89E=@vger.kernel.org X-Gm-Message-State: AOJu0YxAM817ScR6G7h+K0bDqQtmarRCoWMTsR4Mx/BPnh5NzGEhp3wn obZNoRy8ZFpKUcDw9pHn44JOyLElb8AtFKbJXLF1ElMFrAtoobeSr93r X-Gm-Gg: AY/fxX7Kv9Rkn/pshcxlQENaPT03O5mm3RndqCRolZdWsIN75L8F3DiU5rKqt9ZiYUf jAjLu2UZ7xPMA95bDF9PnIji0iDErotBy5ttNTN/FLSotDjhi1A36/yOCzag14YTRwWFFdTdhGA HEYXu5Dhb5g6Pvc1RIxvs1L7aXxE4VqRLLaktK4Z+YhBIS/Mo6GzRfjX/+bmHtyxkUhyJGcgqAq E3vh8H/O3ULc6hUPzf3IQYUE7D6UBpayqhQStr8x3M+5VxWBtMsENhrbo1PqnNNN/00WHfRT7dC +H1SvJiu5QX5+rw8a6mhlcU00pEWC4S02CZKzymRO25va7dkyltgMJiFwEIjxqWeBgwiN4lDLsT EAJuzw4iUpeKOFVmYVyTtUlXqhzZszLRr9+ZF0RPH6N3oU8Hit/RLChOlzjztjH2VVBZcHO9Wmk hiktCJlm8PGfM05lzzOxW201lImIqvBbYsPPZSoYgQ4g3kT0E4ceXD X-Google-Smtp-Source: AGHT+IEr+/1945zEk9Px6JXX2rL02mT5MvIoMEZ8htEYgKPpP4r46YR5Xun01SYV0wEmq0Q8Ko+l8Q== X-Received: by 2002:a17:902:e847:b0:297:f2f1:6711 with SMTP id d9443c01a7336-2a2f2a4a39bmr39141335ad.56.1766173475908; Fri, 19 Dec 2025 11:44:35 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.44.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:44:35 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:33 +0800 Subject: [PATCH v5 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-4-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=3125; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=wbdKOMibaBwyIgmlPxN8UtIMUziDUOOWTYBMRYu4e/E=; b=eXcqEZFNRxR2JEG8FVKf7rUS7y3LiCM9aj7WW7itUSsaEJbdRHEPX2jqg7xRwE4VRDRq+9Gcg x+3noyQuBwHDUVTvk8IBJ1ogPI7uex41kCJqTvbOJKuyICojksDxhAK X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side effect is that a folio may stay in swap cache for a longer time due to lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios are being swapped out very frequently right after swapin, hence improving the performance. But the long pinning of swap slots also increases the fragmentation rate of the swap device significantly, and currently, all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also causes the backing memory to be pinned, increasing the memory pressure. So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices after swapin finishes. Swap cache has served its role as a synchronization layer to prevent any parallel swap-in from wasting CPU or memory allocation, and the redundant IO is not a major concern for SWP_SYNCHRONOUS_IO devices. Worth noting, without this patch, this series so far can provide a ~30% performance gain for certain workloads like MySQL or kernel compilation, but causes significant regression or OOM when under extreme global pressure. With this patch, we still have a nice performance gain for most workloads, and without introducing any observable regressions. This is a hint that further optimization can be done based on the new unified swapin with swap cache, but for now, just keep the behaviour consistent with before. Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- mm/memory.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 3d6ab2689b5e..9e391a283946 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4354,12 +4354,26 @@ static vm_fault_t remove_device_exclusive_entry(str= uct vm_fault *vmf) return 0; } =20 -static inline bool should_try_to_free_swap(struct folio *folio, +/* + * Check if we should call folio_free_swap to free the swap cache. + * folio_free_swap only frees the swap cache to release the slot if swap + * count is zero, so we don't need to check the swap count here. + */ +static inline bool should_try_to_free_swap(struct swap_info_struct *si, + struct folio *folio, struct vm_area_struct *vma, unsigned int fault_flags) { if (!folio_test_swapcache(folio)) return false; + /* + * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap + * cache can help save some IO or memory overhead, but these devices + * are fast, and meanwhile, swap cache pinning the slot deferring the + * release of metadata or fragmentation is a more critical issue. + */ + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) + return true; if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) || folio_test_mlocked(folio)) return true; @@ -4931,7 +4945,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * yet. */ swap_free_nr(entry, nr_pages); - if (should_try_to_free_swap(folio, vma, vmf->flags)) + if (should_try_to_free_swap(si, folio, vma, vmf->flags)) folio_free_swap(folio); =20 add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9F1D433CE82 for ; Fri, 19 Dec 2025 19:44:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.169 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173483; cv=none; b=tsGfY+MaivlBtUikkdfcP4LueCuo4BocxERuXdaC8aiFugXKw2ll6+QlZDLEKuVZcShRUyK4hzlyOgvC3qnojZCPMRQ1qLyMULzC0i4drz1x7WB3GB2+mkxIdKHcxQ0Log0pXZWah7MpInKytfzqLqLrnmB8v1K8mcvqmmU884Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173483; c=relaxed/simple; bh=y8RcVDesF+GSFwbceSak9pSlOcYOrzdJuUEoUCM+o8w=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=dgIAPLmE/QXqn2oqPo7DtI66+YuVTcJEsVcgCh5Na7yEL/9fyXIbLk2jPtbQOytS+PrKvEYptKPKkJPL8cFFazAvrUyfNRaDnpLnMtjMcr8HUhrBRE0Gam/Nrd5u0BUFBswmJPTpccQ/k1vsEwAd0UkggNSH+dLtcKQDuIY+eIQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=NWejrcfu; arc=none smtp.client-ip=209.85.214.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="NWejrcfu" Received: by mail-pl1-f169.google.com with SMTP id d9443c01a7336-29f0f875bc5so28750615ad.3 for ; Fri, 19 Dec 2025 11:44:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173481; x=1766778281; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=n1CugCRqzSGuErEPW0UlHPweW7wIaQYDCBPjqUNKIRw=; b=NWejrcfut0Gs7yKUbJpwusfOx06Mfcw11v91qtKB5WgSb+91oTlImV8rcrF/oMke+9 2NtiLQ/EPJE42L0fNhXS3q+KL2ZfBPHqYY5d3n22ddrTGHdaaloEZMrenfO4aHB/7626 2ULXIRKHkRdUAIrXrSSMLnCx3+Iu8tdZy/nCbgPsCG5/IYSvf5OJguyrAi6CzHbP3Vuq 9C9NrxJLvW1Slskq2C94HG8oXkTM7SGK0t/W8CuN/DWkbWPdS2ywePYWSAEgaRJTqCeJ LdHVyxr1MtFSPXfa5cTAIxcEN18TAypCr649uGYfvMTIxqO6z3sUL1rdW/a+06gXuewA NsJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173481; x=1766778281; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=n1CugCRqzSGuErEPW0UlHPweW7wIaQYDCBPjqUNKIRw=; b=e88o5oSa8J11Th1Pr9ftRg8RbKZcV84hHj/qt2xfWONtJMLlXcLEvTytTU18js3bm6 ef/6z7mUVEDpAzMMIriy586CT74kO01XJnOLnsivmBalQTgixEpTBwB5Ik0REnVI3wFh J1u4sshhSQTu3vX90UUPvFQq742aP4P+k3bmmwJa0Urc/ZNg4Scg1o85sJP4JqBVRpA6 8MqtMTa8WcrFqEgZ7HstClE4fcQkERih8F3WH3uKyfEYobV6/h6M8zuBHE3rUoqstSIw hI8DMWt+F83W83VN7w+jdaavytvy4ZH4wMVRZ8CHvhQvSIgH7FJtQ73Xg/sTUnax09ll IUTg== X-Forwarded-Encrypted: i=1; AJvYcCXA3K24PZ34NVfAcYcbRsSo+jjqkIXgSZ5EtvK2w0cxeGESSxC/obeIS68MSVwWcA2H/0e8NBW9j9JTHyc=@vger.kernel.org X-Gm-Message-State: AOJu0Yw9qCvRayjHzCBJuQXsuYkladyq+wSbrWoZotHyJmRx/Wv+p2u9 0nzlKHxoxuBlYWW+SPFo7win7bR/OgOV7H9tYecAFChWWM9hfI995LBm X-Gm-Gg: AY/fxX5sktQX6JcbOw2MtxSywONAU5P6cHSl7r2NXRsTe+IG7Djk7IQYfmSCn/bvXIQ M/scR4G7SHGL4zPy8oxvgJqZsEHP3cMF6kCOnM0ExB8/ua+zWCpyD3mmWEqxv1YBlB9ypgVIz9x eD1C6sHdw8284ScNSKfUw74nCZxTKErFmsCxc0C3ilq787j6XKpKa1KzEo9iQbwq58MsrT0m0c/ q5VIRrFKk+PJzpaXLiAL2DQsjRhkq5/X5QHE2naCM8wD+nlYWjbXI1vNoF9h8+rmcEPiCeXM0Rv E6keaVyYcc8m3s4gTCjJzix1N5yXedq9JlE6JM35cjtFavURJ8DZAMpdh8UFQMLBc5kmEadIEGZ OZK1R3nVx3FmKZwpftrguE1roD2jjgIoU3W+OGefKx3lOr9vgOQGOlyYht1s0S7N9m5sFhaj4jT FUVWzSEPhm3hAvSNMDegev55rDRfzH9e5gAl2SYczcwWRWlxd9Z1o5 X-Google-Smtp-Source: AGHT+IHoRYPhD5abNYveYyW3YLlwDZTNlDKwRuJoVUUtpviXATS/D+Nb8OzDpMf9zHneBu9tdTADAw== X-Received: by 2002:a17:902:c40b:b0:2a0:9d01:c3c1 with SMTP id d9443c01a7336-2a2f220310amr39365415ad.1.1766173480705; Fri, 19 Dec 2025 11:44:40 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.44.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:44:40 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:34 +0800 Subject: [PATCH v5 05/19] mm, swap: simplify the code and reduce indention Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-5-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=4358; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=WUlfqihUv3ek+SR5ujgLz5PJkzf/2opJSuBs8ZAe3co=; b=5zhzoyx9NhY5k8vy0xE845tIjN0SJyVbaayX0A9+sCQM/9BpbC2Wev4bEf8+OrAh1BYgpzVEQ x+ACNzZBtbLDjPpCsNLDna8i2F8PX7MlhrwS7H0O5mv4yfWltV71Xmw X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Now swap cache is always used, multiple swap cache checks are no longer useful, remove them and reduce the code indention. No behavior change. Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- mm/memory.c | 89 +++++++++++++++++++++++++++++----------------------------= ---- 1 file changed, 43 insertions(+), 46 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 9e391a283946..ca54009cd586 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4764,55 +4764,52 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_release; =20 page =3D folio_file_page(folio, swp_offset(entry)); - if (swapcache) { - /* - * Make sure folio_free_swap() or swapoff did not release the - * swapcache from under us. The page pin, and pte_same test - * below, are not enough to exclude that. Even if it is still - * swapcache, we need to check that the page's swap has not - * changed. - */ - if (unlikely(!folio_matches_swap_entry(folio, entry))) - goto out_page; - - if (unlikely(PageHWPoison(page))) { - /* - * hwpoisoned dirty swapcache pages are kept for killing - * owner processes (which may be unknown at hwpoison time) - */ - ret =3D VM_FAULT_HWPOISON; - goto out_page; - } - - /* - * KSM sometimes has to copy on read faults, for example, if - * folio->index of non-ksm folios would be nonlinear inside the - * anon VMA -- the ksm flag is lost on actual swapout. - */ - folio =3D ksm_might_need_to_copy(folio, vma, vmf->address); - if (unlikely(!folio)) { - ret =3D VM_FAULT_OOM; - folio =3D swapcache; - goto out_page; - } else if (unlikely(folio =3D=3D ERR_PTR(-EHWPOISON))) { - ret =3D VM_FAULT_HWPOISON; - folio =3D swapcache; - goto out_page; - } - if (folio !=3D swapcache) - page =3D folio_page(folio, 0); + /* + * Make sure folio_free_swap() or swapoff did not release the + * swapcache from under us. The page pin, and pte_same test + * below, are not enough to exclude that. Even if it is still + * swapcache, we need to check that the page's swap has not + * changed. + */ + if (unlikely(!folio_matches_swap_entry(folio, entry))) + goto out_page; =20 + if (unlikely(PageHWPoison(page))) { /* - * If we want to map a page that's in the swapcache writable, we - * have to detect via the refcount if we're really the exclusive - * owner. Try removing the extra reference from the local LRU - * caches if required. + * hwpoisoned dirty swapcache pages are kept for killing + * owner processes (which may be unknown at hwpoison time) */ - if ((vmf->flags & FAULT_FLAG_WRITE) && folio =3D=3D swapcache && - !folio_test_ksm(folio) && !folio_test_lru(folio)) - lru_add_drain(); + ret =3D VM_FAULT_HWPOISON; + goto out_page; } =20 + /* + * KSM sometimes has to copy on read faults, for example, if + * folio->index of non-ksm folios would be nonlinear inside the + * anon VMA -- the ksm flag is lost on actual swapout. + */ + folio =3D ksm_might_need_to_copy(folio, vma, vmf->address); + if (unlikely(!folio)) { + ret =3D VM_FAULT_OOM; + folio =3D swapcache; + goto out_page; + } else if (unlikely(folio =3D=3D ERR_PTR(-EHWPOISON))) { + ret =3D VM_FAULT_HWPOISON; + folio =3D swapcache; + goto out_page; + } else if (folio !=3D swapcache) + page =3D folio_page(folio, 0); + + /* + * If we want to map a page that's in the swapcache writable, we + * have to detect via the refcount if we're really the exclusive + * owner. Try removing the extra reference from the local LRU + * caches if required. + */ + if ((vmf->flags & FAULT_FLAG_WRITE) && + !folio_test_ksm(folio) && !folio_test_lru(folio)) + lru_add_drain(); + folio_throttle_swaprate(folio, GFP_KERNEL); =20 /* @@ -5002,7 +4999,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) pte, pte, nr_pages); =20 folio_unlock(folio); - if (folio !=3D swapcache && swapcache) { + if (unlikely(folio !=3D swapcache)) { /* * Hold the lock to avoid the swap entry to be reused * until we take the PT lock for the pte_same() check @@ -5040,7 +5037,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_unlock(folio); out_release: folio_put(folio); - if (folio !=3D swapcache && swapcache) { + if (folio !=3D swapcache) { folio_unlock(swapcache); folio_put(swapcache); } --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com [209.85.214.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4813333CEA5 for ; Fri, 19 Dec 2025 19:44:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173487; cv=none; b=rHe9YdjOEoQOkasbr2iybknHS7AOEhLmsxMbN43tBa23Wke8/+lTOhmlTzxpW34BT8i3YnyCh/UzIynHAlJrXaMfLk8KupPGPLW1BQnKeR/5ZNbRQ7RSlvwtEkABzbykxL+zEU5cjdSjZ2Rco/msXwN8qcTMnRslVJq7SBUGazg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173487; c=relaxed/simple; bh=HKcx/Zthc8osEoP68fK/KRHY55/jJp6jR1VwQgCwLMY=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=AyEtZ16zFta+98mo8oKdHWV0I4V4UlwP1f8aJKhoJmCTGNtNH0dmBWLAiBl10WUXqoyoynLl+zrdFe0sHNxNm12tZY6JvrQ57GRXMjcPRHff8CRC4Ps4CCzH2UjLjIflO/tNcUpEjtAARbDlHaYxMkCUTAeGC/19EIVpOfzGT6M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=iOPNWCqE; arc=none smtp.client-ip=209.85.214.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="iOPNWCqE" Received: by mail-pl1-f174.google.com with SMTP id d9443c01a7336-2a137692691so24571765ad.0 for ; Fri, 19 Dec 2025 11:44:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173485; x=1766778285; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=2PwOG+0/dDfbVzV+wkf5kACnPNuyGcuNjwmqxxfojog=; b=iOPNWCqE+LGJDOkWHMbEb74lci19FUVyRWSO3wFMAEvBpDU1Kq3xrZhWtjvTiV/XcZ 2+CjhoE52odjS3h/ys5VVygsxUh5iQQsa1kn9wh1Y7mOvufCVSCj/TRAgAt65wV9LNuw g6dfiRQNyxjGGPIm5FN/0fRFRS33Ji5Icc87Fjq0k+mUXgVRkXcEuCT8ymHhITTm4FCi NgOZe6nOF7VQ+fBajRWR5CORAexrSmjtFAK3XUdl2WHe4E5iJvXva8ClKzcmGqhnps3V zQ3c+FU3xriGCPaglZPEjmS5+PzeNumd1ChSjXa2zPmZY1wc2ACMiY9xSz1tIdIeAVjB B3Eg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173485; x=1766778285; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=2PwOG+0/dDfbVzV+wkf5kACnPNuyGcuNjwmqxxfojog=; b=GaapnhpjDBCzkQj9967RhYG9Fq2v2SkXvLxTFM3bwQ3IcX8L/bfh+WCfkTFlhR2QEu gXF3Dw5MHieTUxra+Howu+HeG14AeS1HvP9KTqZ/XaXDcYDxb4LTlue9Hj3VZqqojUss ViOeFuO/BWaarBC8zAoAu/yVYjii46+8tPRei9JFqE8wXAtGSymC4kg008hp2GSnUMO3 /KZDDsqLaKwwxYeQwwCQC9PIL/qm/8ePcy6qyNVS2fIL/wC/5VkiWvz62HAFlrbzt6Lg kKuxuGwE/vWHSTEqCbrIcTxoLWVD5MWGF+rX/3J06RRqundIdfZbO37AEPaj9E+UpAGF fYnA== X-Forwarded-Encrypted: i=1; AJvYcCWIAkA1ZUBBD1M1MTPV609Z9ycVESWUo5LUaMUfF5ppsdOvua7MBEzcuFjNTTFmUMocVmWt4Vy5wa8Sbpg=@vger.kernel.org X-Gm-Message-State: AOJu0Yx3tfsF7Va4J4/HjaDKdVpOlih9jnNTb2vN0w9/K/w6qa1qT2vV hBDoWbRBS9os34GBJVgXJvEA4rlLMYD89RI5cpnTwaFqoZsTw14gBkGc X-Gm-Gg: AY/fxX5Iv7iYjRxfMJRQsZh8wqEkwXmw2RJT/jdiDRXu+JFDb0NFqIk68rByNnoXXQR X2XGceQIDV1vGJjWFy5rp9Prvm9T5L0SoAU2W7Ly5BRMLcQZNSsTut331gZrFogW29UAXRLNupe 3MPcQ4nooJA0etxIe2CRn1crjcm4VUBvFsy/ZVpffOyAtemoIZLmJ4BH+Yv4/LolGHCRkLHocp6 eEWjOsfz3TaCoc+F94H9jMDOcSXXCFzYnc//X98opp2AIrmQvR+YY8J36tHCSVoOvgI1CsGYe0Y Ap7eZ5GsO55H1FMTjiqDmn6iZzrHPlp7WirKfC/v8GJFFtzfNdqDe6+6PmM9TzqUDfaOWP3D+uo Eeed5LhuKNTRfWRPkOPytkB0HK+8beRLEG2XvRb1Sc3epD77BU01J0WgM1KQKWfM7Kx4/4WNe7a DavjWjR3STvkxyjBKb8YYUNoR3wIWwZ7+bYV+OOPJL/LrvsAdwTwSt X-Google-Smtp-Source: AGHT+IHC+QiRfT/V3IsfLeampt1WZNrkqY8SV6SNTZa+PJQAQMNi/BGbD8+RMWj9AP3Mi1F0hxikoA== X-Received: by 2002:a17:902:f552:b0:297:c71d:851c with SMTP id d9443c01a7336-2a2f2736bc0mr35603795ad.36.1766173485468; Fri, 19 Dec 2025 11:44:45 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.44.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:44:44 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:35 +0800 Subject: [PATCH v5 06/19] mm, swap: free the swap cache after folio is mapped Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-6-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=2894; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=DGUAHHjNQalSID/K5uxKfa/hQdCPpXKKV/GQ5H67nAI=; b=CDTHqPhaFFt2+4Tu5lSmCzajCWzjr28S5y0RnqGspmj7eJ4qBH7rEslT01td9gvFD3pUzEkUQ Lx1LHAWT4GaB28yMAHGHOd+0+mCTNkmNgPMlJFsm26nwFwTkmdOKri2 X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Currently, we remove the folio from the swap cache and free the swap cache before mapping the PTE. To reduce repeated faults due to parallel swapins of the same PTE, change it to remove the folio from the swap cache after it is mapped. So new faults from the swap PTE will be much more likely to see the folio in the swap cache and wait on it. This does not eliminate all swapin races: an ongoing swapin fault may still see an empty swap cache. That's harmless, as the PTE is changed before the swap cache is cleared, so it will just return and not trigger any repeated faults. This does help to reduce the chance. Reviewed-by: Baoquan He Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/memory.c | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index ca54009cd586..a4c58341c44a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4362,6 +4362,7 @@ static vm_fault_t remove_device_exclusive_entry(struc= t vm_fault *vmf) static inline bool should_try_to_free_swap(struct swap_info_struct *si, struct folio *folio, struct vm_area_struct *vma, + unsigned int extra_refs, unsigned int fault_flags) { if (!folio_test_swapcache(folio)) @@ -4384,7 +4385,7 @@ static inline bool should_try_to_free_swap(struct swa= p_info_struct *si, * reference only in case it's likely that we'll be the exclusive user. */ return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) && - folio_ref_count(folio) =3D=3D (1 + folio_nr_pages(folio)); + folio_ref_count(folio) =3D=3D (extra_refs + folio_nr_pages(folio)); } =20 static vm_fault_t pte_marker_clear(struct vm_fault *vmf) @@ -4936,15 +4937,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ arch_swap_restore(folio_swap(entry, folio), folio); =20 - /* - * Remove the swap entry and conditionally try to free up the swapcache. - * We're already holding a reference on the page but haven't mapped it - * yet. - */ - swap_free_nr(entry, nr_pages); - if (should_try_to_free_swap(si, folio, vma, vmf->flags)) - folio_free_swap(folio); - add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); pte =3D mk_pte(page, vma->vm_page_prot); @@ -4998,6 +4990,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) arch_do_swap_page_nr(vma->vm_mm, vma, address, pte, pte, nr_pages); =20 + /* + * Remove the swap entry and conditionally try to free up the swapcache. + * Do it after mapping, so raced page faults will likely see the folio + * in swap cache and wait on the folio lock. + */ + swap_free_nr(entry, nr_pages); + if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags)) + folio_free_swap(folio); + folio_unlock(folio); if (unlikely(folio !=3D swapcache)) { /* --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f196.google.com (mail-pl1-f196.google.com [209.85.214.196]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AA0577E0E8 for ; Fri, 19 Dec 2025 19:58:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.196 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766174290; cv=none; b=tFlsBexWbOd2ww0Yr8+sJgywSRUDpAdcp/v44lAWJW2NQkF8D250+JGo5zlbDSBl3cUPX9wkUDW1zLk7q4OoZzZqgDdP11joC1tRF8W3/gN4tLJjg4sO2m6ZqIY6NdNO01o4YuMy6TUXHCqrYby1TdNwaFFLwwRAnyEF8Al+AfU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766174290; c=relaxed/simple; bh=31dFcsjD7K9QVX4sb15+PRfrOeEiJVIOLRfCZTtA4oA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=UDYm3+mmWNnfvphacY5vcOncvyi+DnGvdiQlHN71T6oEv5T4zw4hMPfMjYSIJ5ZC7y/7Xdz5v/uMYVrVvKdoOnJJ5LacIPvjd6Xq8U4lbZI5cSJWBYX3JHvu9n+JD6R3y1jtmrPd/44P7nxLBB1atcgH+nQJv7MNjBRRdIGfc6c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=PttNT1q7; arc=none smtp.client-ip=209.85.214.196 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="PttNT1q7" Received: by mail-pl1-f196.google.com with SMTP id d9443c01a7336-2a07f8dd9cdso23600105ad.1 for ; Fri, 19 Dec 2025 11:58:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766174288; x=1766779088; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=4H4DuUxDB6oobQYa52xvrq3Ax3h4Y1Qt9zUrdq7LqpM=; b=PttNT1q7kSprQA87lq1kPV/uPu6tIuy75mCArP4uDdG/Chpf4Bp65AtOghclIm2YOq GnJPNI2rmTOUzkT+OoqsVaMSaeDOvqw4G3eUTAplblpZuuZ57wpQ5s3c79+kI/9kniap d/dKQtiPgIod2he8QtPMAFJzxQwLR+G9sc/l2rSqsqRVjCpL+Lmga1W7pE7d4Ozu4Xad T4s7HObmk/QTOlRpS5qBUf5pN5UG7+VePJ1Ow2SejClKoBOE7DkNgZeG6U6tOdnOLvJl rxgD3NtaSOzznHnLztVgRdQlI8Gej7x2LPZpIEin0tpkBNaLdfdxL0Ai4Javjm6eC0y0 VS6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766174288; x=1766779088; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=4H4DuUxDB6oobQYa52xvrq3Ax3h4Y1Qt9zUrdq7LqpM=; b=QSGjZkWohnsTjTWFEFEs5x7oF94+TriYS4s/M8e4qljJyOC/KpHLsqaFcC4VEuJcIg JkoizgY766dsAcSK/USN77irgQNtNGml8Io1Xz96lGj8QAIbEtkuiJ1Jd7J0PJGJllzM T42lUaNp6g3GVoy7Jdd7iLnA2M0bxdckjJ5xQJNPespcVdxR2wOshqL/MlkmGeXu5bWn N9CpRfGIWNNHAyL57EGeF21pso3P8GWQZiIuWE5jQxXxrRmYuyG9TrrVIGQTefK7Z9lp PTYfn2PlvIfZeRvGDXXTJ1YJ9r7QFneQxVqK6VSrD87fcL8bsqOJi2TYz+57XRinEMWM ZS4Q== X-Forwarded-Encrypted: i=1; AJvYcCU6APaGhKh8ZJvox2tn9WoDDIJ5w5Dg4jNI1crWEkGrlOzNgEwAJst6eJh3QYntvVk8nt2kzpm+SwmlBWk=@vger.kernel.org X-Gm-Message-State: AOJu0YwCU6qNoAPh0/dSJA68yvONHirV++AIIHAALjS3Eh2G09YCQb8E ct6q8Bug1+rrPzoZ8tK0T2OIKHRRmOFAJqwsZRo55HAGDGOp4MYmG8u/ X-Gm-Gg: AY/fxX6oQOV2NHIQLP3N4WdWs/toS0YZ2DAOAC7st7vJ7dU8+Bz8i2KtrDIWJ35vsC/ hGJbAjJ25EI9unQtbGzFUKMaaKJH+I1U0/ej+4SVkr6QfbTmwfQ2/amNUViO+Mq61QKwa7SC5nq yry8km1PHiP/3SyEKKCRFr2wOci2odu9FfHvl/ffOwn4izXu/1UcxkRphFTZ8CQrGcW+YWgHzWs yHJZaJQYmL2Ie3fMPQmBmNJ2ghpYe5ni98461ZQOngKqzdJ7RHjfZIITJNZpkHS9vDt9UU8X/XL qF4snI+q8wb6XklejnjPAPeRPcdnCyM39zLKBk1HK7+NtOrHkYluT+LevVmd2khhaVKilxxOW4v RjZ4gE7HvDePHbCILKTj4JOWVnrGWUM9V8T2poJV1IPa9G+0FAauV4JdIlZRtVd+d1uA71WTjaX rjEAqAaU5bpuPkzDW83LESjqdcJrmQinybkCedtPDqL6NhzZmOTTNLZC5uvwPsJzJbWvO8eAMqF PFSww== X-Google-Smtp-Source: AGHT+IG3wtyX2kCqK0/+5Lh2TLFmivYjvnsdXdS7J5h4wcZuh9jh+Qx4ARXEJV+u1nbtB3ROW7TP9g== X-Received: by 2002:a17:902:d50b:b0:2a0:f47c:cfc with SMTP id d9443c01a7336-2a2f283685amr38147215ad.34.1766174287808; Fri, 19 Dec 2025 11:58:07 -0800 (PST) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d6ec6bsm29561605ad.87.2025.12.19.11.58.03 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 19 Dec 2025 11:58:07 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Kairui Song , Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v5 07/19] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Date: Sat, 20 Dec 2025 03:57:51 +0800 Message-ID: <20251219195751.61328-1-ryncsn@gmail.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=8383; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=sKGl3ENiC74FwrANABEUQMRrBkxGAp/uMm91rZRl9xM=; b=yf5gksHPlsaqtrml++5CgCNI5OX/44vVNb6duB9fWkQkMKDNXC+FFx8Jxgs7hLdW9/Lsta7Bm l1nNK9Cc7AWBVfWv/5eQgmJ/f2SGNDpuQOKI16R60HkMmte2972R59h X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= Content-Transfer-Encoding: quoted-printable From: Kairui Song Now the overhead of the swap cache is trivial to none, bypassing the swap cache is no longer a good optimization. We have removed the cache bypass swapin for anon memory, now do the same for shmem. Many helpers and functions can be dropped now. The performance may slightly drop because of the co-existence and double update of swap_map and swap table, and this problem will be improved very soon in later commits by dropping the swap_map update partially: Swapin of 24 GB file with tmpfs with transparent_hugepage_tmpfs=3Dwithin_size and ZRAM, 3 test runs on my machine: Before: After this commit: After this series: 5.99s 6.29s 6.08s And later swap table phases will drop the swap_map completely to avoid overhead and reduce memory usage. Reviewed-by: Baolin Wang Tested-by: Baolin Wang Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- mm/shmem.c | 65 +++++++++++++++++--------------------------------------= ---- mm/swap.h | 4 ---- mm/swapfile.c | 35 +++++++++----------------------- 3 files changed, 27 insertions(+), 77 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index dd136d40631c..d7eeeaa9580d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2014,10 +2014,9 @@ static struct folio *shmem_swap_alloc_folio(struct i= node *inode, swp_entry_t entry, int order, gfp_t gfp) { struct shmem_inode_info *info =3D SHMEM_I(inode); + struct folio *new, *swapcache; int nr_pages =3D 1 << order; - struct folio *new; gfp_t alloc_gfp; - void *shadow; =20 /* * We have arrived here because our zones are constrained, so don't @@ -2057,34 +2056,19 @@ static struct folio *shmem_swap_alloc_folio(struct = inode *inode, goto fallback; } =20 - /* - * Prevent parallel swapin from proceeding with the swap cache flag. - * - * Of course there is another possible concurrent scenario as well, - * that is to say, the swap cache flag of a large folio has already - * been set by swapcache_prepare(), while another thread may have - * already split the large swap entry stored in the shmem mapping. - * In this case, shmem_add_to_page_cache() will help identify the - * concurrent swapin and return -EEXIST. - */ - if (swapcache_prepare(entry, nr_pages)) { + swapcache =3D swapin_folio(entry, new); + if (swapcache !=3D new) { folio_put(new); - new =3D ERR_PTR(-EEXIST); - /* Try smaller folio to avoid cache conflict */ - goto fallback; + if (!swapcache) { + /* + * The new folio is charged already, swapin can + * only fail due to another raced swapin. + */ + new =3D ERR_PTR(-EEXIST); + goto fallback; + } } - - __folio_set_locked(new); - __folio_set_swapbacked(new); - new->swap =3D entry; - - memcg1_swapin(entry, nr_pages); - shadow =3D swap_cache_get_shadow(entry); - if (shadow) - workingset_refault(new, shadow); - folio_add_lru(new); - swap_read_folio(new, NULL); - return new; + return swapcache; fallback: /* Order 0 swapin failed, nothing to fallback to, abort */ if (!order) @@ -2174,8 +2158,7 @@ static int shmem_replace_folio(struct folio **foliop,= gfp_t gfp, } =20 static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t inde= x, - struct folio *folio, swp_entry_t swap, - bool skip_swapcache) + struct folio *folio, swp_entry_t swap) { struct address_space *mapping =3D inode->i_mapping; swp_entry_t swapin_error; @@ -2191,8 +2174,7 @@ static void shmem_set_folio_swapin_error(struct inode= *inode, pgoff_t index, =20 nr_pages =3D folio_nr_pages(folio); folio_wait_writeback(folio); - if (!skip_swapcache) - swap_cache_del_folio(folio); + swap_cache_del_folio(folio); /* * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks) @@ -2292,7 +2274,6 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, softleaf_t index_entry; struct swap_info_struct *si; struct folio *folio =3D NULL; - bool skip_swapcache =3D false; int error, nr_pages, order; pgoff_t offset; =20 @@ -2335,7 +2316,6 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, folio =3D NULL; goto failed; } - skip_swapcache =3D true; } else { /* Cached swapin only supports order 0 folio */ folio =3D shmem_swapin_cluster(swap, gfp, info, index); @@ -2391,9 +2371,8 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, * and swap cache folios are never partially freed. */ folio_lock(folio); - if ((!skip_swapcache && !folio_test_swapcache(folio)) || - shmem_confirm_swap(mapping, index, swap) < 0 || - folio->swap.val !=3D swap.val) { + if (!folio_matches_swap_entry(folio, swap) || + shmem_confirm_swap(mapping, index, swap) < 0) { error =3D -EEXIST; goto unlock; } @@ -2425,12 +2404,7 @@ static int shmem_swapin_folio(struct inode *inode, p= goff_t index, if (sgp =3D=3D SGP_WRITE) folio_mark_accessed(folio); =20 - if (skip_swapcache) { - folio->swap.val =3D 0; - swapcache_clear(si, swap, nr_pages); - } else { - swap_cache_del_folio(folio); - } + swap_cache_del_folio(folio); folio_mark_dirty(folio); swap_free_nr(swap, nr_pages); put_swap_device(si); @@ -2441,14 +2415,11 @@ static int shmem_swapin_folio(struct inode *inode, = pgoff_t index, if (shmem_confirm_swap(mapping, index, swap) < 0) error =3D -EEXIST; if (error =3D=3D -EIO) - shmem_set_folio_swapin_error(inode, index, folio, swap, - skip_swapcache); + shmem_set_folio_swapin_error(inode, index, folio, swap); unlock: if (folio) folio_unlock(folio); failed_nolock: - if (skip_swapcache) - swapcache_clear(si, folio->swap, folio_nr_pages(folio)); if (folio) folio_put(folio); put_swap_device(si); diff --git a/mm/swap.h b/mm/swap.h index 214e7d041030..e0f05babe13a 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -403,10 +403,6 @@ static inline int swap_writeout(struct folio *folio, return 0; } =20 -static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_= t entry, int nr) -{ -} - static inline struct folio *swap_cache_get_folio(swp_entry_t entry) { return NULL; diff --git a/mm/swapfile.c b/mm/swapfile.c index e5284067a442..3762b8f3f9e9 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1614,22 +1614,6 @@ struct swap_info_struct *get_swap_device(swp_entry_t= entry) return NULL; } =20 -static void swap_entries_put_cache(struct swap_info_struct *si, - swp_entry_t entry, int nr) -{ - unsigned long offset =3D swp_offset(entry); - struct swap_cluster_info *ci; - - ci =3D swap_cluster_lock(si, offset); - if (swap_only_has_cache(si, offset, nr)) { - swap_entries_free(si, ci, entry, nr); - } else { - for (int i =3D 0; i < nr; i++, entry.val++) - swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); - } - swap_cluster_unlock(ci); -} - static bool swap_entries_put_map(struct swap_info_struct *si, swp_entry_t entry, int nr) { @@ -1765,13 +1749,21 @@ void swap_free_nr(swp_entry_t entry, int nr_pages) void put_swap_folio(struct folio *folio, swp_entry_t entry) { struct swap_info_struct *si; + struct swap_cluster_info *ci; + unsigned long offset =3D swp_offset(entry); int size =3D 1 << swap_entry_order(folio_order(folio)); =20 si =3D _swap_info_get(entry); if (!si) return; =20 - swap_entries_put_cache(si, entry, size); + ci =3D swap_cluster_lock(si, offset); + if (swap_only_has_cache(si, offset, size)) + swap_entries_free(si, ci, entry, size); + else + for (int i =3D 0; i < size; i++, entry.val++) + swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); + swap_cluster_unlock(ci); } =20 int __swap_count(swp_entry_t entry) @@ -3784,15 +3776,6 @@ int swapcache_prepare(swp_entry_t entry, int nr) return __swap_duplicate(entry, SWAP_HAS_CACHE, nr); } =20 -/* - * Caller should ensure entries belong to the same folio so - * the entries won't span cross cluster boundary. - */ -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int n= r) -{ - swap_entries_put_cache(si, entry, nr); -} - /* * add_swap_count_continuation - called when a swap count is duplicated * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entr= y's --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 81A6F22AE45 for ; Fri, 19 Dec 2025 19:44:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173499; cv=none; b=lRghOSXVCDo+/mUkawtk6bfAe2SrGTpoiIyFDB/VzJ5ExII3VZQ0WZghOazu4jScxoPwNfUlQGDUA8FGXUoCqWJRtEV+P01GMDaTzOueJY5FoZrTh0LiKwa8XWW+/Gt0uc5i7GZ72UmjX4O3N4eWvh8oi/qygNY2Fbuc2NK+1DQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173499; c=relaxed/simple; bh=JUvM6nUGOYJ7aMfEl8h5cvA1xyJUb85NTuiyaTCPyLg=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=PCFF+FvfLaCAB5ipzZpedAUCUDUjIY1RRzgdRtU7XZF7OeQATsMEA9NcmsC1SacTHnlsRlqeQwcHeWtEn7fCI2cgzm+5Vo1vAYEmztdYxc+QRF0XGHEmPpoBBkHGOzFjky2vvXEFGIZcjjZj0KsH1MEnq30qB7VnkEasXdi1lBY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=CvDkMq36; arc=none smtp.client-ip=209.85.214.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="CvDkMq36" Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-2a099233e8dso19986885ad.3 for ; Fri, 19 Dec 2025 11:44:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173495; x=1766778295; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=2dPOaaOJ2TQSWYXjPLzkHtEJKqmOi8KzZJUw0xmP+Hk=; b=CvDkMq36JGmd/2y/GH4A0APytcQwQ9OLa1vpNqKtGxE2MnK9IMtF58LrTwz9EHK4AA tQe4hhQWx4LW0508UdKKurLhchFhYbHht+IlLzrFyW7TVsdND97OcuV854pIefqagrJl QZnHdWMrRCOVcZdqXh4ua/PgjNQvrj+8fqtG6J/SeY956DjXph6/k6NqPAAtsyY6z3I8 5ibcdrVTghc+yqNR7xQfLHaoGz4sbtAjPnDjDzPow0a3Uj+2TcaeFxF/Rin/I1XbSXq+ sfXEE0oY2fkj8F0+b5GpDk/TQB+y10pVgi0/YbVS2WsEUhF/ue6zxb7790O86f4d+FQe QVAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173495; x=1766778295; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=2dPOaaOJ2TQSWYXjPLzkHtEJKqmOi8KzZJUw0xmP+Hk=; b=kchwNqzn/1h1omSorTJnVmRf+ZqbSn7lOtC9IBUNyRqSMYl73SwC+oXiXLJ14Nc32N d5zdJzf3ByRoghJhofFSKUvaeFm3ofUBPFzkmxlsVcS0SbHDHZoyKHhx1rCcYqoQiC5A YTIFzGLJvEDQBX9iyTT0r81fzSgD+/Ddkycpfj/jH9FfZm7tHQTiz/jC+16+dmqU8szW 4FN+pk8azZuzsLuUFXbEfnH9rz7YDmGto/672bvFtah4s/0f7cdVOmMJ+rL3eEckaKpA S7ya4zBL2M/AUKeDfLFAHAFP7mSQ83TyZB5cnXAtJt52w06Hupzw7IF5Re+o+gDERtjT yDNg== X-Forwarded-Encrypted: i=1; AJvYcCWj9L3bJ3OS76GkaoDGOXaMijJaBro+aE8dBTKWf7Sm0Yw+bchBLFMYSGeEcMzrtod7rmFm7DNNuBsG8SU=@vger.kernel.org X-Gm-Message-State: AOJu0YyqITAV4Em96Kg6EMrK4LOZTUu23nQrlEKkq16nxruCjAlS5Q03 0aYMPFXe/a8GbIPdFTWRDBY4VYC00o6IQHWqpcXdQXDGQ6mXWOjikrSQ X-Gm-Gg: AY/fxX7cHU8YUQYxqJiunVcZdtGcrRFS6FbS0068m0+0I4N1+f/ZkKc7hjaQGe9Rm9D lQ+I0VgEvCeJXC83hQ9tFMSNbEUH7Q/Jtz0h4RBeGpdZ2EGb7mSz8iUP/T62pmH+PZcAs9il0Yh BXPgsTHSrvqExgjMgy81XOWDfBzaIJFoB4yf+ixqRzJm/q7KfK4BKN/if6xfLFVQDmbswj7H6gT UhHrsJRml6hODdbYtx5Xw5rBYYv8G6MUbsaMnLVm5pvCyOFfbrnBlIWoDLeFP8e+kJ91WTI7iOZ f06d4Qyz5q/8RQONxDz0frjpUYHLh3mG+JE+LKSSog9pO8KpY9KpeNllmF2ObdD5yBIYdCT5CAB M+EXodX1/hgn8WP3p3C5avSzTJ38+I6bDdMVbDxd/5FLshcPnzUI0wvZ7IKhBKqUtbLZep/KKpU HmFRjFAc5X2RgSSDTM9i7fAlRmCJMW9wQrLPMbDrBFDxtGrqVHfiLWiDFxS5ymZ2k= X-Google-Smtp-Source: AGHT+IE4dkzSppOt3tHGWCnWXkosxODEpLg9AmqyJsLq4gVhbtsrGH39aXgfltGk3xyFoNlZwzd48w== X-Received: by 2002:a17:902:ef52:b0:2a0:dabc:1383 with SMTP id d9443c01a7336-2a2f22234a1mr44134155ad.14.1766173494957; Fri, 19 Dec 2025 11:44:54 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.44.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:44:54 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:37 +0800 Subject: [PATCH v5 08/19] mm/shmem, swap: remove SWAP_MAP_SHMEM Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-8-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=7240; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=JUOvtAP1dKiH1CMj9+4csv5/DFFUtuyY45QPHt+u/IE=; b=8Jn84cPjQ22CkwioQYmAbl/Yn1agcuCqkgakUWKAtfNfnUYO4P3hs7B2jD8IdxX2p72W98ogA i5Nuo+l6LCbDAysmgOoza5guwPPrK0mCtHVeysaL6aklklSSnaHnS1h X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Nhat Pham The SWAP_MAP_SHMEM state was introduced in the commit aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry belongs to shmem during swapoff. However, swapoff has since been rewritten in the commit b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now having swap count =3D=3D SWAP_MAP_SHMEM value is basically the same as having swap count =3D=3D 1, and swap_shmem_alloc() behaves analogously to swap_duplicate(). The only difference of note is that swap_shmem_alloc() does not check for -ENOMEM returned from __swap_duplicate(), but it is OK because shmem never re-duplicates any swap entry it owns. This will stil be safe if we use (batched) swap_duplicate() instead. This commit adds swap_duplicate_nr(), the batched variant of swap_duplicate(), and removes the SWAP_MAP_SHMEM state and the associated swap_shmem_alloc() helper to simplify the state machine (both mentally and in terms of actual code). We will also have an extra state/special value that can be repurposed (for swap entries that never gets re-duplicated). Signed-off-by: Nhat Pham Reviewed-by: Baolin Wang Tested-by: Baolin Wang Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- include/linux/swap.h | 15 +++++++-------- mm/shmem.c | 2 +- mm/swapfile.c | 42 +++++++++++++++++------------------------- 3 files changed, 25 insertions(+), 34 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 38ca3df68716..bf72b548a96d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -230,7 +230,6 @@ enum { /* Special value in first swap_map */ #define SWAP_MAP_MAX 0x3e /* Max count */ #define SWAP_MAP_BAD 0x3f /* Note page is bad */ -#define SWAP_MAP_SHMEM 0xbf /* Owned by shmem/tmpfs */ =20 /* Special value in each swap_map continuation */ #define SWAP_CONT_MAX 0x7f /* Max count */ @@ -458,8 +457,7 @@ bool folio_free_swap(struct folio *folio); void put_swap_folio(struct folio *folio, swp_entry_t entry); extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); -extern void swap_shmem_alloc(swp_entry_t, int); -extern int swap_duplicate(swp_entry_t); +extern int swap_duplicate_nr(swp_entry_t entry, int nr); extern int swapcache_prepare(swp_entry_t entry, int nr); extern void swap_free_nr(swp_entry_t entry, int nr_pages); extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); @@ -514,11 +512,7 @@ static inline int add_swap_count_continuation(swp_entr= y_t swp, gfp_t gfp_mask) return 0; } =20 -static inline void swap_shmem_alloc(swp_entry_t swp, int nr) -{ -} - -static inline int swap_duplicate(swp_entry_t swp) +static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages) { return 0; } @@ -569,6 +563,11 @@ static inline int add_swap_extent(struct swap_info_str= uct *sis, } #endif /* CONFIG_SWAP */ =20 +static inline int swap_duplicate(swp_entry_t entry) +{ + return swap_duplicate_nr(entry, 1); +} + static inline void free_swap_and_cache(swp_entry_t entry) { free_swap_and_cache_nr(entry, 1); diff --git a/mm/shmem.c b/mm/shmem.c index d7eeeaa9580d..e36330cdd066 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1667,7 +1667,7 @@ int shmem_writeout(struct folio *folio, struct swap_i= ocb **plug, spin_unlock(&shmem_swaplist_lock); } =20 - swap_shmem_alloc(folio->swap, nr_pages); + swap_duplicate_nr(folio->swap, nr_pages); shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap)); =20 BUG_ON(folio_mapped(folio)); diff --git a/mm/swapfile.c b/mm/swapfile.c index 3762b8f3f9e9..e23287c06f1c 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -201,7 +201,7 @@ static bool swap_is_last_map(struct swap_info_struct *s= i, unsigned char *map_end =3D map + nr_pages; unsigned char count =3D *map; =20 - if (swap_count(count) !=3D 1 && swap_count(count) !=3D SWAP_MAP_SHMEM) + if (swap_count(count) !=3D 1) return false; =20 while (++map < map_end) { @@ -1523,12 +1523,6 @@ static unsigned char swap_entry_put_locked(struct sw= ap_info_struct *si, if (usage =3D=3D SWAP_HAS_CACHE) { VM_BUG_ON(!has_cache); has_cache =3D 0; - } else if (count =3D=3D SWAP_MAP_SHMEM) { - /* - * Or we could insist on shmem.c using a special - * swap_shmem_free() and free_shmem_swap_and_cache()... - */ - count =3D 0; } else if ((count & ~COUNT_CONTINUED) <=3D SWAP_MAP_MAX) { if (count =3D=3D COUNT_CONTINUED) { if (swap_count_continued(si, offset, count)) @@ -1626,7 +1620,7 @@ static bool swap_entries_put_map(struct swap_info_str= uct *si, if (nr <=3D 1) goto fallback; count =3D swap_count(data_race(si->swap_map[offset])); - if (count !=3D 1 && count !=3D SWAP_MAP_SHMEM) + if (count !=3D 1) goto fallback; =20 ci =3D swap_cluster_lock(si, offset); @@ -1680,12 +1674,10 @@ static bool swap_entries_put_map_nr(struct swap_inf= o_struct *si, =20 /* * Check if it's the last ref of swap entry in the freeing path. - * Qualified value includes 1, SWAP_HAS_CACHE or SWAP_MAP_SHMEM. */ static inline bool __maybe_unused swap_is_last_ref(unsigned char count) { - return (count =3D=3D SWAP_HAS_CACHE) || (count =3D=3D 1) || - (count =3D=3D SWAP_MAP_SHMEM); + return (count =3D=3D SWAP_HAS_CACHE) || (count =3D=3D 1); } =20 /* @@ -3678,7 +3670,6 @@ static int __swap_duplicate(swp_entry_t entry, unsign= ed char usage, int nr) =20 offset =3D swp_offset(entry); VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); - VM_WARN_ON(usage =3D=3D 1 && nr > 1); ci =3D swap_cluster_lock(si, offset); =20 err =3D 0; @@ -3738,27 +3729,28 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) return err; } =20 -/* - * Help swapoff by noting that swap entry belongs to shmem/tmpfs - * (in which case its reference count is never incremented). - */ -void swap_shmem_alloc(swp_entry_t entry, int nr) -{ - __swap_duplicate(entry, SWAP_MAP_SHMEM, nr); -} - -/* - * Increase reference count of swap entry by 1. +/** + * swap_duplicate_nr() - Increase reference count of nr contiguous swap en= tries + * by 1. + * + * @entry: first swap entry from which we want to increase the refcount. + * @nr: Number of entries in range. + * * Returns 0 for success, or -ENOMEM if a swap_count_continuation is requi= red * but could not be atomically allocated. Returns 0, just as if it succee= ded, * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), wh= ich * might occur if a page table entry has got corrupted. + * + * Note that we are currently not handling the case where nr > 1 and we ne= ed to + * add swap count continuation. This is OK, because no such user exists - = shmem + * is the only user that can pass nr > 1, and it never re-duplicates any s= wap + * entry it owns. */ -int swap_duplicate(swp_entry_t entry) +int swap_duplicate_nr(swp_entry_t entry, int nr) { int err =3D 0; =20 - while (!err && __swap_duplicate(entry, 1, 1) =3D=3D -ENOMEM) + while (!err && __swap_duplicate(entry, 1, nr) =3D=3D -ENOMEM) err =3D add_swap_count_continuation(entry, GFP_ATOMIC); return err; } --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f193.google.com (mail-pl1-f193.google.com [209.85.214.193]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CDF42341067 for ; Fri, 19 Dec 2025 19:45:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.193 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173507; cv=none; b=kv4Il6HOVyKVUP2qRLqgwhA919orLKJLZPneyapYFgQ0+4S5tHkRvkT0Ej/KwjWV/qP5H1n54dg/OgweA2brSXGLMpIOLB/0hcj6aN9DXV2YE6HFDkHIHAHtKWQZEQaxQecYwtIWFAysCB8r7wqsFZv98N+1NFJ2DlP/aOl5UM0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173507; c=relaxed/simple; bh=hc3YIYn1N3w0aiCvpFu7E+9gjMVbpKJVwg9n3D4XsgM=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=WVc8S+wq/+PgdBcoqmePyiKBHmbUjbqA8yudEFO2LN32t4U+9jhbvv6rLh5JnckksZJaBkoGrWfc3nwQLoiDw9RsJ3CbolbfiwtYrK1AYFbCSo7Nu/h6GDj63ujgAwQLYeZwr8FA7xaNPe5YsPXKDzBH9zzhtp+osHh9vJ4FdzQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=L4hfhWln; arc=none smtp.client-ip=209.85.214.193 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="L4hfhWln" Received: by mail-pl1-f193.google.com with SMTP id d9443c01a7336-2a110548cdeso29466085ad.0 for ; Fri, 19 Dec 2025 11:45:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173500; x=1766778300; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=UQJyY+YwzPf31Qkf8bV4s78sdugXZ68Qh1NEIv2FxjE=; b=L4hfhWlnkyoi8W3752RJDIRBoINzAh/kUF9ZMkTFbTaGIdLSAMXhdwAFRjp0OYGI7u eKQZ8gvAcMMI+fvLqQYGeO/54WMjPGpaH1fdL8bXa/i9ZcZCwDX7sEw5/65U+oHT969O U6upLyTWlECd3wqpkxYdexkBZ4iDbwWlSqIzYj01lBvrREZAOkkdekK+WfXTxpsfQWEZ ObS5gpOlzvRR4vOd89FeiyHdu7GbPYp4oqLOBiVMIZ+E4mEzzzrekJnTSLKTXuYWVCVD c1pe6ofxxNuMOHLN5fcGequTntnbzeIUACK/KE06U+cUpxkCYbYjet6AEfeEJJninTUi GYUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173500; x=1766778300; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=UQJyY+YwzPf31Qkf8bV4s78sdugXZ68Qh1NEIv2FxjE=; b=Eziv6brizPmwYExRonbsRgib8YOkmwbwaAgkeE5D+H+T1QwQEVdgJQqdtthL68OtGL s2SJdFo5BBzo9gs0/E18tkAe4jUiZNTCGUv5tcOnhCbUbYwJM97URrL6R6vWTOu7+Qm0 xVZEhXVj50JT/NHTHnhjad7SX+eo2rvEOgMXp35rWRbwthe5qDw2DGcoyZJ4slTQ8x/o eina426HecLvYOIbknyHorDjkwmpe6c3cIsMoqr9ZDNecY2g9WxdsFKmD4WelY1dL/rl jEr0mIxff7CqRFJ28h7A9woDmzz3f1/S80iDQNnb/fzgcgJrkS4cCA3LTVM1CnS52okz /GpQ== X-Forwarded-Encrypted: i=1; AJvYcCXmTZWG5GpEbMrDezAuqZdkmNPQOdnObEmFZP+PDc0m7F+nyv24X+wcGkdTEZufmmTOTlQGDSp3LRZDzbs=@vger.kernel.org X-Gm-Message-State: AOJu0YxXvcWgFX2Rnxp6dEWyJlW92FGBmzDn0zXba0hkbqFXDORwumMQ yL1fY/bVUjgsLWAux91ZAXzP1AzTas8gErv2/YH8hZ/dGYEqQdYsYk93 X-Gm-Gg: AY/fxX4PtQmlzmPooyO2ddJqYwP1lRChFr/4fx3WjLW3I0SvscXnDXQ/Rh5lAoHBvjV ZdXzCnSxg90aa4/B4VgIdmrZrMjhRYRO896dZfJGdxQDJSTJmcbWIb+eDPoZmcpo3wdkaWOGbjN zGbbkGg3KkqHzU9ocmG8X2ToUutxsEAqGhuwxwM3zH/6gBl8vcK2+JR0Wk8JLZBgbsP5f4x7jHK Hvke08u4K8UApr/x+a+TNwpNh9WVw8UiIann5ulauHmGkeLsd/PDCNWuNK872e13Yh4E2VZ1AlX a/u9QEkF2i6soePeVjfOBDHBbG0/D+BBwwKti52QGyHAayTxCqjelVKStQ3ps9o8cmKgyYWN/X/ 5iG7RZI5+Y1RQHA9ciLa3BWr/CqceNwXX1rv2CQdnl29Ta4pS8W1JB2IlQ8M1/ZGL3Bnl+vZWb7 AO4C1axiW7+R1rXnQwwurPAH4DkzStmB081/lt6GCDLMl9EtkOABxe X-Google-Smtp-Source: AGHT+IH4rLRYlF7Km2RzwFxpEBaS9iEraO/JHNRLyy3vZ6hSoMKZ8Zwoy5C5IS/BflnOo3E1stDXrQ== X-Received: by 2002:a17:903:41ca:b0:295:195:23b6 with SMTP id d9443c01a7336-2a2f2a498b6mr37288195ad.55.1766173499635; Fri, 19 Dec 2025 11:44:59 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.44.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:44:59 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:38 +0800 Subject: [PATCH v5 09/19] mm, swap: swap entry of a bad slot should not be considered as swapped out Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-9-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=3162; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=Pi+TZAZwSWqdPEL23QFn7UOFD8uz8uW5toM/dBbdtuY=; b=7X9ok+JvnqWP43fxlMkDbdw05cri4rXNBYcpPfdaDU3Jd01+oq+VhQ1HNeq7aAxdnhRTb+c9w hEygOuc5yh9Dp3Ng8N4obllx+7crEibum2iJMVnJOstaDg8l3FV456u X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song When checking if a swap entry is swapped out, we simply check if the bitwise result of the count value is larger than 0. But SWAP_MAP_BAD will also be considered as a swao count value larger than 0. SWAP_MAP_BAD being considered as a count value larger than 0 is useful for the swap allocator: they will be seen as a used slot, so the allocator will skip them. But for the swapped out check, this isn't correct. There is currently no observable issue. The swapped out check is only useful for readahead and folio swapped-out status check. For readahead, the swap cache layer will abort upon checking and updating the swap map. For the folio swapped out status check, the swap allocator will never allocate an entry of bad slots to folio, so that part is fine too. The worst that could happen now is redundant allocation/freeing of folios and waste CPU time. This also makes it easier to get rid of swap map checking and update during folio insertion in the swap cache layer. Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- mm/swap_state.c | 2 +- mm/swapfile.c | 17 +++++++++-------- 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 8c429dc33ca9..b7a36c18082f 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -527,7 +527,7 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry,= gfp_t gfp_mask, if (folio) return folio; =20 - /* Skip allocation for unused swap slot for readahead path. */ + /* Skip allocation for unused and bad swap slot for readahead. */ if (!swap_entry_swapped(si, entry)) return NULL; =20 diff --git a/mm/swapfile.c b/mm/swapfile.c index e23287c06f1c..6d2ee1af0477 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1766,10 +1766,10 @@ int __swap_count(swp_entry_t entry) return swap_count(si->swap_map[offset]); } =20 -/* - * How many references to @entry are currently swapped out? - * This does not give an exact answer when swap count is continued, - * but does include the high COUNT_CONTINUED flag to allow for that. +/** + * swap_entry_swapped - Check if the swap entry is swapped. + * @si: the swap device. + * @entry: the swap entry. */ bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry) { @@ -1780,7 +1780,8 @@ bool swap_entry_swapped(struct swap_info_struct *si, = swp_entry_t entry) ci =3D swap_cluster_lock(si, offset); count =3D swap_count(si->swap_map[offset]); swap_cluster_unlock(ci); - return !!count; + + return count && count !=3D SWAP_MAP_BAD; } =20 /* @@ -3677,10 +3678,10 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) count =3D si->swap_map[offset + i]; =20 /* - * swapin_readahead() doesn't check if a swap entry is valid, so the - * swap entry could be SWAP_MAP_BAD. Check here with lock held. + * For swapin out, allocator never allocates bad slots. for + * swapin, readahead is guarded by swap_entry_swapped. */ - if (unlikely(swap_count(count) =3D=3D SWAP_MAP_BAD)) { + if (WARN_ON(swap_count(count) =3D=3D SWAP_MAP_BAD)) { err =3D -ENOENT; goto unlock_out; } --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 770E64A02 for ; Fri, 19 Dec 2025 19:45:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173510; cv=none; b=AcbxgPNMUuPABRqI7CIlnx13E+kbohv+bql/HS0ls6dg50+irSz1GG08Sw9RwB258w+3RJ8CEHgYUVHiKtuSi3WE1tH1Yh/qDQcUyFWAgpCjcPdzpPinUVjNB5f66jy5vZwXa4N/mI+OAasUcqubvuLhZMS7yL2A/3KySk0Jma8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173510; c=relaxed/simple; bh=tJroYPRwH9RrOt9AfVnZ4ZG9X0tKEn1CzUEzsEaioH8=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=bI6NiUKro81gAu4HE0vZ2ZwVKBxTjkvSFQ43/gCoRnKCBusLW869xTGfekMBd8+9QcGRBFuIhE99mwpsfF+zXANoXWwML5S/sxUm9PmcCeSmS4mMYV++DhEzVqLA1V0wOeFgHQdSYIic4yy9GvSnVUtwouxbETWyeS9jyotFuuE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=HeUzxFse; arc=none smtp.client-ip=209.85.214.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="HeUzxFse" Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-2a09d981507so15055855ad.1 for ; Fri, 19 Dec 2025 11:45:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173504; x=1766778304; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=4HyDXgqLqhgB1GreDyVhfU9ccv+OxUjYGBlnoKiwIUg=; b=HeUzxFseybsPxLMK3+P/JcDUQxIad0iVUcKp0Dpbfve56f9rlI21deS9kJysQwk88d P57cT7xka5q7GRdbXwAQPI4qXXP/VD2CB4MsJZ9ZnKfxOv2JxKma4tRpEj5SDrvgFWke r99rNXbFB2SRv2TAX25xrxUzmRypRFJrGqF1PqIVhyUun/U2ebWJOV7RMyTGLQ23vuQN x+Cb5IQVZnxUZbtbP4hCKl9I4QSOj9EiOAQ4QeNxeWzL3IcyciPErhWLGDCqO10n2Qwd C0SSohh+nhobR5SVwsZAlfPN0CG9h3Izb1hf0QPDRBitvGFRwWunamBSYMGYV+Jn/IlH FoPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173504; x=1766778304; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=4HyDXgqLqhgB1GreDyVhfU9ccv+OxUjYGBlnoKiwIUg=; b=YfC+2GCXuElquTQ6Tw+e1tZS7N1lf/dnlIJqQogMlqz8fE2b5xOPrG5EiTLm1ZSDTS g84T4tdZYcp69fZW828TvO5MPArozjr+54DIlG9y9aFbpISwzvdOTPSe97KMy3iZ97a8 38XkAgCZL+TSUfDFkWjewcP0YjZfp/SYuBZCocM1rTiCNGHx9TgTSzHv1thLTsU+GEs4 0t1B80dR1QZiIVSuWGedT9wp9NAyob3l+wgH5O0P1htKeCP5g48Vr5O8kTKUMR+sHELL fAhS9q3hc/2Dpm63A7lbJw5plwr1uaZEFS58Gd70OuBCNaGtCYl7Yrj14gSyXsO2XALt 9mxw== X-Forwarded-Encrypted: i=1; AJvYcCU7lFZYJMDmTa62EE8LSR/CMXoZRvSPpaX/3NFvlvyx7J77HjDDY1PkWot9ersEiIFG7UUby9nJB7/5cSk=@vger.kernel.org X-Gm-Message-State: AOJu0YzsnCYzEQoALScr2RLbd+h4wFrFGaNRzdcocFs0GVRk5l9bcPs1 D4Y2gwNvgUJ2HhGLuzEHPxEjvVff4TE4Cdbm3e7+jACQKJRuLc/Ug2dL X-Gm-Gg: AY/fxX6uIw7IuAd4Vv2rTi2Adqo6yovUQoUOFzDO0gAPXpNkw0n/W7ksgmnzHdpZUC6 nqVOVjOWBXdNbQlO4AcNMYl7uVLlCgLK3o45WIad1MONuLAxNasJF14WmetIEhf8runOv1q73c2 pwaf6lPM+Ym4K/La7ePvGP0B+iUEtteGtx6p+newvzgRRhX8jLVNWPQ0XuS/npDy4u84EypIDlU eIdjvqrn9TMpi/1iVGrGAzbIh9M56K+lZLDff2V3qjR3tWGoahbfWIIu2YhCsUj27YCutiFjfb2 AIbNw2Jck1F9MHGwmZkbk6lKZHI673JnqHgYM1jnBeWBxSOMz9jx/yhbEkdqgf2b4ELknvMUgQr fujlRTis8wgKiXAoxOypBVuDVTmlWdX4UIbFA1vSkgzlLZoMd8+/CijvRLA22KLMvdmKaIVf9qb THWtoPdWO1mxXKTerpakIFwZh3r3oTjx2iISyVLvs9XlmYFU6d+uB5 X-Google-Smtp-Source: AGHT+IEGbaYJLQwz82Y32ZQr8sGHVrxAKjqjmAa8Er6Pz2+vPVIlkrRIfDFe5BuAafoIgT8PdMNOUA== X-Received: by 2002:a17:903:228e:b0:296:547a:4bf2 with SMTP id d9443c01a7336-2a2f0d5d63emr37997225ad.27.1766173504336; Fri, 19 Dec 2025 11:45:04 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.44.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:45:03 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:39 +0800 Subject: [PATCH v5 10/19] mm, swap: consolidate cluster reclaim and usability check Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-10-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=4270; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=xnDm6QZuEKeEGBDI1eFAFieFZ69pdwgSuhajCqwh5kM=; b=pZFw4tAOtubaYjGkG7HZR43CluQtM+Je8zGxoWNwlsDS5aDQ0BsoQxekZjMJh12Tio/q49Cjp tgD9gG0XVyrBHTqTLZKoZLHUkwOqd9MNDB0iA5jtvC+Pq/1gFy6KW3r X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Swap cluster cache reclaim requires releasing the lock, so the cluster may become unusable after the reclaim. To prepare for checking swap cache using the swap table directly, consolidate the swap cluster reclaim and the check logic. We will want to avoid touching the cluster's data completely with the swap table, to avoid RCU overhead here. And by moving the cluster usable check into the reclaim helper, it will also help avoid a redundant scan of the slots if the cluster is no longer usable, and we will want to avoid touching the cluster. Also, adjust it very slightly while at it: always scan the whole region during reclaim, don't skip slots covered by a reclaimed folio. Because the reclaim is lockless, it's possible that new cache lands at any time. And for allocation, we want all caches to be reclaimed to avoid fragmentation. Besides, if the scan offset is not aligned with the size of the reclaimed folio, we might skip some existing cache and fail the reclaim unexpectedly. There should be no observable behavior change. It might slightly improve the fragmentation issue or performance. Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- mm/swapfile.c | 45 +++++++++++++++++++++++++++++---------------- 1 file changed, 29 insertions(+), 16 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 6d2ee1af0477..f3516e3c9e40 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -777,33 +777,51 @@ static int swap_cluster_setup_bad_slot(struct swap_cl= uster_info *cluster_info, return 0; } =20 +/* + * Reclaim drops the ci lock, so the cluster may become unusable (freed or + * stolen by a lower order). @usable will be set to false if that happens. + */ static bool cluster_reclaim_range(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long start, unsigned long end) + unsigned long start, unsigned int order, + bool *usable) { + unsigned int nr_pages =3D 1 << order; + unsigned long offset =3D start, end =3D start + nr_pages; unsigned char *map =3D si->swap_map; - unsigned long offset =3D start; int nr_reclaim; =20 spin_unlock(&ci->lock); do { switch (READ_ONCE(map[offset])) { case 0: - offset++; break; case SWAP_HAS_CACHE: nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); - if (nr_reclaim > 0) - offset +=3D nr_reclaim; - else + if (nr_reclaim < 0) goto out; break; default: goto out; } - } while (offset < end); + } while (++offset < end); out: spin_lock(&ci->lock); + + /* + * We just dropped ci->lock so cluster could be used by another + * order or got freed, check if it's still usable or empty. + */ + if (!cluster_is_usable(ci, order)) { + *usable =3D false; + return false; + } + *usable =3D true; + + /* Fast path, no need to scan if the whole cluster is empty */ + if (cluster_is_empty(ci)) + return true; + /* * Recheck the range no matter reclaim succeeded or not, the slot * could have been be freed while we are not holding the lock. @@ -900,9 +918,10 @@ static unsigned int alloc_swap_scan_cluster(struct swa= p_info_struct *si, unsigned long start =3D ALIGN_DOWN(offset, SWAPFILE_CLUSTER); unsigned long end =3D min(start + SWAPFILE_CLUSTER, si->max); unsigned int nr_pages =3D 1 << order; - bool need_reclaim, ret; + bool need_reclaim, ret, usable; =20 lockdep_assert_held(&ci->lock); + VM_WARN_ON(!cluster_is_usable(ci, order)); =20 if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) goto out; @@ -912,14 +931,8 @@ static unsigned int alloc_swap_scan_cluster(struct swa= p_info_struct *si, if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim)) continue; if (need_reclaim) { - ret =3D cluster_reclaim_range(si, ci, offset, offset + nr_pages); - /* - * Reclaim drops ci->lock and cluster could be used - * by another order. Not checking flag as off-list - * cluster has no flag set, and change of list - * won't cause fragmentation. - */ - if (!cluster_is_usable(ci, order)) + ret =3D cluster_reclaim_range(si, ci, offset, order, &usable); + if (!usable) goto out; if (cluster_is_empty(ci)) offset =3D start; --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 69B002F069E for ; Fri, 19 Dec 2025 19:45:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173513; cv=none; b=CueEwhHMcG6+NwGECOcz25CfE01SEodLSnYVZv5/Tat6frqhoRbZ8oQ0XCwHyHbZHlVMtRmRC6/J7EBB6ag9GBHuYHkA3k41GBusUJaAdcy++YTVa40Hq5o09mhHlBk78tOe2aQh7Q5njl/YBnjYRhONXI8lMP/Ku8GBxUies0E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173513; c=relaxed/simple; bh=OXZx3hJUpvyflnbIhrx/hsZoyt5j+zIUJnXxW1934Nw=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=WtOL0/bxa9YTeHsRvgznyOWuNPpoMy7snN/Bz7olm64glSgQrSa9UbSLRlfiNdgaBHI2ZFhiLSBoQnzjaAaPc1V1jNfT625400QET9rmwN8H2LcJGjNn8vlmyxhjsQzzNhjlV9Y5ZxFO0QauHct3YOAFeyiclpx+/iNxbWDW564= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=AFnqYPAx; arc=none smtp.client-ip=209.85.214.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="AFnqYPAx" Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-2a0d52768ccso26816485ad.1 for ; Fri, 19 Dec 2025 11:45:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173509; x=1766778309; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=xBKZWOds8khsJf/1rWCdSiGA4KHr0oVaNprz6OLz6qw=; b=AFnqYPAx3iSHgyMM4AoskRT1pQeaYumZadln/g3bm0/jkWa1NpaEK7psdG8x35zYkS o7btf0Dp9wcVsXbAImjMq1soEZUPFQztt0JjA56VR6DxXRjWi+cgHFE9DhSBSqJQbKxU VjkAvmiQrX/3+jRmK+tfpgJ1Wfwm33FE9rPx9hM9WVv1PdDFOYWLArKawXvUrZoHR3zS s1pHjXjvRf61UGSXZaFGnRoCaZk3j+IE/+CV0IDDEfZ0gaaG8znqscjFGeWkWHxVhgJ9 I+hVqmU+dIboTB7FAq049mJIoosbhdQ74RtJtsRxp2ykhhw3dnq4K1/sgx2sEDr1/YOo Delg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173509; x=1766778309; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=xBKZWOds8khsJf/1rWCdSiGA4KHr0oVaNprz6OLz6qw=; b=fS0UaJObo3/tualoanGvapMZ8h2mcJrOicsc1X66P3w+RhfGlvgpObjBztrhqRRyQJ rBoSiOvP6rVOommifaacSMSfODrIz3qdpq8sR4FPiv/sw2ee1fDCpxtt/Psq4iu3qkTh YXmnKPPIJ4yzy6SWDdEXHR8lkR7Wdqa6Om0AR54WmpYy1m7NdJpHKbKJU98kaVuDztti 3Kc1V8Xq+Xz1HuGLsINYUrbVM6N0kzgte5JpwjwYh8vp/0/9pUohzEQ9vpaPBfQmN292 /oGhdp18Vlhk/7H8oVqhT8JJwjTl0pgQSouAjbnz/ihUdZrnV4cAqt+gr5eheEF2rQZx MvLw== X-Forwarded-Encrypted: i=1; AJvYcCUeZ4oOtlHIVNR/fpGb4gGV1NqPnhJl7hgbjBfPX3ChBV4OUGSXVXdJc8Zet8N4+ycRIRTbPgM74Ge26o4=@vger.kernel.org X-Gm-Message-State: AOJu0YwG9SDcyiJX3YYbsMySV+wXLGc38SL/oGJAJ6MWl6udhy+sq3Cx R5vP0ZMmElsjhgnQgg+ZJ9rZ8Wd82cSVdtwnDTn9sMBIP7kQMSyzK0Q9 X-Gm-Gg: AY/fxX44ay1z5oQ+WItBj88pGY0qLAdcQZvY1qfxYX05VUieycbt6LTmtqDEOBlNNsY XOdp5hI3fsCKmgHUDws/3XfKzFIsSn3Nzfav0gxL1P91+bln9ck2wSQXkRYn134V2GRkQIaF8jK t6PjtBtm/TvjH5v+gnJLRoSK23Njtn0fPDNa8EP8TGS0XiOvHmLY/vewEQSgQuEkyjvBUwGO+/j bV69vD1Czm2f6T+avQyE7ZBbxiLOPULB/RTOAQ3xg0xn0if4wuTkS+oCvwMjHA96fyaVwzxi4+g 20hUk18k1EjYmJW82SIl0nvlyWANd4s/g+YnSS644/ae/A4y0Ye1Ris7HZ9KgI7tjwkNuTnKpcz gmw2AtbJuQf8uPI7f6hRCopzFicpRirzjLysenKRWY6tXvGeqACnxcoFIRGCoqkR5Ok5Q9saT9N O8MU6ItlFXFIRRPSuVbsEQNGDwF0PSwiiyWWuu6nGLRQtGaps0C7ZJ X-Google-Smtp-Source: AGHT+IHue++AcgUKNm6UdMDh1zNcM0NXOdchpWQN9jVvmWptaNwbmQipzIOUzW0NdJKVXkmlvjqurA== X-Received: by 2002:a17:903:3508:b0:2a0:ea4c:51ed with SMTP id d9443c01a7336-2a2f2737055mr36006985ad.28.1766173509044; Fri, 19 Dec 2025 11:45:09 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.45.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:45:08 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:40 +0800 Subject: [PATCH v5 11/19] mm, swap: split locked entry duplicating into a standalone helper Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-11-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=3237; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=MDvljxn/+eOgDP2I6U2bGDv2Q1tUhJNmxWn5F/I5OZM=; b=N1iJmzhWEqsEhTA8SJSJWKN/UgFnp4NM2qNl8VIkZC1as0rpeYYh9nM1MrAu58kx8XtQRHPA4 2pljakNoMLvCwIWc78Kq0oebSA97vZgBAlqipn9ojXj9h245ptzY30r X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song No feature change, split the common logic into a stand alone helper to be reused later. Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- mm/swapfile.c | 62 +++++++++++++++++++++++++++++--------------------------= ---- 1 file changed, 31 insertions(+), 31 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index f3516e3c9e40..c878c4115d00 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3667,26 +3667,14 @@ void si_swapinfo(struct sysinfo *val) * - swap-cache reference is requested but the entry is not used. -> ENOENT * - swap-mapped reference requested but needs continued swap count. -> EN= OMEM */ -static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) +static int swap_dup_entries(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, + unsigned char usage, int nr) { - struct swap_info_struct *si; - struct swap_cluster_info *ci; - unsigned long offset; - unsigned char count; - unsigned char has_cache; - int err, i; - - si =3D swap_entry_to_info(entry); - if (WARN_ON_ONCE(!si)) { - pr_err("%s%08lx\n", Bad_file, entry.val); - return -EINVAL; - } - - offset =3D swp_offset(entry); - VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); - ci =3D swap_cluster_lock(si, offset); + int i; + unsigned char count, has_cache; =20 - err =3D 0; for (i =3D 0; i < nr; i++) { count =3D si->swap_map[offset + i]; =20 @@ -3694,25 +3682,20 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) * For swapin out, allocator never allocates bad slots. for * swapin, readahead is guarded by swap_entry_swapped. */ - if (WARN_ON(swap_count(count) =3D=3D SWAP_MAP_BAD)) { - err =3D -ENOENT; - goto unlock_out; - } + if (WARN_ON(swap_count(count) =3D=3D SWAP_MAP_BAD)) + return -ENOENT; =20 has_cache =3D count & SWAP_HAS_CACHE; count &=3D ~SWAP_HAS_CACHE; =20 if (!count && !has_cache) { - err =3D -ENOENT; + return -ENOENT; } else if (usage =3D=3D SWAP_HAS_CACHE) { if (has_cache) - err =3D -EEXIST; + return -EEXIST; } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) { - err =3D -EINVAL; + return -EINVAL; } - - if (err) - goto unlock_out; } =20 for (i =3D 0; i < nr; i++) { @@ -3731,14 +3714,31 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) * Don't need to rollback changes, because if * usage =3D=3D 1, there must be nr =3D=3D 1. */ - err =3D -ENOMEM; - goto unlock_out; + return -ENOMEM; } =20 WRITE_ONCE(si->swap_map[offset + i], count | has_cache); } =20 -unlock_out: + return 0; +} + +static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) +{ + int err; + struct swap_info_struct *si; + struct swap_cluster_info *ci; + unsigned long offset =3D swp_offset(entry); + + si =3D swap_entry_to_info(entry); + if (WARN_ON_ONCE(!si)) { + pr_err("%s%08lx\n", Bad_file, entry.val); + return -EINVAL; + } + + VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); + ci =3D swap_cluster_lock(si, offset); + err =3D swap_dup_entries(si, ci, offset, usage, nr); swap_cluster_unlock(ci); return err; } --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EBDB9343D60 for ; Fri, 19 Dec 2025 19:45:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.169 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173517; cv=none; b=b+2jr0UGBJFeZmzNhjrC/grPOSLgpT60aGY8jbwNj1wCcV17w8ujKkY+sAumZ/+ZuFQOsTVjB7qoQnLnNdXUCkUN86jViqBJfWz7pWmt1AK4o9cKjN/uBd9uzof1+yuCwWYfVX0rHhv8flbegMMOxJW832OB+fVXbk1bXDxY4Z8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173517; c=relaxed/simple; bh=xL5rtJmmjEMiunfkgoddT/gm9zqa7bsUhDCjPl4ORV8=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=iDskWcSuHOai6VLGJIVggdKqYAGx9V1V6DPHvxvyrfzYArwRCXrwAzOpboxV94ZlSzFmlT+NP1eY6xsNk+NSWRsqWdajbmR1Y4WX0RRv8IVwZilqus/DV/mpE0UowoD3uLI+UIXn3e0nAMMgK+b//gkavJoU6eA84qWr7PA5ZEc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=dHzSHwSt; arc=none smtp.client-ip=209.85.214.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="dHzSHwSt" Received: by mail-pl1-f169.google.com with SMTP id d9443c01a7336-2a0d0788adaso19337025ad.3 for ; Fri, 19 Dec 2025 11:45:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173514; x=1766778314; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=OeT1TFSHDgDmGSm7JamkeKsgvgxmL4bkPd7iGrpb/dg=; b=dHzSHwStHd29DivvdI3SlRpOiw2pFnShF/H41K2GWhhjNoZRZ79wa0847YZBjSqhIp trRUOZhy/9XasDOtviYD4HwNfL7VvGusKLXsV5Qk4UI+9CsPiKdDZh2wc352hEG3mfDz 0x7xCyFL9e1hOIx1nyRvu1IMgdrOeNMV8nyz9nbiQ3f2q2k9jmKa3U7kbs/suYlPTCHE ZnxIet4XcDpF10tM1w9lgiMDuIIS00bFEQwok9okaFfathVfMe15sDg3ylSDRR7DfThF hKH2hshDS3Ez6zipd98lz49oHUGhJDveVBVNq7C90mqwVbUhe2wbjMXrTu+roI2HRZf+ HZ8w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173514; x=1766778314; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=OeT1TFSHDgDmGSm7JamkeKsgvgxmL4bkPd7iGrpb/dg=; b=uEN/o945KVe0xu+rLkym3DqXJ7tUx/38KaKVScJOEStK/lbDYkZFAn1vwdzRRAo9Q5 RBFjDnX0uqjZ6wchNR8eECsG1GdBXratw5BjUNgYfmY7nJKG173erbU4Lfcirl4LfgoR 0wX/KPtPBlP/om6SBC5hzfBSVbXjO3lLlOT9EP3YH0hLiP1njAxUPfThM0D3VHJUPHX7 mI6ji7XVKiM5SVB8QXVvZFrn7tngIlBJm68wP8ZwbqVI4hr73BGz78FCFN6CI65W926v kEMicsU5i4tgodGzu6chUNKOFYgr6+NL2FswS4aZctqJ4Q08TpG/PvnCYeDeC6pha00a E65w== X-Forwarded-Encrypted: i=1; AJvYcCXJqc12Q37tMpj53TStnPBUV/n8cnGxUTkn46yDRolpoYerzaRPLMQ3GBVArgvZ136zGl1z5NHH9OTOFRw=@vger.kernel.org X-Gm-Message-State: AOJu0YzTg4Q9XbbUaAjWagKeuKuNeA/jWpNDgTDerTrBuNeQG8uXD6CI DwceDuRC7txVB92ZVWj3jalD7iA8lmzSmTEP3qaEyfbOSfsl0l6y1xKz X-Gm-Gg: AY/fxX4whcpjURbzMMy8/s1lA5F3kJIUubWGL5fYeEBMSG5UTH2FXXpf6sFva6cl9zv En+Y4QjBzo++HConp9P2MnjI3BW6JdQ9pXmqb5Azy/2suaHxlNgXUz6FkigTdCmGV8j2pSypFOE Q9gMjgYPnyBXQqfhaMaF2kBX8FBr9DOEVw/xdZduCh8aZcGOTgq9DSBVmf0knimGMLgqHj1IMZF rnjkpQJwH7pV+k5+HjG76HuT7VKYmDOa5P5Ll1g4YpCDCbbR5AYwqFJJaETM5Sf5nIVB9FKs8BC IS1umQhefeF1p72Q3RSfl+BX0Ivy8jG8zv0k9CSJRkPFuLhzH1EzvQg8Djd7odEHtoteYozAiGr 8zVXJxXYb9Lj9wkh5SRTyBr6xvADORMGdcnUWzkrQy6qIVxd4TzV/jfdYUpJN2XKHpa6paQLuGO /FOhGXZ+reeiioth2JlK6nOJgoBSgAsiQ+6MjHmhu4UnQ9IOMcPF/xBVM0d/QAo9U= X-Google-Smtp-Source: AGHT+IHTuELgJagWiO9HF0bEHZrRLOvMc2Sd5Mamt8ZnZ2vAcpqNHk4koDJcao+DF2q7BAaL6JiIqg== X-Received: by 2002:a17:902:e804:b0:2a1:388d:8ef5 with SMTP id d9443c01a7336-2a2f222be85mr35317645ad.19.1766173513922; Fri, 19 Dec 2025 11:45:13 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.45.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:45:13 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:41 +0800 Subject: [PATCH v5 12/19] mm, swap: use swap cache as the swap in synchronize layer Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-12-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=14397; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=J1Ds9O04g7V7tEQO5Qz0K9cHHMCuOrrjtcVKQms28oc=; b=1X83e/A5HlJfuwQ/5pAdROI2saGAmIgQhM4X0zbuqk3Oj3jEOz53TBO+Ip9LI22njC3SRqHZG d+FfS6v9kg0B9a2biE3jXZ7eiwKlGacY729m46yppz8IV4XmKURXyiC X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Current swap in synchronization mostly uses the swap_map's SWAP_HAS_CACHE bit. Whoever sets the bit first does the actual work to swap in a folio. This has been causing many issues as it's just a poor implementation of a bit lock. Raced users have no idea what is pinning a slot, so it has to loop with a schedule_timeout_uninterruptible(1), which is ugly and causes long-tailing or other performance issues. Besides, the abuse of SWAP_HAS_CACHE has been causing many other troubles for synchronization or maintenance. This is the first step to remove this bit completely. Now all swap in paths are using the swap cache, and both the swap cache and swap map are protected by the cluster lock. So we can just resolve the swap synchronization with the swap cache layer directly using the cluster lock and folio lock. Whoever inserts a folio in the swap cache first does the swap in work. And because folios are locked during swap operations, other raced swap operations will just wait on the folio lock. The SWAP_HAS_CACHE will be removed in later commit. For now, we still set it for some remaining users. But now we do the bit setting and swap cache folio adding in the same critical section, after swap cache is ready. No one will have to spin on the SWAP_HAS_CACHE bit anymore. This both simplifies the logic and should improve the performance, eliminating issues like the one solved in commit 01626a1823024 ("mm: avoid unconditional one-tick sleep when swapcache_prepare fails"), or the "skip_if_exists" from commit a65b0e7607ccb ("zswap: make shrinking memcg-aware"), which will be removed very soon. Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- include/linux/swap.h | 6 --- mm/swap.h | 15 +++++++- mm/swap_state.c | 105 ++++++++++++++++++++++++++++-------------------= ---- mm/swapfile.c | 39 ++++++++++++------- mm/vmscan.c | 1 - 5 files changed, 96 insertions(+), 70 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index bf72b548a96d..74df3004c850 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -458,7 +458,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t en= try); extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern int swap_duplicate_nr(swp_entry_t entry, int nr); -extern int swapcache_prepare(swp_entry_t entry, int nr); extern void swap_free_nr(swp_entry_t entry, int nr_pages); extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); int swap_type_of(dev_t device, sector_t offset); @@ -517,11 +516,6 @@ static inline int swap_duplicate_nr(swp_entry_t swp, i= nt nr_pages) return 0; } =20 -static inline int swapcache_prepare(swp_entry_t swp, int nr) -{ - return 0; -} - static inline void swap_free_nr(swp_entry_t entry, int nr_pages) { } diff --git a/mm/swap.h b/mm/swap.h index e0f05babe13a..b5075a1aee04 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -234,6 +234,14 @@ static inline bool folio_matches_swap_entry(const stru= ct folio *folio, return folio_entry.val =3D=3D round_down(entry.val, nr_pages); } =20 +/* Temporary internal helpers */ +void __swapcache_set_cached(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry); +void __swapcache_clear_cached(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry, unsigned int nr); + /* * All swap cache helpers below require the caller to ensure the swap entr= ies * used are valid and stablize the device by any of the following ways: @@ -247,7 +255,8 @@ static inline bool folio_matches_swap_entry(const struc= t folio *folio, */ struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); -void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **s= hadow); +int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, + void **shadow, bool alloc); void swap_cache_del_folio(struct folio *folio); struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, struct mempolicy *mpol, pgoff_t ilx, @@ -413,8 +422,10 @@ static inline void *swap_cache_get_shadow(swp_entry_t = entry) return NULL; } =20 -static inline void swap_cache_add_folio(struct folio *folio, swp_entry_t e= ntry, void **shadow) +static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t en= try, + void **shadow, bool alloc) { + return -ENOENT; } =20 static inline void swap_cache_del_folio(struct folio *folio) diff --git a/mm/swap_state.c b/mm/swap_state.c index b7a36c18082f..57311e63efa5 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -128,34 +128,64 @@ void *swap_cache_get_shadow(swp_entry_t entry) * @entry: The swap entry corresponding to the folio. * @gfp: gfp_mask for XArray node allocation. * @shadowp: If a shadow is found, return the shadow. + * @alloc: If it's the allocator that is trying to insert a folio. Allocat= or + * sets SWAP_HAS_CACHE to pin slots before insert so skip map upda= te. * * Context: Caller must ensure @entry is valid and protect the swap device * with reference count or locks. - * The caller also needs to update the corresponding swap_map slots with - * SWAP_HAS_CACHE bit to avoid race or conflict. */ -void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **s= hadowp) +int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, + void **shadowp, bool alloc) { + int err; void *shadow =3D NULL; + struct swap_info_struct *si; unsigned long old_tb, new_tb; struct swap_cluster_info *ci; - unsigned int ci_start, ci_off, ci_end; + unsigned int ci_start, ci_off, ci_end, offset; unsigned long nr_pages =3D folio_nr_pages(folio); =20 VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); =20 + si =3D __swap_entry_to_info(entry); new_tb =3D folio_to_swp_tb(folio); ci_start =3D swp_cluster_offset(entry); ci_end =3D ci_start + nr_pages; ci_off =3D ci_start; - ci =3D swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry)); + offset =3D swp_offset(entry); + ci =3D swap_cluster_lock(si, swp_offset(entry)); + if (unlikely(!ci->table)) { + err =3D -ENOENT; + goto failed; + } do { - old_tb =3D __swap_table_xchg(ci, ci_off, new_tb); - WARN_ON_ONCE(swp_tb_is_folio(old_tb)); + old_tb =3D __swap_table_get(ci, ci_off); + if (unlikely(swp_tb_is_folio(old_tb))) { + err =3D -EEXIST; + goto failed; + } + if (!alloc && unlikely(!__swap_count(swp_entry(swp_type(entry), offset))= )) { + err =3D -ENOENT; + goto failed; + } if (swp_tb_is_shadow(old_tb)) shadow =3D swp_tb_to_shadow(old_tb); + offset++; + } while (++ci_off < ci_end); + + ci_off =3D ci_start; + offset =3D swp_offset(entry); + do { + /* + * Still need to pin the slots with SWAP_HAS_CACHE since + * swap allocator depends on that. + */ + if (!alloc) + __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset)); + __swap_table_set(ci, ci_off, new_tb); + offset++; } while (++ci_off < ci_end); =20 folio_ref_add(folio, nr_pages); @@ -168,6 +198,11 @@ void swap_cache_add_folio(struct folio *folio, swp_ent= ry_t entry, void **shadowp =20 if (shadowp) *shadowp =3D shadow; + return 0; + +failed: + swap_cluster_unlock(ci); + return err; } =20 /** @@ -186,6 +221,7 @@ void swap_cache_add_folio(struct folio *folio, swp_entr= y_t entry, void **shadowp void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *fo= lio, swp_entry_t entry, void *shadow) { + struct swap_info_struct *si; unsigned long old_tb, new_tb; unsigned int ci_start, ci_off, ci_end; unsigned long nr_pages =3D folio_nr_pages(folio); @@ -195,6 +231,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *c= i, struct folio *folio, VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio); =20 + si =3D __swap_entry_to_info(entry); new_tb =3D shadow_swp_to_tb(shadow); ci_start =3D swp_cluster_offset(entry); ci_end =3D ci_start + nr_pages; @@ -210,6 +247,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *c= i, struct folio *folio, folio_clear_swapcache(folio); node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages); lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages); + __swapcache_clear_cached(si, ci, entry, nr_pages); } =20 /** @@ -231,7 +269,6 @@ void swap_cache_del_folio(struct folio *folio) __swap_cache_del_folio(ci, folio, entry, NULL); swap_cluster_unlock(ci); =20 - put_swap_folio(folio, entry); folio_ref_sub(folio, folio_nr_pages(folio)); } =20 @@ -423,67 +460,37 @@ static struct folio *__swap_cache_prepare_and_add(swp= _entry_t entry, gfp_t gfp, bool charged, bool skip_if_exists) { - struct folio *swapcache; + struct folio *swapcache =3D NULL; void *shadow; int ret; =20 - /* - * Check and pin the swap map with SWAP_HAS_CACHE, then add the folio - * into the swap cache. Loop with a schedule delay if raced with - * another process setting SWAP_HAS_CACHE. This hackish loop will - * be fixed very soon. - */ + __folio_set_locked(folio); + __folio_set_swapbacked(folio); for (;;) { - ret =3D swapcache_prepare(entry, folio_nr_pages(folio)); + ret =3D swap_cache_add_folio(folio, entry, &shadow, false); if (!ret) break; =20 /* - * The skip_if_exists is for protecting against a recursive - * call to this helper on the same entry waiting forever - * here because SWAP_HAS_CACHE is set but the folio is not - * in the swap cache yet. This can happen today if - * mem_cgroup_swapin_charge_folio() below triggers reclaim - * through zswap, which may call this helper again in the - * writeback path. - * - * Large order allocation also needs special handling on + * Large order allocation needs special handling on * race: if a smaller folio exists in cache, swapin needs * to fallback to order 0, and doing a swap cache lookup * might return a folio that is irrelevant to the faulting * entry because @entry is aligned down. Just return NULL. */ if (ret !=3D -EEXIST || skip_if_exists || folio_test_large(folio)) - return NULL; + goto failed; =20 - /* - * Check the swap cache again, we can only arrive - * here because swapcache_prepare returns -EEXIST. - */ swapcache =3D swap_cache_get_folio(entry); if (swapcache) - return swapcache; - - /* - * We might race against __swap_cache_del_folio(), and - * stumble across a swap_map entry whose SWAP_HAS_CACHE - * has not yet been cleared. Or race against another - * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE - * in swap_map, but not yet added its folio to swap cache. - */ - schedule_timeout_uninterruptible(1); + goto failed; } =20 - __folio_set_locked(folio); - __folio_set_swapbacked(folio); - if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) { - put_swap_folio(folio, entry); - folio_unlock(folio); - return NULL; + swap_cache_del_folio(folio); + goto failed; } =20 - swap_cache_add_folio(folio, entry, &shadow); memcg1_swapin(entry, folio_nr_pages(folio)); if (shadow) workingset_refault(folio, shadow); @@ -491,6 +498,10 @@ static struct folio *__swap_cache_prepare_and_add(swp_= entry_t entry, /* Caller will initiate read into locked folio */ folio_add_lru(folio); return folio; + +failed: + folio_unlock(folio); + return swapcache; } =20 /** diff --git a/mm/swapfile.c b/mm/swapfile.c index c878c4115d00..38f3c369df72 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1476,7 +1476,11 @@ int folio_alloc_swap(struct folio *folio) if (!entry.val) return -ENOMEM; =20 - swap_cache_add_folio(folio, entry, NULL); + /* + * Allocator has pinned the slots with SWAP_HAS_CACHE + * so it should never fail + */ + WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true)); =20 return 0; =20 @@ -1582,9 +1586,8 @@ static unsigned char swap_entry_put_locked(struct swa= p_info_struct *si, * do_swap_page() * ... swapoff+swapon * swap_cache_alloc_folio() - * swapcache_prepare() - * __swap_duplicate() - * // check swap_map + * swap_cache_add_folio() + * // check swap_map * // verify PTE not changed * * In __swap_duplicate(), the swap_map need to be checked before @@ -3769,17 +3772,25 @@ int swap_duplicate_nr(swp_entry_t entry, int nr) return err; } =20 -/* - * @entry: first swap entry from which we allocate nr swap cache. - * - * Called when allocating swap cache for existing swap entries, - * This can return error codes. Returns 0 at success. - * -EEXIST means there is a swap cache. - * Note: return code is different from swap_duplicate(). - */ -int swapcache_prepare(swp_entry_t entry, int nr) +/* Mark the swap map as HAS_CACHE, caller need to hold the cluster lock */ +void __swapcache_set_cached(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry) +{ + WARN_ON(swap_dup_entries(si, ci, swp_offset(entry), SWAP_HAS_CACHE, 1)); +} + +/* Clear the swap map as !HAS_CACHE, caller need to hold the cluster lock = */ +void __swapcache_clear_cached(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry, unsigned int nr) { - return __swap_duplicate(entry, SWAP_HAS_CACHE, nr); + if (swap_only_has_cache(si, swp_offset(entry), nr)) { + swap_entries_free(si, ci, entry, nr); + } else { + for (int i =3D 0; i < nr; i++, entry.val++) + swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); + } } =20 /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 76e9864447cc..d4b08478d03d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -761,7 +761,6 @@ static int __remove_mapping(struct address_space *mappi= ng, struct folio *folio, __swap_cache_del_folio(ci, folio, swap, shadow); memcg1_swapout(folio, swap); swap_cluster_unlock_irq(ci); - put_swap_folio(folio, swap); } else { void (*free_folio)(struct folio *); =20 --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7905F341067 for ; Fri, 19 Dec 2025 19:45:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173521; cv=none; b=sGZZnteaRAA4K8AOI16mgNaxixDx8bZVJ7tpjmcvL7ra5HrI1ODDcVaeAVq7wLUapTfrRKfbtRAj+rYX7VCf/85FDXuqfONJSDfQcNwCiFCoHdlU/SNxPelyeK32AOUHcvb9GkKzhqoLEdtY6hOG8d/SaW1QM23G9VEXzNjCSUM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173521; c=relaxed/simple; bh=Ih8utwSfwogoxmjbbZPQL76tLQLny/+sEE7g+w+5qx4=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=uk4qtogm+9TkaDi/rLMudjLh84PuxnyS+T8wf/84wwB7EkHLApSegzsv1Ow4ntn0XTLHacN6zLUhDXk0+SepH6gPZdN34I3cnkbScfV3Hg86vxqauJEOV8KPVFYoRTCrPB8kHPGvqKsLsAmxTkYU5pOuBUnhEUxhkFeeBY3lLXo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ei3l1l8f; arc=none smtp.client-ip=209.85.214.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ei3l1l8f" Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-29f30233d8aso27415355ad.0 for ; Fri, 19 Dec 2025 11:45:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173519; x=1766778319; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=PF5XJt/nU8iAJ4BQE3QnHtVFZgGKv4P0kbw6cXKmw6U=; b=ei3l1l8fzVWiqPAkzQI5iVFsZiPjjH2ZNw0tke54uvfhiJtq87WjncDvWOI7faqZAy ZyL0wmP2OnK3kECAhIOAWkLJPRXElGC2/lVz1z0QGukbM/uiRzAgEXpsYZtgDJ568+MJ A6DpRDxLmYpRpPLrmZtmOmN+tOsRh86tosD6XmlVyDMtD1KZKrG9fsmeKLjQWmgoSxTE h+fZjSB0YBBW6/Sjdl1Afb4lrWN0AGtrSXO21AN4D78zbHp/4jM7W7X9grzUTHwlkXcz GCbTUZqV87z0QrL0SXSlcE6gs5qtb11OZQW1A5fN0Fg1rq3qQ4iGU6lx6L9yP5VNFwC7 5utg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173519; x=1766778319; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=PF5XJt/nU8iAJ4BQE3QnHtVFZgGKv4P0kbw6cXKmw6U=; b=qJz0h2DrqMGhz1zl/gRVapu2QvTTyAjFaZfNqbLrloiCsrwcns9N/pzwecTqo7KgMC 3n0gqrW69IuBKON3BJceld3p3FechQJR6OKX0qTK3dM+SaCUHhh/uuNEN6+P036bK4lz rmhm6mhj7Dz8L1QLfI7QDgSSRxXwikJ1Nrk8U53LQCJpXyX3E5fqlfK+UBa2y1uPZV4/ KHilVCpjxHVsbRtDlO3ooSjrnT50ozQ+ahP7EsFKV6fXmwFOLR5pMVUDl3BW5cIYLYhk nNHTBxxM4asqVntl1ikML1Mn+B4FLxZXbn9KlQSdJlhIJCar86YS348O+ePIFBGPTmQA 8Y4w== X-Forwarded-Encrypted: i=1; AJvYcCWqx9PceJt3X86bW6UzjE70MXQfrAY9FY86kWW5M5YDjl3JJhxjKUPS/T3C0TIZgOcXi91Hq+8K2lOuWKA=@vger.kernel.org X-Gm-Message-State: AOJu0YxMVJ/mJJsKtTDGeL2OahP6E3/RsKC8h2qDUKrCT4W4NqZE/2db BYgc9JhSwpi7eub3vnK9/ggoSmNSJOMtYaAhsAQSrKbrjL6XoJxFjQ2e X-Gm-Gg: AY/fxX7Hq/eztzlNxXgZJcJR/V25keocXNpf9k6gS8WN/22WepkS/LTq0JfWK+Q7z5o NmiNdVl6VY36nv+osl5ucEpvohyMeiBdlvyGmLHryZn5agm0HurULM8but4Gei0ykQi79SibJAZ XqEJbt2SmzvS43u113okPe7Co/2TG7xtpf+t8gUJTFIlLZ5FnrL8I0wc9aRZjMNhtPwjp+6D5S2 BKCICtwNdz99yb22Szp0jqK9ANBOWCfqbzq6tue8TB00kYIT9wPCLbj4/nQ3r+zTf0wu4Bd2cuK eQ9lHr1/FyVR4seVIl2uZvfkoFIA4R9WAU9ySwpkuD94Xl2exL3QggM3yuCmA8oWyFkkZmb7dLB GK6tpbeREHocURgKkAoHByQQc/qxd9qKawJtOQgxJIt35bhvx9SNoFvHEon7JOoPN75zn5s8+24 MNRqs1i7zGDQdO2eZaZmo1Rv/qWKTDWU9ojB/9Ewj2brnDQ7zMYYqSu5JqLhKhZKY= X-Google-Smtp-Source: AGHT+IHyFiCjGWjxbpnTxEAvuhX6HZnuwDWgftPyPKoGfJjyR1MJhk4bDdtEm2j5zF1rCoIDXvQ7mw== X-Received: by 2002:a17:902:d2c6:b0:2a0:7f60:9786 with SMTP id d9443c01a7336-2a2f2329b92mr40659905ad.26.1766173518719; Fri, 19 Dec 2025 11:45:18 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.45.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:45:18 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:42 +0800 Subject: [PATCH v5 13/19] mm, swap: remove workaround for unsynchronized swap map cache state Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-13-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=7074; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=zTalgfGlV7+CJTkvCj0IVamySdLUBiF/z2UipqwcRfk=; b=SeMVJF7sbJAO3fb9Jk+YBdlBabmvuiOr1aVwmX8ZYy4VLPGEyuv7eHO57a/r3rfIFJgG2MQz6 qzF9xBlxPyDD0vDbkOZ6BriDHDsAu6vyQ1iphWqr+F308vMHQmjhQNo X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Remove the "skip if exists" check from commit a65b0e7607ccb ("zswap: make shrinking memcg-aware"). It was needed because there is a tiny time window between setting the SWAP_HAS_CACHE bit and actually adding the folio to the swap cache. If a user is trying to add the folio into the swap cache but another user was interrupted after setting SWAP_HAS_CACHE but hasn't added the folio to the swap cache yet, it might lead to a deadlock. We have moved the bit setting to the same critical section as adding the folio, so this is no longer needed. Remove it and clean it up. Reviewed-by: Baoquan He Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swap.h | 2 +- mm/swap_state.c | 27 ++++++++++----------------- mm/zswap.c | 2 +- 3 files changed, 12 insertions(+), 19 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index b5075a1aee04..6777b2ab9d92 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -260,7 +260,7 @@ int swap_cache_add_folio(struct folio *folio, swp_entry= _t entry, void swap_cache_del_folio(struct folio *folio); struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, struct mempolicy *mpol, pgoff_t ilx, - bool *alloced, bool skip_if_exists); + bool *alloced); /* Below helpers require the caller to lock and pass in the swap cluster. = */ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry, void *shadow); diff --git a/mm/swap_state.c b/mm/swap_state.c index 57311e63efa5..327c051d7cd0 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -445,8 +445,6 @@ void swap_update_readahead(struct folio *folio, struct = vm_area_struct *vma, * @folio: folio to be added. * @gfp: memory allocation flags for charge, can be 0 if @charged if true. * @charged: if the folio is already charged. - * @skip_if_exists: if the slot is in a cached state, return NULL. - * This is an old workaround that will be removed shortly. * * Update the swap_map and add folio as swap cache, typically before swapi= n. * All swap slots covered by the folio must have a non-zero swap count. @@ -457,8 +455,7 @@ void swap_update_readahead(struct folio *folio, struct = vm_area_struct *vma, */ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry, struct folio *folio, - gfp_t gfp, bool charged, - bool skip_if_exists) + gfp_t gfp, bool charged) { struct folio *swapcache =3D NULL; void *shadow; @@ -478,7 +475,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, * might return a folio that is irrelevant to the faulting * entry because @entry is aligned down. Just return NULL. */ - if (ret !=3D -EEXIST || skip_if_exists || folio_test_large(folio)) + if (ret !=3D -EEXIST || folio_test_large(folio)) goto failed; =20 swapcache =3D swap_cache_get_folio(entry); @@ -511,8 +508,6 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, * @mpol: NUMA memory allocation policy to be applied * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE * @new_page_allocated: sets true if allocation happened, false otherwise - * @skip_if_exists: if the slot is a partially cached state, return NULL. - * This is a workaround that would be removed shortly. * * Allocate a folio in the swap cache for one swap slot, typically before * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by @@ -525,8 +520,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, */ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx, - bool *new_page_allocated, - bool skip_if_exists) + bool *new_page_allocated) { struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct folio *folio; @@ -547,8 +541,7 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry,= gfp_t gfp_mask, if (!folio) return NULL; /* Try add the new folio, returns existing folio or NULL on failure. */ - result =3D __swap_cache_prepare_and_add(entry, folio, gfp_mask, - false, skip_if_exists); + result =3D __swap_cache_prepare_and_add(entry, folio, gfp_mask, false); if (result =3D=3D folio) *new_page_allocated =3D true; else @@ -577,7 +570,7 @@ struct folio *swapin_folio(swp_entry_t entry, struct fo= lio *folio) unsigned long nr_pages =3D folio_nr_pages(folio); =20 entry =3D swp_entry(swp_type(entry), round_down(offset, nr_pages)); - swapcache =3D __swap_cache_prepare_and_add(entry, folio, 0, true, false); + swapcache =3D __swap_cache_prepare_and_add(entry, folio, 0, true); if (swapcache =3D=3D folio) swap_read_folio(folio, NULL); return swapcache; @@ -605,7 +598,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, = gfp_t gfp_mask, =20 mpol =3D get_vma_policy(vma, addr, 0, &ilx); folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, - &page_allocated, false); + &page_allocated); mpol_cond_put(mpol); =20 if (page_allocated) @@ -724,7 +717,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, /* Ok, do the async read-ahead now */ folio =3D swap_cache_alloc_folio( swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx, - &page_allocated, false); + &page_allocated); if (!folio) continue; if (page_allocated) { @@ -742,7 +735,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, skip: /* The page was likely read above, so no need for plugging here */ folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, - &page_allocated, false); + &page_allocated); if (unlikely(page_allocated)) swap_read_folio(folio, NULL); return folio; @@ -847,7 +840,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, continue; } folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, - &page_allocated, false); + &page_allocated); if (si) put_swap_device(si); if (!folio) @@ -869,7 +862,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, skip: /* The folio was likely read above, so no need for plugging here */ folio =3D swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx, - &page_allocated, false); + &page_allocated); if (unlikely(page_allocated)) swap_read_folio(folio, NULL); return folio; diff --git a/mm/zswap.c b/mm/zswap.c index a7a2443912f4..d8a33db9d3cc 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1015,7 +1015,7 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, =20 mpol =3D get_task_policy(current); folio =3D swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol, - NO_INTERLEAVE_INDEX, &folio_was_allocated, true); + NO_INTERLEAVE_INDEX, &folio_was_allocated); put_swap_device(si); if (!folio) return -ENOMEM; --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C6A4F3446C5 for ; Fri, 19 Dec 2025 19:45:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173527; cv=none; b=lnD9lH5WqnoQxh1Ca+CoBJEF1qor+bxTpcpXNtDOB2evrTkiVt01ZzuA520xRQzBlZUI40hp2z5zAp5eXC7aTEBv17RO4fBzrleZ02JfQHE9pRhwwWSPKkVD+mWaluVJ3h2I5J61hwxfHuhqarOAUaKNvyWQw3rWx6AmukLV1z8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173527; c=relaxed/simple; bh=JuZgsEFchU4Vsl45pzATmh5MciIymSXItdnHS6U9O5k=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=a9OPh5J/kOnrMhgmzU51tnlqaQLkdZMKMMYnA3YYRmWryKy5P0l6hbJ9KXAdOkxUlYOtxkC/BPto1oQsDF6vti07OJqehBjZmKbq+qQmW1zmLPcnjO1RxRf6IdRqMYUNPy3qmwzWo2t0HlifUPg1u/reOQe5fLPTZZF2+jNV3g4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=QIZ30rSH; arc=none smtp.client-ip=209.85.214.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="QIZ30rSH" Received: by mail-pl1-f170.google.com with SMTP id d9443c01a7336-29f0f875bc5so28760495ad.3 for ; Fri, 19 Dec 2025 11:45:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173524; x=1766778324; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=Omaew5hCVMXYhCbGwyN9VzlVLzfiBJYAophcoMb+MEo=; b=QIZ30rSHjwyC+yp7m0hSIGoUqFjsDePGOgULXX+2k4YMr3xo4gRkV9NQXdPWDuyTep tip/GBaNQV2iQvXRNWidUqgf4tHQXqLNQfp4tRLKGNPbP4Zk4O3ie89jPljYT+1lvUXn gGfZhfbU6A56v3iEGGSuNg2S7mbRS5BeWphAjAdeMj6AjxuxnqcIkBSyGIMDgrxru+K9 BM6nWSv+/bXPjIWGxQuh3lXE52LVnlFWOoShT7akpt2Bvh1gYxlFTuIOQ625gzqHMSy9 xYu0WatmXXSbDTsoQXf1gCK5YVDjyE63eqhM0amCNCyNwh8E6md+nXhNXTjCHE+J/qc9 jSFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173524; x=1766778324; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=Omaew5hCVMXYhCbGwyN9VzlVLzfiBJYAophcoMb+MEo=; b=mTiNzVc0fdZWuduyBeUB12bfOo7Ca9rKEbjtmI7965gFU+VYcQH9aiHEw7g1z4/rxw Is3qA8hKvSDn6oeawzi4m+c7wpeH7UIbvZY4LlzLgJZCIdnTJM+wzkEVLdXYjo2Z/pjx LM52Xx5wGd84Ar6Lu5py0gtM75uU1NEO6K8n3LxQC+HEC9MHvoLIgxW8c3HNIpcc93la +Cn6faqfoxJU1FcRSuH8MDiIKBp6VSg2Vh3rTQ7oA+wwY1b865G8aYSwmr8EcGWBxKfI BrBizr+JmN6eRJpP1LCkx1Fr3GCLKtl6DE93j2rG3ydY6pOTej/c/1lzcX+HgJuPsF4a jasQ== X-Forwarded-Encrypted: i=1; AJvYcCV5QgMpx+36c0WHaDovDnzPyQAOOoCSFz076Ym2mMGv3qlY6J3lCNXg5Ye+iv5fL3+6i2taspDtpo5guko=@vger.kernel.org X-Gm-Message-State: AOJu0YymmzYuQj2xQJeW+mnoakP/GNB0nE1uSbx6927Z/krPzxOqBaZy MV3r9j20LxaIIhnFnZbaiJQXIgBxtAw9Y5crClZxZeP5qz4+JAOuBssN X-Gm-Gg: AY/fxX6MNwx8O1BoE7sM2X8/5y2lgCTEU1ur+Q1ipUUxE1+SK2IHmMIJ1TlvnroPe7p z7InLD+mEhpYPfzjllHToVw71w2C0X4QYM+sMfXzHGVxHwAAWjT8b4r5KKenlgbOPPXknVnx2qd gGVw0O/wXGWLWJ6z0dQMgI0YZ8ubVWQHmZYRfBL2XfbMnwm/9ZKWxe5ONft+wwtCOwbexSJkNog iR4RFje7ubMxkWYnb8qg/SlVufgHZGn+q2zIPsp3OvHBYmPXk3tqh8GE+0RTb4Ie60oYQVZ0Vak I7/anRyEboq2++it1nPLUxicVcM2ffRzkXzguocW9LTLDlrxS0A4ad3X2FiawnSqiFTqKr5/gaw 1deIBvFcO1zdJKIpvrqP6L2keJVDoAivsUQEv4QEcfKTtXhshkzZ7mfuawsNedkH6bCMSjF4p7G mU0MIOAyGWZOQfXJNurxlEwbpL0Nd0BbrcSXEbxX1Jv6YI8x/LWuHZ X-Google-Smtp-Source: AGHT+IHzD6pQbyJvytaj2rtAmDnroiHSCkFQ+BevmAWpqD50j4jv/BtqhkImN+1QeMRGDCNd0XvAhQ== X-Received: by 2002:a17:902:da8a:b0:2a0:a9f8:48f7 with SMTP id d9443c01a7336-2a2f2a3daacmr38639635ad.55.1766173523935; Fri, 19 Dec 2025 11:45:23 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.45.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:45:23 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:43 +0800 Subject: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-14-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song , linux-pm@vger.kernel.org, "Rafael J. Wysocki (Intel)" X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=28193; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=jtllLytpIDkfMR1Y1XubSZJtpTW82fEBN3I2FxW6bR0=; b=n9C1DfcI51YqtWYmP54HS02d2lUMycer2cXL4bkXahjm4zuGfxCBjhxV1/rZ1VurOSf/voaix iPf8aDbXr1nA7PNesSRIZWo0kNugA7fcpFK4h9/mu/TvbBIoaSAjNLX X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song The current swap entry allocation/freeing workflow has never had a clear definition. This makes it hard to debug or add new optimizations. This commit introduces a proper definition of how swap entries would be allocated and freed. Now, most operations are folio based, so they will never exceed one swap cluster, and we now have a cleaner border between swap and the rest of mm, making it much easier to follow and debug, especially with new added sanity checks. Also making more optimization possible. Swap entry will be mostly allocated and free with a folio bound. The folio lock will be useful for resolving many swap ralated races. Now swap allocation (except hibernation) always starts with a folio in the swap cache, and gets duped/freed protected by the folio lock: - folio_alloc_swap() - The only allocation entry point now. Context: The folio must be locked. This allocates one or a set of continuous swap slots for a folio and binds them to the folio by adding the folio to the swap cache. The swap slots' swap count start with zero value. - folio_dup_swap() - Increase the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This increases the ref count of swap entries allocated to a folio. Newly allocated swap slots' count has to be increased by this helper as the folio got unmapped (and swap entries got installed). - folio_put_swap() - Decrease the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This decreases the ref count of swap entries allocated to a folio. Typically, swapin will decrease the swap count as the folio got installed back and the swap entry got uninstalled This won't remove the folio from the swap cache and free the slot. Lazy freeing of swap cache is helpful for reducing IO. There is already a folio_free_swap() for immediate cache reclaim. This part could be further optimized later. The above locking constraints could be further relaxed when the swap table if fully implemented. Currently dup still needs the caller to lock the swap entry container (e.g. PTL), or a concurrent zap may underflow the swap count. Some swap users need to interact with swap count without involving folio (e.g. forking/zapping the page table or mapping truncate without swapin). In such cases, the caller has to ensure there is no race condition on whatever owns the swap count and use the below helpers: - swap_put_entries_direct() - Decrease the swap count directly. Context: The caller must lock whatever is referencing the slots to avoid a race. Typically the page table zapping or shmem mapping truncate will need to free swap slots directly. If a slot is cached (has a folio bound), this will also try to release the swap cache. - swap_dup_entry_direct() - Increase the swap count directly. Context: The caller must lock whatever is referencing the entries to avoid race, and the entries must already have a swap count > 1. Typically, forking will need to copy the page table and hence needs to increase the swap count of the entries in the table. The page table is locked while referencing the swap entries, so the entries all have a swap count > 1 and can't be freed. Hibernation subsystem is a bit different, so two special wrappers are here: - swap_alloc_hibernation_slot() - Allocate one entry from one device. - swap_free_hibernation_slot() - Free one entry allocated by the above helper. All hibernation entries are exclusive to the hibernation subsystem and should not interact with ordinary swap routines. By separating the workflows, it will be possible to bind folio more tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary pin. This commit should not introduce any behavior change Cc: linux-pm@vger.kernel.org Acked-by: Rafael J. Wysocki (Intel) Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- arch/s390/mm/gmap_helpers.c | 2 +- arch/s390/mm/pgtable.c | 2 +- include/linux/swap.h | 58 ++++++++--------- kernel/power/swap.c | 10 +-- mm/madvise.c | 2 +- mm/memory.c | 15 +++-- mm/rmap.c | 7 +- mm/shmem.c | 10 +-- mm/swap.h | 37 +++++++++++ mm/swapfile.c | 152 +++++++++++++++++++++++++++++++---------= ---- 10 files changed, 197 insertions(+), 98 deletions(-) diff --git a/arch/s390/mm/gmap_helpers.c b/arch/s390/mm/gmap_helpers.c index d41b19925a5a..dd89fce28531 100644 --- a/arch/s390/mm/gmap_helpers.c +++ b/arch/s390/mm/gmap_helpers.c @@ -32,7 +32,7 @@ static void ptep_zap_softleaf_entry(struct mm_struct *mm,= softleaf_t entry) dec_mm_counter(mm, MM_SWAPENTS); else if (softleaf_is_migration(entry)) dec_mm_counter(mm, mm_counter(softleaf_to_folio(entry))); - free_swap_and_cache(entry); + swap_put_entries_direct(entry, 1); } =20 /** diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c index 666adcd681ab..b22181e1079e 100644 --- a/arch/s390/mm/pgtable.c +++ b/arch/s390/mm/pgtable.c @@ -682,7 +682,7 @@ static void ptep_zap_softleaf_entry(struct mm_struct *m= m, softleaf_t entry) =20 dec_mm_counter(mm, mm_counter(folio)); } - free_swap_and_cache(entry); + swap_put_entries_direct(entry, 1); } =20 void ptep_zap_unused(struct mm_struct *mm, unsigned long addr, diff --git a/include/linux/swap.h b/include/linux/swap.h index 74df3004c850..aaa868f60b9c 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -452,14 +452,8 @@ static inline long get_nr_swap_pages(void) } =20 extern void si_swapinfo(struct sysinfo *); -int folio_alloc_swap(struct folio *folio); -bool folio_free_swap(struct folio *folio); void put_swap_folio(struct folio *folio, swp_entry_t entry); -extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); -extern int swap_duplicate_nr(swp_entry_t entry, int nr); -extern void swap_free_nr(swp_entry_t entry, int nr_pages); -extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); int swap_type_of(dev_t device, sector_t offset); int find_first_swap(dev_t *device); extern unsigned int count_swap_pages(int, int); @@ -471,6 +465,29 @@ struct backing_dev_info; extern struct swap_info_struct *get_swap_device(swp_entry_t entry); sector_t swap_folio_sector(struct folio *folio); =20 +/* + * If there is an existing swap slot reference (swap entry) and the caller + * guarantees that there is no race modification of it (e.g., PTL + * protecting the swap entry in page table; shmem's cmpxchg protects t + * he swap entry in shmem mapping), these two helpers below can be used + * to put/dup the entries directly. + * + * All entries must be allocated by folio_alloc_swap(). And they must have + * a swap count > 1. See comments of folio_*_swap helpers for more info. + */ +int swap_dup_entry_direct(swp_entry_t entry); +void swap_put_entries_direct(swp_entry_t entry, int nr); + +/* + * folio_free_swap tries to free the swap entries pinned by a swap cache + * folio, it has to be here to be called by other components. + */ +bool folio_free_swap(struct folio *folio); + +/* Allocate / free (hibernation) exclusive entries */ +swp_entry_t swap_alloc_hibernation_slot(int type); +void swap_free_hibernation_slot(swp_entry_t entry); + static inline void put_swap_device(struct swap_info_struct *si) { percpu_ref_put(&si->users); @@ -498,10 +515,6 @@ static inline void put_swap_device(struct swap_info_st= ruct *si) #define free_pages_and_swap_cache(pages, nr) \ release_pages((pages), (nr)); =20 -static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr) -{ -} - static inline void free_swap_cache(struct folio *folio) { } @@ -511,12 +524,12 @@ static inline int add_swap_count_continuation(swp_ent= ry_t swp, gfp_t gfp_mask) return 0; } =20 -static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages) +static inline int swap_dup_entry_direct(swp_entry_t ent) { return 0; } =20 -static inline void swap_free_nr(swp_entry_t entry, int nr_pages) +static inline void swap_put_entries_direct(swp_entry_t ent, int nr) { } =20 @@ -539,11 +552,6 @@ static inline int swp_swapcount(swp_entry_t entry) return 0; } =20 -static inline int folio_alloc_swap(struct folio *folio) -{ - return -EINVAL; -} - static inline bool folio_free_swap(struct folio *folio) { return false; @@ -556,22 +564,6 @@ static inline int add_swap_extent(struct swap_info_str= uct *sis, return -EINVAL; } #endif /* CONFIG_SWAP */ - -static inline int swap_duplicate(swp_entry_t entry) -{ - return swap_duplicate_nr(entry, 1); -} - -static inline void free_swap_and_cache(swp_entry_t entry) -{ - free_swap_and_cache_nr(entry, 1); -} - -static inline void swap_free(swp_entry_t entry) -{ - swap_free_nr(entry, 1); -} - #ifdef CONFIG_MEMCG static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) { diff --git a/kernel/power/swap.c b/kernel/power/swap.c index 33a186373bef..859476a714ac 100644 --- a/kernel/power/swap.c +++ b/kernel/power/swap.c @@ -174,10 +174,10 @@ sector_t alloc_swapdev_block(int swap) * Allocate a swap page and register that it has been allocated, so that * it can be freed in case of an error. */ - offset =3D swp_offset(get_swap_page_of_type(swap)); + offset =3D swp_offset(swap_alloc_hibernation_slot(swap)); if (offset) { if (swsusp_extents_insert(offset)) - swap_free(swp_entry(swap, offset)); + swap_free_hibernation_slot(swp_entry(swap, offset)); else return swapdev_block(swap, offset); } @@ -186,6 +186,7 @@ sector_t alloc_swapdev_block(int swap) =20 void free_all_swap_pages(int swap) { + unsigned long offset; struct rb_node *node; =20 /* @@ -197,8 +198,9 @@ void free_all_swap_pages(int swap) =20 ext =3D rb_entry(node, struct swsusp_extent, node); rb_erase(node, &swsusp_extents); - swap_free_nr(swp_entry(swap, ext->start), - ext->end - ext->start + 1); + + for (offset =3D ext->start; offset < ext->end; offset++) + swap_free_hibernation_slot(swp_entry(swap, offset)); =20 kfree(ext); } diff --git a/mm/madvise.c b/mm/madvise.c index 6bf7009fa5ce..5f79f6fabfc0 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -694,7 +694,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned = long addr, max_nr =3D (end - addr) / PAGE_SIZE; nr =3D swap_pte_batch(pte, max_nr, ptent); nr_swap -=3D nr; - free_swap_and_cache_nr(entry, nr); + swap_put_entries_direct(entry, nr); clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); } else if (softleaf_is_hwpoison(entry) || softleaf_is_poison_marker(entry)) { diff --git a/mm/memory.c b/mm/memory.c index a4c58341c44a..a61508107f6d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -934,7 +934,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm= _struct *src_mm, struct page *page; =20 if (likely(softleaf_is_swap(entry))) { - if (swap_duplicate(entry) < 0) + if (swap_dup_entry_direct(entry) < 0) return -EIO; =20 /* make sure dst_mm is on swapoff's mmlist. */ @@ -1744,7 +1744,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gath= er *tlb, =20 nr =3D swap_pte_batch(pte, max_nr, ptent); rss[MM_SWAPENTS] -=3D nr; - free_swap_and_cache_nr(entry, nr); + swap_put_entries_direct(entry, nr); } else if (softleaf_is_migration(entry)) { struct folio *folio =3D softleaf_to_folio(entry); =20 @@ -4933,7 +4933,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) /* * Some architectures may have to restore extra metadata to the page * when reading from swap. This metadata may be indexed by swap entry - * so this must be called before swap_free(). + * so this must be called before folio_put_swap(). */ arch_swap_restore(folio_swap(entry, folio), folio); =20 @@ -4971,6 +4971,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (unlikely(folio !=3D swapcache)) { folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); + folio_put_swap(swapcache, NULL); } else if (!folio_test_anon(folio)) { /* * We currently only expect !anon folios that are fully @@ -4979,9 +4980,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) !=3D nr_pages, folio); VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); + folio_put_swap(folio, NULL); } else { + VM_WARN_ON_ONCE(nr_pages !=3D 1 && nr_pages !=3D folio_nr_pages(folio)); folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address, - rmap_flags); + rmap_flags); + folio_put_swap(folio, nr_pages =3D=3D 1 ? page : NULL); } =20 VM_BUG_ON(!folio_test_anon(folio) || @@ -4995,7 +4999,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * Do it after mapping, so raced page faults will likely see the folio * in swap cache and wait on the folio lock. */ - swap_free_nr(entry, nr_pages); if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags)) folio_free_swap(folio); =20 @@ -5005,7 +5008,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * Hold the lock to avoid the swap entry to be reused * until we take the PT lock for the pte_same() check * (to avoid false positives from pte_same). For - * further safety release the lock after the swap_free + * further safety release the lock after the folio_put_swap * so that the swap count won't change under a * parallel locked swapcache. */ diff --git a/mm/rmap.c b/mm/rmap.c index d6799afe1114..e805ddc5a27b 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -82,6 +82,7 @@ #include =20 #include "internal.h" +#include "swap.h" =20 static struct kmem_cache *anon_vma_cachep; static struct kmem_cache *anon_vma_chain_cachep; @@ -2147,7 +2148,7 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, goto discard; } =20 - if (swap_duplicate(entry) < 0) { + if (folio_dup_swap(folio, subpage) < 0) { set_pte_at(mm, address, pvmw.pte, pteval); goto walk_abort; } @@ -2158,7 +2159,7 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, * so we'll not check/care. */ if (arch_unmap_one(mm, vma, address, pteval) < 0) { - swap_free(entry); + folio_put_swap(folio, subpage); set_pte_at(mm, address, pvmw.pte, pteval); goto walk_abort; } @@ -2166,7 +2167,7 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, /* See folio_try_share_anon_rmap(): clear PTE first. */ if (anon_exclusive && folio_try_share_anon_rmap_pte(folio, subpage)) { - swap_free(entry); + folio_put_swap(folio, subpage); set_pte_at(mm, address, pvmw.pte, pteval); goto walk_abort; } diff --git a/mm/shmem.c b/mm/shmem.c index e36330cdd066..df346f0c8ddc 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -970,7 +970,7 @@ static long shmem_free_swap(struct address_space *mappi= ng, old =3D xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0); if (old !=3D radswap) return 0; - free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order); + swap_put_entries_direct(radix_to_swp_entry(radswap), 1 << order); =20 return 1 << order; } @@ -1667,7 +1667,7 @@ int shmem_writeout(struct folio *folio, struct swap_i= ocb **plug, spin_unlock(&shmem_swaplist_lock); } =20 - swap_duplicate_nr(folio->swap, nr_pages); + folio_dup_swap(folio, NULL); shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap)); =20 BUG_ON(folio_mapped(folio)); @@ -1688,7 +1688,7 @@ int shmem_writeout(struct folio *folio, struct swap_i= ocb **plug, /* Swap entry might be erased by racing shmem_free_swap() */ if (!error) { shmem_recalc_inode(inode, 0, -nr_pages); - swap_free_nr(folio->swap, nr_pages); + folio_put_swap(folio, NULL); } =20 /* @@ -2174,6 +2174,7 @@ static void shmem_set_folio_swapin_error(struct inode= *inode, pgoff_t index, =20 nr_pages =3D folio_nr_pages(folio); folio_wait_writeback(folio); + folio_put_swap(folio, NULL); swap_cache_del_folio(folio); /* * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks @@ -2181,7 +2182,6 @@ static void shmem_set_folio_swapin_error(struct inode= *inode, pgoff_t index, * in shmem_evict_inode(). */ shmem_recalc_inode(inode, -nr_pages, -nr_pages); - swap_free_nr(swap, nr_pages); } =20 static int shmem_split_large_entry(struct inode *inode, pgoff_t index, @@ -2404,9 +2404,9 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, if (sgp =3D=3D SGP_WRITE) folio_mark_accessed(folio); =20 + folio_put_swap(folio, NULL); swap_cache_del_folio(folio); folio_mark_dirty(folio); - swap_free_nr(swap, nr_pages); put_swap_device(si); =20 *foliop =3D folio; diff --git a/mm/swap.h b/mm/swap.h index 6777b2ab9d92..9ed12936b889 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -183,6 +183,28 @@ static inline void swap_cluster_unlock_irq(struct swap= _cluster_info *ci) spin_unlock_irq(&ci->lock); } =20 +/* + * Below are the core routines for doing swap for a folio. + * All helpers requires the folio to be locked, and a locked folio + * in the swap cache pins the swap entries / slots allocated to the + * folio, swap relies heavily on the swap cache and folio lock for + * synchronization. + * + * folio_alloc_swap(): the entry point for a folio to be swapped + * out. It allocates swap slots and pins the slots with swap cache. + * The slots start with a swap count of zero. + * + * folio_dup_swap(): increases the swap count of a folio, usually + * during it gets unmapped and a swap entry is installed to replace + * it (e.g., swap entry in page table). A swap slot with swap + * count =3D=3D 0 should only be increasd by this helper. + * + * folio_put_swap(): does the opposite thing of folio_dup_swap(). + */ +int folio_alloc_swap(struct folio *folio); +int folio_dup_swap(struct folio *folio, struct page *subpage); +void folio_put_swap(struct folio *folio, struct page *subpage); + /* linux/mm/page_io.c */ int sio_pool_init(void); struct swap_iocb; @@ -363,9 +385,24 @@ static inline struct swap_info_struct *__swap_entry_to= _info(swp_entry_t entry) return NULL; } =20 +static inline int folio_alloc_swap(struct folio *folio) +{ + return -EINVAL; +} + +static inline int folio_dup_swap(struct folio *folio, struct page *page) +{ + return -EINVAL; +} + +static inline void folio_put_swap(struct folio *folio, struct page *page) +{ +} + static inline void swap_read_folio(struct folio *folio, struct swap_iocb *= *plug) { } + static inline void swap_write_unplug(struct swap_iocb *sio) { } diff --git a/mm/swapfile.c b/mm/swapfile.c index 38f3c369df72..f812fdea68b3 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -58,6 +58,9 @@ static void swap_entries_free(struct swap_info_struct *si, swp_entry_t entry, unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); +static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr= ); +static bool swap_entries_put_map(struct swap_info_struct *si, + swp_entry_t entry, int nr); static bool folio_swapcache_freeable(struct folio *folio); static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, @@ -1482,6 +1485,12 @@ int folio_alloc_swap(struct folio *folio) */ WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true)); =20 + /* + * Allocator should always allocate aligned entries so folio based + * operations never crossed more than one cluster. + */ + VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio); + return 0; =20 out_free: @@ -1489,6 +1498,66 @@ int folio_alloc_swap(struct folio *folio) return -ENOMEM; } =20 +/** + * folio_dup_swap() - Increase swap count of swap entries of a folio. + * @folio: folio with swap entries bounded. + * @subpage: if not NULL, only increase the swap count of this subpage. + * + * Typically called when the folio is unmapped and have its swap entry to + * take its palce. + * + * Context: Caller must ensure the folio is locked and in the swap cache. + * NOTE: The caller also has to ensure there is no raced call to + * swap_put_entries_direct on its swap entry before this helper returns, or + * the swap map may underflow. Currently, we only accept @subpage =3D=3D N= ULL + * for shmem due to the limitation of swap continuation: shmem always + * duplicates the swap entry only once, so there is no such issue for it. + */ +int folio_dup_swap(struct folio *folio, struct page *subpage) +{ + int err =3D 0; + swp_entry_t entry =3D folio->swap; + unsigned long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio); + + if (subpage) { + entry.val +=3D folio_page_idx(folio, subpage); + nr_pages =3D 1; + } + + while (!err && __swap_duplicate(entry, 1, nr_pages) =3D=3D -ENOMEM) + err =3D add_swap_count_continuation(entry, GFP_ATOMIC); + + return err; +} + +/** + * folio_put_swap() - Decrease swap count of swap entries of a folio. + * @folio: folio with swap entries bounded, must be in swap cache and lock= ed. + * @subpage: if not NULL, only decrease the swap count of this subpage. + * + * This won't free the swap slots even if swap count drops to zero, they a= re + * still pinned by the swap cache. User may call folio_free_swap to free t= hem. + * Context: Caller must ensure the folio is locked and in the swap cache. + */ +void folio_put_swap(struct folio *folio, struct page *subpage) +{ + swp_entry_t entry =3D folio->swap; + unsigned long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio); + + if (subpage) { + entry.val +=3D folio_page_idx(folio, subpage); + nr_pages =3D 1; + } + + swap_entries_put_map(__swap_entry_to_info(entry), entry, nr_pages); +} + static struct swap_info_struct *_swap_info_get(swp_entry_t entry) { struct swap_info_struct *si; @@ -1729,28 +1798,6 @@ static void swap_entries_free(struct swap_info_struc= t *si, partial_free_cluster(si, ci); } =20 -/* - * Caller has made sure that the swap device corresponding to entry - * is still around or has not been recycled. - */ -void swap_free_nr(swp_entry_t entry, int nr_pages) -{ - int nr; - struct swap_info_struct *sis; - unsigned long offset =3D swp_offset(entry); - - sis =3D _swap_info_get(entry); - if (!sis) - return; - - while (nr_pages) { - nr =3D min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER= ); - swap_entries_put_map(sis, swp_entry(sis->type, offset), nr); - offset +=3D nr; - nr_pages -=3D nr; - } -} - /* * Called after dropping swapcache to decrease refcnt to swap entries. */ @@ -1940,16 +1987,19 @@ bool folio_free_swap(struct folio *folio) } =20 /** - * free_swap_and_cache_nr() - Release reference on range of swap entries a= nd - * reclaim their cache if no more references re= main. + * swap_put_entries_direct() - Release reference on range of swap entries = and + * reclaim their cache if no more references r= emain. * @entry: First entry of range. * @nr: Number of entries in range. * * For each swap entry in the contiguous range, release a reference. If an= y swap * entries become free, try to reclaim their underlying folios, if present= . The * offset range is defined by [entry.offset, entry.offset + nr). + * + * Context: Caller must ensure there is no race condition on the reference + * owner. e.g., locking the PTL of a PTE containing the entry being releas= ed. */ -void free_swap_and_cache_nr(swp_entry_t entry, int nr) +void swap_put_entries_direct(swp_entry_t entry, int nr) { const unsigned long start_offset =3D swp_offset(entry); const unsigned long end_offset =3D start_offset + nr; @@ -1958,10 +2008,9 @@ void free_swap_and_cache_nr(swp_entry_t entry, int n= r) unsigned long offset; =20 si =3D get_swap_device(entry); - if (!si) + if (WARN_ON_ONCE(!si)) return; - - if (WARN_ON(end_offset > si->max)) + if (WARN_ON_ONCE(end_offset > si->max)) goto out; =20 /* @@ -2005,8 +2054,8 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr) } =20 #ifdef CONFIG_HIBERNATION - -swp_entry_t get_swap_page_of_type(int type) +/* Allocate a slot for hibernation */ +swp_entry_t swap_alloc_hibernation_slot(int type) { struct swap_info_struct *si =3D swap_type_to_info(type); unsigned long offset; @@ -2034,6 +2083,27 @@ swp_entry_t get_swap_page_of_type(int type) return entry; } =20 +/* Free a slot allocated by swap_alloc_hibernation_slot */ +void swap_free_hibernation_slot(swp_entry_t entry) +{ + struct swap_info_struct *si; + struct swap_cluster_info *ci; + pgoff_t offset =3D swp_offset(entry); + + si =3D get_swap_device(entry); + if (WARN_ON(!si)) + return; + + ci =3D swap_cluster_lock(si, offset); + swap_entry_put_locked(si, ci, entry, 1); + WARN_ON(swap_entry_swapped(si, entry)); + swap_cluster_unlock(ci); + + /* In theory readahead might add it to the swap cache by accident */ + __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); + put_swap_device(si); +} + /* * Find the swap type that corresponds to given device (if any). * @@ -2195,7 +2265,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_= t *pmd, /* * Some architectures may have to restore extra metadata to the page * when reading from swap. This metadata may be indexed by swap entry - * so this must be called before swap_free(). + * so this must be called before folio_put_swap(). */ arch_swap_restore(folio_swap(entry, folio), folio); =20 @@ -2236,7 +2306,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_= t *pmd, new_pte =3D pte_mkuffd_wp(new_pte); setpte: set_pte_at(vma->vm_mm, addr, pte, new_pte); - swap_free(entry); + folio_put_swap(folio, page); out: if (pte) pte_unmap_unlock(pte, ptl); @@ -3746,28 +3816,22 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) return err; } =20 -/** - * swap_duplicate_nr() - Increase reference count of nr contiguous swap en= tries - * by 1. - * +/* + * swap_dup_entry_direct() - Increase reference count of a swap entry by o= ne. * @entry: first swap entry from which we want to increase the refcount. - * @nr: Number of entries in range. * * Returns 0 for success, or -ENOMEM if a swap_count_continuation is requi= red * but could not be atomically allocated. Returns 0, just as if it succee= ded, * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), wh= ich * might occur if a page table entry has got corrupted. * - * Note that we are currently not handling the case where nr > 1 and we ne= ed to - * add swap count continuation. This is OK, because no such user exists - = shmem - * is the only user that can pass nr > 1, and it never re-duplicates any s= wap - * entry it owns. + * Context: Caller must ensure there is no race condition on the reference + * owner. e.g., locking the PTL of a PTE containing the entry being increa= sed. */ -int swap_duplicate_nr(swp_entry_t entry, int nr) +int swap_dup_entry_direct(swp_entry_t entry) { int err =3D 0; - - while (!err && __swap_duplicate(entry, 1, nr) =3D=3D -ENOMEM) + while (!err && __swap_duplicate(entry, 1, 1) =3D=3D -ENOMEM) err =3D add_swap_count_continuation(entry, GFP_ATOMIC); return err; } --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D11E57E0E8 for ; Fri, 19 Dec 2025 19:45:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173531; cv=none; b=cOf/ERNp1NNqVihzymcV1zMHvEeVNgB47XuLyFY3jFDj2t/zeELUsIKSFxg0qUFsyZxXumNDUxGMPggwI4/KwXCA51wNctvts0XHwNbKZuzkeQy72ikfKf6Tnw0z+QhQk0NHJTThVWlb5NXwz1XJzCYOcYbkmqBLieUWQEwt3WI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173531; c=relaxed/simple; bh=cmiHYNbEO2G8957Nj4Y38TxnEvNHMSrLIdG74fj+uYM=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=aawm0OCNOhLuoRnCXVzBmxNM7KlKVzB+ACFpMfuB4wymLiLOQjac8gWB5rcChmQdQKJMm+enud5H4IffpD5UTOgr+nPpKDMn7vug3HV2lvlkZiEkLkuoVSYFyKGHpdqg2vz90xhAiYBtF32iIM1RIVOPWfEs78MJcVI6jE6cvQQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Pi5AeKgn; arc=none smtp.client-ip=209.85.214.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Pi5AeKgn" Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-29efd139227so27892685ad.1 for ; Fri, 19 Dec 2025 11:45:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173529; x=1766778329; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=vJe7TUlSLTyG3X6lbp3TN7yAGLGf2fYmibniRVXmXAQ=; b=Pi5AeKgn7CxWa2+4lkYUz7GsM1WifFLXeaWq0Ifjc3i6zs+Jr2j5v7co2a7S0B68p3 Kt3OAGSy4TSOb4KTmc/8E7/CvDuo6ChU8gz5xl68SZNcvFWGUbqLh+sLrZOUZp7V1YUA g/7wPis4+Eic01bHlw5XJ//UbiIGGC7g0nXpN9OJdModee0L39KPOXnjsKxes3l7jhp6 D54oAvzKSFZJVfAqRWaaXKJ9r4yON0jRRXyUT9xDBWuLsAZQ1oul8xITgIZids7T9gwX dxFD93KeMjsqUAztS998CNbHeSfF43cQrk1NbIOYs8JA+Mu88pcggJdWEi2chgUpxVJ3 Vg9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173529; x=1766778329; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=vJe7TUlSLTyG3X6lbp3TN7yAGLGf2fYmibniRVXmXAQ=; b=fpwgUfdzEuHQoysbyGbNft9+QXVfdxsf03jsNlQ61+7KkAo403JMRwIN3mkHUzXoqg L/Qn2DzUsBg0LKfHA4+QgnY6DwNw1S8UDW+WE2VzrQ3xho7Mslg3PoDAznsHj9BU4oFS gaTJAXvvVgXNgrSmfRftzHNNoAqvCSl429Pj4KRKPeV9l6noYhpcgYKvgCHjbWoSNaiX yK3Xo3YtRUAaR/S/+sH3e0/rQFBwQWlzfWdsokTfXc85DE+dxHQu/eZA4GxGZlqt/6B/ E7M79FfCHPEJiu0be7stNk2FJKtAasgc1eP32HydnsNu3cGvtgGTjrFMZZVB5HXI3ZWc AHdw== X-Forwarded-Encrypted: i=1; AJvYcCXbTZ+O4z1BaqS6g0hJeWjn63zqkjDyzFn6DDygn2tIUTXBHJhNDIrYZ0h5VZeAfeJ16GA/4lwcSr+KnP8=@vger.kernel.org X-Gm-Message-State: AOJu0YzWr1et0uF+zPGks2Ozh0S8x2k9U8DzA+PE6PfRDoBkToRkMISO W1oV3joiwPSIEOP8MoAODir0u0l2S7u7QjNeZ1kCY4trWwTJ7uDd6bih X-Gm-Gg: AY/fxX6Mf0cOHE0LU3s3Z8ZAg1+ur7iNhid0BxUubzn1bklEjrgKxapqEO+sNtYzxlE fiCVcGxZkbKjT7hEzMwiXatBeYziMf2quVHBdcjAF0LLaMya73zruzuoRMx55bTX3ljzYnZfaQZ kBqf/gNWWlK3tKXoc9ZAfUH3B61vy90kpBvS2MieDZ95/pDTE0xq0Ofk/DhAaJMrPKPmCmTPcjG Xx6lip60rnlrxT8oIwcN+squvALj3ft6FgcFz6J+/dew/VsslcF0jJKRoibkEJswa5s7QN5LBGj MDVfDT6fGRoTmyshno1hVU6LdZh/60Tqv1ZDBWzN3xFNaQfdUdrupgsECvSOTqPrzRiCgxq5gzC LThFNz7mT+YzQWBSd8t4voG6mq0hspGl3taGo2upHQVk1Ug04p4Q8XxlVPGI0nisud47kGDJTnM lazEtjDgFdJwgJ+0eVbSuUy3fwFr6FiZvA+jD0ni48mMhoPdZSyXM8 X-Google-Smtp-Source: AGHT+IGyP70mQedU5GkPxMY02O8grCvS8qgRmoNGheDHQnYEgWaWJw8AmLLhPaICPuifjoyXraitXQ== X-Received: by 2002:a17:902:cf0d:b0:29f:b3e5:5186 with SMTP id d9443c01a7336-2a2f293d13bmr40116945ad.56.1766173528751; Fri, 19 Dec 2025 11:45:28 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.45.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:45:28 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:44 +0800 Subject: [PATCH v5 15/19] mm, swap: add folio to swap cache directly on allocation Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-15-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=18723; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=ljO3bAI9+CcPRr1+yVdxPpxrgKCXdZlXBv6eM2iyJP8=; b=Lcref8/LZ6ecnxIakmh38uBHpEsdGDYwJNPEjAlf1uOHslPegk93cBsuOAOFevJHFIggYfeAA jlAmX6URsH4CDhqhoGTqKM3fmuOGGoI4kOpuQqIXOU00KcdcdVplJ6x X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation. SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion. This pinning usage here can be dropped by adding the folio to swap cache directly on allocation. All swap allocations are folio-based now (except for hibernation), so the swap allocator can always take the folio as the parameter. And now both swap cache (swap table) and swap map are protected by the cluster lock, scanning the map and inserting the folio can be done in the same critical section. This eliminates the time window that a slot is pinned by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock multiple times. This is both a cleanup and an optimization. Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- include/linux/swap.h | 5 -- mm/swap.h | 10 +--- mm/swap_state.c | 58 +++++++++++-------- mm/swapfile.c | 161 ++++++++++++++++++++++-------------------------= ---- 4 files changed, 105 insertions(+), 129 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index aaa868f60b9c..517d24e96d8c 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -452,7 +452,6 @@ static inline long get_nr_swap_pages(void) } =20 extern void si_swapinfo(struct sysinfo *); -void put_swap_folio(struct folio *folio, swp_entry_t entry); extern int add_swap_count_continuation(swp_entry_t, gfp_t); int swap_type_of(dev_t device, sector_t offset); int find_first_swap(dev_t *device); @@ -533,10 +532,6 @@ static inline void swap_put_entries_direct(swp_entry_t= ent, int nr) { } =20 -static inline void put_swap_folio(struct folio *folio, swp_entry_t swp) -{ -} - static inline int __swap_count(swp_entry_t entry) { return 0; diff --git a/mm/swap.h b/mm/swap.h index 9ed12936b889..ec1ef7d0c35b 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -277,13 +277,13 @@ void __swapcache_clear_cached(struct swap_info_struct= *si, */ struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, - void **shadow, bool alloc); void swap_cache_del_folio(struct folio *folio); struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, struct mempolicy *mpol, pgoff_t ilx, bool *alloced); /* Below helpers require the caller to lock and pass in the swap cluster. = */ +void __swap_cache_add_folio(struct swap_cluster_info *ci, + struct folio *folio, swp_entry_t entry); void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry, void *shadow); void __swap_cache_replace_folio(struct swap_cluster_info *ci, @@ -459,12 +459,6 @@ static inline void *swap_cache_get_shadow(swp_entry_t = entry) return NULL; } =20 -static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t en= try, - void **shadow, bool alloc) -{ - return -ENOENT; -} - static inline void swap_cache_del_folio(struct folio *folio) { } diff --git a/mm/swap_state.c b/mm/swap_state.c index 327c051d7cd0..29fa8d313a79 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -122,35 +122,56 @@ void *swap_cache_get_shadow(swp_entry_t entry) return NULL; } =20 +void __swap_cache_add_folio(struct swap_cluster_info *ci, + struct folio *folio, swp_entry_t entry) +{ + unsigned long new_tb; + unsigned int ci_start, ci_off, ci_end; + unsigned long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); + + new_tb =3D folio_to_swp_tb(folio); + ci_start =3D swp_cluster_offset(entry); + ci_off =3D ci_start; + ci_end =3D ci_start + nr_pages; + do { + VM_WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off))); + __swap_table_set(ci, ci_off, new_tb); + } while (++ci_off < ci_end); + + folio_ref_add(folio, nr_pages); + folio_set_swapcache(folio); + folio->swap =3D entry; + + node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); + lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); +} + /** * swap_cache_add_folio - Add a folio into the swap cache. * @folio: The folio to be added. * @entry: The swap entry corresponding to the folio. * @gfp: gfp_mask for XArray node allocation. * @shadowp: If a shadow is found, return the shadow. - * @alloc: If it's the allocator that is trying to insert a folio. Allocat= or - * sets SWAP_HAS_CACHE to pin slots before insert so skip map upda= te. * * Context: Caller must ensure @entry is valid and protect the swap device * with reference count or locks. */ -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, - void **shadowp, bool alloc) +static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, + void **shadowp) { int err; void *shadow =3D NULL; + unsigned long old_tb; struct swap_info_struct *si; - unsigned long old_tb, new_tb; struct swap_cluster_info *ci; unsigned int ci_start, ci_off, ci_end, offset; unsigned long nr_pages =3D folio_nr_pages(folio); =20 - VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); - VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); - VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); - si =3D __swap_entry_to_info(entry); - new_tb =3D folio_to_swp_tb(folio); ci_start =3D swp_cluster_offset(entry); ci_end =3D ci_start + nr_pages; ci_off =3D ci_start; @@ -166,7 +187,7 @@ int swap_cache_add_folio(struct folio *folio, swp_entry= _t entry, err =3D -EEXIST; goto failed; } - if (!alloc && unlikely(!__swap_count(swp_entry(swp_type(entry), offset))= )) { + if (unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) { err =3D -ENOENT; goto failed; } @@ -182,20 +203,11 @@ int swap_cache_add_folio(struct folio *folio, swp_ent= ry_t entry, * Still need to pin the slots with SWAP_HAS_CACHE since * swap allocator depends on that. */ - if (!alloc) - __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset)); - __swap_table_set(ci, ci_off, new_tb); + __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset)); offset++; } while (++ci_off < ci_end); - - folio_ref_add(folio, nr_pages); - folio_set_swapcache(folio); - folio->swap =3D entry; + __swap_cache_add_folio(ci, folio, entry); swap_cluster_unlock(ci); - - node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); - lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); - if (shadowp) *shadowp =3D shadow; return 0; @@ -464,7 +476,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, __folio_set_locked(folio); __folio_set_swapbacked(folio); for (;;) { - ret =3D swap_cache_add_folio(folio, entry, &shadow, false); + ret =3D swap_cache_add_folio(folio, entry, &shadow); if (!ret) break; =20 diff --git a/mm/swapfile.c b/mm/swapfile.c index f812fdea68b3..de0f4c1352cb 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -884,28 +884,57 @@ static void swap_cluster_assert_table_empty(struct sw= ap_cluster_info *ci, } } =20 -static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_c= luster_info *ci, - unsigned int start, unsigned char usage, - unsigned int order) +static bool cluster_alloc_range(struct swap_info_struct *si, + struct swap_cluster_info *ci, + struct folio *folio, + unsigned int offset) { - unsigned int nr_pages =3D 1 << order; + unsigned long nr_pages; + unsigned int order; =20 lockdep_assert_held(&ci->lock); =20 if (!(si->flags & SWP_WRITEOK)) return false; =20 + /* + * All mm swap allocation starts with a folio (folio_alloc_swap), + * it's also the only allocation path for large orders allocation. + * Such swap slots starts with count =3D=3D 0 and will be increased + * upon folio unmap. + * + * Else, it's a exclusive order 0 allocation for hibernation. + * The slot starts with count =3D=3D 1 and never increases. + */ + if (likely(folio)) { + order =3D folio_order(folio); + nr_pages =3D 1 << order; + /* + * Pin the slot with SWAP_HAS_CACHE to satisfy swap_dup_entries. + * This is the legacy allocation behavior, will drop it very soon. + */ + memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages); + __swap_cache_add_folio(ci, folio, swp_entry(si->type, offset)); + } else if (IS_ENABLED(CONFIG_HIBERNATION)) { + order =3D 0; + nr_pages =3D 1; + WARN_ON_ONCE(si->swap_map[offset]); + si->swap_map[offset] =3D 1; + swap_cluster_assert_table_empty(ci, offset, 1); + } else { + /* Allocation without folio is only possible with hibernation */ + WARN_ON_ONCE(1); + return false; + } + /* * The first allocation in a cluster makes the * cluster exclusive to this order */ if (cluster_is_empty(ci)) ci->order =3D order; - - memset(si->swap_map + start, usage, nr_pages); - swap_cluster_assert_table_empty(ci, start, nr_pages); - swap_range_alloc(si, nr_pages); ci->count +=3D nr_pages; + swap_range_alloc(si, nr_pages); =20 return true; } @@ -913,13 +942,12 @@ static bool cluster_alloc_range(struct swap_info_stru= ct *si, struct swap_cluster /* Try use a new cluster for current CPU and allocate from it. */ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long offset, - unsigned int order, - unsigned char usage) + struct folio *folio, unsigned long offset) { unsigned int next =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTRY_INVALID; unsigned long start =3D ALIGN_DOWN(offset, SWAPFILE_CLUSTER); unsigned long end =3D min(start + SWAPFILE_CLUSTER, si->max); + unsigned int order =3D likely(folio) ? folio_order(folio) : 0; unsigned int nr_pages =3D 1 << order; bool need_reclaim, ret, usable; =20 @@ -943,7 +971,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap= _info_struct *si, if (!ret) continue; } - if (!cluster_alloc_range(si, ci, offset, usage, order)) + if (!cluster_alloc_range(si, ci, folio, offset)) break; found =3D offset; offset +=3D nr_pages; @@ -965,8 +993,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap= _info_struct *si, =20 static unsigned int alloc_swap_scan_list(struct swap_info_struct *si, struct list_head *list, - unsigned int order, - unsigned char usage, + struct folio *folio, bool scan_all) { unsigned int found =3D SWAP_ENTRY_INVALID; @@ -978,7 +1005,7 @@ static unsigned int alloc_swap_scan_list(struct swap_i= nfo_struct *si, if (!ci) break; offset =3D cluster_offset(si, ci); - found =3D alloc_swap_scan_cluster(si, ci, offset, order, usage); + found =3D alloc_swap_scan_cluster(si, ci, folio, offset); if (found) break; } while (scan_all); @@ -1039,10 +1066,11 @@ static void swap_reclaim_work(struct work_struct *w= ork) * Try to allocate swap entries with specified order and try set a new * cluster for current CPU too. */ -static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,= int order, - unsigned char usage) +static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, + struct folio *folio) { struct swap_cluster_info *ci; + unsigned int order =3D likely(folio) ? folio_order(folio) : 0; unsigned int offset =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTRY_INVALID; =20 /* @@ -1064,8 +1092,7 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o if (cluster_is_usable(ci, order)) { if (cluster_is_empty(ci)) offset =3D cluster_offset(si, ci); - found =3D alloc_swap_scan_cluster(si, ci, offset, - order, usage); + found =3D alloc_swap_scan_cluster(si, ci, folio, offset); } else { swap_cluster_unlock(ci); } @@ -1079,22 +1106,19 @@ static unsigned long cluster_alloc_swap_entry(struc= t swap_info_struct *si, int o * to spread out the writes. */ if (si->flags & SWP_PAGE_DISCARD) { - found =3D alloc_swap_scan_list(si, &si->free_clusters, order, usage, - false); + found =3D alloc_swap_scan_list(si, &si->free_clusters, folio, false); if (found) goto done; } =20 if (order < PMD_ORDER) { - found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[order], - order, usage, true); + found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[order], folio, = true); if (found) goto done; } =20 if (!(si->flags & SWP_PAGE_DISCARD)) { - found =3D alloc_swap_scan_list(si, &si->free_clusters, order, usage, - false); + found =3D alloc_swap_scan_list(si, &si->free_clusters, folio, false); if (found) goto done; } @@ -1110,8 +1134,7 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o * failure is not critical. Scanning one cluster still * keeps the list rotated and reclaimed (for HAS_CACHE). */ - found =3D alloc_swap_scan_list(si, &si->frag_clusters[order], order, - usage, false); + found =3D alloc_swap_scan_list(si, &si->frag_clusters[order], folio, fal= se); if (found) goto done; } @@ -1125,13 +1148,11 @@ static unsigned long cluster_alloc_swap_entry(struc= t swap_info_struct *si, int o * Clusters here have at least one usable slots and can't fail order 0 * allocation, but reclaim may drop si->lock and race with another user. */ - found =3D alloc_swap_scan_list(si, &si->frag_clusters[o], - 0, usage, true); + found =3D alloc_swap_scan_list(si, &si->frag_clusters[o], folio, true); if (found) goto done; =20 - found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[o], - 0, usage, true); + found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[o], folio, true= ); if (found) goto done; } @@ -1322,12 +1343,12 @@ static bool get_swap_device_info(struct swap_info_s= truct *si) * Fast path try to get swap entries with specified order from current * CPU's swap entry pool (a cluster). */ -static bool swap_alloc_fast(swp_entry_t *entry, - int order) +static bool swap_alloc_fast(struct folio *folio) { + unsigned int order =3D folio_order(folio); struct swap_cluster_info *ci; struct swap_info_struct *si; - unsigned int offset, found =3D SWAP_ENTRY_INVALID; + unsigned int offset; =20 /* * Once allocated, swap_info_struct will never be completely freed, @@ -1342,22 +1363,18 @@ static bool swap_alloc_fast(swp_entry_t *entry, if (cluster_is_usable(ci, order)) { if (cluster_is_empty(ci)) offset =3D cluster_offset(si, ci); - found =3D alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE); - if (found) - *entry =3D swp_entry(si->type, found); + alloc_swap_scan_cluster(si, ci, folio, offset); } else { swap_cluster_unlock(ci); } =20 put_swap_device(si); - return !!found; + return folio_test_swapcache(folio); } =20 /* Rotate the device and switch to a new cluster */ -static void swap_alloc_slow(swp_entry_t *entry, - int order) +static void swap_alloc_slow(struct folio *folio) { - unsigned long offset; struct swap_info_struct *si, *next; =20 spin_lock(&swap_avail_lock); @@ -1367,13 +1384,11 @@ static void swap_alloc_slow(swp_entry_t *entry, plist_requeue(&si->avail_list, &swap_avail_head); spin_unlock(&swap_avail_lock); if (get_swap_device_info(si)) { - offset =3D cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE); + cluster_alloc_swap_entry(si, folio); put_swap_device(si); - if (offset) { - *entry =3D swp_entry(si->type, offset); + if (folio_test_swapcache(folio)) return; - } - if (order) + if (folio_test_large(folio)) return; } =20 @@ -1438,7 +1453,6 @@ int folio_alloc_swap(struct folio *folio) { unsigned int order =3D folio_order(folio); unsigned int size =3D 1 << order; - swp_entry_t entry =3D {}; =20 VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio); @@ -1463,39 +1477,23 @@ int folio_alloc_swap(struct folio *folio) =20 again: local_lock(&percpu_swap_cluster.lock); - if (!swap_alloc_fast(&entry, order)) - swap_alloc_slow(&entry, order); + if (!swap_alloc_fast(folio)) + swap_alloc_slow(folio); local_unlock(&percpu_swap_cluster.lock); =20 - if (unlikely(!order && !entry.val)) { + if (!order && unlikely(!folio_test_swapcache(folio))) { if (swap_sync_discard()) goto again; } =20 /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */ - if (mem_cgroup_try_charge_swap(folio, entry)) - goto out_free; + if (unlikely(mem_cgroup_try_charge_swap(folio, folio->swap))) + swap_cache_del_folio(folio); =20 - if (!entry.val) + if (unlikely(!folio_test_swapcache(folio))) return -ENOMEM; =20 - /* - * Allocator has pinned the slots with SWAP_HAS_CACHE - * so it should never fail - */ - WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true)); - - /* - * Allocator should always allocate aligned entries so folio based - * operations never crossed more than one cluster. - */ - VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio); - return 0; - -out_free: - put_swap_folio(folio, entry); - return -ENOMEM; } =20 /** @@ -1798,29 +1796,6 @@ static void swap_entries_free(struct swap_info_struc= t *si, partial_free_cluster(si, ci); } =20 -/* - * Called after dropping swapcache to decrease refcnt to swap entries. - */ -void put_swap_folio(struct folio *folio, swp_entry_t entry) -{ - struct swap_info_struct *si; - struct swap_cluster_info *ci; - unsigned long offset =3D swp_offset(entry); - int size =3D 1 << swap_entry_order(folio_order(folio)); - - si =3D _swap_info_get(entry); - if (!si) - return; - - ci =3D swap_cluster_lock(si, offset); - if (swap_only_has_cache(si, offset, size)) - swap_entries_free(si, ci, entry, size); - else - for (int i =3D 0; i < size; i++, entry.val++) - swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); - swap_cluster_unlock(ci); -} - int __swap_count(swp_entry_t entry) { struct swap_info_struct *si =3D __swap_entry_to_info(entry); @@ -2072,7 +2047,7 @@ swp_entry_t swap_alloc_hibernation_slot(int type) * with swap table allocation. */ local_lock(&percpu_swap_cluster.lock); - offset =3D cluster_alloc_swap_entry(si, 0, 1); + offset =3D cluster_alloc_swap_entry(si, NULL); local_unlock(&percpu_swap_cluster.lock); if (offset) entry =3D swp_entry(si->type, offset); --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 95CB23451C6 for ; Fri, 19 Dec 2025 19:45:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173536; cv=none; b=izqXEyoYcj95EHuyE2Nds05mkgIc92rWmXHGTj1eyWjldh5yQ9w9hnqhbvTJlz9bwbLffo3EYSK5xds//9VmSALGANroUVDQMdc7/hvUT9wTEUM+o7kKrSgSlfN/tioHLc962e948+f0Zaw6orJE9j0w1pWPnFyDoN8weX9uCz8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173536; c=relaxed/simple; bh=RJLL0Hah2Ooew4GfWCRyEdWrxQz+MRIFKyq2ADRnnOE=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=buvywHpCJvBGarBewIahVReA8wnSRf6ia8sy/w8SREe2CA1mc2wp2OVLz48lxdO+C13fL0huhe7AzuOcUiOH8F8IzDAj1YNDawQN9Hr+Gdz4PGhUrwCJ2J4RtpMdnOYZB8Z3hEZPIwJHbGRXJJyrrUHXFwLO6o6pH/5E12fkaqg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Mx+71TdS; arc=none smtp.client-ip=209.85.214.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Mx+71TdS" Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-2a09d981507so15058555ad.1 for ; Fri, 19 Dec 2025 11:45:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173534; x=1766778334; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=C+RQ/TfHSPFWESgUxtvx2lU9JguTovbHRFDwOD/mV7w=; b=Mx+71TdSbDsBw5uWpfmYN9b+lvlJb7kGMVvCK2dVJ1uHjxbqukZ39s8EQF9lVuq7/2 +tQUQpQBHs691CyosoNbH0qsy5bz/OrHJfIkAUnZPFxBF3/7LQaAeO3g3gEe3Dpmy0br yEhyvSNcE7P5iIFT++x/6jp5PQBul8+f7YQLatp/on0zpeJ3z4KpcAGwGccW0oyHrK5a Jj0SeiCG7aIZSQjHmtwjB1t9GqS0c7hnnvRVm//pkw8CteyDoORKhkMWO0uizBLSyrTw bSFzJC1PiF2QFKJxnjOYYk6F52swEqM0QQ7w4btYlqJmvlx42QpcChvbs11wxAaP+hTW HtjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173534; x=1766778334; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=C+RQ/TfHSPFWESgUxtvx2lU9JguTovbHRFDwOD/mV7w=; b=Xg2BIw523Uk/WpirMRuBZYqjFID1fLa2JLzE00tON90HB6ESEDPRsruJfh0cnXOZXZ 18Waqce/H3AJufjeZeE4bTcCpDncJIhH1LJAM16EIUQjZcYIxvD0s+HZ6rLsilphaYv+ eCEPYCHSAncpiHoK9cHNp/hn5gNu64o+DWMUPxeapPr8y5VZajhIvrKhqy4NckiOheBm 4Pqw/Pp+iDEI1XY82c/E1zoii4VmbNz6c8LqClYa+S6CutzAfHsbmDG4GMoaANFB8zcX nEwhX2uPF0KLn3r8Mg29tnYzUYviJqrsP/jEwNGz672+Ol4YdywdOjBqR4UpzP+eBOIn rXQA== X-Forwarded-Encrypted: i=1; AJvYcCXOtVr6HUXHGsTVjUkaO0myu29lAdrWuDdyWn3OsrgH11eDUqzIXR5P7ZbFIQRpfnIQdhKiLJzcDc6Y2E8=@vger.kernel.org X-Gm-Message-State: AOJu0YzEG1hihG3ZvhlpUajuGq02NrBAsJ8zkbZgKVWoukA9NJ5SnvLO Kw18/x3elntqV/kk/5ykOeNn3rCdP8wbm9GEMwZsg5hfOvrBbOYBjEH6 X-Gm-Gg: AY/fxX5+NHiDwVmFzBG3REw4tfnsuJQXpdy5O3GnR/xt+lBrt4FjkfSY3NGlKeLnerT IYq0K3jL7+dOqpa3Dhol0eIpBHP/EMefXyIo0vHa6oXuhDQg/FSYdgnyIO12+fHcw0Il+Z3CNYV wvF0hFGFbK0nS6s417oLUld+t/bQ0palMjr+gn5n/iPMB5BHUEex6dId9wiviAnDRWxSkaU3KqF 8N2GxENzNM10Nt0aTollVtOnaybOsbgrtkXLXQ7HLpJ8v2OZsAHdHav/buEPokrs6h8l2rDqcaX uQy5oLKaD5kvy17wysQEjO0JLiNP/3x58axV51J3xr3F60w3aofg/zs0cs+UF4PfvsLbk3WkQ8r M9YyVMXMg133tGjx1tVQN4uQhLLLn4b5xA2MOmHKgWEavWl9LKAKXLDosUxe/pmiWfkOKpxPFsA 83H4brPhk1D2XY8IwKSs3tf/rpnqKizkF/w/GpYQaLseOj5REM3Wph X-Google-Smtp-Source: AGHT+IH/UTCTqpYSmz9cLrJwQg84NY1Mo97k7i39oFV9pTJDKLPSFp149pFzopAt++ZGKki6mXWCFQ== X-Received: by 2002:a17:902:e78c:b0:2a0:d0ae:454d with SMTP id d9443c01a7336-2a2cab4335fmr58302215ad.22.1766173533875; Fri, 19 Dec 2025 11:45:33 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.45.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:45:33 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:45 +0800 Subject: [PATCH v5 16/19] mm, swap: check swap table directly for checking cache Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-16-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=7912; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=6WuwtHjxFkVBfgapei4Gm4oolBwvY/HEU9CfiVMF7mo=; b=eyypQBKFn6nJdcyrPA2kVE7MGM8ssqk71hkL0pP6t7Ns6U3cc+hHDDoOMIRLaoDkXUp26Te77 mpSbCIGxjtkDOMMG2Z2ZOw1FpaXoPdrAZAeOSfEyEWgBznKVO3FXQ/a X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Instead of looking at the swap map, check swap table directly to tell if a swap slot is cached. Prepares for the removal of SWAP_HAS_CACHE. Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- mm/swap.h | 11 ++++++++--- mm/swap_state.c | 16 ++++++++++++++++ mm/swapfile.c | 55 +++++++++++++++++++++++++++++-----------------------= --- mm/userfaultfd.c | 10 +++------- 4 files changed, 56 insertions(+), 36 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index ec1ef7d0c35b..3692e143eeba 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -275,6 +275,7 @@ void __swapcache_clear_cached(struct swap_info_struct *= si, * swap entries in the page table, similar to locking swap cache folio. * - See the comment of get_swap_device() for more complex usage. */ +bool swap_cache_has_folio(swp_entry_t entry); struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); void swap_cache_del_folio(struct folio *folio); @@ -335,8 +336,6 @@ static inline int swap_zeromap_batch(swp_entry_t entry,= int max_nr, =20 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr) { - struct swap_info_struct *si =3D __swap_entry_to_info(entry); - pgoff_t offset =3D swp_offset(entry); int i; =20 /* @@ -345,8 +344,9 @@ static inline int non_swapcache_batch(swp_entry_t entry= , int max_nr) * be in conflict with the folio in swap cache. */ for (i =3D 0; i < max_nr; i++) { - if ((si->swap_map[offset + i] & SWAP_HAS_CACHE)) + if (swap_cache_has_folio(entry)) return i; + entry.val++; } =20 return i; @@ -449,6 +449,11 @@ static inline int swap_writeout(struct folio *folio, return 0; } =20 +static inline bool swap_cache_has_folio(swp_entry_t entry) +{ + return false; +} + static inline struct folio *swap_cache_get_folio(swp_entry_t entry) { return NULL; diff --git a/mm/swap_state.c b/mm/swap_state.c index 29fa8d313a79..0ff6c09ee702 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -103,6 +103,22 @@ struct folio *swap_cache_get_folio(swp_entry_t entry) return NULL; } =20 +/** + * swap_cache_has_folio - Check if a swap slot has cache. + * @entry: swap entry indicating the slot. + * + * Context: Caller must ensure @entry is valid and protect the swap + * device with reference count or locks. + */ +bool swap_cache_has_folio(swp_entry_t entry) +{ + unsigned long swp_tb; + + swp_tb =3D swap_table_get(__swap_entry_to_cluster(entry), + swp_cluster_offset(entry)); + return swp_tb_is_folio(swp_tb); +} + /** * swap_cache_get_shadow - Looks up a shadow in the swap cache. * @entry: swap entry used for the lookup. diff --git a/mm/swapfile.c b/mm/swapfile.c index de0f4c1352cb..91436ee41446 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -792,23 +792,18 @@ static bool cluster_reclaim_range(struct swap_info_st= ruct *si, unsigned int nr_pages =3D 1 << order; unsigned long offset =3D start, end =3D start + nr_pages; unsigned char *map =3D si->swap_map; - int nr_reclaim; + unsigned long swp_tb; =20 spin_unlock(&ci->lock); do { - switch (READ_ONCE(map[offset])) { - case 0: + if (swap_count(READ_ONCE(map[offset]))) break; - case SWAP_HAS_CACHE: - nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); - if (nr_reclaim < 0) - goto out; - break; - default: - goto out; + swp_tb =3D swap_table_get(ci, offset % SWAPFILE_CLUSTER); + if (swp_tb_is_folio(swp_tb)) { + if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY) < 0) + break; } } while (++offset < end); -out: spin_lock(&ci->lock); =20 /* @@ -829,37 +824,41 @@ static bool cluster_reclaim_range(struct swap_info_st= ruct *si, * Recheck the range no matter reclaim succeeded or not, the slot * could have been be freed while we are not holding the lock. */ - for (offset =3D start; offset < end; offset++) - if (READ_ONCE(map[offset])) + for (offset =3D start; offset < end; offset++) { + swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); + if (swap_count(map[offset]) || !swp_tb_is_null(swp_tb)) return false; + } =20 return true; } =20 static bool cluster_scan_range(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long start, unsigned int nr_pages, + unsigned long offset, unsigned int nr_pages, bool *need_reclaim) { - unsigned long offset, end =3D start + nr_pages; + unsigned long end =3D offset + nr_pages; unsigned char *map =3D si->swap_map; + unsigned long swp_tb; =20 if (cluster_is_empty(ci)) return true; =20 - for (offset =3D start; offset < end; offset++) { - switch (READ_ONCE(map[offset])) { - case 0: - continue; - case SWAP_HAS_CACHE: + do { + if (swap_count(map[offset])) + return false; + swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); + if (swp_tb_is_folio(swp_tb)) { + WARN_ON_ONCE(!(map[offset] & SWAP_HAS_CACHE)); if (!vm_swap_full()) return false; *need_reclaim =3D true; - continue; - default: - return false; + } else { + /* A entry with no count and no cache must be null */ + VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb)); } - } + } while (++offset < end); =20 return true; } @@ -1030,7 +1029,8 @@ static void swap_reclaim_full_clusters(struct swap_in= fo_struct *si, bool force) to_scan--; =20 while (offset < end) { - if (READ_ONCE(map[offset]) =3D=3D SWAP_HAS_CACHE) { + if (!swap_count(READ_ONCE(map[offset])) && + swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER))) { spin_unlock(&ci->lock); nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); @@ -1981,6 +1981,7 @@ void swap_put_entries_direct(swp_entry_t entry, int n= r) struct swap_info_struct *si; bool any_only_cache =3D false; unsigned long offset; + unsigned long swp_tb; =20 si =3D get_swap_device(entry); if (WARN_ON_ONCE(!si)) @@ -2005,7 +2006,9 @@ void swap_put_entries_direct(swp_entry_t entry, int n= r) */ for (offset =3D start_offset; offset < end_offset; offset +=3D nr) { nr =3D 1; - if (READ_ONCE(si->swap_map[offset]) =3D=3D SWAP_HAS_CACHE) { + swp_tb =3D swap_table_get(__swap_offset_to_cluster(si, offset), + offset % SWAPFILE_CLUSTER); + if (!swap_count(READ_ONCE(si->swap_map[offset])) && swp_tb_is_folio(swp_= tb)) { /* * Folios are always naturally aligned in swap so * advance forward to the next boundary. Zero means no diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index b11f81095fa5..46256e7a3d51 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1190,17 +1190,13 @@ static int move_swap_pte(struct mm_struct *mm, stru= ct vm_area_struct *dst_vma, * Check if the swap entry is cached after acquiring the src_pte * lock. Otherwise, we might miss a newly loaded swap cache folio. * - * Check swap_map directly to minimize overhead, READ_ONCE is sufficient. * We are trying to catch newly added swap cache, the only possible case= is * when a folio is swapped in and out again staying in swap cache, using= the * same entry before the PTE check above. The PTL is acquired and releas= ed - * twice, each time after updating the swap_map's flag. So holding - * the PTL here ensures we see the updated value. False positive is poss= ible, - * e.g. SWP_SYNCHRONOUS_IO swapin may set the flag without touching the - * cache, or during the tiny synchronization window between swap cache a= nd - * swap_map, but it will be gone very quickly, worst result is retry jit= ters. + * twice, each time after updating the swap table. So holding + * the PTL here ensures we see the updated value. */ - if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) { + if (swap_cache_has_folio(entry)) { double_pt_unlock(dst_ptl, src_ptl); return -EAGAIN; } --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7A3CE346A13 for ; Fri, 19 Dec 2025 19:45:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173541; cv=none; b=lRWlsecPmLLpeupx5zXJDc2lyNG3m39KvdmBU2/rDUzm5qKXjwlo+jfP2hgeOdOdRKeTt9U85golKyf4npbJ5I3tWkdE1nsixljaGcTioAdwnIcuMnqSP+huC8aZxz4tgD3g9slaU8pdltbMSFkSDVhwsyid7LKGbWyy1jD2nZE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173541; c=relaxed/simple; bh=1aVB0QHrDhmskULunfrUzwWCFS/BjjlXpKbpde1JYMY=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=sS2KAafx1RmIElpiQJlN5aRuXECb4j4gWXDwY6Qn8MAGvqnsd6f+q1ZlHx12MWvLty9aTZehr2ZMP2Ngzkbqv6G4JCmAZVKH0xLfxam27nObtIjT9oA4WrGfEObLDka3fG+6RA18o9SPKdHPcHn98bYqNO4fdd1T/zavF5cFtDk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=OpXXo4M+; arc=none smtp.client-ip=209.85.214.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="OpXXo4M+" Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-2a0bae9aca3so30014505ad.3 for ; Fri, 19 Dec 2025 11:45:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173539; x=1766778339; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=EXUnD0TxdvvVVNqfs6+uH6gTotwTKlm7rUuPyZoYLVg=; b=OpXXo4M+eLy7R95u2Hbvevx46zQe8HUK0r8ND+/Mf5miKheCzFVW23BNThq79opuoW zRL3f0DneUjhr9oicnHCssNj+qWZzp/2FU9bEoSZJoB6OE1yq82O0hlQEpoEnUD5nKP0 g4WjaUWMBlj59Ni8iP1SRFE0mjFqWCEOnABkyqyEef4pGZzz0v1nS8aggqQ4mhdqAzrQ /7c8R9LKRVVBf2umgaSORjG00dWtkybyPTwOpvosFsNrLRaAcrwJevECjVMPwb6rLHkY k3bWf57PLEAbsRWx6bY+SCOehXM3Y01LUZI8Gt+uJ4EXODl0pJ4464pNJMgtmPXmkH+y kZXg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173539; x=1766778339; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=EXUnD0TxdvvVVNqfs6+uH6gTotwTKlm7rUuPyZoYLVg=; b=gucgIzBC4u/XNI82QpZRa4q4DuxiNCLvNS2OWmaAx2n/GsA5nJi3Q0GlAG8+/cuhLX xuS9lRSf1/HBYzjx9bK+kXTmMwotVQ/FT8X5O3EApSwPFFJfRBuqLoWIeGEC7iQMAHkx xHeNqBhE3LILyLsaEFB31tft/pCqol5LpoWbny8z2YVnbvZY6SZhnJgxTBvXl/AQt8VW pvS4kKUWOIaVOpUySZZJ2qIHJ9bfmakn8CRKILWrmrBLQZVfkDetONryfsvi10/nZAt2 i5RKHGdF14B9HJJYj7C9E2X+A2lnhtBe51pcQ2B7yVY10WwxvYI6dYDiSByevY4TVVU2 WMfg== X-Forwarded-Encrypted: i=1; AJvYcCXPRqpepWBoPPp1CmqMskKQTFl6wUHPD+anrYz/AW46YE/cVLB00LSwl7hsc1VJ/xJUNyzYlmz4Pun1rn4=@vger.kernel.org X-Gm-Message-State: AOJu0YwMxwg18zo8hGrxzgmsLn3cuQJc22/rhOSRXRguJUjKutqoEpAw FsuAm+TOxJC2eTCkqJouA3PZyHqIHdA6DunDGKoDAhIqGULo38Z6NdZ6 X-Gm-Gg: AY/fxX6N/Y80SQlCOg/e43yrqAU823UEsPLxzKJpX/guZxG2sbzP18uRla12Tg4cEKQ CbJrwMxa3Nsoh92szwSLavIKque3yOGeYW6PpruXFBU820cX5uO1GALa1ZQY/f3qOaNSqsN21Yj tR5BwRM/q8Bb/PH5TMvGrbNwy4Vml9KI7opsF2dcoodMwO45otpM7RoVw8cxwNTV7gSGgDIHihU I75EgRfIbWRbVRjP+St4t7nCOzKATl4qR02ceUZ15gKLW04g5jHslwSsmDBs71/2AJ6/ItcB/sn pkJWTWCmgREoDSo7jpjS0j7GYfy9MTHSGAaq++7sX5wCPfJ/wONwEmZDSnzExKeqm/qBlDGm5Ai eY2ZXJG4IhCevel/e5B06dAMdsoxriNm16yRg2en5DckLyobpPLtbmMdFvcilWsIJ6pygp+H0te oVoFYJeJKG+H6P0eTO0ZlA3NhKWWHzWbfoHvFY53h4mgA4db93Lg4W X-Google-Smtp-Source: AGHT+IHyinYRJHe8wMfdWgxB9GsoLx6jzI9oqskhYPXafXHKGKtObYuOnRTJnebQDO1RErNCH8baBg== X-Received: by 2002:a17:903:40cb:b0:29e:facd:7c02 with SMTP id d9443c01a7336-2a2f2830ff4mr37513005ad.28.1766173538644; Fri, 19 Dec 2025 11:45:38 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.45.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:45:38 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:46 +0800 Subject: [PATCH v5 17/19] mm, swap: clean up and improve swap entries freeing Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-17-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=13630; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=0KT1qbJ6MLKCnGsBuLpaMmy6fVRGJpUGw01IvVzuI5c=; b=7Rl/9Zbe0K99+uB0pukbc2Z+bRniLyD1ih0Hhne2xrgy8aw/WHhmqMrM52gx1ylglpFXCgtA1 b1s9Cm523vbD2VpD7iE0dONf5Xva0p38RkBd2N0oRdbddo3PN5fKpmO X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song There are a few problems with the current freeing of swap entries. When freeing a set of swap entries directly (swap_put_entries_direct, typically from zapping the page table), it scans the whole swap region multiple times. First, it scans the whole region to check if it can be batch freed and if there is any cached folio. Then do a batch free only if the whole region's swap count equals 1. And if any entry is cached, even if only one, it will have to walk the whole region again to clean up the cache. And if any entry is not in a consistent status with other entries, it will fall back to order 0 freeing. For example, if only one of them is cached, the batch free will fall back. And the current batch freeing workflow relies on the swap map's SWAP_HAS_CACHE bit for both continuous checking and batch freeing, which isn't compatible with the swap table design. Tidy this up, introduce a new cluster scoped helper for all swap entry freeing job. It will batch frees all continuous entries, and just start a new batch if any inconsistent entry is found. This may improve the batch size when the clusters are fragmented. This should also be more robust with more sanity checks, and make it clear that a slot pinned by swap cache will be cleared upon cache reclaim. And the cache reclaim scan is also now limited to each cluster. If a cluster has any clean swap cache left after putting the swap count, reclaim the cluster only instead of the whole region. And since a folio's entries are always in the same cluster, putting swap entries from a folio can also use the new helper directly. This should be both an optimization and a cleanup, and the new helper is adapted to the swap table. Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- mm/swapfile.c | 238 +++++++++++++++++++++++-------------------------------= ---- 1 file changed, 96 insertions(+), 142 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 91436ee41446..9fbb2f98219e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -55,12 +55,14 @@ static bool swap_count_continued(struct swap_info_struc= t *, pgoff_t, static void free_swap_count_continuations(struct swap_info_struct *); static void swap_entries_free(struct swap_info_struct *si, struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr_pages); + unsigned long start, unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr= ); -static bool swap_entries_put_map(struct swap_info_struct *si, - swp_entry_t entry, int nr); +static void swap_put_entry_locked(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, + unsigned char usage); static bool folio_swapcache_freeable(struct folio *folio); static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, @@ -197,25 +199,6 @@ static bool swap_only_has_cache(struct swap_info_struc= t *si, return true; } =20 -static bool swap_is_last_map(struct swap_info_struct *si, - unsigned long offset, int nr_pages, bool *has_cache) -{ - unsigned char *map =3D si->swap_map + offset; - unsigned char *map_end =3D map + nr_pages; - unsigned char count =3D *map; - - if (swap_count(count) !=3D 1) - return false; - - while (++map < map_end) { - if (*map !=3D count) - return false; - } - - *has_cache =3D !!(count & SWAP_HAS_CACHE); - return true; -} - /* * returns number of pages in the folio that backs the swap entry. If posi= tive, * the folio was reclaimed. If negative, the folio was not reclaimed. If 0= , no @@ -1439,6 +1422,76 @@ static bool swap_sync_discard(void) return false; } =20 +/** + * swap_put_entries_cluster - Decrease the swap count of a set of slots. + * @si: The swap device. + * @start: start offset of slots. + * @nr: number of slots. + * @reclaim_cache: if true, also reclaim the swap cache. + * + * This helper decreases the swap count of a set of slots and tries to + * batch free them. Also reclaims the swap cache if @reclaim_cache is true. + * Context: The caller must ensure that all slots belong to the same + * cluster and their swap count doesn't go underflow. + */ +static void swap_put_entries_cluster(struct swap_info_struct *si, + unsigned long start, int nr, + bool reclaim_cache) +{ + unsigned long offset =3D start, end =3D start + nr; + unsigned long batch_start =3D SWAP_ENTRY_INVALID; + struct swap_cluster_info *ci; + bool need_reclaim =3D false; + unsigned int nr_reclaimed; + unsigned long swp_tb; + unsigned int count; + + ci =3D swap_cluster_lock(si, offset); + do { + swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); + count =3D si->swap_map[offset]; + VM_WARN_ON(swap_count(count) < 1 || count =3D=3D SWAP_MAP_BAD); + if (swap_count(count) =3D=3D 1) { + /* count =3D=3D 1 and non-cached slots will be batch freed. */ + if (!swp_tb_is_folio(swp_tb)) { + if (!batch_start) + batch_start =3D offset; + continue; + } + /* count will be 0 after put, slot can be reclaimed */ + VM_WARN_ON(!(count & SWAP_HAS_CACHE)); + need_reclaim =3D true; + } + /* + * A count !=3D 1 or cached slot can't be freed. Put its swap + * count and then free the interrupted pending batch. Cached + * slots will be freed when folio is removed from swap cache + * (__swap_cache_del_folio). + */ + swap_put_entry_locked(si, ci, offset, 1); + if (batch_start) { + swap_entries_free(si, ci, batch_start, offset - batch_start); + batch_start =3D SWAP_ENTRY_INVALID; + } + } while (++offset < end); + + if (batch_start) + swap_entries_free(si, ci, batch_start, offset - batch_start); + swap_cluster_unlock(ci); + + if (!need_reclaim || !reclaim_cache) + return; + + offset =3D start; + do { + nr_reclaimed =3D __try_to_reclaim_swap(si, offset, + TTRS_UNMAPPED | TTRS_FULL); + offset++; + if (nr_reclaimed) + offset =3D round_up(offset, abs(nr_reclaimed)); + } while (offset < end); +} + /** * folio_alloc_swap - allocate swap space for a folio * @folio: folio we want to move to swap @@ -1544,6 +1597,7 @@ void folio_put_swap(struct folio *folio, struct page = *subpage) { swp_entry_t entry =3D folio->swap; unsigned long nr_pages =3D folio_nr_pages(folio); + struct swap_info_struct *si =3D __swap_entry_to_info(entry); =20 VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio); @@ -1553,7 +1607,7 @@ void folio_put_swap(struct folio *folio, struct page = *subpage) nr_pages =3D 1; } =20 - swap_entries_put_map(__swap_entry_to_info(entry), entry, nr_pages); + swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false); } =20 static struct swap_info_struct *_swap_info_get(swp_entry_t entry) @@ -1590,12 +1644,11 @@ static struct swap_info_struct *_swap_info_get(swp_= entry_t entry) return NULL; } =20 -static unsigned char swap_entry_put_locked(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry, - unsigned char usage) +static void swap_put_entry_locked(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, + unsigned char usage) { - unsigned long offset =3D swp_offset(entry); unsigned char count; unsigned char has_cache; =20 @@ -1621,9 +1674,7 @@ static unsigned char swap_entry_put_locked(struct swa= p_info_struct *si, if (usage) WRITE_ONCE(si->swap_map[offset], usage); else - swap_entries_free(si, ci, entry, 1); - - return usage; + swap_entries_free(si, ci, offset, 1); } =20 /* @@ -1691,70 +1742,6 @@ struct swap_info_struct *get_swap_device(swp_entry_t= entry) return NULL; } =20 -static bool swap_entries_put_map(struct swap_info_struct *si, - swp_entry_t entry, int nr) -{ - unsigned long offset =3D swp_offset(entry); - struct swap_cluster_info *ci; - bool has_cache =3D false; - unsigned char count; - int i; - - if (nr <=3D 1) - goto fallback; - count =3D swap_count(data_race(si->swap_map[offset])); - if (count !=3D 1) - goto fallback; - - ci =3D swap_cluster_lock(si, offset); - if (!swap_is_last_map(si, offset, nr, &has_cache)) { - goto locked_fallback; - } - if (!has_cache) - swap_entries_free(si, ci, entry, nr); - else - for (i =3D 0; i < nr; i++) - WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE); - swap_cluster_unlock(ci); - - return has_cache; - -fallback: - ci =3D swap_cluster_lock(si, offset); -locked_fallback: - for (i =3D 0; i < nr; i++, entry.val++) { - count =3D swap_entry_put_locked(si, ci, entry, 1); - if (count =3D=3D SWAP_HAS_CACHE) - has_cache =3D true; - } - swap_cluster_unlock(ci); - return has_cache; -} - -/* - * Only functions with "_nr" suffix are able to free entries spanning - * cross multi clusters, so ensure the range is within a single cluster - * when freeing entries with functions without "_nr" suffix. - */ -static bool swap_entries_put_map_nr(struct swap_info_struct *si, - swp_entry_t entry, int nr) -{ - int cluster_nr, cluster_rest; - unsigned long offset =3D swp_offset(entry); - bool has_cache =3D false; - - cluster_rest =3D SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER; - while (nr) { - cluster_nr =3D min(nr, cluster_rest); - has_cache |=3D swap_entries_put_map(si, entry, cluster_nr); - cluster_rest =3D SWAPFILE_CLUSTER; - nr -=3D cluster_nr; - entry.val +=3D cluster_nr; - } - - return has_cache; -} - /* * Check if it's the last ref of swap entry in the freeing path. */ @@ -1769,9 +1756,9 @@ static inline bool __maybe_unused swap_is_last_ref(un= signed char count) */ static void swap_entries_free(struct swap_info_struct *si, struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr_pages) + unsigned long offset, unsigned int nr_pages) { - unsigned long offset =3D swp_offset(entry); + swp_entry_t entry =3D swp_entry(si->type, offset); unsigned char *map =3D si->swap_map + offset; unsigned char *map_end =3D map + nr_pages; =20 @@ -1978,10 +1965,8 @@ void swap_put_entries_direct(swp_entry_t entry, int = nr) { const unsigned long start_offset =3D swp_offset(entry); const unsigned long end_offset =3D start_offset + nr; + unsigned long offset, cluster_end; struct swap_info_struct *si; - bool any_only_cache =3D false; - unsigned long offset; - unsigned long swp_tb; =20 si =3D get_swap_device(entry); if (WARN_ON_ONCE(!si)) @@ -1989,44 +1974,13 @@ void swap_put_entries_direct(swp_entry_t entry, int= nr) if (WARN_ON_ONCE(end_offset > si->max)) goto out; =20 - /* - * First free all entries in the range. - */ - any_only_cache =3D swap_entries_put_map_nr(si, entry, nr); - - /* - * Short-circuit the below loop if none of the entries had their - * reference drop to zero. - */ - if (!any_only_cache) - goto out; - - /* - * Now go back over the range trying to reclaim the swap cache. - */ - for (offset =3D start_offset; offset < end_offset; offset +=3D nr) { - nr =3D 1; - swp_tb =3D swap_table_get(__swap_offset_to_cluster(si, offset), - offset % SWAPFILE_CLUSTER); - if (!swap_count(READ_ONCE(si->swap_map[offset])) && swp_tb_is_folio(swp_= tb)) { - /* - * Folios are always naturally aligned in swap so - * advance forward to the next boundary. Zero means no - * folio was found for the swap entry, so advance by 1 - * in this case. Negative value means folio was found - * but could not be reclaimed. Here we can still advance - * to the next boundary. - */ - nr =3D __try_to_reclaim_swap(si, offset, - TTRS_UNMAPPED | TTRS_FULL); - if (nr =3D=3D 0) - nr =3D 1; - else if (nr < 0) - nr =3D -nr; - nr =3D ALIGN(offset + 1, nr) - offset; - } - } - + /* Put entries and reclaim cache in each cluster */ + offset =3D start_offset; + do { + cluster_end =3D min(round_up(offset + 1, SWAPFILE_CLUSTER), end_offset); + swap_put_entries_cluster(si, offset, cluster_end - offset, true); + offset =3D cluster_end; + } while (offset < end_offset); out: put_swap_device(si); } @@ -2073,7 +2027,7 @@ void swap_free_hibernation_slot(swp_entry_t entry) return; =20 ci =3D swap_cluster_lock(si, offset); - swap_entry_put_locked(si, ci, entry, 1); + swap_put_entry_locked(si, ci, offset, 1); WARN_ON(swap_entry_swapped(si, entry)); swap_cluster_unlock(ci); =20 @@ -3828,10 +3782,10 @@ void __swapcache_clear_cached(struct swap_info_stru= ct *si, swp_entry_t entry, unsigned int nr) { if (swap_only_has_cache(si, swp_offset(entry), nr)) { - swap_entries_free(si, ci, entry, nr); + swap_entries_free(si, ci, swp_offset(entry), nr); } else { for (int i =3D 0; i < nr; i++, entry.val++) - swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); + swap_put_entry_locked(si, ci, swp_offset(entry), SWAP_HAS_CACHE); } } =20 @@ -3952,7 +3906,7 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) * into, carry if so, or else fail until a new continuation page is alloca= ted; * when the original swap_map count is decremented from 0 with continuatio= n, * borrow from the continuation and report whether it still holds more. - * Called while __swap_duplicate() or caller of swap_entry_put_locked() + * Called while __swap_duplicate() or caller of swap_put_entry_locked() * holds cluster lock. */ static bool swap_count_continued(struct swap_info_struct *si, --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com [209.85.214.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 57598347BA8 for ; Fri, 19 Dec 2025 19:45:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173546; cv=none; b=JQcLgQR33K7PHt3+q56WVd+PnLZ46CrGNYDz43BQ4Da9y5fv56aVCF90VpKF+/oIoqDAUfFJeJh+a7cLZ7l9vkqFTvfkDG5RUYBRqFDNOqtE0p90w3j1hy3iw4owYm1+yv/E4nBCiJDVGwU0UZg8L+nA6z19m/LUYGeA2YJfO8o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173546; c=relaxed/simple; bh=/0Bn/3fGA4HduKisuBkgXAXhXXho2iYOP8RHdzU3XvE=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=Gj5NL4mr7Dc1W0fwV7QNOdvHWlmjpxmrcdtZJywTxTVVoyQIaVQqE496IPWa+YWpcDBhxiQdhGrPfTaZkrv0BTlcaRZ4DsThzACFMwlC5iJBER3ZPH6hSDTQ3UaLFcSIX/2T3jTLA5pO6UObjcja5clUat+/ebVlRqdpBri4a/k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=esR4EHwT; arc=none smtp.client-ip=209.85.214.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="esR4EHwT" Received: by mail-pl1-f174.google.com with SMTP id d9443c01a7336-2a0bb2f093aso22109465ad.3 for ; Fri, 19 Dec 2025 11:45:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173544; x=1766778344; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=Y33aihQQpfKryOu95DzywZIZfqZ/sqsVdQjUuTghQCM=; b=esR4EHwTqfNV+fwEDsDcflWXy+2g9Fl+Ea2zFNH5fOUjTTAwqyzffLdyPJvpAiYlFu j1inn//t65/6vSyVP+F6FpZPIoQzMgZGGbzKcTpzdKC9E4nuOHNa5Qqiv+mOqxXycMoa /BD6Du7VOmwJ/bInbjASTQNTrZqnPTcSWhPwSphRL+xWSPlYGzZuO1qk4oL3JEj7zpTy jVQupdcZev+hDyZnb6Iw6VpPv8EKoUN53TLS40+Fc3Z2r+nwDlPVTa8ts/fYlOHUflYb qRIIgnUmd+lh/geH/uZvrsghdE8aQRyC2C0dyTxuA8LD+70vPjp/YmIiTTMc/wqdFZBO 0GjQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173544; x=1766778344; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=Y33aihQQpfKryOu95DzywZIZfqZ/sqsVdQjUuTghQCM=; b=GeHsUsTe17Tkv5ll174ChZ+0kpeAUzJjajTm0XeJWpJ/3OBPhY8J2VeFKRbsuPBHzI AV6xfrL4Qi8g7xNJB19FbF0z/dAmZuJM/eVuIBKzNgXMmc/jEcvoIiPq58sMZmAi/2S1 e092T4QxSVyjTsnkdGk127JUhtXhHq7ZtDU/BvMr7SpbMCkISnZfUjOfPGaVjBIT7tOB rpgcmN0SJ5YrQCziN9b3GDw87wO1c6IYBM1MeiIsRKucrhmwVR8g37D4zZrzisXTE1T3 LbemQchjItV0otcUOdnKn1Wis+kyZNQny+6O/p+Qt54KY7VTIWuXnOvf/rUXy2kuFEYe Gt2w== X-Forwarded-Encrypted: i=1; AJvYcCVGw0DsHwW0z9nfrlDkovl9C+1JiTDMcaFQ+vtCKhCECpmYfDgFJSgnZ2OkOWTK+nJBKjeJIDPmFOWICus=@vger.kernel.org X-Gm-Message-State: AOJu0YwpYZr4wdsa5yv2453zzJs7U5oRDYbkzwRbO7vcUskisCS6x9Mw vaxNm0K56YC/tO6esJ0/Lz5euMm7mDAykVAFKDboJca+41qBzEJ4hrKc X-Gm-Gg: AY/fxX4ReLvCiBAdLrwE3oOGEXBeSrkOf69T8hPr0aImUwJbcuN2ObJczE/cD2PXU0W G1/xMjqiFSCqBuQIVrZhZBkENiz6XmLrYAsURX9hQgsKXNmy8ZyXyqSEuGM74YlQ22KPWWz+YcE VT5uaok1gNoBbUnW0TR9slTeFZv8Csl6lkdpLn2hPb412GtdHJfQ+a2H5fMt/qe7Zy8Niytl5yl 3u8aZQ46UlDnqIvak9L9jl0jXT/49mNd/PuJwiMstYdNq/ZCjoXr2BHcceIc+lIaCri6f+n1YE3 LQGhEa3t2XKz/72HJQYHYH1lXPIgu83sIfadHufjIiYk9c2g/p0L3ovBQTAABU3kxxwi9EJPnB0 VnmdFg9PlmjgUY8GVvZYjOHLD6BMRjm1OGZ+ntsZd49Kdm00q+JGWppHOuaux4roLASJNtStRbX 6j9bfUYL9kQ/fBGqPdhRLvxsed4kFbAAjuFzhN9XCd7tKKJ05LaIbN X-Google-Smtp-Source: AGHT+IFXQQUcsH6QVmaHbowf0ZTfJ+73gu5v3gRFenngD0EDvXntjwu+/zZFK5zhd2dCtEzGG89fOg== X-Received: by 2002:a17:902:ce83:b0:2a0:b62e:e016 with SMTP id d9443c01a7336-2a2f27325ecmr36428205ad.32.1766173543595; Fri, 19 Dec 2025 11:45:43 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.45.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:45:43 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:47 +0800 Subject: [PATCH v5 18/19] mm, swap: drop the SWAP_HAS_CACHE flag Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-18-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=21265; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=4d0ImER5nO89d1JtKjF94ViDa3wyz365ngTpeMhUYo8=; b=IM84FPahOiUjHB8ueN8tPY6qhJhGar4by9z8lJdtqHeT4AXQbJU9WDraBtQTsbVfx1GWtO7AP GHEsJjy4IgIAQxvKb6Z9K0cYX1EaVAaS4O+d8pKvWIKHb/BtnNl1GWT X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Now, the swap cache is managed by the swap table. All swap cache users are checking the swap table directly to check the swap cache state. SWAP_HAS_CACHE is now just a temporary pin before the first increase from 0 to 1 of a slot's swap count (swap_dup_entries) after swap allocation (folio_alloc_swap), or before the final free of slots pinned by folio in swap cache (put_swap_folio). Drop these two usages. For the first dup, SWAP_HAS_CACHE pinning was hard to kill because it used to have multiple meanings, more than just "a slot is cached". We have just simplified that and defined that the first dup is always done with folio locked in swap cache (folio_dup_swap), so stop checking the SWAP_HAS_CACHE bit and just check the swap cache (swap table) directly, and add a WARN if a swap entry's count is being increased for the first time while the folio is not in swap cache. As for freeing, just let the swap cache free all swap entries of a folio that have a swap count of zero directly upon folio removal. We have also just cleaned up batch freeing to check the swap cache usage using the swap table: a slot with swap cache in the swap table will not be freed until its cache is gone, and no SWAP_HAS_CACHE bit is involved anymore. And besides, the removal of a folio and freeing of the slots are being done in the same critical section now, which should improve the performance. After these two changes, SWAP_HAS_CACHE no longer has any users. Swap cache synchronization is also done by the swap table directly, so using SWAP_HAS_CACHE to pin a slot before adding the cache is also no longer needed. Remove all related logic and helpers. swap_map is now only used for tracking the count, so all swap_map users can just read it directly, ignoring the swap_count helper, which was previously used to filter out the SWAP_HAS_CACHE bit. The idea of dropping SWAP_HAS_CACHE and using the swap table directly was initially from Chris's idea of merging all the metadata usage of all swaps into one place. Suggested-by: Chris Li Signed-off-by: Kairui Song Reviewed-by: Baoquan He --- include/linux/swap.h | 1 - mm/swap.h | 13 ++-- mm/swap_state.c | 28 +++++---- mm/swapfile.c | 168 +++++++++++++++++------------------------------= ---- 4 files changed, 78 insertions(+), 132 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 517d24e96d8c..62fc7499b408 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -224,7 +224,6 @@ enum { #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX =20 /* Bit flag in swap_map */ -#define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */ #define COUNT_CONTINUED 0x80 /* Flag swap_map continuation for full count = */ =20 /* Special value in first swap_map */ diff --git a/mm/swap.h b/mm/swap.h index 3692e143eeba..b2d83e661132 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -205,6 +205,11 @@ int folio_alloc_swap(struct folio *folio); int folio_dup_swap(struct folio *folio, struct page *subpage); void folio_put_swap(struct folio *folio, struct page *subpage); =20 +/* For internal use */ +extern void swap_entries_free(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, unsigned int nr_pages); + /* linux/mm/page_io.c */ int sio_pool_init(void); struct swap_iocb; @@ -256,14 +261,6 @@ static inline bool folio_matches_swap_entry(const stru= ct folio *folio, return folio_entry.val =3D=3D round_down(entry.val, nr_pages); } =20 -/* Temporary internal helpers */ -void __swapcache_set_cached(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry); -void __swapcache_clear_cached(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr); - /* * All swap cache helpers below require the caller to ensure the swap entr= ies * used are valid and stablize the device by any of the following ways: diff --git a/mm/swap_state.c b/mm/swap_state.c index 0ff6c09ee702..73e6166a5013 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -211,17 +211,6 @@ static int swap_cache_add_folio(struct folio *folio, s= wp_entry_t entry, shadow =3D swp_tb_to_shadow(old_tb); offset++; } while (++ci_off < ci_end); - - ci_off =3D ci_start; - offset =3D swp_offset(entry); - do { - /* - * Still need to pin the slots with SWAP_HAS_CACHE since - * swap allocator depends on that. - */ - __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset)); - offset++; - } while (++ci_off < ci_end); __swap_cache_add_folio(ci, folio, entry); swap_cluster_unlock(ci); if (shadowp) @@ -252,6 +241,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *c= i, struct folio *folio, struct swap_info_struct *si; unsigned long old_tb, new_tb; unsigned int ci_start, ci_off, ci_end; + bool folio_swapped =3D false, need_free =3D false; unsigned long nr_pages =3D folio_nr_pages(folio); =20 VM_WARN_ON_ONCE(__swap_entry_to_cluster(entry) !=3D ci); @@ -269,13 +259,27 @@ void __swap_cache_del_folio(struct swap_cluster_info = *ci, struct folio *folio, old_tb =3D __swap_table_xchg(ci, ci_off, new_tb); WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) !=3D folio); + if (__swap_count(swp_entry(si->type, + swp_offset(entry) + ci_off - ci_start))) + folio_swapped =3D true; + else + need_free =3D true; } while (++ci_off < ci_end); =20 folio->swap.val =3D 0; folio_clear_swapcache(folio); node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages); lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages); - __swapcache_clear_cached(si, ci, entry, nr_pages); + + if (!folio_swapped) { + swap_entries_free(si, ci, swp_offset(entry), nr_pages); + } else if (need_free) { + do { + if (!__swap_count(entry)) + swap_entries_free(si, ci, swp_offset(entry), 1); + entry.val++; + } while (--nr_pages); + } } =20 /** diff --git a/mm/swapfile.c b/mm/swapfile.c index 9fbb2f98219e..886f9d6d1a2c 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -48,21 +48,18 @@ #include #include "swap_table.h" #include "internal.h" +#include "swap_table.h" #include "swap.h" =20 static bool swap_count_continued(struct swap_info_struct *, pgoff_t, unsigned char); static void free_swap_count_continuations(struct swap_info_struct *); -static void swap_entries_free(struct swap_info_struct *si, - struct swap_cluster_info *ci, - unsigned long start, unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr= ); static void swap_put_entry_locked(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long offset, - unsigned char usage); + unsigned long offset); static bool folio_swapcache_freeable(struct folio *folio); static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, @@ -149,11 +146,6 @@ static struct swap_info_struct *swap_entry_to_info(swp= _entry_t entry) return swap_type_to_info(swp_type(entry)); } =20 -static inline unsigned char swap_count(unsigned char ent) -{ - return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */ -} - /* * Use the second highest bit of inuse_pages counter as the indicator * if one swap device is on the available plist, so the atomic can @@ -185,15 +177,20 @@ static long swap_usage_in_pages(struct swap_info_stru= ct *si) #define TTRS_FULL 0x4 =20 static bool swap_only_has_cache(struct swap_info_struct *si, - unsigned long offset, int nr_pages) + struct swap_cluster_info *ci, + unsigned long offset, int nr_pages) { + unsigned int ci_off =3D offset % SWAPFILE_CLUSTER; unsigned char *map =3D si->swap_map + offset; unsigned char *map_end =3D map + nr_pages; + unsigned long swp_tb; =20 do { - VM_BUG_ON(!(*map & SWAP_HAS_CACHE)); - if (*map !=3D SWAP_HAS_CACHE) + swp_tb =3D __swap_table_get(ci, ci_off); + VM_WARN_ON_ONCE(!swp_tb_is_folio(swp_tb)); + if (*map) return false; + ++ci_off; } while (++map < map_end); =20 return true; @@ -248,12 +245,12 @@ static int __try_to_reclaim_swap(struct swap_info_str= uct *si, goto out_unlock; =20 /* - * It's safe to delete the folio from swap cache only if the folio's - * swap_map is HAS_CACHE only, which means the slots have no page table + * It's safe to delete the folio from swap cache only if the folio + * is in swap cache with swap count =3D=3D 0. The slots have no page table * reference or pending writeback, and can't be allocated to others. */ ci =3D swap_cluster_lock(si, offset); - need_reclaim =3D swap_only_has_cache(si, offset, nr_pages); + need_reclaim =3D swap_only_has_cache(si, ci, offset, nr_pages); swap_cluster_unlock(ci); if (!need_reclaim) goto out_unlock; @@ -779,7 +776,7 @@ static bool cluster_reclaim_range(struct swap_info_stru= ct *si, =20 spin_unlock(&ci->lock); do { - if (swap_count(READ_ONCE(map[offset]))) + if (READ_ONCE(map[offset])) break; swp_tb =3D swap_table_get(ci, offset % SWAPFILE_CLUSTER); if (swp_tb_is_folio(swp_tb)) { @@ -809,7 +806,7 @@ static bool cluster_reclaim_range(struct swap_info_stru= ct *si, */ for (offset =3D start; offset < end; offset++) { swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); - if (swap_count(map[offset]) || !swp_tb_is_null(swp_tb)) + if (map[offset] || !swp_tb_is_null(swp_tb)) return false; } =20 @@ -829,11 +826,10 @@ static bool cluster_scan_range(struct swap_info_struc= t *si, return true; =20 do { - if (swap_count(map[offset])) + if (map[offset]) return false; swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); if (swp_tb_is_folio(swp_tb)) { - WARN_ON_ONCE(!(map[offset] & SWAP_HAS_CACHE)); if (!vm_swap_full()) return false; *need_reclaim =3D true; @@ -891,11 +887,6 @@ static bool cluster_alloc_range(struct swap_info_struc= t *si, if (likely(folio)) { order =3D folio_order(folio); nr_pages =3D 1 << order; - /* - * Pin the slot with SWAP_HAS_CACHE to satisfy swap_dup_entries. - * This is the legacy allocation behavior, will drop it very soon. - */ - memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages); __swap_cache_add_folio(ci, folio, swp_entry(si->type, offset)); } else if (IS_ENABLED(CONFIG_HIBERNATION)) { order =3D 0; @@ -1012,8 +1003,8 @@ static void swap_reclaim_full_clusters(struct swap_in= fo_struct *si, bool force) to_scan--; =20 while (offset < end) { - if (!swap_count(READ_ONCE(map[offset])) && - swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER))) { + if (!READ_ONCE(map[offset]) && + swp_tb_is_folio(swap_table_get(ci, offset % SWAPFILE_CLUSTER))) { spin_unlock(&ci->lock); nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); @@ -1115,7 +1106,7 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, * Scan only one fragment cluster is good enough. Order 0 * allocation will surely success, and large allocation * failure is not critical. Scanning one cluster still - * keeps the list rotated and reclaimed (for HAS_CACHE). + * keeps the list rotated and reclaimed (for clean swap cache). */ found =3D alloc_swap_scan_list(si, &si->frag_clusters[order], folio, fal= se); if (found) @@ -1450,8 +1441,8 @@ static void swap_put_entries_cluster(struct swap_info= _struct *si, do { swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); count =3D si->swap_map[offset]; - VM_WARN_ON(swap_count(count) < 1 || count =3D=3D SWAP_MAP_BAD); - if (swap_count(count) =3D=3D 1) { + VM_WARN_ON(count < 1 || count =3D=3D SWAP_MAP_BAD); + if (count =3D=3D 1) { /* count =3D=3D 1 and non-cached slots will be batch freed. */ if (!swp_tb_is_folio(swp_tb)) { if (!batch_start) @@ -1459,7 +1450,6 @@ static void swap_put_entries_cluster(struct swap_info= _struct *si, continue; } /* count will be 0 after put, slot can be reclaimed */ - VM_WARN_ON(!(count & SWAP_HAS_CACHE)); need_reclaim =3D true; } /* @@ -1468,7 +1458,7 @@ static void swap_put_entries_cluster(struct swap_info= _struct *si, * slots will be freed when folio is removed from swap cache * (__swap_cache_del_folio). */ - swap_put_entry_locked(si, ci, offset, 1); + swap_put_entry_locked(si, ci, offset); if (batch_start) { swap_entries_free(si, ci, batch_start, offset - batch_start); batch_start =3D SWAP_ENTRY_INVALID; @@ -1625,7 +1615,8 @@ static struct swap_info_struct *_swap_info_get(swp_en= try_t entry) offset =3D swp_offset(entry); if (offset >=3D si->max) goto bad_offset; - if (data_race(!si->swap_map[swp_offset(entry)])) + if (data_race(!si->swap_map[swp_offset(entry)]) && + !swap_cache_has_folio(entry)) goto bad_free; return si; =20 @@ -1646,21 +1637,12 @@ static struct swap_info_struct *_swap_info_get(swp_= entry_t entry) =20 static void swap_put_entry_locked(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long offset, - unsigned char usage) + unsigned long offset) { unsigned char count; - unsigned char has_cache; =20 count =3D si->swap_map[offset]; - - has_cache =3D count & SWAP_HAS_CACHE; - count &=3D ~SWAP_HAS_CACHE; - - if (usage =3D=3D SWAP_HAS_CACHE) { - VM_BUG_ON(!has_cache); - has_cache =3D 0; - } else if ((count & ~COUNT_CONTINUED) <=3D SWAP_MAP_MAX) { + if ((count & ~COUNT_CONTINUED) <=3D SWAP_MAP_MAX) { if (count =3D=3D COUNT_CONTINUED) { if (swap_count_continued(si, offset, count)) count =3D SWAP_MAP_MAX | COUNT_CONTINUED; @@ -1670,10 +1652,8 @@ static void swap_put_entry_locked(struct swap_info_s= truct *si, count--; } =20 - usage =3D count | has_cache; - if (usage) - WRITE_ONCE(si->swap_map[offset], usage); - else + WRITE_ONCE(si->swap_map[offset], count); + if (!count && !swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLU= STER))) swap_entries_free(si, ci, offset, 1); } =20 @@ -1742,21 +1722,13 @@ struct swap_info_struct *get_swap_device(swp_entry_= t entry) return NULL; } =20 -/* - * Check if it's the last ref of swap entry in the freeing path. - */ -static inline bool __maybe_unused swap_is_last_ref(unsigned char count) -{ - return (count =3D=3D SWAP_HAS_CACHE) || (count =3D=3D 1); -} - /* * Drop the last ref of swap entries, caller have to ensure all entries * belong to the same cgroup and cluster. */ -static void swap_entries_free(struct swap_info_struct *si, - struct swap_cluster_info *ci, - unsigned long offset, unsigned int nr_pages) +void swap_entries_free(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, unsigned int nr_pages) { swp_entry_t entry =3D swp_entry(si->type, offset); unsigned char *map =3D si->swap_map + offset; @@ -1769,7 +1741,7 @@ static void swap_entries_free(struct swap_info_struct= *si, =20 ci->count -=3D nr_pages; do { - VM_BUG_ON(!swap_is_last_ref(*map)); + VM_WARN_ON(*map > 1); *map =3D 0; } while (++map < map_end); =20 @@ -1788,7 +1760,7 @@ int __swap_count(swp_entry_t entry) struct swap_info_struct *si =3D __swap_entry_to_info(entry); pgoff_t offset =3D swp_offset(entry); =20 - return swap_count(si->swap_map[offset]); + return si->swap_map[offset]; } =20 /** @@ -1803,7 +1775,7 @@ bool swap_entry_swapped(struct swap_info_struct *si, = swp_entry_t entry) int count; =20 ci =3D swap_cluster_lock(si, offset); - count =3D swap_count(si->swap_map[offset]); + count =3D si->swap_map[offset]; swap_cluster_unlock(ci); =20 return count && count !=3D SWAP_MAP_BAD; @@ -1830,7 +1802,7 @@ int swp_swapcount(swp_entry_t entry) =20 ci =3D swap_cluster_lock(si, offset); =20 - count =3D swap_count(si->swap_map[offset]); + count =3D si->swap_map[offset]; if (!(count & COUNT_CONTINUED)) goto out; =20 @@ -1868,12 +1840,12 @@ static bool swap_page_trans_huge_swapped(struct swa= p_info_struct *si, =20 ci =3D swap_cluster_lock(si, offset); if (nr_pages =3D=3D 1) { - if (swap_count(map[roffset])) + if (map[roffset]) ret =3D true; goto unlock_out; } for (i =3D 0; i < nr_pages; i++) { - if (swap_count(map[offset + i])) { + if (map[offset + i]) { ret =3D true; break; } @@ -2027,7 +1999,7 @@ void swap_free_hibernation_slot(swp_entry_t entry) return; =20 ci =3D swap_cluster_lock(si, offset); - swap_put_entry_locked(si, ci, offset, 1); + swap_put_entry_locked(si, ci, offset); WARN_ON(swap_entry_swapped(si, entry)); swap_cluster_unlock(ci); =20 @@ -2433,6 +2405,7 @@ static unsigned int find_next_to_unuse(struct swap_in= fo_struct *si, unsigned int prev) { unsigned int i; + unsigned long swp_tb; unsigned char count; =20 /* @@ -2443,7 +2416,11 @@ static unsigned int find_next_to_unuse(struct swap_i= nfo_struct *si, */ for (i =3D prev + 1; i < si->max; i++) { count =3D READ_ONCE(si->swap_map[i]); - if (count && swap_count(count) !=3D SWAP_MAP_BAD) + swp_tb =3D swap_table_get(__swap_offset_to_cluster(si, i), + i % SWAPFILE_CLUSTER); + if (count =3D=3D SWAP_MAP_BAD) + continue; + if (count || swp_tb_is_folio(swp_tb)) break; if ((i % LATENCY_LIMIT) =3D=3D 0) cond_resched(); @@ -3668,8 +3645,7 @@ void si_swapinfo(struct sysinfo *val) * Returns error code in following case. * - success -> 0 * - swp_entry is invalid -> EINVAL - * - swap-cache reference is requested but there is already one. -> EEXIST - * - swap-cache reference is requested but the entry is not used. -> ENOENT + * - swap-mapped reference is requested but the entry is not used. -> ENOE= NT * - swap-mapped reference requested but needs continued swap count. -> EN= OMEM */ static int swap_dup_entries(struct swap_info_struct *si, @@ -3678,39 +3654,30 @@ static int swap_dup_entries(struct swap_info_struct= *si, unsigned char usage, int nr) { int i; - unsigned char count, has_cache; + unsigned char count; =20 for (i =3D 0; i < nr; i++) { count =3D si->swap_map[offset + i]; - /* * For swapin out, allocator never allocates bad slots. for * swapin, readahead is guarded by swap_entry_swapped. */ - if (WARN_ON(swap_count(count) =3D=3D SWAP_MAP_BAD)) + if (WARN_ON(count =3D=3D SWAP_MAP_BAD)) return -ENOENT; - - has_cache =3D count & SWAP_HAS_CACHE; - count &=3D ~SWAP_HAS_CACHE; - - if (!count && !has_cache) { + /* + * Swap count duplication must be guarded by either swap cache folio (fr= om + * folio_dup_swap) or external lock of existing entry (from swap_dup_ent= ry_direct). + */ + if (WARN_ON(!count && + !swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER)))) return -ENOENT; - } else if (usage =3D=3D SWAP_HAS_CACHE) { - if (has_cache) - return -EEXIST; - } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) { + if (WARN_ON((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)) return -EINVAL; - } } =20 for (i =3D 0; i < nr; i++) { count =3D si->swap_map[offset + i]; - has_cache =3D count & SWAP_HAS_CACHE; - count &=3D ~SWAP_HAS_CACHE; - - if (usage =3D=3D SWAP_HAS_CACHE) - has_cache =3D SWAP_HAS_CACHE; - else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) + if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) count +=3D usage; else if (swap_count_continued(si, offset + i, count)) count =3D COUNT_CONTINUED; @@ -3722,7 +3689,7 @@ static int swap_dup_entries(struct swap_info_struct *= si, return -ENOMEM; } =20 - WRITE_ONCE(si->swap_map[offset + i], count | has_cache); + WRITE_ONCE(si->swap_map[offset + i], count); } =20 return 0; @@ -3768,27 +3735,6 @@ int swap_dup_entry_direct(swp_entry_t entry) return err; } =20 -/* Mark the swap map as HAS_CACHE, caller need to hold the cluster lock */ -void __swapcache_set_cached(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry) -{ - WARN_ON(swap_dup_entries(si, ci, swp_offset(entry), SWAP_HAS_CACHE, 1)); -} - -/* Clear the swap map as !HAS_CACHE, caller need to hold the cluster lock = */ -void __swapcache_clear_cached(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr) -{ - if (swap_only_has_cache(si, swp_offset(entry), nr)) { - swap_entries_free(si, ci, swp_offset(entry), nr); - } else { - for (int i =3D 0; i < nr; i++, entry.val++) - swap_put_entry_locked(si, ci, swp_offset(entry), SWAP_HAS_CACHE); - } -} - /* * add_swap_count_continuation - called when a swap count is duplicated * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entr= y's @@ -3834,7 +3780,7 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) =20 ci =3D swap_cluster_lock(si, offset); =20 - count =3D swap_count(si->swap_map[offset]); + count =3D si->swap_map[offset]; =20 if ((count & ~COUNT_CONTINUED) !=3D SWAP_MAP_MAX) { /* --=20 2.52.0 From nobody Sun Feb 8 20:53:22 2026 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A07AB348875 for ; Fri, 19 Dec 2025 19:45:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173551; cv=none; b=Fys5Uce7+s3CqRgyRxkjO6ohwO6TCH/tdLYDzIG17ttYq9AMGX9oFoS3m7Z/5sSK7YmKL2wnL9xyMUNQcJGL9eqwEVb/0uZ46VY/gDqdYmiNSFCoHwPhHULZFwg3sPMCdgN/AOs+4ha7ea0cH3XOHYflRMafkkb0I3AnyGGkuK0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766173551; c=relaxed/simple; bh=ns0pHLHd/jEjqUYFJLjXB/fXVWjjHDPM7TpSzEpBVSw=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=hDEOGBQQL3hNGRkDFkPOYIt0gRWSEY5MBbYbOTUx+LUAOd3Zhf/hFNQ5oD39/GmBSFv+DNV0kaGppHKh5KoYn6zUhgjDVOrEIUI4EU/D9kaYI6ngpOcspShyiubicmFqjBEQDONVc/2cbykQyUzytVAj/C6UR2eKSCyzpYFD4xU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=GQgHFaLX; arc=none smtp.client-ip=209.85.214.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="GQgHFaLX" Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-2a0d0788adaso19341735ad.3 for ; Fri, 19 Dec 2025 11:45:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1766173549; x=1766778349; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=yhrT3dYvAHuR7z+x4zbB6DCnVKLKVvLlBLpnRsQvASs=; b=GQgHFaLXhQ0f2vFfamgbKwQwv72JWge7j+QV+yAZbZOHA9Q0Iw3SllROc0/8DGylP9 mrtDhHF1KgqDzDOh/Btf81U2UFfpGVSg8ssRA37ohPpITxbtutwrEzUN+1bfRsmlAMSW XGStVcdcrJJQ4ehT1BavIFx+lkt5/wNFxUwz1sV+igp6KzzwOJPXgcOzwo/CX2QZVq2T lOB5K0N8/lpI3Y5sEUHEhuRs0HnbsA4/xRd/ZEfqOkQ+oQavsSjzWYuSVl3q9VkFRMIZ valbaH/IXhcpPbmeTb2aOAlGTPxxvFLUhdLybPUJKPmVltW5e7FP0w1iiLkYl/CdMkXc p6Gg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766173549; x=1766778349; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=yhrT3dYvAHuR7z+x4zbB6DCnVKLKVvLlBLpnRsQvASs=; b=Svj9MpQfoSFh6+YkDFp+gblk1GT3jmfJ65v1c5Tkn2BaT4RNbtjytOYny1+lWUIAHX pV1ny/3EWofJkM60wCE+R27j+oXWb9goKxa2ufCq0PlqGN/M3wMmvRrJiLw1Rf/8SQ6L z0wEhOi9d4WlH7PTvSPpLEq3/LeGgOqt8RwYrEEV3pWL8LcDg7nWbqho65+c/QaIpz2n rodgCsNZHKtx6rF/gw60zZg3MUPd8gccM55B/D1E5xeVW5vOnYV21R7CwI2t0vIW6DTi sRHP+6hPJwxwq+qmSl7MJPB6cOv9bUPkMSIokDdTSz7kx5fRhRznvfWDn9Lye9h7Ugqh /QPA== X-Forwarded-Encrypted: i=1; AJvYcCW4N6/CNy1nfNc0AYUmaM1VsNUTi/jBxbr6Sh/4/HM9WZFJWqnlpQMN816o0FflYUH7hc9w02ieHIhwkd8=@vger.kernel.org X-Gm-Message-State: AOJu0YzqsljHvBgSL3O0QghMCKmFPV6fPROxjhBvg1CiuGG0JZHpJeie RWgnRXAa1kvd+HuDB9QFOczfyc8zEeZaAH3Pfu0+b4+x+Q+AVs9vXKgE X-Gm-Gg: AY/fxX4waSZtRDa0lufWe5VRjOfGFIv0B1u22geLy+YhtbO0IZwmca1Tdxb47j7Kcqi IVLSFdOhQBpu/cwxnszDRY2lN9mSOC/3WSI76Nlp13uEKAmtvKMaWy/RAJazAlSq14LofSzUcsk 4srlyBwowZV8S6JctkY4YOIovOSktjlY7IxpV/vZGidqYxOM2hb6TJAfQeToJpRvol2a1SrXCa9 XHxnbx2hEotXxpI2JrKpI8Pt2Og9/aMkbM69OysBM/NRQlGCdHsELphK5YjerT4gZBTIFRIOY2w Dz6/Wll7MYh1Xd08X0LKhJXUV0masHLM3+BEXQOT0QmltIvdZyGrEQiXnK/TpcRU6R4MLYBu129 kMYIja4Ab1COak6NXaIJY659lJ2pjd6fpCGFHinUPFfDA1SiJ+3tpexun1fNEZg6bjFCKIcjHeF SJzHmchd2LcE+Bz1ycmHUNf46iBbTPVELaAp8iSJJo27Aw4vW1UPpBr6EIUjUNfYI= X-Google-Smtp-Source: AGHT+IHYkvMNmtufwGbQFZyaAJlBw5lJ5+Po+hUSBwaQcbyjQEel2hdkBOLfJ0tXOUXH+Tpsl9Ec0g== X-Received: by 2002:a17:903:b83:b0:2a1:388d:8ef3 with SMTP id d9443c01a7336-2a2f222bc6amr41151885ad.18.1766173548952; Fri, 19 Dec 2025 11:45:48 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a2f3d76ceesm30170985ad.91.2025.12.19.11.45.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Dec 2025 11:45:48 -0800 (PST) From: Kairui Song Date: Sat, 20 Dec 2025 03:43:48 +0800 Subject: [PATCH v5 19/19] mm, swap: remove no longer needed _swap_info_get Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251220-swap-table-p2-v5-19-8862a265a033@tencent.com> References: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> In-Reply-To: <20251220-swap-table-p2-v5-0-8862a265a033@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1766173451; l=3385; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=pCVvSP05LttLarbkjFKj8kSFxDT7Dt4yyCzNi6fJQ+k=; b=t5W6ua32Km122SKf4fxjddcswwCKbIbefSoFS8ELA/zydvgz+HlYq42nSlpLbQZLwwNab82Mm KdT6vb8aUcJAVN1pAltmNw5XUP5FcmaVvV8kNIADb5tTgy8HN9bpgdh X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song There are now only two users of _swap_info_get after consolidating these callers, folio_free_swap and swp_swapcount. folio_free_swap already holds the folio lock, and the folio must be in the swap cache, _swap_info_get is redundant. For swp_swapcount, it should use get_swap_device instead. get_swap_device increases the device ref count, which is actually a bit safer. The only current use is smap walking, and the performance change here is tiny. And after these changes, _swap_info_get is no longer used, so we can safely remove it. Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- mm/swapfile.c | 47 ++++++----------------------------------------- 1 file changed, 6 insertions(+), 41 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 886f9d6d1a2c..1ee13390d910 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -83,9 +83,7 @@ bool swap_migration_ad_supported; #endif /* CONFIG_MIGRATION */ =20 static const char Bad_file[] =3D "Bad swap file entry "; -static const char Unused_file[] =3D "Unused swap file entry "; static const char Bad_offset[] =3D "Bad swap offset entry "; -static const char Unused_offset[] =3D "Unused swap offset entry "; =20 /* * all active swap_info_structs @@ -1600,41 +1598,6 @@ void folio_put_swap(struct folio *folio, struct page= *subpage) swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false); } =20 -static struct swap_info_struct *_swap_info_get(swp_entry_t entry) -{ - struct swap_info_struct *si; - unsigned long offset; - - if (!entry.val) - goto out; - si =3D swap_entry_to_info(entry); - if (!si) - goto bad_nofile; - if (data_race(!(si->flags & SWP_USED))) - goto bad_device; - offset =3D swp_offset(entry); - if (offset >=3D si->max) - goto bad_offset; - if (data_race(!si->swap_map[swp_offset(entry)]) && - !swap_cache_has_folio(entry)) - goto bad_free; - return si; - -bad_free: - pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val); - goto out; -bad_offset: - pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); - goto out; -bad_device: - pr_err("%s: %s%08lx\n", __func__, Unused_file, entry.val); - goto out; -bad_nofile: - pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val); -out: - return NULL; -} - static void swap_put_entry_locked(struct swap_info_struct *si, struct swap_cluster_info *ci, unsigned long offset) @@ -1794,7 +1757,7 @@ int swp_swapcount(swp_entry_t entry) pgoff_t offset; unsigned char *map; =20 - si =3D _swap_info_get(entry); + si =3D get_swap_device(entry); if (!si) return 0; =20 @@ -1824,6 +1787,7 @@ int swp_swapcount(swp_entry_t entry) } while (tmp_count & COUNT_CONTINUED); out: swap_cluster_unlock(ci); + put_swap_device(si); return count; } =20 @@ -1858,11 +1822,12 @@ static bool swap_page_trans_huge_swapped(struct swa= p_info_struct *si, static bool folio_swapped(struct folio *folio) { swp_entry_t entry =3D folio->swap; - struct swap_info_struct *si =3D _swap_info_get(entry); + struct swap_info_struct *si; =20 - if (!si) - return false; + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); =20 + si =3D __swap_entry_to_info(entry); if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio))) return swap_entry_swapped(si, entry); =20 --=20 2.52.0