From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 536772D77E5 for ; Mon, 24 Nov 2025 19:15:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.50 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011744; cv=none; b=iV4Z8Q/ZR8oYenGG8FFUesFKAXbznuuBIAZc5HnYrVxj9zQ5IkkQKKvgnY74OfSECcoC8XenmaDiSCKjR+LiCQy3om39qXDhojjtXKgRiSvsdderG8jzUvfiGRPnihoZMgA0zGLVii0w81FqKiTOMQcsB5s6Adv9r2y21W7wx8g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011744; c=relaxed/simple; bh=lIFTS89MUmtCEde/402ztDBT0/PXLOgizG5Dlzk5OMs=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=Jc+MeIhwecM6bqnYyFARHrww/lCpC9birbYmx5ih24A70mMd8Y0YgqXTovkrBCSHYRXjeTIEmQ/+8flBjVOYWLi7huru2GSJjc9Hw9pLot1TKsIFH0cGD9yI9o6zaKmuyMED9OAJHDgmlXPm/rL0sLF8i3lOk2MjPdjDCgUCnz4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=cNFFE42H; arc=none smtp.client-ip=209.85.216.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="cNFFE42H" Received: by mail-pj1-f50.google.com with SMTP id 98e67ed59e1d1-3437ea05540so4017171a91.0 for ; Mon, 24 Nov 2025 11:15:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011741; x=1764616541; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=u9jkj+STmE/k14nh35f8Limla3vrnfgi3c5F8LmHokA=; b=cNFFE42HsszlzjfXghpdOR+QuLVds7yPJlQ5I8bI3VwuexGTIaJbiPRXta3SBPXraA XmYoYU8mX00mE8pnJLDmzKhNXaDCm33an4cUpIoy076uLBjW2yJmNH2tWPwgjLB8fmRJ h7VDHUShNhwKGm7ZxCVm1oxOrU6EseX8hmxebtXXBTnAW9ZF/K0n7icA9OhLqp66pVqv RKQ3s4GY2FtEnxbezF6t0AZRVXuto88V4TK4BhsrZfke5t5wyuhRss6pm7pW7UhQwUdr yryugn2bQgto6P0ZBSYu+oc/hFxzCIOZ50FAwZeENT2QLpSPKLlAObEVECt6GuE5ZhaW WhtQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011741; x=1764616541; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=u9jkj+STmE/k14nh35f8Limla3vrnfgi3c5F8LmHokA=; b=KBTDdEGRburZnbsXm2DD3Sls3aMtlmhzYPBZSWjaQAZCPng+xzdDh7Cn3czDn68l9S fnGqzOH/Aj1ChRJbVn7q88Z2DmeouUHwfdei70P65Xxq3GBHxB2FMqCZ7TAVnr9muz3x 4LyHK4W0UPhvyjOvL2S5LA7xWcmuk8ZGDZKKrfWSbAHUwp523OhBfhtgGeOngGA+My9J xgYWiPpKld9umNFKj6k5d0VAEMO+GJ9K0kFZS+EdRTIOHzP7HClt9mkVfwMS6e6pVSJl cahoSEXv921l/c24p8SYt9PqKZVcIRgiiMbmC0YIxIb+hRNIGUGg/eb00N/kfOaVtfyJ dmHw== X-Forwarded-Encrypted: i=1; AJvYcCWOFffllwbdIzob4XddAjDB+MI6YpzT4JfShdqz4vRV70RXv/Fu7ZzBGR1JdnLKKQhBKz//bYztXz+5y/E=@vger.kernel.org X-Gm-Message-State: AOJu0YzW2XQURKGQ5mvDdFKCB/kmT+oiQuQih5g/1nmF4HfQXopPfVNB xNl0ZFQF8ooZ87vdKVNBXZK+0WWxVmxKe4LB+lmWbp8UPDKbo1+5Hict X-Gm-Gg: ASbGnctpoXXtnqOy1Tgoda1WffAhFByiUQX3wa3wuDoC+qk8T9HmelL8EaNtmOZqPji kw4sdiuyJt8lXrq/cuFsUQcqZIct+44K0La0sziww87XXG725d9h4Ira3R9uGLZTayrQmNgCSQA lwPb91rYX2B+A0nft+qOrjtk1ASdzharBroc+UpWKZjx6e0mWRQPXq1mnG91Y/C0kx4fLZl4g0v bBQMxW4ixUaGCI7uqVhpBDHP4oJSf2e9dM4AMAwGSDrGX2H4Z0ZMvUjw9xwUw9g8uC2DbjPF+2e RYf06kYb4MV4rObPxv74TI+k529AUwcZSPFYi85nxpJ/2lLEqab5fWXWDVT6jWkexkcc15w38H+ aH28t1JIJqIf8TpCJ3JTWU1mwvuw4jmpBTqeQ0k3xZOoq01LHuHmdS4ZYHmwHUOohviD34HNY1+ H+AiBeqjkzHhuPGe07YrajP+3Jk68Iy+maGQW43fs6ZSF6NgLIWdOt5DJ4d9Y= X-Google-Smtp-Source: AGHT+IGGUc01kDS8qREeDUvGKB0ou8GTVkJk9Tagxxza3GasTdHBffRSTXXNoF4aXll3ha9UOleOmw== X-Received: by 2002:a17:90b:1e0e:b0:340:f05a:3ec2 with SMTP id 98e67ed59e1d1-34733f34135mr15093395a91.17.1764011741360; Mon, 24 Nov 2025 11:15:41 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.15.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:15:40 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:13:44 +0800 Subject: [PATCH v3 01/19] mm, swap: rename __read_swap_cache_async to swap_cache_alloc_folio Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-1-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=8299; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=avBXrzr/6LcA0po0XEbV8MSGPC0ib13hKBO1eBIu8Z8=; b=AClgnDVHupKXe6zKun7ExA0riZ3WusbVWCDk0HLSr8Nw0f6hBl58RDuVlmxlTSApXArTiHjzQ J3478M2xmErDF3a11oYExiZqSy1rvVQWb5rEmF3uqZJrdQtSiI84+tu X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song __read_swap_cache_async is widely used to allocate and ensure a folio is in swapcache, or get the folio if a folio is already there. It's not async, and it's not doing any read. Rename it to better present its usage, and prepare to be reworked as part of new swap cache APIs. Also, add some comments for the function. Worth noting that the skip_if_exists argument is an long existing workaround that will be dropped soon. Reviewed-by: Yosry Ahmed Acked-by: Chris Li Reviewed-by: Barry Song Reviewed-by: Nhat Pham Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swap.h | 6 +++--- mm/swap_state.c | 46 +++++++++++++++++++++++++++++++++------------- mm/swapfile.c | 2 +- mm/zswap.c | 4 ++-- 4 files changed, 39 insertions(+), 19 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index d034c13d8dd2..0fff92e42cfe 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -249,6 +249,9 @@ struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **s= hadow); void swap_cache_del_folio(struct folio *folio); +struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, + struct mempolicy *mpol, pgoff_t ilx, + bool *alloced, bool skip_if_exists); /* Below helpers require the caller to lock and pass in the swap cluster. = */ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry, void *shadow); @@ -261,9 +264,6 @@ void swapcache_clear(struct swap_info_struct *si, swp_e= ntry_t entry, int nr); struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr, struct swap_iocb **plug); -struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags, - struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated, - bool skip_if_exists); struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, struct mempolicy *mpol, pgoff_t ilx); struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, diff --git a/mm/swap_state.c b/mm/swap_state.c index 5f97c6ae70a2..08252eaef32f 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -402,9 +402,29 @@ void swap_update_readahead(struct folio *folio, struct= vm_area_struct *vma, } } =20 -struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, - struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated, - bool skip_if_exists) +/** + * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap ca= che. + * @entry: the swapped out swap entry to be binded to the folio. + * @gfp_mask: memory allocation flags + * @mpol: NUMA memory allocation policy to be applied + * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE + * @new_page_allocated: sets true if allocation happened, false otherwise + * @skip_if_exists: if the slot is a partially cached state, return NULL. + * This is a workaround that would be removed shortly. + * + * Allocate a folio in the swap cache for one swap slot, typically before + * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by + * @entry must have a non-zero swap count (swapped out). + * Currently only supports order 0. + * + * Context: Caller must protect the swap device with reference count or lo= cks. + * Return: Returns the existing folio if @entry is cached already. Returns + * NULL if failed due to -ENOMEM or @entry have a swap count < 1. + */ +struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask, + struct mempolicy *mpol, pgoff_t ilx, + bool *new_page_allocated, + bool skip_if_exists) { struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct folio *folio; @@ -452,12 +472,12 @@ struct folio *__read_swap_cache_async(swp_entry_t ent= ry, gfp_t gfp_mask, goto put_and_return; =20 /* - * Protect against a recursive call to __read_swap_cache_async() + * Protect against a recursive call to swap_cache_alloc_folio() * on the same entry waiting forever here because SWAP_HAS_CACHE * is set but the folio is not the swap cache yet. This can * happen today if mem_cgroup_swapin_charge_folio() below * triggers reclaim through zswap, which may call - * __read_swap_cache_async() in the writeback path. + * swap_cache_alloc_folio() in the writeback path. */ if (skip_if_exists) goto put_and_return; @@ -466,7 +486,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry= , gfp_t gfp_mask, * We might race against __swap_cache_del_folio(), and * stumble across a swap_map entry whose SWAP_HAS_CACHE * has not yet been cleared. Or race against another - * __read_swap_cache_async(), which has set SWAP_HAS_CACHE + * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE * in swap_map, but not yet added its folio to swap cache. */ schedule_timeout_uninterruptible(1); @@ -525,7 +545,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, = gfp_t gfp_mask, return NULL; =20 mpol =3D get_vma_policy(vma, addr, 0, &ilx); - folio =3D __read_swap_cache_async(entry, gfp_mask, mpol, ilx, + folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, &page_allocated, false); mpol_cond_put(mpol); =20 @@ -643,9 +663,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, blk_start_plug(&plug); for (offset =3D start_offset; offset <=3D end_offset ; offset++) { /* Ok, do the async read-ahead now */ - folio =3D __read_swap_cache_async( - swp_entry(swp_type(entry), offset), - gfp_mask, mpol, ilx, &page_allocated, false); + folio =3D swap_cache_alloc_folio( + swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx, + &page_allocated, false); if (!folio) continue; if (page_allocated) { @@ -662,7 +682,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, lru_add_drain(); /* Push any new pages onto the LRU now */ skip: /* The page was likely read above, so no need for plugging here */ - folio =3D __read_swap_cache_async(entry, gfp_mask, mpol, ilx, + folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, &page_allocated, false); if (unlikely(page_allocated)) swap_read_folio(folio, NULL); @@ -767,7 +787,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, if (!si) continue; } - folio =3D __read_swap_cache_async(entry, gfp_mask, mpol, ilx, + folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, &page_allocated, false); if (si) put_swap_device(si); @@ -789,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, lru_add_drain(); skip: /* The folio was likely read above, so no need for plugging here */ - folio =3D __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx, + folio =3D swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx, &page_allocated, false); if (unlikely(page_allocated)) swap_read_folio(folio, NULL); diff --git a/mm/swapfile.c b/mm/swapfile.c index d12332423a06..ee6bb37ab174 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1570,7 +1570,7 @@ static unsigned char swap_entry_put_locked(struct swa= p_info_struct *si, * CPU1 CPU2 * do_swap_page() * ... swapoff+swapon - * __read_swap_cache_async() + * swap_cache_alloc_folio() * swapcache_prepare() * __swap_duplicate() * // check swap_map diff --git a/mm/zswap.c b/mm/zswap.c index 5d0f8b13a958..a7a2443912f4 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1014,8 +1014,8 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, return -EEXIST; =20 mpol =3D get_task_policy(current); - folio =3D __read_swap_cache_async(swpentry, GFP_KERNEL, mpol, - NO_INTERLEAVE_INDEX, &folio_was_allocated, true); + folio =3D swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol, + NO_INTERLEAVE_INDEX, &folio_was_allocated, true); put_swap_device(si); if (!folio) return -ENOMEM; --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pf1-f176.google.com (mail-pf1-f176.google.com [209.85.210.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DD44E2D7DD0 for ; Mon, 24 Nov 2025 19:15:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011749; cv=none; b=fWH/VXViEiwtNnwdampXebpasm080RscFNc+P06AJde6tmqHTQMN4VHksbI2afcgUlK3kgaaHlAwd1umsf54PiIC3n//0UxqoskLAQhtKTwliQ6GgoqL0kInmmqikb7tsmDp8FHZgxOxq8rLWpqlIrJkJBUowKr60LJ1B1+CpMs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011749; c=relaxed/simple; bh=mDR+6KpgZSns5kFDjnZ/kPqYNF5XkIY1kIQ6+gyx8GY=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=qeuSWpMfMBxEuUUlnG2YPPAdSaFjgsycNhLVgiuzDLhFEVZgeNPqou9P0wL/gvgiCBjYGdjmLxKhQo8+smtcNUiWkumVZlxSUVASUCyT6N9/9RjzpkYNNOvS7MH2/vye9S46vU0u5bqiXN9s/mCLCrJreeFtQcXSEz6L0MRQhj8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=BXDUtyNW; arc=none smtp.client-ip=209.85.210.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="BXDUtyNW" Received: by mail-pf1-f176.google.com with SMTP id d2e1a72fcca58-7aab061e7cbso5474146b3a.1 for ; Mon, 24 Nov 2025 11:15:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011747; x=1764616547; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=wOM3SvUw6equ2pewheYxYcWCJBK1kBvmm5iLOdPjOeg=; b=BXDUtyNWeCS6nDJTXNIImFeBZAjLJNTLDfBqsQkzwoqBOIlFyH6O0P9xyAEtkhTNpm 45OV/7wZUhFv+LvqNMLnYAYCNOXeeZEWuxJqNRDaSSnPh5bfwEPqxGlMYdx6yNLijxGX 25QFj6PSv69jYTgXS7QoABp5B9Y8CN5t5enW66CUTJkEsjWFMWgQkZeTj2FSZ8VcTnSl ox7H2tEwu6vzsOmyhAj12d5ZOwa7swQqPXsCoPCtpj26YvKktnJhqxr9GuPSgG5L68MB lzmcdtNhIwt8dXPz4OCAfE0a1bq6ADSHoh3zISD2q1KqDrv/CK4FeGV84/gS12oxKtdt XHIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011747; x=1764616547; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=wOM3SvUw6equ2pewheYxYcWCJBK1kBvmm5iLOdPjOeg=; b=VQDYBJrlW4IxFQh6aluNxVxnx8QxBWl6R51lKeg+KWZbcnhNTQzk9WN+QrnMV30XDQ F+InvEUcqZ30rP5rbcqewaGqpHcqRB+y+YRgu95EKgDiqrkvzJNQGw93h8+tBdDfWE50 TrMqjN0RGJIaSJOD8MnXtTD0Su6DjiGpoXtiCTRvOnq9l3xAA+trsb2Gw/EqLVbQGyv8 iHjunsFX49i7ALzWiIemdl9skK/9a6pHRHSYh0+chg1w3bGp1RwfnEb2Tbzjx8pFXpGC KZTFSBfQ0yBCmPrhDjDzsA29Q3YjR6d3R+NGE1AsM+hNEpup9VS5+xF1vN4E5i4xnjlY sIAw== X-Forwarded-Encrypted: i=1; AJvYcCVixBND7EYyXvyqdrWryFXGFCzq2NdiHUoBzGbT2ZMYZxn38AHGWFFcGjsCsgcuNyfSeF7tu7UW2+SnwJA=@vger.kernel.org X-Gm-Message-State: AOJu0YyM2ztCHRwxfMWj6YsXROvu2FgpezUL+lnwgXGby8i5nznNtLVR jvXRwJP19DHI5b47yDjogNOBb1gSOHLQmzGZfpkGKzbxoP2swVzbw5uf X-Gm-Gg: ASbGncvq39/i4CAG0KGijm2HcKV2lZRDtBNsvA6Pk1cHcHi71CIOVNad5ehFclZD47A wBKAtjkpjpxzedWJ4Y3Gs6gl/CGnZEFpOzAzZd/t4mrJlIA/KzTosTjN3oVA7SJUJTriI4+PiNW +0m3PMGg1ulUPI/unz3pBWxwGLf5LT7f+kpkGFcbVtVacXhx30BaBuTieU4b2ik1TSgsipECY7i SVWmBl/jfh7IoL+iPSADncWeXDh2tdb2d3sjaKMm4YtCghq7EBBpZ/kXN0W8459ffpOgbjGlgnc 5UOi+wD0EgD2RUNsnzhwK74MwaBri3+mpPJNe4gbz9QA+8yhv6NV5vHbtUfBnLUjE9FEFcETTrw iFmr4j6iDDTtIkipfGoiCiPVIdQEkpDOZN6x88flTih5mK7YLkvmWcTtmW/5zXMQ2V2erzvjrlx sa+B0AR5qOxNZLCWF4OSrPTj6kZ7KnLhm6cFEuIy0wj4+KTTFm X-Google-Smtp-Source: AGHT+IF0TVKgirCriQhQIqanjTyXUXt/Fdsy4Pfg34dF7leoPxgPxDVoSG2O3RN3YQxVQcLSx4jiow== X-Received: by 2002:a05:6a20:2449:b0:35e:7605:56bb with SMTP id adf61e73a8af0-3614eeb01a1mr15594864637.59.1764011747013; Mon, 24 Nov 2025 11:15:47 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.15.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:15:46 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:13:45 +0800 Subject: [PATCH v3 02/19] mm, swap: split swap cache preparation loop into a standalone helper Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-2-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=7996; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=zTQ0qqEjQqVMue2feX7IfMCwec9b88xoOj9LChkN4Ao=; b=/9/Zq1tFDJh0xOp/YKLyGKXbuQJ4l7zdp4kLU0qagMc4+s++qCl9m66xLq5XbU8mElb4CKQ7q NLMt6YLo2oXC5wNj3JSNLWb9ebHRjnpA16dEkppXixKYN6q/02l4FUj X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song To prepare for the removal of swap cache bypass swapin, introduce a new helper that accepts an allocated and charged fresh folio, prepares the folio, the swap map, and then adds the folio to the swap cache. This doesn't change how swap cache works yet, we are still depending on the SWAP_HAS_CACHE in the swap map for synchronization. But all synchronization hacks are now all in this single helper. No feature change. Acked-by: Chris Li Reviewed-by: Barry Song Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swap_state.c | 197 +++++++++++++++++++++++++++++++---------------------= ---- 1 file changed, 109 insertions(+), 88 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 08252eaef32f..a8511ce43242 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -402,6 +402,97 @@ void swap_update_readahead(struct folio *folio, struct= vm_area_struct *vma, } } =20 +/** + * __swap_cache_prepare_and_add - Prepare the folio and add it to swap cac= he. + * @entry: swap entry to be bound to the folio. + * @folio: folio to be added. + * @gfp: memory allocation flags for charge, can be 0 if @charged if true. + * @charged: if the folio is already charged. + * @skip_if_exists: if the slot is in a cached state, return NULL. + * This is an old workaround that will be removed shortly. + * + * Update the swap_map and add folio as swap cache, typically before swapi= n. + * All swap slots covered by the folio must have a non-zero swap count. + * + * Context: Caller must protect the swap device with reference count or lo= cks. + * Return: Returns the folio being added on success. Returns the existing = folio + * if @entry is already cached. Returns NULL if raced with swapin or swapo= ff. + */ +static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry, + struct folio *folio, + gfp_t gfp, bool charged, + bool skip_if_exists) +{ + struct folio *swapcache; + void *shadow; + int ret; + + /* + * Check and pin the swap map with SWAP_HAS_CACHE, then add the folio + * into the swap cache. Loop with a schedule delay if raced with + * another process setting SWAP_HAS_CACHE. This hackish loop will + * be fixed very soon. + */ + for (;;) { + ret =3D swapcache_prepare(entry, folio_nr_pages(folio)); + if (!ret) + break; + + /* + * The skip_if_exists is for protecting against a recursive + * call to this helper on the same entry waiting forever + * here because SWAP_HAS_CACHE is set but the folio is not + * in the swap cache yet. This can happen today if + * mem_cgroup_swapin_charge_folio() below triggers reclaim + * through zswap, which may call this helper again in the + * writeback path. + * + * Large order allocation also needs special handling on + * race: if a smaller folio exists in cache, swapin needs + * to fallback to order 0, and doing a swap cache lookup + * might return a folio that is irrelevant to the faulting + * entry because @entry is aligned down. Just return NULL. + */ + if (ret !=3D -EEXIST || skip_if_exists || folio_test_large(folio)) + return NULL; + + /* + * Check the swap cache again, we can only arrive + * here because swapcache_prepare returns -EEXIST. + */ + swapcache =3D swap_cache_get_folio(entry); + if (swapcache) + return swapcache; + + /* + * We might race against __swap_cache_del_folio(), and + * stumble across a swap_map entry whose SWAP_HAS_CACHE + * has not yet been cleared. Or race against another + * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE + * in swap_map, but not yet added its folio to swap cache. + */ + schedule_timeout_uninterruptible(1); + } + + __folio_set_locked(folio); + __folio_set_swapbacked(folio); + + if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) { + put_swap_folio(folio, entry); + folio_unlock(folio); + return NULL; + } + + swap_cache_add_folio(folio, entry, &shadow); + memcg1_swapin(entry, folio_nr_pages(folio)); + if (shadow) + workingset_refault(folio, shadow); + + /* Caller will initiate read into locked folio */ + folio_add_lru(folio); + return folio; +} + /** * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap ca= che. * @entry: the swapped out swap entry to be binded to the folio. @@ -428,99 +519,29 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entr= y, gfp_t gfp_mask, { struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct folio *folio; - struct folio *new_folio =3D NULL; struct folio *result =3D NULL; - void *shadow =3D NULL; =20 *new_page_allocated =3D false; - for (;;) { - int err; - - /* - * Check the swap cache first, if a cached folio is found, - * return it unlocked. The caller will lock and check it. - */ - folio =3D swap_cache_get_folio(entry); - if (folio) - goto got_folio; - - /* - * Just skip read ahead for unused swap slot. - */ - if (!swap_entry_swapped(si, entry)) - goto put_and_return; - - /* - * Get a new folio to read into from swap. Allocate it now if - * new_folio not exist, before marking swap_map SWAP_HAS_CACHE, - * when -EEXIST will cause any racers to loop around until we - * add it to cache. - */ - if (!new_folio) { - new_folio =3D folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); - if (!new_folio) - goto put_and_return; - } - - /* - * Swap entry may have been freed since our caller observed it. - */ - err =3D swapcache_prepare(entry, 1); - if (!err) - break; - else if (err !=3D -EEXIST) - goto put_and_return; - - /* - * Protect against a recursive call to swap_cache_alloc_folio() - * on the same entry waiting forever here because SWAP_HAS_CACHE - * is set but the folio is not the swap cache yet. This can - * happen today if mem_cgroup_swapin_charge_folio() below - * triggers reclaim through zswap, which may call - * swap_cache_alloc_folio() in the writeback path. - */ - if (skip_if_exists) - goto put_and_return; + /* Check the swap cache again for readahead path. */ + folio =3D swap_cache_get_folio(entry); + if (folio) + return folio; =20 - /* - * We might race against __swap_cache_del_folio(), and - * stumble across a swap_map entry whose SWAP_HAS_CACHE - * has not yet been cleared. Or race against another - * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE - * in swap_map, but not yet added its folio to swap cache. - */ - schedule_timeout_uninterruptible(1); - } - - /* - * The swap entry is ours to swap in. Prepare the new folio. - */ - __folio_set_locked(new_folio); - __folio_set_swapbacked(new_folio); - - if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry)) - goto fail_unlock; - - swap_cache_add_folio(new_folio, entry, &shadow); - memcg1_swapin(entry, 1); + /* Skip allocation for unused swap slot for readahead path. */ + if (!swap_entry_swapped(si, entry)) + return NULL; =20 - if (shadow) - workingset_refault(new_folio, shadow); - - /* Caller will initiate read into locked new_folio */ - folio_add_lru(new_folio); - *new_page_allocated =3D true; - folio =3D new_folio; -got_folio: - result =3D folio; - goto put_and_return; - -fail_unlock: - put_swap_folio(new_folio, entry); - folio_unlock(new_folio); -put_and_return: - if (!(*new_page_allocated) && new_folio) - folio_put(new_folio); + /* Allocate a new folio to be added into the swap cache. */ + folio =3D folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); + if (!folio) + return NULL; + /* Try add the new folio, returns existing folio or NULL on failure. */ + result =3D __swap_cache_prepare_and_add(entry, folio, gfp_mask, + false, skip_if_exists); + if (result =3D=3D folio) + *new_page_allocated =3D true; + else + folio_put(folio); return result; } =20 --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 041E32D9481 for ; Mon, 24 Nov 2025 19:15:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.175 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011754; cv=none; b=h49KTVC8/u2G3ulEZtQA8CmyVV/TeEO8uOO5XjwjHPgqKSKA/7MgcG6uKzV0qwQDPsDPd3bix0OFtcXAT1wIQ8dEBVaujZBL3rZCoi4KLNW6D9ci3we6nb2OfiMUf8FzS2dt6K6lbntceoEpNudLGJfpkq7zeDF3dwFCxWWsC90= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011754; c=relaxed/simple; bh=pY3FO/WjNBMsjToboqcBi3y6bgjo6W11jV5FUOFuB/s=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=BQJyOUdh8WNFyeihGC/kbRhHPVy+7g2u5zH/6/U7ad53u4dDk/BrQpFAwvteuyu6/VNgei70SNYGbM24KB3c059Heodhx9dCH9WqnlxL2BvWe4lsRCQC0wienMEbjsLZ0K85uSj+dk0aG2Ipxwzc9Yn3NTCsfv6LWDKwJkZ8ZEA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=VHZUGUFm; arc=none smtp.client-ip=209.85.214.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VHZUGUFm" Received: by mail-pl1-f175.google.com with SMTP id d9443c01a7336-298144fb9bcso51121405ad.0 for ; Mon, 24 Nov 2025 11:15:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011752; x=1764616552; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=TP0sJnBGats4iaw1yAn2XnkxPNwTDZpe+FnZpSRfeaQ=; b=VHZUGUFm4ppPxATHxn+Exa/1Ztp0VMDLBbwjVwt0JAH5+XS9YOVffMmclJmWDThVPR TcRDBayBo7jwYBCo8rdx97nLFaU72eXPD7E+i9MUiT9HB4AD4ix/APwCu5EP5Z5xrbMN lqa5wT5W269K5ZaFfUNe0n3adoIFCC+HLXfHRn5MHqXaZuNPSpn5sw3UpqFZEO8KbfHp sm5fIYz91owNijwL/dd3otUdB3e3w57bRXBcv9CL5W+r/JjWyzemtheBUFgOhnyNecbs OjYj3PyJffRxwAvTT82gXD6YvahjypGI/ApmavXTW55euvbc9pjO2Wpucu6rHs3kl4tt 1NCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011752; x=1764616552; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=TP0sJnBGats4iaw1yAn2XnkxPNwTDZpe+FnZpSRfeaQ=; b=E3PPRUKFl3tCM0/94h754MIBHZJtToPuoC/bX82Qm5GCXn6sLa5E1GgpuC6RuuOcXf /MhM53W7qt/pDt9r3Mg7zDWaOwFSgZHavlkk/EgNyyqIhLo42VFJ4dUxhrCCqlv5RzPu x0y1ikTciBYYxzGVYGeAXUC3TYuWR8yscoyD2CXcOgFsAY36Bq4+TMcF8jC1kILXjAAA n9zcmGwUo5+u/Du7zM3K8oItewAuO2pl3aoZz0MZpAO1iDHgTC4VWrOnDmmWBP2zqdMG e/umi+GVRkIgb0sqnebQCHlivlhzNO4PFb9oeBf2vP+M22ljE14f0N8+KQSnG18np3QQ 8qaQ== X-Forwarded-Encrypted: i=1; AJvYcCViVscqWC8OLzQ9JxJQ/meXGGN0TI7uik3ey5ibuEe/5IFm9lz87BtyH3n2jalQ+9lFg8mzinGf0U4O0BY=@vger.kernel.org X-Gm-Message-State: AOJu0YzhHcatzash/yp10gktUS/xgvWxdwDGOntnb9y6LpLxbHObgYGi OAdrus/EW9SNND+nuSEMG27RRsX0oncZLgLd6S4zJepm4YvxNFClTCkr X-Gm-Gg: ASbGncsSBQsk1jJ6x24vZM0vg5qDDGnjRyzYzaoba7Jjp5+fYCR5XLRSHuuDJo7Ul6l ll3eAVrFkn9jQ/YqJTl4n6tI7IQXW5TT+Ksac2umklrHXYW1WZmWRAC3HY32k0vXVmWfaf/76jM RvnGj8s+zT10Vx5I0z2GZeGMUWlG9xtTH7GY5IeankdRPFz3nHWslvsDvh9kxZdWe96UfxtSbqi 3spvGSihDoJQNHWI5SvGSNPlphQNNOz6y2vroVfVU/YfgMsSYGkZrK2Pd8/cxOq0ISHsi01geYY Ihdvb5qyHkedq03DWDo2CdluH2dvBA6yxJa5Dd42M0650D8Ajk8wh3ym5OXXQydxdDvrEAz6CYN HMVBSUD9a3OThnP7v+xF7UdRpivAZi2lKUHcq6SGKWHGg6YsTfuZ5Ob6/WuWv2guwxvF+lnGJF+ ArQqRmySqGe8PKIuvH9rlL9F2sRhiGuF25icEJJVH5QSpKg8tm X-Google-Smtp-Source: AGHT+IHX9EthexG+oeWI//LhOXn4BfZoVzv5xsvsh8Z3gG7/UPRBMZdtZujHkCx0+spRKcHWHMeP/g== X-Received: by 2002:a17:903:b84:b0:295:a1a5:bb0f with SMTP id d9443c01a7336-29b6bec3fe6mr146413115ad.18.1764011752071; Mon, 24 Nov 2025 11:15:52 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.15.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:15:51 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:13:46 +0800 Subject: [PATCH v3 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-3-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=12677; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=aVFvf2crptwdmhd9deVsooYisoiLEO+MLQv3WATEJDc=; b=hjUaqvrGMfhEkIbdfUq9ZlDOej8AxaikVBeDeOY138gqVzpfnPqWZc4dj5LySIBHHJ1RprTAc JDHq8clEGGGAHlAH86RVFbandwEURXQAU29zKsOjHWgadAC1TWGAfkx X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Now the overhead of the swap cache is trivial. Bypassing the swap cache is no longer a valid optimization. So unify the swapin path using the swap cache. This changes the swap in behavior in two observable ways. Readahead is now always disabled for SWP_SYNCHRONOUS_IO devices, which is a huge win for most workloads: We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) =3D=3D 1` as the indicator to bypass both the swap cache and readahead, the swap count check made bypassing ineffective in many cases, and it's not a good indicator. The limitation existed because the current swap design made it hard to decouple readahead bypassing and swap cache bypassing [1]. We do want to always bypass readahead for SWP_SYNCHRONOUS_IO devices, but bypassing swap cache will cause redundant IO and memory overhead. Now that swap cache bypassing is gone, this swap count check can be dropped, the new introduced swap path is now always effective to skip the readahead. The second thing here is that this enabled a large swap for all swap entries on SWP_SYNCHRONOUS_IO devices. Previously, the large swap in is also coupled with swap cache bypassing, and so the swap count checking also makes large swap in less effective. Now this is also improved. We will always have large swap supported for all SWP_SYNCHRONOUS_IO cases. And to catch potential issues with large swap in, especially with page exclusiveness and swap cache, more debug sanity checks and comments are added. But overall, the code is simpler. And new helper and routines will be used by other components in later commits too. And now it's possible to rely on the swap cache layer for resolving synchronization issues, which will also be done by a later commit. Worth mentioning that for a large folio workload, this may cause more serious thrashing. This isn't a problem with this commit, but a generic large folio issue. For a 4K workload, this commit increases the performance. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/memory.c | 137 +++++++++++++++++++++-------------------------------= ---- mm/swap.h | 6 +++ mm/swap_state.c | 27 +++++++++++ 3 files changed, 85 insertions(+), 85 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 6675e87eb7dd..41b690eb8c00 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4608,7 +4608,16 @@ static struct folio *alloc_swap_folio(struct vm_faul= t *vmf) } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 -static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq); +/* Sanity check that a folio is fully exclusive */ +static void check_swap_exclusive(struct folio *folio, swp_entry_t entry, + unsigned int nr_pages) +{ + /* Called under PT locked and folio locked, the swap count is stable */ + do { + VM_WARN_ON_ONCE_FOLIO(__swap_count(entry) !=3D 1, folio); + entry.val++; + } while (--nr_pages); +} =20 /* * We enter with non-exclusive mmap_lock (to exclude vma changes, @@ -4621,17 +4630,14 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq); vm_fault_t do_swap_page(struct vm_fault *vmf) { struct vm_area_struct *vma =3D vmf->vma; - struct folio *swapcache, *folio =3D NULL; - DECLARE_WAITQUEUE(wait, current); + struct folio *swapcache =3D NULL, *folio; struct page *page; struct swap_info_struct *si =3D NULL; rmap_t rmap_flags =3D RMAP_NONE; - bool need_clear_cache =3D false; bool exclusive =3D false; softleaf_t entry; pte_t pte; vm_fault_t ret =3D 0; - void *shadow =3D NULL; int nr_pages; unsigned long page_idx; unsigned long address; @@ -4702,57 +4708,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio =3D swap_cache_get_folio(entry); if (folio) swap_update_readahead(folio, vma, vmf->address); - swapcache =3D folio; - if (!folio) { - if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && - __swap_count(entry) =3D=3D 1) { - /* skip swapcache */ + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { folio =3D alloc_swap_folio(vmf); if (folio) { - __folio_set_locked(folio); - __folio_set_swapbacked(folio); - - nr_pages =3D folio_nr_pages(folio); - if (folio_test_large(folio)) - entry.val =3D ALIGN_DOWN(entry.val, nr_pages); /* - * Prevent parallel swapin from proceeding with - * the cache flag. Otherwise, another thread - * may finish swapin first, free the entry, and - * swapout reusing the same entry. It's - * undetectable as pte_same() returns true due - * to entry reuse. + * folio is charged, so swapin can only fail due + * to raced swapin and return NULL. */ - if (swapcache_prepare(entry, nr_pages)) { - /* - * Relax a bit to prevent rapid - * repeated page faults. - */ - add_wait_queue(&swapcache_wq, &wait); - schedule_timeout_uninterruptible(1); - remove_wait_queue(&swapcache_wq, &wait); - goto out_page; - } - need_clear_cache =3D true; - - memcg1_swapin(entry, nr_pages); - - shadow =3D swap_cache_get_shadow(entry); - if (shadow) - workingset_refault(folio, shadow); - - folio_add_lru(folio); - - /* To provide entry to swap_read_folio() */ - folio->swap =3D entry; - swap_read_folio(folio, NULL); - folio->private =3D NULL; + swapcache =3D swapin_folio(entry, folio); + if (swapcache !=3D folio) + folio_put(folio); + folio =3D swapcache; } } else { - folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, - vmf); - swapcache =3D folio; + folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf); } =20 if (!folio) { @@ -4774,6 +4744,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); } =20 + swapcache =3D folio; ret |=3D folio_lock_or_retry(folio, vmf); if (ret & VM_FAULT_RETRY) goto out_release; @@ -4843,24 +4814,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_nomap; } =20 - /* allocated large folios for SWP_SYNCHRONOUS_IO */ - if (folio_test_large(folio) && !folio_test_swapcache(folio)) { - unsigned long nr =3D folio_nr_pages(folio); - unsigned long folio_start =3D ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); - unsigned long idx =3D (vmf->address - folio_start) / PAGE_SIZE; - pte_t *folio_ptep =3D vmf->pte - idx; - pte_t folio_pte =3D ptep_get(folio_ptep); - - if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || - swap_pte_batch(folio_ptep, nr, folio_pte) !=3D nr) - goto out_nomap; - - page_idx =3D idx; - address =3D folio_start; - ptep =3D folio_ptep; - goto check_folio; - } - nr_pages =3D 1; page_idx =3D 0; address =3D vmf->address; @@ -4904,12 +4857,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio)); BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page)); =20 + /* + * If a large folio already belongs to anon mapping, then we + * can just go on and map it partially. + * If not, with the large swapin check above failing, the page table + * have changed, so sub pages might got charged to the wrong cgroup, + * or even should be shmem. So we have to free it and fallback. + * Nothing should have touched it, both anon and shmem checks if a + * large folio is fully appliable before use. + * + * This will be removed once we unify folio allocation in the swap cache + * layer, where allocation of a folio stabilizes the swap entries. + */ + if (!folio_test_anon(folio) && folio_test_large(folio) && + nr_pages !=3D folio_nr_pages(folio)) { + if (!WARN_ON_ONCE(folio_test_dirty(folio))) + swap_cache_del_folio(folio); + goto out_nomap; + } + /* * Check under PT lock (to protect against concurrent fork() sharing * the swap entry concurrently) for certainly exclusive pages. */ if (!folio_test_ksm(folio)) { + /* + * The can_swapin_thp check above ensures all PTE have + * same exclusiveness. Checking just one PTE is fine. + */ exclusive =3D pte_swp_exclusive(vmf->orig_pte); + if (exclusive) + check_swap_exclusive(folio, entry, nr_pages); if (folio !=3D swapcache) { /* * We have a fresh page that is not exposed to the @@ -4987,18 +4965,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) vmf->orig_pte =3D pte_advance_pfn(pte, page_idx); =20 /* ksm created a completely new copy */ - if (unlikely(folio !=3D swapcache && swapcache)) { + if (unlikely(folio !=3D swapcache)) { folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); } else if (!folio_test_anon(folio)) { /* - * We currently only expect small !anon folios which are either - * fully exclusive or fully shared, or new allocated large - * folios which are fully exclusive. If we ever get large - * folios within swapcache here, we have to be careful. + * We currently only expect !anon folios that are fully + * mappable. See the comment after can_swapin_thp above. */ - VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) !=3D nr_pages, folio); + VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); } else { folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address, @@ -5038,12 +5014,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); out: - /* Clear the swap cache pin for direct swapin after PTL unlock */ - if (need_clear_cache) { - swapcache_clear(si, entry, nr_pages); - if (waitqueue_active(&swapcache_wq)) - wake_up(&swapcache_wq); - } if (si) put_swap_device(si); return ret; @@ -5051,6 +5021,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); out_page: + if (folio_test_swapcache(folio)) + folio_free_swap(folio); folio_unlock(folio); out_release: folio_put(folio); @@ -5058,11 +5030,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_unlock(swapcache); folio_put(swapcache); } - if (need_clear_cache) { - swapcache_clear(si, entry, nr_pages); - if (waitqueue_active(&swapcache_wq)) - wake_up(&swapcache_wq); - } if (si) put_swap_device(si); return ret; diff --git a/mm/swap.h b/mm/swap.h index 0fff92e42cfe..214e7d041030 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -268,6 +268,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t flag, struct mempolicy *mpol, pgoff_t ilx); struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf); +struct folio *swapin_folio(swp_entry_t entry, struct folio *folio); void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, unsigned long addr); =20 @@ -386,6 +387,11 @@ static inline struct folio *swapin_readahead(swp_entry= _t swp, gfp_t gfp_mask, return NULL; } =20 +static inline struct folio *swapin_folio(swp_entry_t entry, struct folio *= folio) +{ + return NULL; +} + static inline void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, unsigned long addr) { diff --git a/mm/swap_state.c b/mm/swap_state.c index a8511ce43242..e3c01e5bc978 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -545,6 +545,33 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry= , gfp_t gfp_mask, return result; } =20 +/** + * swapin_folio - swap-in one or multiple entries skipping readahead. + * @entry: starting swap entry to swap in + * @folio: a new allocated and charged folio + * + * Reads @entry into @folio, @folio will be added to the swap cache. + * If @folio is a large folio, the @entry will be rounded down to align + * with the folio size. + * + * Return: returns pointer to @folio on success. If folio is a large folio + * and this raced with another swapin, NULL will be returned. Else, if + * another folio was already added to the swap cache, return that swap + * cache folio instead. + */ +struct folio *swapin_folio(swp_entry_t entry, struct folio *folio) +{ + struct folio *swapcache; + pgoff_t offset =3D swp_offset(entry); + unsigned long nr_pages =3D folio_nr_pages(folio); + + entry =3D swp_entry(swp_type(entry), round_down(offset, nr_pages)); + swapcache =3D __swap_cache_prepare_and_add(entry, folio, 0, true, false); + if (swapcache =3D=3D folio) + swap_read_folio(folio, NULL); + return swapcache; +} + /* * Locate a page of swap in physical memory, reserving swap cache space * and reading the disk if it is not already cached. --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pf1-f181.google.com (mail-pf1-f181.google.com [209.85.210.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D40B7220698 for ; Mon, 24 Nov 2025 19:15:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011759; cv=none; b=djE9BLyn6zmy1Vf1amIQBfSvKgsQlXl5g1FxoXxswIBeF8/Vk0M6UhIUVTYkI14s95VB6fS3ZVJL85GguSuqXAnNF0Ei9oI9wPX8Wo0Ec9STXOqp4F12dSnB5EK4hcCZlXDf+6lkV5pbS32XMs+z8tqiYRRLQ8c5dZ2fAkGwRh0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011759; c=relaxed/simple; bh=mEECQ/wo8pHlcOj88GTHTL5nEJiE8fkV3P8vt0L/P0Q=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=BZvoQi/+RDfepVO8DvC5sDBEZBduG94tF+TmyqhHcSWKel9S4cTB7Jy4NZfybzfCTHracyvma8mBoAkJ4qBTn/R1sZx/BzIkwDoMSzB1AcN4F0G4+4bDJN5VaFK9ik6RR5O4zl6AK00E1eOX3VfaahxPV+nnKKhONmTUPOKElKY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=RCjZcWNL; arc=none smtp.client-ip=209.85.210.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="RCjZcWNL" Received: by mail-pf1-f181.google.com with SMTP id d2e1a72fcca58-7ba49f92362so2731417b3a.1 for ; Mon, 24 Nov 2025 11:15:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011757; x=1764616557; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=Z74RzNdfSl8VMAwJnh85EBGE2QwwEumphH4Ky//xWjU=; b=RCjZcWNL4XdD26smKffHFwsNyjeqs2W7WAnMNYp4ukNp/2zi40SCN6A8oAJdZI0dk+ V9yGX62w7+qvFtThympnWpXXcwxI5kQQZRr/mEcZZgoQLGQpCdSa9hLz7uiIeckkIkx0 HlAWIV1OQ8/Vlb+LLbOOVxFuzieEcyZszuFr1vFSg3C8AfguXB6HYalrLSNQYLrwFDjE 2OTFvIVpIuS4+FR3AoKk4wVJ66uDillGRVdA+bBqOdmvs8Flf0tFfqqMeHC60mypRTqq DFHb09IU8x9Iv5eeA/jxL6+tk1VolVJBt1/zlHId7Io4IcRa+zTMJTOFeNdGBNxnt9/N OBKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011757; x=1764616557; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=Z74RzNdfSl8VMAwJnh85EBGE2QwwEumphH4Ky//xWjU=; b=dmMKdwsYUbg7U8E5A1ogZJAZZEtlLMK/kGZgnfRGMva0PKdulv+Ph1hd/Otibyi96J 85LLQwRhMgPI3nIuzIteH3VVgnAc3qgaLDeiMk2HUU9/1giqLqHmBrcshvwMqoJgPpWd qvezoalkRpBHI46BZ2PoqVHkfPRBCc38q+l+Xemq8xMPjfoDpdNqT5AIaabna9GyJgyU Kzlr0rDbwHCRys5seNb4gex7OGueN/Iee2pff88JPy6bOhxg8et0C+WfR254YNr28NCh rPQTbXbURXD86wukiyt0ztI82ENl1zxq2T6ft1tjfuUqxcdfbOIzKj+uMYszoMrAVQ4D X4HA== X-Forwarded-Encrypted: i=1; AJvYcCWkk5RAVlucxntt0ldVPskQFxb7iUEMdo1SXe9xY1XhtaEeRGCf52gZk7NCjhgMn2F15ZzdTZwPuKSSVWM=@vger.kernel.org X-Gm-Message-State: AOJu0Yx3SDcgF/pXE6opzhYN9EjoNv8toGnqcqY3DHquWZs4t91lTrE9 o2fo0btjo1EJCfTflu6GQ3es1mCpYlEPUQbfn42Wg7H4b1cZa1O8mrGu X-Gm-Gg: ASbGncu2wh3JDkDJD43BszT88Oqe/3KX1fMhR4Ry8xQvT60Wyb0XXMs6QMVFCIZv9CV yh7QAtWu8Y4Ajtmye8WMMUi/NbszKB+LTdbmkoKYumX0tEXn3Stc7DsUm5lcMuF0+eyuNJu0Qu8 ziaXFArVUhrXp8hU6ih36LHnakBzfkkqr/rXh0KZwUHtfARGXepSY/RQY/8eIbxzcV3rcAXNVpt bOXh6SZ+GM82jjRUqbahXwVExO0V2n6MKHgCQlmz4mZhE5ZRmta21NM0Dp0CSnpIhUzooAR9lyu I4bakUuWrtC7W5c/zvPuBfyQnpo4fADO37Yk07SrFjF+Ijxh5zqhrhEEQcTFDx7Aj7CfQrdifg/ OpP8YAVaBdpkRGFCHTzWWm6mX2O1fG/4MmZEPBXc6FJ08KD9m+wtlBe6r3b97NSzYlTipkpJsEm +ruBhk+yHNgaW0rotirnflXfsa+yCe7DZkhrmuYW258D+ZPsr2 X-Google-Smtp-Source: AGHT+IGePS0KpOixf04O0GPHrPDOTyvQ+JK5iIOfL4RCRH+wqcyx+HJrz5BODSKqd55md+TCGGnUFg== X-Received: by 2002:a05:6a21:339a:b0:35e:86c3:af0a with SMTP id adf61e73a8af0-3613e56c0femr16023692637.22.1764011756938; Mon, 24 Nov 2025 11:15:56 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.15.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:15:56 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:13:47 +0800 Subject: [PATCH v3 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-4-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=3125; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=CI2m/8pPC1xTUXTV6AisxBkS+TKo9XryUKDABctQipg=; b=1fBW1wrl1FdUcEqOwfoSs2tA5fBJlSZoQd/l5dSSLoz+ti6Fy2iBwItbC578HMHFO77guf/7B wYVXFaNjKC+Bd4iNACTNobyn7D0+ubSqDpVeQl9ZX01YuoM5frK6QXW X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side effect is that a folio may stay in swap cache for a longer time due to lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios are being swapped out very frequently right after swapin, hence improving the performance. But the long pinning of swap slots also increases the fragmentation rate of the swap device significantly, and currently, all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also causes the backing memory to be pinned, increasing the memory pressure. So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices after swapin finishes. Swap cache has served its role as a synchronization layer to prevent any parallel swap-in from wasting CPU or memory allocation, and the redundant IO is not a major concern for SWP_SYNCHRONOUS_IO devices. Worth noting, without this patch, this series so far can provide a ~30% performance gain for certain workloads like MySQL or kernel compilation, but causes significant regression or OOM when under extreme global pressure. With this patch, we still have a nice performance gain for most workloads, and without introducing any observable regressions. This is a hint that further optimization can be done based on the new unified swapin with swap cache, but for now, just keep the behaviour consistent with before. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/memory.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 41b690eb8c00..9fb2032772f2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4354,12 +4354,26 @@ static vm_fault_t remove_device_exclusive_entry(str= uct vm_fault *vmf) return 0; } =20 -static inline bool should_try_to_free_swap(struct folio *folio, +/* + * Check if we should call folio_free_swap to free the swap cache. + * folio_free_swap only frees the swap cache to release the slot if swap + * count is zero, so we don't need to check the swap count here. + */ +static inline bool should_try_to_free_swap(struct swap_info_struct *si, + struct folio *folio, struct vm_area_struct *vma, unsigned int fault_flags) { if (!folio_test_swapcache(folio)) return false; + /* + * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap + * cache can help save some IO or memory overhead, but these devices + * are fast, and meanwhile, swap cache pinning the slot deferring the + * release of metadata or fragmentation is a more critical issue. + */ + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) + return true; if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) || folio_test_mlocked(folio)) return true; @@ -4931,7 +4945,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * yet. */ swap_free_nr(entry, nr_pages); - if (should_try_to_free_swap(folio, vma, vmf->flags)) + if (should_try_to_free_swap(si, folio, vma, vmf->flags)) folio_free_swap(folio); =20 add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A96542DA749 for ; Mon, 24 Nov 2025 19:16:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011764; cv=none; b=bphNJ2IRsRAxWEiZKO53v7tF7kRebQMFk+KrkXtmxGwhGIQ04JvV6/2gMcLUFbGDMYDFE/T862Ma2LEMy9txgi8gRN2XXCOFSHMEEtIS1LxUAt0pCLC3qDnpYx+tkvShR+fkT0u+ldVmWZl2AwNHKebW6U/20FFhHRnHXMsoOOA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011764; c=relaxed/simple; bh=KMzyXdRK1sUzmmUco0/Q9JRq8HWOzI6kvSLBoif3w94=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=RUgSADtjB2BVqJH/SdZjIJn4Oy7xA++aOpWmziDQ8azX8rgFlpZWR/iG2A/leJOY13i6yW6jlzksuLEFNo1bI2UO8tJTTEwg0BoMrkNZCyIZ1EJqR0SSpGcLMBjc15rd+N6Y6fJrcSEsBabblKvWE3I+yPx6WsEp3aRuFJWUnSg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ZCUCVDx0; arc=none smtp.client-ip=209.85.214.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ZCUCVDx0" Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-29853ec5b8cso56914885ad.3 for ; Mon, 24 Nov 2025 11:16:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011762; x=1764616562; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=uX9WKXwPMUhDf2nsEW9hz+YtSdKaTQoicE44/0zqUlg=; b=ZCUCVDx0gIZVu6TcSiYJBufGDJsWfIzLT6ceM7Yb7JZf0htE1QEgPrFLld7NtqovlT fGAmlWIPIQLFaJrcL1VoCapXekrj8XraOdhzNxlvqJx3uE+0w0AIfZjys/ia+YRU5aI4 h/JtXmpTPPbFiEodgBjHk0vRytHpvdsT2rsP7AjFUIo/2DjS7VgQIwsXNRO3MRZu+Hh7 qTDRHGrKUQJcK4eatH4A3C3ZbDoUqL4BP8m3mOI3M2oD3GxRWjECAW/iNVnh76pg5h6V SHUIWFXPXq2hGUqm4vV0U2nq/vSazjMb4EVGQ+55lGTmy/FH4bUwnfTc1ikSGlX7ytwd tr9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011762; x=1764616562; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=uX9WKXwPMUhDf2nsEW9hz+YtSdKaTQoicE44/0zqUlg=; b=tbwNuKcYIibewDFZKKKB4I2Qwdpt2/+sxCodLCC6wiuAMIuTPbG5oVLM6S0uchlrIZ Qystr16/kCBpyE+GhxmlPFay/OwoLKAoWY1oEXRkGmRTDwCjYWWm29ikHx+4r11fx4fw O1zFCGQb0ynghrGsRYR/lZBKyZ+7v/PgsRqBatzrF1HEFuhlwjiUY7QbdzFvRvOWGfWS zDzBRSf2M/zCvGzsiBOtlsZn3VGq6bo0GTwmkzJu92tv8f9FQ05d3VIUT7RwCGXEBwP6 uWTdNtG5bML0h5NedHCCnVohHIpKxbWKM0Ul8swrZJG9BdWXHaEVYRPw/AfoZ/+UbYKh o/+Q== X-Forwarded-Encrypted: i=1; AJvYcCUaN6BfHUh/wcjIHcGyV84Y+g+342eZuofaxjP9VVFNBeTrfBsq/j/sL5cyTHif1kkKKJwDrwgJEVES9Is=@vger.kernel.org X-Gm-Message-State: AOJu0YzpIjsu+lwoL6e7BY/NT2rAUH4xG0inx60X9mj7g6aypZaX/MCr pRFNiP4/BrcYFxwFtIvWI3hl8u9MpJj9DWVh5SYBspJ0T1OZrFf42UI/ X-Gm-Gg: ASbGnctLoUHr2UpzOzJEIiK0Y1knkIv02iLHMYlwC0KJ71GQJZ7awUKGaGjfix4FUWX oWog/4uS0W2aAyAHmNxjixNY1T+52+MxbcNHCoSEFdpDzqZZTguKnUpmJDcKfSaPYODlxFjxUfm xQGAqFjxvDH4tAc34BFAi1C7XJQnWldzcvjpCDjpSARH8PcoH6tsTtDWik3zuCib8op0Bz+Ikkk zeXVUDJhwkvJsGai8wp/gxmOSKZ1E/+z0l3qr9ATW6xuXc/NNWGnDrziPuvR7zM+zvIZSqWSrxs B34bLxcUwOxCc2hF6X535Y6V5juRUgWJbX+9NePMc+hRNzhaVl7ng2jP1PQ2XduLZ/1gIOHjnEl 8Xrb4R1FLd2u5vxsg1fD6OE4jep+KIZKOm/uDcjUC0PPvlccPSo9tgp+5RbjzCZ+u+/ZnSh5NnD 3TayASNR1gATeZOcz5xxjWpf6yWEN3h8ZjyVLON/vln7IFGz7n X-Google-Smtp-Source: AGHT+IGXvjF9/EgUklxEzQMKZ0EM1T6fsSlpBa15IzV2WahRfwiKSOTO8fwKtjP/5+JY2RbSjWFmqQ== X-Received: by 2002:a17:902:ccc4:b0:295:54cd:d2dc with SMTP id d9443c01a7336-29b6c4047efmr127444105ad.16.1764011761818; Mon, 24 Nov 2025 11:16:01 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.15.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:16:01 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:13:48 +0800 Subject: [PATCH v3 05/19] mm, swap: simplify the code and reduce indention Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-5-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=4358; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=7VD9T4L1mbbNr1WJIIQ0GVUukfOHNZ1e5OxrzJG+X+g=; b=3D4v+AHG9s/wpBznpCDKMKaj1xP0A5A9yd1yAPXPCdUuinASQIZYf+USPDHHk9oX1eM8vgTDp oKHMS9i28MkCiqWOdiw0b9Zlmvf5RPt/9oH8ixBpJyKiqLhNlQxOXEh X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Now swap cache is always used, multiple swap cache checks are no longer useful, remove them and reduce the code indention. No behavior change. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/memory.c | 89 +++++++++++++++++++++++++++++----------------------------= ---- 1 file changed, 43 insertions(+), 46 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 9fb2032772f2..3f707275d540 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4764,55 +4764,52 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_release; =20 page =3D folio_file_page(folio, swp_offset(entry)); - if (swapcache) { - /* - * Make sure folio_free_swap() or swapoff did not release the - * swapcache from under us. The page pin, and pte_same test - * below, are not enough to exclude that. Even if it is still - * swapcache, we need to check that the page's swap has not - * changed. - */ - if (unlikely(!folio_matches_swap_entry(folio, entry))) - goto out_page; - - if (unlikely(PageHWPoison(page))) { - /* - * hwpoisoned dirty swapcache pages are kept for killing - * owner processes (which may be unknown at hwpoison time) - */ - ret =3D VM_FAULT_HWPOISON; - goto out_page; - } - - /* - * KSM sometimes has to copy on read faults, for example, if - * folio->index of non-ksm folios would be nonlinear inside the - * anon VMA -- the ksm flag is lost on actual swapout. - */ - folio =3D ksm_might_need_to_copy(folio, vma, vmf->address); - if (unlikely(!folio)) { - ret =3D VM_FAULT_OOM; - folio =3D swapcache; - goto out_page; - } else if (unlikely(folio =3D=3D ERR_PTR(-EHWPOISON))) { - ret =3D VM_FAULT_HWPOISON; - folio =3D swapcache; - goto out_page; - } - if (folio !=3D swapcache) - page =3D folio_page(folio, 0); + /* + * Make sure folio_free_swap() or swapoff did not release the + * swapcache from under us. The page pin, and pte_same test + * below, are not enough to exclude that. Even if it is still + * swapcache, we need to check that the page's swap has not + * changed. + */ + if (unlikely(!folio_matches_swap_entry(folio, entry))) + goto out_page; =20 + if (unlikely(PageHWPoison(page))) { /* - * If we want to map a page that's in the swapcache writable, we - * have to detect via the refcount if we're really the exclusive - * owner. Try removing the extra reference from the local LRU - * caches if required. + * hwpoisoned dirty swapcache pages are kept for killing + * owner processes (which may be unknown at hwpoison time) */ - if ((vmf->flags & FAULT_FLAG_WRITE) && folio =3D=3D swapcache && - !folio_test_ksm(folio) && !folio_test_lru(folio)) - lru_add_drain(); + ret =3D VM_FAULT_HWPOISON; + goto out_page; } =20 + /* + * KSM sometimes has to copy on read faults, for example, if + * folio->index of non-ksm folios would be nonlinear inside the + * anon VMA -- the ksm flag is lost on actual swapout. + */ + folio =3D ksm_might_need_to_copy(folio, vma, vmf->address); + if (unlikely(!folio)) { + ret =3D VM_FAULT_OOM; + folio =3D swapcache; + goto out_page; + } else if (unlikely(folio =3D=3D ERR_PTR(-EHWPOISON))) { + ret =3D VM_FAULT_HWPOISON; + folio =3D swapcache; + goto out_page; + } else if (folio !=3D swapcache) + page =3D folio_page(folio, 0); + + /* + * If we want to map a page that's in the swapcache writable, we + * have to detect via the refcount if we're really the exclusive + * owner. Try removing the extra reference from the local LRU + * caches if required. + */ + if ((vmf->flags & FAULT_FLAG_WRITE) && + !folio_test_ksm(folio) && !folio_test_lru(folio)) + lru_add_drain(); + folio_throttle_swaprate(folio, GFP_KERNEL); =20 /* @@ -5002,7 +4999,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) pte, pte, nr_pages); =20 folio_unlock(folio); - if (folio !=3D swapcache && swapcache) { + if (unlikely(folio !=3D swapcache)) { /* * Hold the lock to avoid the swap entry to be reused * until we take the PT lock for the pte_same() check @@ -5040,7 +5037,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_unlock(folio); out_release: folio_put(folio); - if (folio !=3D swapcache && swapcache) { + if (folio !=3D swapcache) { folio_unlock(swapcache); folio_put(swapcache); } --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pf1-f175.google.com (mail-pf1-f175.google.com [209.85.210.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B206F21ADCB for ; Mon, 24 Nov 2025 19:16:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.175 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011769; cv=none; b=JtdVL3Vw+wPVW4DPq12/ok28LL4msoBUES4fRH3zLMO2xSInVLr2n+azv52aAVzqsD/7lW4AEcAq2I1ofmdJhBVV2VE5tZ7/+fzvzj/JiqbkvY0KeflMK83z1+cHW53Meokh3XJw9CQ+O9OxYif7bpnTE9Enoy+p6VOxG4fdla8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011769; c=relaxed/simple; bh=oai4UBxh3GEyx6xhRb3I2QqT99Mr+f1uES5spQpBFNY=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=pA0vKkZ/BqXDhXyt0+5lgzHHzIoE2cyDB3+X839tn1gQ3+waE34tpS2dmk81uxL4o5/hwX1Ncxq89aR0JCuIHMUxOZIj8GKzcUudKKWOxLAoSivopVkWBmgByYtzkjNKjtZ0mLLU4VfVIZXapVZE0CCmKe31d+1Op+6ovf8wvAE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=dmofi+OB; arc=none smtp.client-ip=209.85.210.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="dmofi+OB" Received: by mail-pf1-f175.google.com with SMTP id d2e1a72fcca58-7b9a98b751eso3659186b3a.1 for ; Mon, 24 Nov 2025 11:16:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011767; x=1764616567; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=m3/znPoavjBgqqjpUZfnq7wwOL79gjtKqNHBOh+8lvo=; b=dmofi+OBJUhgieIqTog0A5At0Qc8ZjMV38oVNImpYH9KulRVgd324j+FjuIrz45PH8 OqxkZXGCfrO1BjS8UPm1eh+q3dZPPuUOm0ZqqcvkXOIjBI/jyKxljZw3M+NxdEN0lmU+ HWZdSPuxCdkRQH2J4CZuxUAY0jhHCJ7jWHwovmwQdm+A7OJ0wfwRdFpKEvZLkEJlpp7a EkBf5lLZBgdRfKdM5MLuQrcgpW5WD/XmgpUFrm1k1CAbBua9RJhRUNErbE7uc+lVnvHL 9ZNzp+fNwomojVrWNhYftYkvkvGA6fRctf8OONBMRZlrNg1SERQMUZDMWyNLjIXbHiUK cL3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011767; x=1764616567; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=m3/znPoavjBgqqjpUZfnq7wwOL79gjtKqNHBOh+8lvo=; b=SVpHDKFekMZuNSZVKyE+gjsnOc79G3+rK6XMqbiivlsEuvi/1XWQ58vQeA7gajElRc hNTNERLUtdxdni7Fyiv/SD14mnf3Y0Iz1R1wDtv9NnplqANw0y1VB1S5kqOQh18qEpyH 5XoJuJeWu7FP0rP1sgc0PHAik4pNUr6yqd7JHB3DujJaK2t0W6/hn1diXmLwyQfNH2zN LMtoKoINLQiBa23GjL6sj0r5NzUAXipHoT6Lx1dbajD7Wsol5M95XaK3k4c6iy09BrEM kjfCRB6W01E3qBmNk2e/JEO7qr19SQZ+u8GXLXOdx+sOzd4DoVIeatb80FO1BjxSMiex ajuQ== X-Forwarded-Encrypted: i=1; AJvYcCXrcRIQpmfUk+BSojEhWsV2p5+lkLEhuMA5Wnel0YS8kLIt0n4CNwZmbrM2ulZkR6vgbtXKPinBfitTFsA=@vger.kernel.org X-Gm-Message-State: AOJu0YyHAt4+uu9cvoTe1nqf7/ntUusp5GzjJlULiYdyxZeg/+BjuX+1 1C4sXrCrr/zlTvPKgaidSHDtvyig0iAjzdtVe9LEEmh50qwjBZqbjhCh X-Gm-Gg: ASbGnctfp8WENat4TezJTD2pxYTNk/ND9CV54+WfWVKhOAvKM4AeV9m5xDclOGwbzwT DU9UPaX4trd10ZGdxKCsmQuTISexrhyjk6cwcK5Xne4WmTqvv0DCwRGOwSU+qejzjopawWJJG44 BR3N9UMUpVIyq3f1CVYPhpdd+kdrSpde9nGbeKuoH1s5ekUfV7cex8N7T80RBk862cJMjmt9ihI eVC34W7iaagWXi+sXuPefiSJaUsLSbpLvaFihyAPGuQ0w9L3HcY6S6bfbm0VJJJ1GFH1DWyINHB NN01eQF+KTRxctvWxR3gzBQnE1Q17080gW8KEC3I/be11Uf/k6XazgzG2C+BxfME1AR+MSTHZeW KctJphv4CBCd6XYHpr77E/fNC+8uqoyEOkPeyUB4uHtCDkKmvZMpFl7WtheFKSrlUbX+fSY8Zbk 8kRFsq0/0A2sZvlYG6jfsDAV76pVBRQG46JWAq35g4CIZFruLjFrmakC0dK40= X-Google-Smtp-Source: AGHT+IEdNvv63j9gRZ+PlWPsm6yqUdrO7ifaRQFcoPAtjFV84g74FKwIZFNPZKCySdCzjMYIl0hjdQ== X-Received: by 2002:a05:6a21:6d89:b0:352:3695:fa64 with SMTP id adf61e73a8af0-3614eddecf9mr14535414637.37.1764011766767; Mon, 24 Nov 2025 11:16:06 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.16.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:16:06 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:13:49 +0800 Subject: [PATCH v3 06/19] mm, swap: free the swap cache after folio is mapped Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-6-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=2699; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=7ANx1B1sM45AeElSDuuzzNPN1Tv7606509lOkJkHgRE=; b=1GahXSVrzSIb69EYYrLaQ5o8p7Eq2dNT0v+CUr1H6Miwk462u5EDamd0NV7lZ8C8w3T5qDMEE /6bdy6IuS0kAgr7/KGczPaDcuKfPjZUGRdmiZxAdD6BLnyVFLNeLqup X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song To reduce repeated faults due to parallel swapins of the same PTE, remove the folio from the swap cache after it is mapped. So new faults from the swap PTE will be much more likely to see the folio in the swap cache and wait on it. This does not eliminate all swapin races: an ongoing swapin fault may still see an empty swap cache. That's harmless, as the PTE is changed before the swap cache is cleared, so it will just return and not trigger any repeated faults. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/memory.c | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 3f707275d540..ce9f56f77ae5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4362,6 +4362,7 @@ static vm_fault_t remove_device_exclusive_entry(struc= t vm_fault *vmf) static inline bool should_try_to_free_swap(struct swap_info_struct *si, struct folio *folio, struct vm_area_struct *vma, + unsigned int extra_refs, unsigned int fault_flags) { if (!folio_test_swapcache(folio)) @@ -4384,7 +4385,7 @@ static inline bool should_try_to_free_swap(struct swa= p_info_struct *si, * reference only in case it's likely that we'll be the exclusive user. */ return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) && - folio_ref_count(folio) =3D=3D (1 + folio_nr_pages(folio)); + folio_ref_count(folio) =3D=3D (extra_refs + folio_nr_pages(folio)); } =20 static vm_fault_t pte_marker_clear(struct vm_fault *vmf) @@ -4936,15 +4937,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ arch_swap_restore(folio_swap(entry, folio), folio); =20 - /* - * Remove the swap entry and conditionally try to free up the swapcache. - * We're already holding a reference on the page but haven't mapped it - * yet. - */ - swap_free_nr(entry, nr_pages); - if (should_try_to_free_swap(si, folio, vma, vmf->flags)) - folio_free_swap(folio); - add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); pte =3D mk_pte(page, vma->vm_page_prot); @@ -4998,6 +4990,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) arch_do_swap_page_nr(vma->vm_mm, vma, address, pte, pte, nr_pages); =20 + /* + * Remove the swap entry and conditionally try to free up the swapcache. + * Do it after mapping, so raced page faults will likely see the folio + * in swap cache and wait on the folio lock. + */ + swap_free_nr(entry, nr_pages); + if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags)) + folio_free_swap(folio); + folio_unlock(folio); if (unlikely(folio !=3D swapcache)) { /* --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ADBC123EAAB for ; Mon, 24 Nov 2025 19:16:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011776; cv=none; b=Pks7BRqwB2/4q5wvIP71Cjz3ssI2wQEWzUeXMzjm3RAgY7zcIqVp75rn/jsNX9C9GcWApF1ERttlgPAcIjgWGAatCzlgDQLaM/ak80rPUP5/+5mNgVCkmo30uxoobNN1T4TK40L3aATxNIK1tof3u50NF27TbLsTjQBi5nI8sts= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011776; c=relaxed/simple; bh=BCyM/sRpXUizVKA+fshJhSXkN4lhT/BtafffoT97yuY=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=eOjySuOPKGDQ0vMWNkQz//E941ntZuE8VUNgBC8J5AqG1wPWQ+ihvcBhNo3CW3p/WVDvdKKHHOyUApTQ4gRUK90A1WcI64doBZPBpKyde03HCkzYZbTGgVxM2opIK5kGcy2tSsD4abVSnlwdrpbrTDupKHpfoQXbKIQ2Sj8D+Xo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ZC+scpTZ; arc=none smtp.client-ip=209.85.214.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ZC+scpTZ" Received: by mail-pl1-f170.google.com with SMTP id d9443c01a7336-29845b06dd2so58186105ad.2 for ; Mon, 24 Nov 2025 11:16:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011772; x=1764616572; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=obraih8eLf0OOjYF+WhdrywBlHz95tFwt0h0OBRMkZ8=; b=ZC+scpTZNlZHjqIisD4+mmjn9DU6zyMkjpUbaW24ynnbA0bPtNKNDcwjgZml6DE9eJ J74hUce4sqPQ/KrjJVqTDvhjTX/on9Y+q5Nmd3No6G+k+ItA2CRn8MheGBUBBjxFi+o7 vk3tpC6ztwSE3CB3zIUclROWZr9/JFjFebP9oxiwB7Mu3ZAjYcvLopqWkp5Cmtwe6dGO 6bRh4u8k/23Kp9pZnjZHeOdl0H9urThImTzJetCJAL3yajh1KzjoJPLARX384Ok7q2Fb Me/iQINal5GorRMgHGep4LvxYCFcNU2rVQwk28Ltn3OlgDOjBPtu+9nyNo4Wp84ZCYCX +moQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011772; x=1764616572; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=obraih8eLf0OOjYF+WhdrywBlHz95tFwt0h0OBRMkZ8=; b=O8qEsHvjdeKimm76pC8h508Iz7fneojwJztVysAiCqv+saICsYZysIKtueOI8JWB4u NeoMYGJVasoKmhK9DuUUScDSXGbQcobyuBy3QKTpfHUjBh/+EspIZ/7Q6AK17iwkfNG6 fVtGdQz5Dsdk6T0fOx7aPT3/B8pC2/Prm+HcSDQvXOrFJuGQzf+iU7nNG5zmHoJgvvMs XHkc16ppkm3Mxn5iz9iCSOc9UnvCTyy9cVWX8X/clyGcry9DTUZmbOUns43nRpXLjsEf mNE2/15yDwJJebh/LDLuFVgFHw+7r0dsuVaqzbK2fuaCAeO8G9AFtUbG1E62CDHXqQLG LLYg== X-Forwarded-Encrypted: i=1; AJvYcCVcTvozuw+PcWeAw+0MFBM+Ir3u6Wb8X9FUJZwJzwJp4POa8BgGoQa5OAS+F9ntq0qldd/LEld9hUk/88U=@vger.kernel.org X-Gm-Message-State: AOJu0YzUamAti6lI7kLZXXUTYvNir/dVNzwkYlbTeCqn1fekKjG5GL+a rsHgJhBtHND/M56SMZHZ6pUVUlLL/obeXmOQKCc7sN2WE6SCJvNCsFJS X-Gm-Gg: ASbGnctlzJ+BEffjwDc8ZHBxOLFeQdUGMRiEv2rfzn6kEmm6IuknBQMQgGk7A8LXvvm JINLyLYBXsREGDLQ8QYKN/rBymd8E5CuHLOmKKx4BLBwXS8YEzTyMAQHQMxFznnHvfghZ6xj5LC 840pGbsleXFj8gczB61e1Z8iZ/JzROPbHLA2PK2OKsSJIuEnrggonjCMnZG1GGEmWnDRWcxxzD3 EeB3YR9bRiCr/H2kiLikz0KWWt35BTYN1Xe7DsjUTCI+ex6apE6HAYTVzpnEg5+ZNVeVc5dhYMq QXRK5N7XwtLzG/KWhnjzhRBaZjPLkiRk90gVphcY9YK24StW4irTV4mz/iTPbP/5vYg64J3lavf QRFbz3Fjx+q7rksCIuV+BeQZYcz+mIDtm9hJujf9wxOJViwPiMYdpGoRxuF22MaWoZPOxPX6jo1 KDQ+QZ9oJVA8c3CJVb3w7GjejvHKr1ywQcYAMqfxeIBa1usKNm X-Google-Smtp-Source: AGHT+IGe78PocrlSrCBER8KtPpbn93UgwXznFo2T6cbCpPDjFbaqSNvsNNClyif7SLSaUJ9xOH6mGA== X-Received: by 2002:a17:903:2f8d:b0:298:52a9:31d4 with SMTP id d9443c01a7336-29b6c6b87c7mr163646585ad.54.1764011771769; Mon, 24 Nov 2025 11:16:11 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.16.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:16:11 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:13:50 +0800 Subject: [PATCH v3 07/19] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-7-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=7744; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=mFJMctr8NWLbKAoBn+MFMi16Ti5CS/gXXQ3s4bOVo4E=; b=N17pfrXCrtJLWjiS9vI4Ie7s945niM4NnEl9K/tyR45NMLgbtHmGvPybvJANfb0UFKeSi3Dw3 r2NC3i5ugLwCmwUc7hkTIJ9Nv8FQo2FjRze+wmW33V8R+Whs4ivMfnz X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Now the overhead of the swap cache is trivial to none, bypassing the swap cache is no longer a valid optimization. We have removed the cache bypass swapin for anon memory, now do the same for shmem. Many helpers and functions can be dropped now. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/shmem.c | 65 +++++++++++++++++--------------------------------------= ---- mm/swap.h | 4 ---- mm/swapfile.c | 35 +++++++++----------------------- 3 files changed, 27 insertions(+), 77 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index ad18172ff831..d08248fd67ff 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2001,10 +2001,9 @@ static struct folio *shmem_swap_alloc_folio(struct i= node *inode, swp_entry_t entry, int order, gfp_t gfp) { struct shmem_inode_info *info =3D SHMEM_I(inode); + struct folio *new, *swapcache; int nr_pages =3D 1 << order; - struct folio *new; gfp_t alloc_gfp; - void *shadow; =20 /* * We have arrived here because our zones are constrained, so don't @@ -2044,34 +2043,19 @@ static struct folio *shmem_swap_alloc_folio(struct = inode *inode, goto fallback; } =20 - /* - * Prevent parallel swapin from proceeding with the swap cache flag. - * - * Of course there is another possible concurrent scenario as well, - * that is to say, the swap cache flag of a large folio has already - * been set by swapcache_prepare(), while another thread may have - * already split the large swap entry stored in the shmem mapping. - * In this case, shmem_add_to_page_cache() will help identify the - * concurrent swapin and return -EEXIST. - */ - if (swapcache_prepare(entry, nr_pages)) { + swapcache =3D swapin_folio(entry, new); + if (swapcache !=3D new) { folio_put(new); - new =3D ERR_PTR(-EEXIST); - /* Try smaller folio to avoid cache conflict */ - goto fallback; + if (!swapcache) { + /* + * The new folio is charged already, swapin can + * only fail due to another raced swapin. + */ + new =3D ERR_PTR(-EEXIST); + goto fallback; + } } - - __folio_set_locked(new); - __folio_set_swapbacked(new); - new->swap =3D entry; - - memcg1_swapin(entry, nr_pages); - shadow =3D swap_cache_get_shadow(entry); - if (shadow) - workingset_refault(new, shadow); - folio_add_lru(new); - swap_read_folio(new, NULL); - return new; + return swapcache; fallback: /* Order 0 swapin failed, nothing to fallback to, abort */ if (!order) @@ -2161,8 +2145,7 @@ static int shmem_replace_folio(struct folio **foliop,= gfp_t gfp, } =20 static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t inde= x, - struct folio *folio, swp_entry_t swap, - bool skip_swapcache) + struct folio *folio, swp_entry_t swap) { struct address_space *mapping =3D inode->i_mapping; swp_entry_t swapin_error; @@ -2178,8 +2161,7 @@ static void shmem_set_folio_swapin_error(struct inode= *inode, pgoff_t index, =20 nr_pages =3D folio_nr_pages(folio); folio_wait_writeback(folio); - if (!skip_swapcache) - swap_cache_del_folio(folio); + swap_cache_del_folio(folio); /* * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks) @@ -2279,7 +2261,6 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, softleaf_t index_entry; struct swap_info_struct *si; struct folio *folio =3D NULL; - bool skip_swapcache =3D false; int error, nr_pages, order; pgoff_t offset; =20 @@ -2322,7 +2303,6 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, folio =3D NULL; goto failed; } - skip_swapcache =3D true; } else { /* Cached swapin only supports order 0 folio */ folio =3D shmem_swapin_cluster(swap, gfp, info, index); @@ -2378,9 +2358,8 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, * and swap cache folios are never partially freed. */ folio_lock(folio); - if ((!skip_swapcache && !folio_test_swapcache(folio)) || - shmem_confirm_swap(mapping, index, swap) < 0 || - folio->swap.val !=3D swap.val) { + if (!folio_matches_swap_entry(folio, swap) || + shmem_confirm_swap(mapping, index, swap) < 0) { error =3D -EEXIST; goto unlock; } @@ -2412,12 +2391,7 @@ static int shmem_swapin_folio(struct inode *inode, p= goff_t index, if (sgp =3D=3D SGP_WRITE) folio_mark_accessed(folio); =20 - if (skip_swapcache) { - folio->swap.val =3D 0; - swapcache_clear(si, swap, nr_pages); - } else { - swap_cache_del_folio(folio); - } + swap_cache_del_folio(folio); folio_mark_dirty(folio); swap_free_nr(swap, nr_pages); put_swap_device(si); @@ -2428,14 +2402,11 @@ static int shmem_swapin_folio(struct inode *inode, = pgoff_t index, if (shmem_confirm_swap(mapping, index, swap) < 0) error =3D -EEXIST; if (error =3D=3D -EIO) - shmem_set_folio_swapin_error(inode, index, folio, swap, - skip_swapcache); + shmem_set_folio_swapin_error(inode, index, folio, swap); unlock: if (folio) folio_unlock(folio); failed_nolock: - if (skip_swapcache) - swapcache_clear(si, folio->swap, folio_nr_pages(folio)); if (folio) folio_put(folio); put_swap_device(si); diff --git a/mm/swap.h b/mm/swap.h index 214e7d041030..e0f05babe13a 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -403,10 +403,6 @@ static inline int swap_writeout(struct folio *folio, return 0; } =20 -static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_= t entry, int nr) -{ -} - static inline struct folio *swap_cache_get_folio(swp_entry_t entry) { return NULL; diff --git a/mm/swapfile.c b/mm/swapfile.c index ee6bb37ab174..5853db044031 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1610,22 +1610,6 @@ struct swap_info_struct *get_swap_device(swp_entry_t= entry) return NULL; } =20 -static void swap_entries_put_cache(struct swap_info_struct *si, - swp_entry_t entry, int nr) -{ - unsigned long offset =3D swp_offset(entry); - struct swap_cluster_info *ci; - - ci =3D swap_cluster_lock(si, offset); - if (swap_only_has_cache(si, offset, nr)) { - swap_entries_free(si, ci, entry, nr); - } else { - for (int i =3D 0; i < nr; i++, entry.val++) - swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); - } - swap_cluster_unlock(ci); -} - static bool swap_entries_put_map(struct swap_info_struct *si, swp_entry_t entry, int nr) { @@ -1761,13 +1745,21 @@ void swap_free_nr(swp_entry_t entry, int nr_pages) void put_swap_folio(struct folio *folio, swp_entry_t entry) { struct swap_info_struct *si; + struct swap_cluster_info *ci; + unsigned long offset =3D swp_offset(entry); int size =3D 1 << swap_entry_order(folio_order(folio)); =20 si =3D _swap_info_get(entry); if (!si) return; =20 - swap_entries_put_cache(si, entry, size); + ci =3D swap_cluster_lock(si, offset); + if (swap_only_has_cache(si, offset, size)) + swap_entries_free(si, ci, entry, size); + else + for (int i =3D 0; i < size; i++, entry.val++) + swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); + swap_cluster_unlock(ci); } =20 int __swap_count(swp_entry_t entry) @@ -3780,15 +3772,6 @@ int swapcache_prepare(swp_entry_t entry, int nr) return __swap_duplicate(entry, SWAP_HAS_CACHE, nr); } =20 -/* - * Caller should ensure entries belong to the same folio so - * the entries won't span cross cluster boundary. - */ -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int n= r) -{ - swap_entries_put_cache(si, entry, nr); -} - /* * add_swap_count_continuation - called when a swap count is duplicated * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entr= y's --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pf1-f179.google.com (mail-pf1-f179.google.com [209.85.210.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E607D2D7805 for ; Mon, 24 Nov 2025 19:16:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011781; cv=none; b=Sqi88qmQURcDBfRd6N2Tck/TFRH65bYF2TDU2epL09V1Ru2xqAmdbLgptCZNIMnVDF0hMS3q4iAcOwFlHs2Q1oFwSmbiAhE555IusaDRj9L+0jSS5B0XMv1ZPkvaQmXaJX2DuR1bOYvex85w/eBroKDEtZoKUUl3GO7802CqkrU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011781; c=relaxed/simple; bh=uZu/oK8HF6f9aElwNggI6FXrGUebt3fNoU+h3a7hMx4=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=FGLGLXbdij5C8SCtHlc0hgKbDliA++NtU3tNvHbmIcPIRECGHrH4F83nE+nEsU1lF/9YRRIKWo0sZmGXG6L9E2I/+AHQjDAQFyoA0jEJSJlXHwoMXMcZiMfFqaKPNSEja4/clI442UXDXZVBiu6hOj0lHcP83rZ2B1RlHL8lZI8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=LioHTHqy; arc=none smtp.client-ip=209.85.210.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="LioHTHqy" Received: by mail-pf1-f179.google.com with SMTP id d2e1a72fcca58-7bb3092e4d7so4988313b3a.0 for ; Mon, 24 Nov 2025 11:16:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011777; x=1764616577; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=GpNi8FjhSIttEAU4/SYztJU0s3eplrF9RCrsYzGaLWo=; b=LioHTHqyGoTGmRShqOtBD4sJkVz1pBLaOn+Z8haaNyg/e5siGwBZfNcwiewRrS/uqw t7LkOXjpip9boVVr6wgsnReQoU2UIzQPmAXaGbvbOE1TzDeL1xO6kZaH6VPn0oZ8nlJo OvIjD57hNqCjYnToTB+AI7lESVXObehG+B3Us3qiQUR+uW7BWegMh9IcncqA1Y7xjCro 7XgNY0hd+M8hJerJIPc7X7YF1s/M13XjppCpV4gsK90EEgB7TdcAx2oOqW5ujZuJp51L Xb4DSfo3mmKLag0hMkJmTanNnliaDp2hC0G70hECEc/MG8+Mrp7j4aTftTiD6nO+/OL6 Z2UQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011777; x=1764616577; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=GpNi8FjhSIttEAU4/SYztJU0s3eplrF9RCrsYzGaLWo=; b=BC9ls+jkv1uu4egqSMfOaR7VS+L76RfImHTk7Z092sEXDVv5uJqVfGknyRaYcRqUM+ KbFgDB68gGeUCXekCmFIlCVtQtGpWKXJEQ9dym4QeaSjQ1qpLV76WFqwngGbV2lS9VEM PZZeoZH0MZnAcqO/EqxfwshRCXzwdhERplMyigUTi9tK3FRLEOSIo/mBRO6wghQqFk6+ qz5AuQdLT9v+Qcqcl/uWTkulh7aLnKJzdf/BwC3J50ySkwA94OVTbCOF8w2rCx1CmrlS 81HU83utmkZhKJYnmO3Y3QjKufdIsLmeG12YyABGyLIcUwpQp2NQF14BIYOiTCm8XXAv VSdQ== X-Forwarded-Encrypted: i=1; AJvYcCX4gFQbw18AmrmGU49SUFffk3iYaAt0Mm6HRtLI1ucf8VpPkf9sPiWOaluNalEBwM2P/Z8NuvbcJNhsj1U=@vger.kernel.org X-Gm-Message-State: AOJu0YxzMCNJ7I9SdVM6b3TDoboiurbuH1GPCbXxKHNrI3upvcPTVKSY PnVFCECqyfszI5/eKimVi4pt5Wbt0wTrCpa91H9bXn04dTvO0zlrZjyL X-Gm-Gg: ASbGnctpxOBhFSjzFxqFapFIQ35gmObu/CwsFFw302P7C2WFQwLEG6dDANly4Vw081+ EOivRJ2/+IxVfANP/3ordot7PT9FTOT7V0q63xJAkKA7QC7pGnNvlhweibrNw0UkaM7qlECmUD7 aTbv9T+MHs4M1EuQRj6eYjX8c7t9qOVwu7UDbPcicAIaVzFsalLmceqqESCtuOVs+tc4ChFAXmL eO9ORonF9hDmaHgVpgeTQnGBsQoq5rsgCQY3VQa4uyzA0uTcIOVf7ybt1J1JdWoY/gs/Dzpy9tt jlSlKO5OE29I6pXuz4MLDeFcs5ftwt6W+t2XV7TCo/K6QU0o8Ba0JsMx5p4vCNPY8LcMA0T/n3J aD19cAokbrUb8VnLxZMn7R4GSm4KJR64WKPHbuy1tR2owiJAlwjGGma89GqCRpXQ/X4V056nZlh 34KGgrQgNOBQfNLRMGR8TjnfBsrMrR+jw66XeloSXFhmoZTEwVeG6sROECXvg= X-Google-Smtp-Source: AGHT+IECvy8wsfwYygzgHpvNBsk6OsB2Az3ifuap9gn5TViDmzLyWX3HaNQQ51t3DUydcesHNZXR3Q== X-Received: by 2002:a05:6a21:6d96:b0:334:a784:3046 with SMTP id adf61e73a8af0-3614ed971e2mr14909542637.38.1764011777067; Mon, 24 Nov 2025 11:16:17 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.16.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:16:16 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:13:51 +0800 Subject: [PATCH v3 08/19] mm/shmem, swap: remove SWAP_MAP_SHMEM Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-8-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=7126; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=mVLrWCcOCF2opEwIpHmoO8gRgwudRDdzZKps7g3enEw=; b=gjX6UZ//yplqGvwKWzYk0/CGzU4i15dfWYfnqKLpnNStNKAwg1KK+h+8qq0JEo/rh7L7WJgEM ZE9ktkILD3nDsYG5yQ0QDKAumJbGDDjOdE35IHicpm9T9e0+t2jBweo X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Nhat Pham The SWAP_MAP_SHMEM state was introduced in the commit aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry belongs to shmem during swapoff. However, swapoff has since been rewritten in the commit b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now having swap count =3D=3D SWAP_MAP_SHMEM value is basically the same as having swap count =3D=3D 1, and swap_shmem_alloc() behaves analogously to swap_duplicate(). The only difference of note is that swap_shmem_alloc() does not check for -ENOMEM returned from __swap_duplicate(), but it is OK because shmem never re-duplicates any swap entry it owns. This will stil be safe if we use (batched) swap_duplicate() instead. This commit adds swap_duplicate_nr(), the batched variant of swap_duplicate(), and removes the SWAP_MAP_SHMEM state and the associated swap_shmem_alloc() helper to simplify the state machine (both mentally and in terms of actual code). We will also have an extra state/special value that can be repurposed (for swap entries that never gets re-duplicated). Signed-off-by: Nhat Pham Signed-off-by: Kairui Song Suggested-by: Chris Li --- include/linux/swap.h | 15 +++++++-------- mm/shmem.c | 2 +- mm/swapfile.c | 42 +++++++++++++++++------------------------- 3 files changed, 25 insertions(+), 34 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 38ca3df68716..bf72b548a96d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -230,7 +230,6 @@ enum { /* Special value in first swap_map */ #define SWAP_MAP_MAX 0x3e /* Max count */ #define SWAP_MAP_BAD 0x3f /* Note page is bad */ -#define SWAP_MAP_SHMEM 0xbf /* Owned by shmem/tmpfs */ =20 /* Special value in each swap_map continuation */ #define SWAP_CONT_MAX 0x7f /* Max count */ @@ -458,8 +457,7 @@ bool folio_free_swap(struct folio *folio); void put_swap_folio(struct folio *folio, swp_entry_t entry); extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); -extern void swap_shmem_alloc(swp_entry_t, int); -extern int swap_duplicate(swp_entry_t); +extern int swap_duplicate_nr(swp_entry_t entry, int nr); extern int swapcache_prepare(swp_entry_t entry, int nr); extern void swap_free_nr(swp_entry_t entry, int nr_pages); extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); @@ -514,11 +512,7 @@ static inline int add_swap_count_continuation(swp_entr= y_t swp, gfp_t gfp_mask) return 0; } =20 -static inline void swap_shmem_alloc(swp_entry_t swp, int nr) -{ -} - -static inline int swap_duplicate(swp_entry_t swp) +static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages) { return 0; } @@ -569,6 +563,11 @@ static inline int add_swap_extent(struct swap_info_str= uct *sis, } #endif /* CONFIG_SWAP */ =20 +static inline int swap_duplicate(swp_entry_t entry) +{ + return swap_duplicate_nr(entry, 1); +} + static inline void free_swap_and_cache(swp_entry_t entry) { free_swap_and_cache_nr(entry, 1); diff --git a/mm/shmem.c b/mm/shmem.c index d08248fd67ff..eb9bd9241f99 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1654,7 +1654,7 @@ int shmem_writeout(struct folio *folio, struct swap_i= ocb **plug, spin_unlock(&shmem_swaplist_lock); } =20 - swap_shmem_alloc(folio->swap, nr_pages); + swap_duplicate_nr(folio->swap, nr_pages); shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap)); =20 BUG_ON(folio_mapped(folio)); diff --git a/mm/swapfile.c b/mm/swapfile.c index 5853db044031..cc07246985ef 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -201,7 +201,7 @@ static bool swap_is_last_map(struct swap_info_struct *s= i, unsigned char *map_end =3D map + nr_pages; unsigned char count =3D *map; =20 - if (swap_count(count) !=3D 1 && swap_count(count) !=3D SWAP_MAP_SHMEM) + if (swap_count(count) !=3D 1) return false; =20 while (++map < map_end) { @@ -1519,12 +1519,6 @@ static unsigned char swap_entry_put_locked(struct sw= ap_info_struct *si, if (usage =3D=3D SWAP_HAS_CACHE) { VM_BUG_ON(!has_cache); has_cache =3D 0; - } else if (count =3D=3D SWAP_MAP_SHMEM) { - /* - * Or we could insist on shmem.c using a special - * swap_shmem_free() and free_shmem_swap_and_cache()... - */ - count =3D 0; } else if ((count & ~COUNT_CONTINUED) <=3D SWAP_MAP_MAX) { if (count =3D=3D COUNT_CONTINUED) { if (swap_count_continued(si, offset, count)) @@ -1622,7 +1616,7 @@ static bool swap_entries_put_map(struct swap_info_str= uct *si, if (nr <=3D 1) goto fallback; count =3D swap_count(data_race(si->swap_map[offset])); - if (count !=3D 1 && count !=3D SWAP_MAP_SHMEM) + if (count !=3D 1) goto fallback; =20 ci =3D swap_cluster_lock(si, offset); @@ -1676,12 +1670,10 @@ static bool swap_entries_put_map_nr(struct swap_inf= o_struct *si, =20 /* * Check if it's the last ref of swap entry in the freeing path. - * Qualified value includes 1, SWAP_HAS_CACHE or SWAP_MAP_SHMEM. */ static inline bool __maybe_unused swap_is_last_ref(unsigned char count) { - return (count =3D=3D SWAP_HAS_CACHE) || (count =3D=3D 1) || - (count =3D=3D SWAP_MAP_SHMEM); + return (count =3D=3D SWAP_HAS_CACHE) || (count =3D=3D 1); } =20 /* @@ -3674,7 +3666,6 @@ static int __swap_duplicate(swp_entry_t entry, unsign= ed char usage, int nr) =20 offset =3D swp_offset(entry); VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); - VM_WARN_ON(usage =3D=3D 1 && nr > 1); ci =3D swap_cluster_lock(si, offset); =20 err =3D 0; @@ -3734,27 +3725,28 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) return err; } =20 -/* - * Help swapoff by noting that swap entry belongs to shmem/tmpfs - * (in which case its reference count is never incremented). - */ -void swap_shmem_alloc(swp_entry_t entry, int nr) -{ - __swap_duplicate(entry, SWAP_MAP_SHMEM, nr); -} - -/* - * Increase reference count of swap entry by 1. +/** + * swap_duplicate_nr() - Increase reference count of nr contiguous swap en= tries + * by 1. + * + * @entry: first swap entry from which we want to increase the refcount. + * @nr: Number of entries in range. + * * Returns 0 for success, or -ENOMEM if a swap_count_continuation is requi= red * but could not be atomically allocated. Returns 0, just as if it succee= ded, * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), wh= ich * might occur if a page table entry has got corrupted. + * + * Note that we are currently not handling the case where nr > 1 and we ne= ed to + * add swap count continuation. This is OK, because no such user exists - = shmem + * is the only user that can pass nr > 1, and it never re-duplicates any s= wap + * entry it owns. */ -int swap_duplicate(swp_entry_t entry) +int swap_duplicate_nr(swp_entry_t entry, int nr) { int err =3D 0; =20 - while (!err && __swap_duplicate(entry, 1, 1) =3D=3D -ENOMEM) + while (!err && __swap_duplicate(entry, 1, nr) =3D=3D -ENOMEM) err =3D add_swap_count_continuation(entry, GFP_ATOMIC); return err; } --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pf1-f180.google.com (mail-pf1-f180.google.com [209.85.210.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0072F2D7DC2 for ; Mon, 24 Nov 2025 19:16:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011784; cv=none; b=TIyJRaHp3YdlpE0amfNsnPCHbD9/JixCvkUvrQNOq0V2OrqcuDvjLDwb/4pk5BzWaSj+U0u2982z4Bw6O3/YqlDlomSxMEGoCi5jtQWxXK7uQaxEO7COGMu9tJQCCwVU89j/uo1kKn8tQg3c2USv3af8jJXtJ94GoO2/zMd1LsE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011784; c=relaxed/simple; bh=Ottae+wfp4J+mOm2zx+2NKMOTXCf9uuaZZhSfcDDjuc=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=hbUXriXxyBWc2UxMf5wW2mQnuYjpyZe5thnA8k2quatfmbvfifNzzVVL08PqMzA3B4+HnPgaBFDYu4r0TizfFRDF/F+ysIPx6LUX1+g60Hh99L77psR6SiPWJbU9BXdEqI5MLF5o6nIZcn2EJ6fmmuTYY21apu9ST1mXk36jvMo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=YGxkaCHh; arc=none smtp.client-ip=209.85.210.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="YGxkaCHh" Received: by mail-pf1-f180.google.com with SMTP id d2e1a72fcca58-7aace33b75bso4335444b3a.1 for ; Mon, 24 Nov 2025 11:16:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011782; x=1764616582; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=FKeLqk+1/RWvLu8QYWhyCh1bLbzPzjz8T6cPe9gn1XY=; b=YGxkaCHhPp+8vXSnV4JRLY+DnfO7elnHQ6j14tWGLP8k0E7DITSyGouoC9pxK8GChI ooD/W1Cg2TN4DXi9nLXjoGqL0BnRObqZfexdTe9pcyoHsQKsp5Fus86ghGkoM2pjK9lX 3mMzVhOI+2ZVNkRHL9fpBSLZRz9LZVZ+FdCq90jIn2aklztlPhhJOma7XGN8NDutLkBt qG0dh1hIlCB8DYtF6adL6szABNFA5U+ThFWgPOEI0FLebJlrVYI4kvbfeFpILpZZVLg1 uHyohA9e2E9b33VndUlg4kVr2mJlbmgOiPNlLMLm3FPsL9tdLmjn/np4vl/115ww8fJZ 11Iw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011782; x=1764616582; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=FKeLqk+1/RWvLu8QYWhyCh1bLbzPzjz8T6cPe9gn1XY=; b=IwO+3ve0KJ9ljwm8b+h2EvAWqJFONK0Vbzna2bLkpz6egp4zGWk0cTenguWrR5hho/ 1eYgc0at36WbPhec1xIdSaflF1JcJztrp/XeSJQykfZKdAsCfeVEpHeTFSvqERBTZ8UR gXSF5BpaVv7F17O84IWDjrdOplHIcfVTHfwRmi/Fh3n0mZoNtyjRor6Jrezj3KbBRCT0 W347YuJCAtNOvTBS5AyoxZNUqM6QvPySx5E0U5aq7OZ7END6B2WDXgCCkVcCIERDBaMX Ndsmi1LfGu8OoeQUePTNa+oGmBFaAXoTcULv4t6UXbQputG8kDPntsdf0ch4i88841v2 cCeA== X-Forwarded-Encrypted: i=1; AJvYcCUN7EevjtRkvf91Ca4uBSz9awzBPF+vtFUSyq5DBrFSds9FY1yokrs65tHBo+wGn8a1xuQBKUcd5D+ZKdY=@vger.kernel.org X-Gm-Message-State: AOJu0YxzC52LNAZF4DHjs2SJHyv6Z4zVqZoMrDqE52/uNN3pi7YFGcwB 470ZBMS8dRxKmCrOfTWwrkU9Fxcgp11UPX9ISEFQurawpKNkmx1iSsG4 X-Gm-Gg: ASbGncsmHMXVq4QtvS1jnxM6vlpbf8undfY6DIlKOSZ0Rb4ERD6ICLocm24WzDN9kHF hgwfqw0vkJEP12Mu5z2rS5mL7K0qBLTnDEi3uvYlQ9XLdNRccQZvZTt75wl3hWcPUEABs4ANKkZ z9rLY6oJX6oJjduOQE3am07bIRfK9+IaoGSiPThRUow2o+T1KM+jATymb7uQBPt7QmdQhNVeHo2 0e6cUSFCh/wDPIkC7vmhFaSMGr5KQKxVkS0caOyK3ZhtXeHpXXLCjm9zqG0odbtQSu0+Z3AyiSp Q7UC5MNgSBZ8yLHeMjDibTflUcGiryha0LaEEx03+vL/ccxJsLc24QIgbAjPSmluLvq8nvq2i30 +pVGeYtmgzKW4zvi06yA1ua6APDqo4vyMoWH+fKSJ+Pl/0YPFIFxH4s4WXTQkH+paF+aIiigdth y/ETw091XuLlsIzAMUxSwSUmKsJmIBSxqCCTLio51OiaIgafDz X-Google-Smtp-Source: AGHT+IH4BgyhtjQ6aAK/eANfM70fuqbmBXp38dAAyIsWduO5W2Y5IejVtVYXYVWdCO6+aJeJ0iJSng== X-Received: by 2002:a05:6a20:6a27:b0:350:d523:808d with SMTP id adf61e73a8af0-36150e6b162mr14792076637.15.1764011782031; Mon, 24 Nov 2025 11:16:22 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.16.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:16:21 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:13:52 +0800 Subject: [PATCH v3 09/19] mm, swap: swap entry of a bad slot should not be considered as swapped out Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-9-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=4721; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=zJsDgTvoRf5ue0A09hZMBoXms74WK9KtuFF78M2jiUM=; b=wEJdetMpxtMi4VDQmNp3IzzjeOOKJVStHD+BWwhHDiebKbwfmQiNS/Ue8io/7d460vuXnxeCl 8Yg+a6i0i2vD3t/fSngjap9MV78qj0T5rRMHgifocJ6AaXj1T4lydd0 X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song When checking if a swap entry is swapped out, we simply check if the bitwise result of the count value is larger than 0. But SWAP_MAP_BAD will also be considered as a swao count value larger than 0. SWAP_MAP_BAD being considered as a count value larger than 0 is useful for the swap allocator: they will be seen as a used slot, so the allocator will skip them. But for the swapped out check, this isn't correct. There is currently no observable issue. The swapped out check is only useful for readahead and folio swapped-out status check. For readahead, the swap cache layer will abort upon checking and updating the swap map. For the folio swapped out status check, the swap allocator will never allocate an entry of bad slots to folio, so that part is fine too. The worst that could happen now is redundant allocation/freeing of folios and waste CPU time. This also makes it easier to get rid of swap map checking and update during folio insertion in the swap cache layer. Signed-off-by: Kairui Song Suggested-by: Chris Li --- include/linux/swap.h | 6 ++++-- mm/swap_state.c | 4 ++-- mm/swapfile.c | 22 +++++++++++----------- 3 files changed, 17 insertions(+), 15 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index bf72b548a96d..936fa8f9e5f3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -466,7 +466,8 @@ int find_first_swap(dev_t *device); extern unsigned int count_swap_pages(int, int); extern sector_t swapdev_block(int, pgoff_t); extern int __swap_count(swp_entry_t entry); -extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t en= try); +extern bool swap_entry_swapped(struct swap_info_struct *si, + unsigned long offset); extern int swp_swapcount(swp_entry_t entry); struct backing_dev_info; extern struct swap_info_struct *get_swap_device(swp_entry_t entry); @@ -535,7 +536,8 @@ static inline int __swap_count(swp_entry_t entry) return 0; } =20 -static inline bool swap_entry_swapped(struct swap_info_struct *si, swp_ent= ry_t entry) +static inline bool swap_entry_swapped(struct swap_info_struct *si, + unsigned long offset) { return false; } diff --git a/mm/swap_state.c b/mm/swap_state.c index e3c01e5bc978..a99411b70c99 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -527,8 +527,8 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry,= gfp_t gfp_mask, if (folio) return folio; =20 - /* Skip allocation for unused swap slot for readahead path. */ - if (!swap_entry_swapped(si, entry)) + /* Skip allocation for unused and bad swap slot for readahead. */ + if (!swap_entry_swapped(si, swp_offset(entry))) return NULL; =20 /* Allocate a new folio to be added into the swap cache. */ diff --git a/mm/swapfile.c b/mm/swapfile.c index cc07246985ef..cb59930b6415 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1762,21 +1762,21 @@ int __swap_count(swp_entry_t entry) return swap_count(si->swap_map[offset]); } =20 -/* - * How many references to @entry are currently swapped out? - * This does not give an exact answer when swap count is continued, - * but does include the high COUNT_CONTINUED flag to allow for that. +/** + * swap_entry_swapped - Check if the swap entry at @offset is swapped. + * @si: the swap device. + * @offset: offset of the swap entry. */ -bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry) +bool swap_entry_swapped(struct swap_info_struct *si, unsigned long offset) { - pgoff_t offset =3D swp_offset(entry); struct swap_cluster_info *ci; int count; =20 ci =3D swap_cluster_lock(si, offset); count =3D swap_count(si->swap_map[offset]); swap_cluster_unlock(ci); - return !!count; + + return count && count !=3D SWAP_MAP_BAD; } =20 /* @@ -1862,7 +1862,7 @@ static bool folio_swapped(struct folio *folio) return false; =20 if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio))) - return swap_entry_swapped(si, entry); + return swap_entry_swapped(si, swp_offset(entry)); =20 return swap_page_trans_huge_swapped(si, entry, folio_order(folio)); } @@ -3673,10 +3673,10 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) count =3D si->swap_map[offset + i]; =20 /* - * swapin_readahead() doesn't check if a swap entry is valid, so the - * swap entry could be SWAP_MAP_BAD. Check here with lock held. + * Allocator never allocates bad slots, and readahead is guarded + * by swap_entry_swapped. */ - if (unlikely(swap_count(count) =3D=3D SWAP_MAP_BAD)) { + if (WARN_ON(swap_count(count) =3D=3D SWAP_MAP_BAD)) { err =3D -ENOENT; goto unlock_out; } --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D45062DE1E4 for ; Mon, 24 Nov 2025 19:16:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.169 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011789; cv=none; b=opM9wSpfrNJ0u2/HSpwz7RQVlozAcU5aQbcX2+ggmTdLbm26j6JUmxyYQXo7HS9xxe4atCA6vzXRMUqaEYnPqEd/6eOKbTR0BxDEkJDBsMb9c2zJK25iNb9twwiuBF6GjAazgNRPSQeo1b/qpcnX/Htlct/toxXrDp3R9p3Seqg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011789; c=relaxed/simple; bh=nwrUqbmghzFF0bV5vQ1bPGx6AcmszDgsinjIPzKXTKc=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=chw0oDFqIq98q4RTOL5uUJez5kvm9DKXbfbTDShcmHcKGRIMiMcu0HynJl9U/oKi9es1mFxme7s0fcc4+NKnbKz0fBaXqo2yJ3ic4vgf2IR/ixp7nHtKXty/AdEOYkBk/2jMtgUURxXLVB24It9zfVWtfR0xFwYQIY4vh57HA08= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=YH5u0mAM; arc=none smtp.client-ip=209.85.210.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="YH5u0mAM" Received: by mail-pf1-f169.google.com with SMTP id d2e1a72fcca58-7b22ffa2a88so4289103b3a.1 for ; Mon, 24 Nov 2025 11:16:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011787; x=1764616587; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=eOxnIY0a99Y52eqXzKdRmg80jUOJW+tE1KL/X+5vX9M=; b=YH5u0mAMhN3W8TR2j6MKPuUJHG5LqSlAALq4kxweKdRLl/4TYkPY9OgxNjgzBwsHB+ HZFYKwAUk+8B6GMXw5zz6QsZxwI49AwO43HThqBG9jhCqaf6m+kiTNOjKwjwtnMSvMo6 4naU5w4JXLnqLqDFz/kZwo3ezGjH5uVlCvhfQ7xVip7Pn5S8t3nQCtZLzzEgQEq70AdD wuOsWaX2PAmofyHv92IeSoQ+LWdfTzSxGvXuVhktI+wZB9KmaqYD0TMUNLbOGRCOejEQ fB801V5gno4H8LGaAIU7zvoo/rDYDQjDkpdaEn+Dc6RqLoX/joz7hgwNWUMz7E3mYQfq JA1w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011787; x=1764616587; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=eOxnIY0a99Y52eqXzKdRmg80jUOJW+tE1KL/X+5vX9M=; b=DffxJ2GG9vhNztN8VXakD7Y/ELPzgq5RsMy809kf+PX9N3I75VgPh3yXp71FZ6EQ+q fo2vPYfWZrF1lTWNGNJx/uzZTRJB7bZ84Ii+anfuO0+iQIrAUZeqhe7F/5dA0kvkTD2q RuhtPsJTyyzQuM8b7zfAVSScRCX7VRijrMjlwQrolDGUsSmVnBgUkTaVX2oYQRQ8OKM9 IskEkR3fesLRnjj8N9W12kkN2dfxxwSTgNO/9Z8jZ4fgn2VILkC+DnWImBptL93/WZRh BJysfX+0y251BGpW9luCgU53Z1qdIdUgXAVuNGdty1iaVx3/SjrKCElaICJ0SF+a7HMF B9TA== X-Forwarded-Encrypted: i=1; AJvYcCX84pAzWwjc1Xvd1dzjzvRDcuol1Ywu/+L8XvEtoGCVWaTJMttkZWklwjFDGlf+BaXxPOownh9Cnz/DT9k=@vger.kernel.org X-Gm-Message-State: AOJu0YznRXtbHY+FqDbQpy+7AE4YfVzke+G6Sh5gY7ivRsjCxI8YrJcT kiKUfDQOkg4aYQNuui4mAmcWbkDZhws0OYkBCLen/tgCAiBdN/Uisx7p X-Gm-Gg: ASbGnct0FqxLDT8/OYhHIxwFZO0mLAZjIn9v8mMhSlCWLNx8SM5qCTL9a2fNdkTx9ag tCuVghjYSHLJagHoDpOxJtSslo414JSfNChzoKnHVLFoYnZjic2M9LS8Pk/E8+SHRznAmlL2hjx 0XGHgBEmF0cRIs81iSI+uq91UwpC5cMuCIvt5qstl6UkIOziHp2paT4iTPB89KXxQ1jZXhRLehk 30p5BGiB7SkXx0iCmuFIN3vk4+caZV4Gpa0OMsR2hV9L6g7OZfZhpeBeGy/ZrmUKUClHHP4neKN g1t6UheefxGq8+jPJ8lXRuWaMrjkFoxo0lSIH6CgyrHKnioKIrmnYHW31wyXZ/s3fdAmfCoB0hA 8HqUQyIpwqgsQAcTF/nVSq1UKC9YVzsB4aummNjArqi4QTQ2LXiqtqaJID5z0rA08Q1CbE7yAAa cV1uLXss+E93rzeAxVLcxdZU9edWRh5FCdF9JLvC34CjPhTxIv X-Google-Smtp-Source: AGHT+IH2rOQ7eWjFBbbc/91093MxswZ+bcVKJjX3xC+3hLOctGc8E18g829dn1xUDBoO5jNLcrGkVQ== X-Received: by 2002:a05:6a20:7f8f:b0:35f:aa1b:bc02 with SMTP id adf61e73a8af0-36150ead3e1mr13261971637.26.1764011787031; Mon, 24 Nov 2025 11:16:27 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.16.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:16:26 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:13:53 +0800 Subject: [PATCH v3 10/19] mm, swap: consolidate cluster reclaim and usability check Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-10-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=4270; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=aUyVC2as+Gs9TiZQkBGnriVQMvOjS3mdGW8SaiNYelI=; b=6jf8DU9Y71Vc6lenDn92r1r5/qxPu/KMRxJuU4xK4VDA/xkkD2eazk3yHm0sPCCjy3iWn+QOD XI5kgQyF/tUDszPynUPDa5uCNxVhoVjJzHIDCiOXwAJ2FLfISQSAbFW X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Swap cluster cache reclaim requires releasing the lock, so the cluster may become unusable after the reclaim. To prepare for checking swap cache using the swap table directly, consolidate the swap cluster reclaim and the check logic. We will want to avoid touching the cluster's data completely with the swap table, to avoid RCU overhead here. And by moving the cluster usable check into the reclaim helper, it will also help avoid a redundant scan of the slots if the cluster is no longer usable, and we will want to avoid touching the cluster. Also, adjust it very slightly while at it: always scan the whole region during reclaim, don't skip slots covered by a reclaimed folio. Because the reclaim is lockless, it's possible that new cache lands at any time. And for allocation, we want all caches to be reclaimed to avoid fragmentation. Besides, if the scan offset is not aligned with the size of the reclaimed folio, we might skip some existing cache and fail the reclaim unexpectedly. There should be no observable behavior change. It might slightly improve the fragmentation issue or performance. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swapfile.c | 45 +++++++++++++++++++++++++++++---------------- 1 file changed, 29 insertions(+), 16 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index cb59930b6415..bdbdb4a4c452 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -777,33 +777,51 @@ static int swap_cluster_setup_bad_slot(struct swap_cl= uster_info *cluster_info, return 0; } =20 +/* + * Reclaim drops the ci lock, so the cluster may become unusable (freed or + * stolen by a lower order). @usable will be set to false if that happens. + */ static bool cluster_reclaim_range(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long start, unsigned long end) + unsigned long start, unsigned int order, + bool *usable) { + unsigned int nr_pages =3D 1 << order; + unsigned long offset =3D start, end =3D start + nr_pages; unsigned char *map =3D si->swap_map; - unsigned long offset =3D start; int nr_reclaim; =20 spin_unlock(&ci->lock); do { switch (READ_ONCE(map[offset])) { case 0: - offset++; break; case SWAP_HAS_CACHE: nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); - if (nr_reclaim > 0) - offset +=3D nr_reclaim; - else + if (nr_reclaim < 0) goto out; break; default: goto out; } - } while (offset < end); + } while (++offset < end); out: spin_lock(&ci->lock); + + /* + * We just dropped ci->lock so cluster could be used by another + * order or got freed, check if it's still usable or empty. + */ + if (!cluster_is_usable(ci, order)) { + *usable =3D false; + return false; + } + *usable =3D true; + + /* Fast path, no need to scan if the whole cluster is empty */ + if (cluster_is_empty(ci)) + return true; + /* * Recheck the range no matter reclaim succeeded or not, the slot * could have been be freed while we are not holding the lock. @@ -900,9 +918,10 @@ static unsigned int alloc_swap_scan_cluster(struct swa= p_info_struct *si, unsigned long start =3D ALIGN_DOWN(offset, SWAPFILE_CLUSTER); unsigned long end =3D min(start + SWAPFILE_CLUSTER, si->max); unsigned int nr_pages =3D 1 << order; - bool need_reclaim, ret; + bool need_reclaim, ret, usable; =20 lockdep_assert_held(&ci->lock); + VM_WARN_ON(!cluster_is_usable(ci, order)); =20 if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) goto out; @@ -912,14 +931,8 @@ static unsigned int alloc_swap_scan_cluster(struct swa= p_info_struct *si, if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim)) continue; if (need_reclaim) { - ret =3D cluster_reclaim_range(si, ci, offset, offset + nr_pages); - /* - * Reclaim drops ci->lock and cluster could be used - * by another order. Not checking flag as off-list - * cluster has no flag set, and change of list - * won't cause fragmentation. - */ - if (!cluster_is_usable(ci, order)) + ret =3D cluster_reclaim_range(si, ci, offset, order, &usable); + if (!usable) goto out; if (cluster_is_empty(ci)) offset =3D start; --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pf1-f176.google.com (mail-pf1-f176.google.com [209.85.210.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 013412D877F for ; Mon, 24 Nov 2025 19:16:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011796; cv=none; b=gJfIc25oVcKU5T5Xc7hn4Bj0NJqDmwWz7XjaufVsrsAK1vOdNJmWAcbSTA2iVd5VsFm/rlTROaC0OxfD+f6haxLjJKg5FHQiUVG3Y0mdM3bXMV9JuRNAB6c/BDkVvjrT4MRVgcuT8skY1MN+udWKb+9IZBs3zsaKoPVKqMQKYJE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011796; c=relaxed/simple; bh=ydyG7nRsB3shTBRPDVTg5QINk3fb6Ark0u5sZzeHh/M=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=jCMdNxB1dBtjZxjHjLQ5PRdGmEaArH8uLaLgnZlauEX/3YbBhRXdlQgcwdoN4DbFDJ8y2IvRz7jwDw0Ppt4euNsW4ApE1CjzrwzKB1x3nLeuM/3uxNunfwvWC+KJp7gd3mkI+2aqW6eaSN8E3zn5bCWZQHW8gwIdAGEBEVX+C9s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=cyvZdnB3; arc=none smtp.client-ip=209.85.210.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="cyvZdnB3" Received: by mail-pf1-f176.google.com with SMTP id d2e1a72fcca58-7ad1cd0db3bso3973306b3a.1 for ; Mon, 24 Nov 2025 11:16:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011792; x=1764616592; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=aApchueQuc9NX7QoG5SpFOXLdsh6omcVGTvQgPwgUXI=; b=cyvZdnB3DVwzs2DfOQ5XUQspXrG8T2yH8vh8cV3h1EvyE8L/sBsybYeCNOP783Qw0z BkzmzSRZNtqbyTBBT2f6szUctb8X36AygG0Jn0v+nU4qJpS7A6G+84CiwfoXvs8G2lyI bOhXUkJz4k/e2Yy6osp3DAObzYvJHpPmF8TCdYNuEYISG46ekiAefkHw3Plc+SIYNivx M6GlOhdbRc2zVZ/WzaePsk4gkeUqgRFW1IYNq+thRnpl8osClCAfHr1TLXlkUACr0dxI sEyPg8Wj0o2WEZDfWyy4Qq/dJG7OT6jEs8bcKOBKrnwmQUORn4sVsrJ4+8JT+IKwQMbe jpCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011792; x=1764616592; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=aApchueQuc9NX7QoG5SpFOXLdsh6omcVGTvQgPwgUXI=; b=BBjNFFqamOimXNr1P2IfBnSZHLUhqTthyEmh23S+dbuijin7Uhqb2mHEOhkYMj0B6r lqxLIloyjA0F3gjlPSE67lOYIH0ci8YyUtZfsNwuivwUuKxk/VDBKeQ/GPFYTFCcRorF HrJ8GaBVMVbRGlPur2UNRtqeyWZGp2eMgj2FOOcccz9Bq9wMuoH1SAXhCKYgN6nD7WLV iPXCWNyJnHy89tOTzdyC1+H6pA++FVddlCzGf7elNZxx5k8DLq1WcfdnC23lAF4z7YSy uAshSEkX9ZmN8R8UHnH9Sm5ggyyrybPIPmTZwPwqOtxviKuMeyWzj34ZGxh3B4DYZJ3O ix5g== X-Forwarded-Encrypted: i=1; AJvYcCX6QZ2Qr0Uyb9J2vpWGKJVEQatmOHRwTfTeRxr+Ld9MeP6JBFfY+fn4/5KqMXVcLzpykxepNJwhdplKrME=@vger.kernel.org X-Gm-Message-State: AOJu0Yw5xzzoI0UCGmsEX61oa2WBCKgL2I6EjcWN/bB6gm9fFeOKDH22 czD6F18aaYKjLYJhYszPqGXrbmbc+ZUqc4jjSrKvx2ATSnIxLj1p2H/Z X-Gm-Gg: ASbGncsvb+HqfQuOQvc18nFwNhRGfyc0hCILgHjTDlkzBXbYyiCBF+FD1gx2Idz5eox txVQFm28dtbbT4373JTpb310dG4otqNOkK7fSTVExGpaKVAqiYG3Er3Jq3tEuaakZnBCb1z6DWY 4mcZf+w6++gT/unvVTr0batiYzEgLXN8eb8r0UVcBIrhYRtc5BP2Eyox5sbPeOeGUMTkOwsOD8D JoXIVmRTtD6emoSSLPpzTSafgcPrExdzyC8g5rZitejyj521zbctX08HcHn05SdH6XyfMbyn/BB kDP28z5RzAozlnJ/6wYO7PBLmMkjcaU2nnwTsA0ukTWrG90CXxEpSZcHOnJg3p9/MhT3MyEmOBO SqOrm8tJO98ewv7RyfeoFsOByhdBIOz89I/WpNn46b/kQVHjjKcuYje+w/oH4qIH1ezhhFIwL+M fq3Rgmduj/N19+n6MdNKBTRx18Sb45YIEGbmXR5GgTu7gHdxxw X-Google-Smtp-Source: AGHT+IEQsH3iGKnaf18AIE9fwY37Xal5hjhuN438/kkbKsBXRp1SnnyRRphbQAhP2oB7wIDvxgCBaA== X-Received: by 2002:a05:6300:8b0f:b0:35d:d477:a7e0 with SMTP id adf61e73a8af0-3614ebad215mr14443143637.15.1764011792157; Mon, 24 Nov 2025 11:16:32 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.16.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:16:31 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:13:54 +0800 Subject: [PATCH v3 11/19] mm, swap: split locked entry duplicating into a standalone helper Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-11-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=3213; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=pQmOxy2a4hlTElz+FVuPM6pQWOBbEh87j1+qRKen2wM=; b=dxdfFMK7ckPqm/0zEPcQesdz8rcrvfbMXOndagpSD5xrTTwYVIz97wkW/J0xLiWGxa7LtAP6m LGJNIkCyL6kAtqXcU4+Gvlqi3r+BoeXNgoq51uNZ9JbDPqJHdR56Okm X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song No feature change, split the common logic into a stand alone helper to be reused later. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swapfile.c | 62 +++++++++++++++++++++++++++++--------------------------= ---- 1 file changed, 31 insertions(+), 31 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index bdbdb4a4c452..5b173ef0bc74 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3662,26 +3662,14 @@ void si_swapinfo(struct sysinfo *val) * - swap-cache reference is requested but the entry is not used. -> ENOENT * - swap-mapped reference requested but needs continued swap count. -> EN= OMEM */ -static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) +static int swap_dup_entries(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, + unsigned char usage, int nr) { - struct swap_info_struct *si; - struct swap_cluster_info *ci; - unsigned long offset; - unsigned char count; - unsigned char has_cache; - int err, i; - - si =3D swap_entry_to_info(entry); - if (WARN_ON_ONCE(!si)) { - pr_err("%s%08lx\n", Bad_file, entry.val); - return -EINVAL; - } - - offset =3D swp_offset(entry); - VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); - ci =3D swap_cluster_lock(si, offset); + int i; + unsigned char count, has_cache; =20 - err =3D 0; for (i =3D 0; i < nr; i++) { count =3D si->swap_map[offset + i]; =20 @@ -3689,25 +3677,20 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) * Allocator never allocates bad slots, and readahead is guarded * by swap_entry_swapped. */ - if (WARN_ON(swap_count(count) =3D=3D SWAP_MAP_BAD)) { - err =3D -ENOENT; - goto unlock_out; - } + if (WARN_ON(swap_count(count) =3D=3D SWAP_MAP_BAD)) + return -ENOENT; =20 has_cache =3D count & SWAP_HAS_CACHE; count &=3D ~SWAP_HAS_CACHE; =20 if (!count && !has_cache) { - err =3D -ENOENT; + return -ENOENT; } else if (usage =3D=3D SWAP_HAS_CACHE) { if (has_cache) - err =3D -EEXIST; + return -EEXIST; } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) { - err =3D -EINVAL; + return -EINVAL; } - - if (err) - goto unlock_out; } =20 for (i =3D 0; i < nr; i++) { @@ -3726,14 +3709,31 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) * Don't need to rollback changes, because if * usage =3D=3D 1, there must be nr =3D=3D 1. */ - err =3D -ENOMEM; - goto unlock_out; + return -ENOMEM; } =20 WRITE_ONCE(si->swap_map[offset + i], count | has_cache); } =20 -unlock_out: + return 0; +} + +static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) +{ + int err; + struct swap_info_struct *si; + struct swap_cluster_info *ci; + unsigned long offset =3D swp_offset(entry); + + si =3D swap_entry_to_info(entry); + if (WARN_ON_ONCE(!si)) { + pr_err("%s%08lx\n", Bad_file, entry.val); + return -EINVAL; + } + + VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); + ci =3D swap_cluster_lock(si, offset); + err =3D swap_dup_entries(si, ci, offset, usage, nr); swap_cluster_unlock(ci); return err; } --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pf1-f180.google.com (mail-pf1-f180.google.com [209.85.210.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2E5B82DEA67 for ; Mon, 24 Nov 2025 19:16:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011800; cv=none; b=kj22SG4/hjBRklQcSPLxQzs6ktW8WymUe/pB/fDnTrRiD9UTrBQvufHEueh4JJC20pghQg6u57Hnjbo/n3EINtkKV1CNbdwgGz7LABozRhXtZks82FIfd+EcNgARzbYm4G35UwvOz1vP1ERMbtaq21NrDr8vE5GkF1/WynT6XEQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011800; c=relaxed/simple; bh=El0sVKtCwdiJ/Vp4EbjgL9pB0srZuzKTapkdEl9M+48=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=QJkWh+a5ciXhVRROBgb+U0WYZ2OhZ0xtssuc4gvMF5X5+kQUmESvnx6sDfr03z+yaeyzgq+42L2WrXbXUs/mS3Y7xXe2z+rjuqRhr2mkv5sWikOuLrD7pgiRp+cKuluZewy2VjbfMdiMnpcfS82T6PTmeBrk1fN1qrrRo1tNzo0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=H27+xkOw; arc=none smtp.client-ip=209.85.210.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="H27+xkOw" Received: by mail-pf1-f180.google.com with SMTP id d2e1a72fcca58-7bc248dc16aso3494619b3a.0 for ; Mon, 24 Nov 2025 11:16:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011797; x=1764616597; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=pf/cbbwuQPv4so2HbL/+KOwhVvFF2m13qb5GevDv0r0=; b=H27+xkOw8tmMz0syhrSNvVMVUTV9Gdg8WpJb/J2I8bqLd6wJfKORrPm9RtsDViS5uj sx363zsXPWoh9IRnraQAxI2JccKpsh597sFJwG6pIhxBw+RkXQ5qwrxIeV1z43KgtKfw /WqZznTg2RIFRFEdkKpkMBoAFPMN3X5TQsAy7QOeK6+iauqQB0a1nyMdY5MBMhrZiUS0 NlT6cd4xzi5Sot1wIeew0z/iY7TmSQBTSqmYBdfyEx8MsehLwA9TIij/g4J+ftQ79UTV IZKxRpgPUD7nFVpp+iaZwPiGHZRnu/ulTQs0erBrWas+CIWBfb4UyazEtlrtQGiZT6ms LJRQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011797; x=1764616597; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=pf/cbbwuQPv4so2HbL/+KOwhVvFF2m13qb5GevDv0r0=; b=XoN8efKZ/bmtmZ8WT24K0V1KzLMMB+U4IfuZm9Unz9BDkWAzKqCJFOkhYbtPBiPhWm JccxuFqGhCKsWKUuIQcZ6foPtmAVz2KBjyifyerHdyHzP9KAMjlQ8GbM6Nl7Th2FmB2y Q1CTThlQOuMAnwqSsVBehX+zLAJ7ohu4eqe/cmtZvF0KmqzAAzPFBMs+/dwPVId98G1x V/x809a5pr5qMC/sIohB1lwn7YYIbIKHPmiiuav8M8aL9FbJR/K4RXcHenvI675FCTdq 5wYUcHF3dYYg0xO6XyKoG2e0MjNC8I73BqVBflyfWUhf3ua5aUi7B8vPVpPZ7i2HMQJ4 3kfQ== X-Forwarded-Encrypted: i=1; AJvYcCU+G+/zJOrFEKmFmLP0JaKe/hZlA7lb/XWd636gnWi1+njb/FZe/hoHVARw3D1UbNCmAhU1Zw9y3v2AfS4=@vger.kernel.org X-Gm-Message-State: AOJu0YwVOefKV+R4ME3me0et183l4Ldw+Iqu7tH9bjPvSVJfMbs1C3Jq t0IGj7aUA7if8kYKhfyegw8zYM3b9nnpYV9pSUQvo79xKpb7YWmoJH6z X-Gm-Gg: ASbGncvfkBxN1ua+S8kScT87ePeMkjEXL7r5+1w9n9QUhaQfKHSFnWrpSfeaX4rR8Ul 5FiXYZwRZfwKYPtX8FOQ8ClDP3MP8uJVQSZrTAju43wuwAetq6wN9VfWvwB0qh+ru0jVhjQFppj 4uE7IkJ6dGpA4SXfYZz/X8u0OcrTMpwf7iMKGKbkfT2/nVK+vLgZfTAPPPrk3yQXLwDolUPlHcw GdgcZMy9ziwKJuwShOwoVtiksDDfhksdZyYBxqhg73LEv7lNxkCdCt2muwrRzICqKOplsnXyoN2 10ndBnXl4Ye39Mfo9rukHrUjNgvSblk7qVz5ktyZm9ye0kEWf5gcEp6sQnjeXJ9UDJ1+DdKjrJZ EMcrnFqj9m1ojivRoVa4eXd4ykYDmBwAWYoCo+bwAE3y0eVkyKWttCJGevR3SNDdcgwduVMIQob TfDZQAKg1k+V9OzdmnzNXslKc8pz+KxD/3WspFFpJopE/6qM65 X-Google-Smtp-Source: AGHT+IHQV5VkZCtvC5zBu86MTJiizn8IrXJuvUIUA4DHSYAWVjquEPbsNdHmE6Oo4mnT3yQLmT3QEA== X-Received: by 2002:a05:6a20:e212:b0:359:8dd3:9121 with SMTP id adf61e73a8af0-3614ede036fmr14823031637.33.1764011797255; Mon, 24 Nov 2025 11:16:37 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.16.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:16:36 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:13:55 +0800 Subject: [PATCH v3 12/19] mm, swap: use swap cache as the swap in synchronize layer Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-12-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=14399; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=0RMabqX1iKeFsfqlJddyVwD0HFh8SlI/mslw+tg5GjI=; b=Qnpu2MqwZJyyOQ1+BJr7bXDdnAvlaqDU0CYMvYp80uqoM1+Jq+mJh5u7Yyr4Raq+HkwbbKhOh jL6EfXLylVEBBqUwcy4TAQhZNUq/kgWZm1xCAGvBvoh4JT5d4h8h4Pn X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Current swap in synchronization mostly uses the swap_map's SWAP_HAS_CACHE bit. Whoever sets the bit first does the actual work to swap in a folio. This has been causing many issues as it's just a poor implementation of a bit lock. Raced users have no idea what is pinning a slot, so it has to loop with a schedule_timeout_uninterruptible(1), which is ugly and causes long-tailing or other performance issues. Besides, the abuse of SWAP_HAS_CACHE has been causing many other troubles for synchronization or maintenance. This is the first step to remove this bit completely. We have just removed all swap in paths that bypass the swap cache, and now both the swap cache and swap map are protected by the cluster lock. So now we can just resolve the swap synchronization with the swap cache layer directly using the cluster lock. Whoever inserts a folio in the swap cache first does the swap in work. And because folios are locked during swap operations, other raced users will just wait on the folio lock. The SWAP_HAS_CACHE will be removed in later commit. For now, we still set it for some remaining users. But now we do the bit setting and swap cache folio adding in the same critical section, after swap cache is ready. No one will have to spin on the SWAP_HAS_CACHE bit anymore. This both simplifies the logic and should improve the performance, eliminating issues like the one solved in commit 01626a1823024 ("mm: avoid unconditional one-tick sleep when swapcache_prepare fails"), or the "skip_if_exists" from commit a65b0e7607ccb ("zswap: make shrinking memcg-aware"), which will be removed very soon. Signed-off-by: Kairui Song Suggested-by: Chris Li --- include/linux/swap.h | 6 --- mm/swap.h | 15 +++++++- mm/swap_state.c | 105 ++++++++++++++++++++++++++++-------------------= ---- mm/swapfile.c | 39 ++++++++++++------- mm/vmscan.c | 1 - 5 files changed, 96 insertions(+), 70 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 936fa8f9e5f3..69025b473472 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -458,7 +458,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t en= try); extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern int swap_duplicate_nr(swp_entry_t entry, int nr); -extern int swapcache_prepare(swp_entry_t entry, int nr); extern void swap_free_nr(swp_entry_t entry, int nr_pages); extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); int swap_type_of(dev_t device, sector_t offset); @@ -518,11 +517,6 @@ static inline int swap_duplicate_nr(swp_entry_t swp, i= nt nr_pages) return 0; } =20 -static inline int swapcache_prepare(swp_entry_t swp, int nr) -{ - return 0; -} - static inline void swap_free_nr(swp_entry_t entry, int nr_pages) { } diff --git a/mm/swap.h b/mm/swap.h index e0f05babe13a..b5075a1aee04 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -234,6 +234,14 @@ static inline bool folio_matches_swap_entry(const stru= ct folio *folio, return folio_entry.val =3D=3D round_down(entry.val, nr_pages); } =20 +/* Temporary internal helpers */ +void __swapcache_set_cached(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry); +void __swapcache_clear_cached(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry, unsigned int nr); + /* * All swap cache helpers below require the caller to ensure the swap entr= ies * used are valid and stablize the device by any of the following ways: @@ -247,7 +255,8 @@ static inline bool folio_matches_swap_entry(const struc= t folio *folio, */ struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); -void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **s= hadow); +int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, + void **shadow, bool alloc); void swap_cache_del_folio(struct folio *folio); struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, struct mempolicy *mpol, pgoff_t ilx, @@ -413,8 +422,10 @@ static inline void *swap_cache_get_shadow(swp_entry_t = entry) return NULL; } =20 -static inline void swap_cache_add_folio(struct folio *folio, swp_entry_t e= ntry, void **shadow) +static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t en= try, + void **shadow, bool alloc) { + return -ENOENT; } =20 static inline void swap_cache_del_folio(struct folio *folio) diff --git a/mm/swap_state.c b/mm/swap_state.c index a99411b70c99..847763c6dd4a 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -128,34 +128,64 @@ void *swap_cache_get_shadow(swp_entry_t entry) * @entry: The swap entry corresponding to the folio. * @gfp: gfp_mask for XArray node allocation. * @shadowp: If a shadow is found, return the shadow. + * @alloc: If it's the allocator that is trying to insert a folio. Allocat= or + * sets SWAP_HAS_CACHE to pin slots before insert so skip map upda= te. * * Context: Caller must ensure @entry is valid and protect the swap device * with reference count or locks. - * The caller also needs to update the corresponding swap_map slots with - * SWAP_HAS_CACHE bit to avoid race or conflict. */ -void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **s= hadowp) +int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, + void **shadowp, bool alloc) { + int err; void *shadow =3D NULL; + struct swap_info_struct *si; unsigned long old_tb, new_tb; struct swap_cluster_info *ci; - unsigned int ci_start, ci_off, ci_end; + unsigned int ci_start, ci_off, ci_end, offset; unsigned long nr_pages =3D folio_nr_pages(folio); =20 VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); =20 + si =3D __swap_entry_to_info(entry); new_tb =3D folio_to_swp_tb(folio); ci_start =3D swp_cluster_offset(entry); ci_end =3D ci_start + nr_pages; ci_off =3D ci_start; - ci =3D swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry)); + offset =3D swp_offset(entry); + ci =3D swap_cluster_lock(si, swp_offset(entry)); + if (unlikely(!ci->table)) { + err =3D -ENOENT; + goto failed; + } do { - old_tb =3D __swap_table_xchg(ci, ci_off, new_tb); - WARN_ON_ONCE(swp_tb_is_folio(old_tb)); + old_tb =3D __swap_table_get(ci, ci_off); + if (unlikely(swp_tb_is_folio(old_tb))) { + err =3D -EEXIST; + goto failed; + } + if (!alloc && unlikely(!__swap_count(swp_entry(swp_type(entry), offset))= )) { + err =3D -ENOENT; + goto failed; + } if (swp_tb_is_shadow(old_tb)) shadow =3D swp_tb_to_shadow(old_tb); + offset++; + } while (++ci_off < ci_end); + + ci_off =3D ci_start; + offset =3D swp_offset(entry); + do { + /* + * Still need to pin the slots with SWAP_HAS_CACHE since + * swap allocator depends on that. + */ + if (!alloc) + __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset)); + __swap_table_set(ci, ci_off, new_tb); + offset++; } while (++ci_off < ci_end); =20 folio_ref_add(folio, nr_pages); @@ -168,6 +198,11 @@ void swap_cache_add_folio(struct folio *folio, swp_ent= ry_t entry, void **shadowp =20 if (shadowp) *shadowp =3D shadow; + return 0; + +failed: + swap_cluster_unlock(ci); + return err; } =20 /** @@ -186,6 +221,7 @@ void swap_cache_add_folio(struct folio *folio, swp_entr= y_t entry, void **shadowp void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *fo= lio, swp_entry_t entry, void *shadow) { + struct swap_info_struct *si; unsigned long old_tb, new_tb; unsigned int ci_start, ci_off, ci_end; unsigned long nr_pages =3D folio_nr_pages(folio); @@ -195,6 +231,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *c= i, struct folio *folio, VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio); =20 + si =3D __swap_entry_to_info(entry); new_tb =3D shadow_swp_to_tb(shadow); ci_start =3D swp_cluster_offset(entry); ci_end =3D ci_start + nr_pages; @@ -210,6 +247,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *c= i, struct folio *folio, folio_clear_swapcache(folio); node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages); lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages); + __swapcache_clear_cached(si, ci, entry, nr_pages); } =20 /** @@ -231,7 +269,6 @@ void swap_cache_del_folio(struct folio *folio) __swap_cache_del_folio(ci, folio, entry, NULL); swap_cluster_unlock(ci); =20 - put_swap_folio(folio, entry); folio_ref_sub(folio, folio_nr_pages(folio)); } =20 @@ -423,67 +460,37 @@ static struct folio *__swap_cache_prepare_and_add(swp= _entry_t entry, gfp_t gfp, bool charged, bool skip_if_exists) { - struct folio *swapcache; + struct folio *swapcache =3D NULL; void *shadow; int ret; =20 - /* - * Check and pin the swap map with SWAP_HAS_CACHE, then add the folio - * into the swap cache. Loop with a schedule delay if raced with - * another process setting SWAP_HAS_CACHE. This hackish loop will - * be fixed very soon. - */ + __folio_set_locked(folio); + __folio_set_swapbacked(folio); for (;;) { - ret =3D swapcache_prepare(entry, folio_nr_pages(folio)); + ret =3D swap_cache_add_folio(folio, entry, &shadow, false); if (!ret) break; =20 /* - * The skip_if_exists is for protecting against a recursive - * call to this helper on the same entry waiting forever - * here because SWAP_HAS_CACHE is set but the folio is not - * in the swap cache yet. This can happen today if - * mem_cgroup_swapin_charge_folio() below triggers reclaim - * through zswap, which may call this helper again in the - * writeback path. - * - * Large order allocation also needs special handling on + * Large order allocation needs special handling on * race: if a smaller folio exists in cache, swapin needs * to fallback to order 0, and doing a swap cache lookup * might return a folio that is irrelevant to the faulting * entry because @entry is aligned down. Just return NULL. */ if (ret !=3D -EEXIST || skip_if_exists || folio_test_large(folio)) - return NULL; + goto failed; =20 - /* - * Check the swap cache again, we can only arrive - * here because swapcache_prepare returns -EEXIST. - */ swapcache =3D swap_cache_get_folio(entry); if (swapcache) - return swapcache; - - /* - * We might race against __swap_cache_del_folio(), and - * stumble across a swap_map entry whose SWAP_HAS_CACHE - * has not yet been cleared. Or race against another - * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE - * in swap_map, but not yet added its folio to swap cache. - */ - schedule_timeout_uninterruptible(1); + goto failed; } =20 - __folio_set_locked(folio); - __folio_set_swapbacked(folio); - if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) { - put_swap_folio(folio, entry); - folio_unlock(folio); - return NULL; + swap_cache_del_folio(folio); + goto failed; } =20 - swap_cache_add_folio(folio, entry, &shadow); memcg1_swapin(entry, folio_nr_pages(folio)); if (shadow) workingset_refault(folio, shadow); @@ -491,6 +498,10 @@ static struct folio *__swap_cache_prepare_and_add(swp_= entry_t entry, /* Caller will initiate read into locked folio */ folio_add_lru(folio); return folio; + +failed: + folio_unlock(folio); + return swapcache; } =20 /** diff --git a/mm/swapfile.c b/mm/swapfile.c index 5b173ef0bc74..567aea6f1cd4 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1472,7 +1472,11 @@ int folio_alloc_swap(struct folio *folio) if (!entry.val) return -ENOMEM; =20 - swap_cache_add_folio(folio, entry, NULL); + /* + * Allocator has pinned the slots with SWAP_HAS_CACHE + * so it should never fail + */ + WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true)); =20 return 0; =20 @@ -1578,9 +1582,8 @@ static unsigned char swap_entry_put_locked(struct swa= p_info_struct *si, * do_swap_page() * ... swapoff+swapon * swap_cache_alloc_folio() - * swapcache_prepare() - * __swap_duplicate() - * // check swap_map + * swap_cache_add_folio() + * // check swap_map * // verify PTE not changed * * In __swap_duplicate(), the swap_map need to be checked before @@ -3764,17 +3767,25 @@ int swap_duplicate_nr(swp_entry_t entry, int nr) return err; } =20 -/* - * @entry: first swap entry from which we allocate nr swap cache. - * - * Called when allocating swap cache for existing swap entries, - * This can return error codes. Returns 0 at success. - * -EEXIST means there is a swap cache. - * Note: return code is different from swap_duplicate(). - */ -int swapcache_prepare(swp_entry_t entry, int nr) +/* Mark the swap map as HAS_CACHE, caller need to hold the cluster lock */ +void __swapcache_set_cached(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry) +{ + WARN_ON(swap_dup_entries(si, ci, swp_offset(entry), SWAP_HAS_CACHE, 1)); +} + +/* Clear the swap map as !HAS_CACHE, caller need to hold the cluster lock = */ +void __swapcache_clear_cached(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry, unsigned int nr) { - return __swap_duplicate(entry, SWAP_HAS_CACHE, nr); + if (swap_only_has_cache(si, swp_offset(entry), nr)) { + swap_entries_free(si, ci, entry, nr); + } else { + for (int i =3D 0; i < nr; i++, entry.val++) + swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); + } } =20 /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 3b85652a42b9..9483267ebf70 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -761,7 +761,6 @@ static int __remove_mapping(struct address_space *mappi= ng, struct folio *folio, __swap_cache_del_folio(ci, folio, swap, shadow); memcg1_swapout(folio, swap); swap_cluster_unlock_irq(ci); - put_swap_folio(folio, swap); } else { void (*free_folio)(struct folio *); =20 --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pf1-f179.google.com (mail-pf1-f179.google.com [209.85.210.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 082F12DEA94 for ; Mon, 24 Nov 2025 19:16:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011804; cv=none; b=oZSesRqjcqk4vIK32FKr3svk02mlNWjRC+Wc33MWM7rHO6WZfyFJ7J1aW5BBhDIQrm+CU29jmtzTp69FDphBGYDsNPeYFIvBx86De6Tv6e5F8XPM+nuCT4wEqQ7m1bPUkxxTfvzTopGLzbH+YcRO5t5JGFGGGcjXKU0J+e7M0sI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011804; c=relaxed/simple; bh=3fIBJX96gJ3NGGMqm/+7n50xJ1aualCQQmrO0cYmagU=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=MBtpXrfZU4fXbSJJEnKpeuugGvbkdMiCyOzmTNundIcsCMcLVITM3yfP1HIGKtzbL2OitSGE/SsF8N3voFORehjVSL8hQQ6yhmNrp9hXoKUEN4js8+1a4gOpOTTYkMJPju3kKwCRsEftyLtZ8QJEQkfWbJS0Ik/9RvsHqcNzQO4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=HZnyjlSX; arc=none smtp.client-ip=209.85.210.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="HZnyjlSX" Received: by mail-pf1-f179.google.com with SMTP id d2e1a72fcca58-7ba49f92362so2731849b3a.1 for ; Mon, 24 Nov 2025 11:16:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011802; x=1764616602; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=yWRzV1WRkV60aRgwV/nuINMmHJf95UsBR54eymIys8E=; b=HZnyjlSXeocUeFekOpEprFfOLyBO7Rq4LhGPV1A54kbfV+4fseYOrs+Ky7Af0q+/lg Y2NJgGdREFIsre0m/uhqQo3znnilU6i8lmfKFo5i9Oo9gjbXuim9Sj7stopYrnzus9n6 FrDqSNBzoA4RNW7rF7COpuyFfffeNiknpXVcw0OdQz0dXDKa0uR0HhIBfAg9nVJkHKNs KHXBPlDd3ewn2v3MD/TP+pYsyPuv461lm3vTeyf6SgjDHz8+azYNOX8f5QBoPPhtCkIP Iz0cDiO2941LjwkqtkJvpqMBpHdPNjuRP41SYRJta/9ad5Hy5xk6AplS+1CNKvQDuktK 94Og== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011802; x=1764616602; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=yWRzV1WRkV60aRgwV/nuINMmHJf95UsBR54eymIys8E=; b=SRnXNFtBHjFq4g7HeF3cKGyUttAsmTAIq7lJXlxitFh17sLPAqj5KVrkMIEzMURBJq R3XfuTadSSzhLar7niJ8ZXRwSXIw2EZ5Jj1wsF3vWaTIcqvm4Kk3+90zHfRlHgGgEjmV TrlsC3AZh74O24MJDcCddBVb+yIE/jBAt+74yo0FCPRKf7WoEdenjA2IB3wE+sbzWjdD 4GA9vo7XyVaeS/BO0UHKqdU4wD79B8GL2XD37Vk9tQwFcI6DlZsAs80IcoIluBzmgHQr DaBozvPebt1ps7zngDW/FoyR17WEadeBNoFarF+aP9OCbtpARkQwXerqhfsVYX6PIKCP IGeQ== X-Forwarded-Encrypted: i=1; AJvYcCWOnfZz+z16Q5pYa74/X6O1FSh0YCW+AsjlUsAtfUT4resGB/DyGcnsBSx5+XZrTWXacvCFyjpg2JBAPIg=@vger.kernel.org X-Gm-Message-State: AOJu0YyTlIuCeVhR39gXICEVCjksIpp5r7KY1zAVIpb2OjPXxhO2kgGW j3cgoBPMVVKadtGFmVwdNDjxWdzV47aiUO6h94BD02uXH99DXsdPb3Af X-Gm-Gg: ASbGncuwgLT99vH8FFNre6QSLWqEbVul1gn5eTs0lA8nphjm9TObZ1UDQxnKIX8/DCa bHiDPFCdJTpyNSrpPvJ+eQC6/VwzHyXCSRZnMoBs+7oGmuux4zNxVF56HJnCxFGaxv3GkmNDcjG 2QUVwa1Q0sm/Uj4YAf1FwXOJ8UH1asPDHypZT1R8bwrQxUIYGZzsfOsWQ2Q2iMi3AKAYN9TcXSn VeTobewZfd1kDJYTrZzK5T4cEYQfSawMiNOMM/iaDky9iG4DX6VKd18msaqFEpjdH5iO+yhAicI TPUepBizMw87NlLmC3ci4qeE/Qa4xe+6si+hz9rcr1lhhwzTu1aE/CHFr/IjnOeZwFo8GzjfYAT G5DG0BZte9I9O62z1XvSUR71eRJhSsnTJljCd5Un6bjPJ2UaM+B4uBVchq712VqW0uVf9hir34S E3KkZjq2zVIYkm/RhrvXJA5zfL/ELJ0liwiGfyR/v5hixhpzJz X-Google-Smtp-Source: AGHT+IF6oDkF29fP7Zgad5PE+BeKohxldhqsbyS6qNXAnkrppKlg9jNj64G56E7F3aiZ6PnkAZD+OQ== X-Received: by 2002:a05:6a21:3290:b0:2ab:a456:9b09 with SMTP id adf61e73a8af0-3614f4a0c7fmr15516656637.15.1764011802296; Mon, 24 Nov 2025 11:16:42 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.16.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:16:41 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:13:56 +0800 Subject: [PATCH v3 13/19] mm, swap: remove workaround for unsynchronized swap map cache state Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-13-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=7032; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=EqyErHrG43rTclh/Dj/hWJSWApsejLOsmiO76lsWvo8=; b=/oih483LAH2Cg8yPDBqVttvrPiriRWEQTCTk58Q6x8+VMqd3wNHzP3ITe0bAg2xv7661yfmZf ZEEkXZkG255BX/sg5BxhXbn68OYafTCP3y4/s7htW+ddB/tWFmz4Q9O X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Remove the "skip if exists" check from commit a65b0e7607ccb ("zswap: make shrinking memcg-aware"). It was needed because there is a tiny time window between setting the SWAP_HAS_CACHE bit and actually adding the folio to the swap cache. If a user is trying to add the folio into the swap cache but another user was interrupted after setting SWAP_HAS_CACHE but hasn't added the folio to the swap cache yet, it might lead to a deadlock. We have moved the bit setting to the same critical section as adding the folio, so this is no longer needed. Remove it and clean it up. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swap.h | 2 +- mm/swap_state.c | 27 ++++++++++----------------- mm/zswap.c | 2 +- 3 files changed, 12 insertions(+), 19 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index b5075a1aee04..6777b2ab9d92 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -260,7 +260,7 @@ int swap_cache_add_folio(struct folio *folio, swp_entry= _t entry, void swap_cache_del_folio(struct folio *folio); struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, struct mempolicy *mpol, pgoff_t ilx, - bool *alloced, bool skip_if_exists); + bool *alloced); /* Below helpers require the caller to lock and pass in the swap cluster. = */ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry, void *shadow); diff --git a/mm/swap_state.c b/mm/swap_state.c index 847763c6dd4a..c29b7e386a7c 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -445,8 +445,6 @@ void swap_update_readahead(struct folio *folio, struct = vm_area_struct *vma, * @folio: folio to be added. * @gfp: memory allocation flags for charge, can be 0 if @charged if true. * @charged: if the folio is already charged. - * @skip_if_exists: if the slot is in a cached state, return NULL. - * This is an old workaround that will be removed shortly. * * Update the swap_map and add folio as swap cache, typically before swapi= n. * All swap slots covered by the folio must have a non-zero swap count. @@ -457,8 +455,7 @@ void swap_update_readahead(struct folio *folio, struct = vm_area_struct *vma, */ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry, struct folio *folio, - gfp_t gfp, bool charged, - bool skip_if_exists) + gfp_t gfp, bool charged) { struct folio *swapcache =3D NULL; void *shadow; @@ -478,7 +475,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, * might return a folio that is irrelevant to the faulting * entry because @entry is aligned down. Just return NULL. */ - if (ret !=3D -EEXIST || skip_if_exists || folio_test_large(folio)) + if (ret !=3D -EEXIST || folio_test_large(folio)) goto failed; =20 swapcache =3D swap_cache_get_folio(entry); @@ -511,8 +508,6 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, * @mpol: NUMA memory allocation policy to be applied * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE * @new_page_allocated: sets true if allocation happened, false otherwise - * @skip_if_exists: if the slot is a partially cached state, return NULL. - * This is a workaround that would be removed shortly. * * Allocate a folio in the swap cache for one swap slot, typically before * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by @@ -525,8 +520,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, */ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx, - bool *new_page_allocated, - bool skip_if_exists) + bool *new_page_allocated) { struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct folio *folio; @@ -547,8 +541,7 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry,= gfp_t gfp_mask, if (!folio) return NULL; /* Try add the new folio, returns existing folio or NULL on failure. */ - result =3D __swap_cache_prepare_and_add(entry, folio, gfp_mask, - false, skip_if_exists); + result =3D __swap_cache_prepare_and_add(entry, folio, gfp_mask, false); if (result =3D=3D folio) *new_page_allocated =3D true; else @@ -577,7 +570,7 @@ struct folio *swapin_folio(swp_entry_t entry, struct fo= lio *folio) unsigned long nr_pages =3D folio_nr_pages(folio); =20 entry =3D swp_entry(swp_type(entry), round_down(offset, nr_pages)); - swapcache =3D __swap_cache_prepare_and_add(entry, folio, 0, true, false); + swapcache =3D __swap_cache_prepare_and_add(entry, folio, 0, true); if (swapcache =3D=3D folio) swap_read_folio(folio, NULL); return swapcache; @@ -605,7 +598,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, = gfp_t gfp_mask, =20 mpol =3D get_vma_policy(vma, addr, 0, &ilx); folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, - &page_allocated, false); + &page_allocated); mpol_cond_put(mpol); =20 if (page_allocated) @@ -724,7 +717,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, /* Ok, do the async read-ahead now */ folio =3D swap_cache_alloc_folio( swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx, - &page_allocated, false); + &page_allocated); if (!folio) continue; if (page_allocated) { @@ -742,7 +735,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, skip: /* The page was likely read above, so no need for plugging here */ folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, - &page_allocated, false); + &page_allocated); if (unlikely(page_allocated)) swap_read_folio(folio, NULL); return folio; @@ -847,7 +840,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, continue; } folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, - &page_allocated, false); + &page_allocated); if (si) put_swap_device(si); if (!folio) @@ -869,7 +862,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, skip: /* The folio was likely read above, so no need for plugging here */ folio =3D swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx, - &page_allocated, false); + &page_allocated); if (unlikely(page_allocated)) swap_read_folio(folio, NULL); return folio; diff --git a/mm/zswap.c b/mm/zswap.c index a7a2443912f4..d8a33db9d3cc 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1015,7 +1015,7 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, =20 mpol =3D get_task_policy(current); folio =3D swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol, - NO_INTERLEAVE_INDEX, &folio_was_allocated, true); + NO_INTERLEAVE_INDEX, &folio_was_allocated); put_swap_device(si); if (!folio) return -ENOMEM; --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pf1-f179.google.com (mail-pf1-f179.google.com [209.85.210.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B37F22DF13E for ; Mon, 24 Nov 2025 19:16:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011811; cv=none; b=CK0P6kHq9xh3BqQwaVLeuzDeAv0Zm2VhP4ibr/QDGKlV0drGhBChXDQNrCxzpPlW1FvoIckaWTsASdIq17smZspSat8Nbbi33uQ+D2R8cmWCK9cKemYbOEAFoo/QobShM/EVa9MB72lkmhhBAW+74LDs3l33e3eZHazT3T+9EZQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011811; c=relaxed/simple; bh=lTh6B7Ex9x0tOzF2ozOxyUNMNChwTEZFKcZw1n835nU=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=qb5tSm6UjUHpu6Uq0TU09zW6cc2nDw7iCpUvIV3qb9xgV0gZOH1AefzA9wOU+VPK6QPCsB2BBFMKzgX+BBESp9H1WLO0DTVmsut3usjshVxQqOXR6pOImU9VnGafibVkXacIWFNP3tue+X+YtzdXVtnPjDVf76yb6XI3ARIeELo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=hzDMFNx/; arc=none smtp.client-ip=209.85.210.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="hzDMFNx/" Received: by mail-pf1-f179.google.com with SMTP id d2e1a72fcca58-7acd9a03ba9so5267661b3a.1 for ; Mon, 24 Nov 2025 11:16:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011808; x=1764616608; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=5f5xjA1yGhalrWBJZWk0baG5zYyCIMSCKyCOd4PyvhQ=; b=hzDMFNx/4C3yZdZP33YrFmNANLdrUw1pyop0yO8GHDXWVpLrM8JTgjxn6iU3uRyNwu m4ud03ginG7CNrZSg0IeE1b1OzPOnOrU+Pq9fw9rXTMXXRavhKvKcl4HmSmo0zrxgtRY ow3GajG6LBgvxvSBmDSMmCN3pJYRU2qm3+3ldYovkR4BpGt9Jf6E65xyBxVBQyZ93UEP wAv7Jj5zubRc0ikWVj/PiJl2q4Y7UZFxWgcjF6R0NSk4B2XXjPqjp2c5Jb8WEUpIVVQq TDfxDonFas5TinHn5RWAXRLJlPEKzYgQJbBhJsfqDIgFXcEDf0GqfZhsBrMj19xVYdjT f81g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011808; x=1764616608; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=5f5xjA1yGhalrWBJZWk0baG5zYyCIMSCKyCOd4PyvhQ=; b=SbgaI3UxYepdsTCLSAPDpCFDSdKtguBYnfFgrOTGoEIVr7D/0FFOq4HRaO2Ksd4hJv NBmDX31bhKb6E7VQmo81Jlw/iwU5mGwwZoj7Y/ygIfkpTsbDo368sKi/y4D6GeVnYsG9 evA41BUdfgvj5gBTqdAzxAcs5lxezrZL6ilao+wktDPssxc82ZaueM7nwSJAuV9tlkdg zm55ZbHFi5TfwuWYbYinGOumaL3Nb0RRrXbbgTJg9fM2LED+JYxQ90EEbstOQxkIop10 XyeHitQu9ChPbxg9kh/++DAfJPGda0OeKXNjNu+0uenAltBd0QSs0Yvg/kpjO2N9cCJK eOXw== X-Forwarded-Encrypted: i=1; AJvYcCU1PBkVd56A2QxFarTbjbbXxR2uakRPqY8yBIs36DhnaNFkS08739D6C2bmMOv9ZsUuAbU5OzIFsjPo5SY=@vger.kernel.org X-Gm-Message-State: AOJu0YxCyKPqyDGehtWnFlCWdZ+UsId7ykE0Ldgvmiu5LDWwwmE+zZbA lUUW4CVJpAFEEWbuulBM14OYgR6zQLeZMeDthF680mfcE+udlQZOTB/l X-Gm-Gg: ASbGnct9wlTGRkXbFG5cUq1ac7IBDsHrxfG4Odjk8fHM4+MpN2goC5yxkweZVZogK0e TSMbgwDhTcE9HBxf2YUnTGxUnyBEBx8h4k7GkA8n65AL3v/fDLUng5M4UDfSkPPQmh+xQJ4k4uu kNZA3qe5Mtrqj4GCV3yul9qFXq1bhRWqO4x4VNnPZv2xyn/1neESYVVu1SrfZ7Idk6RD17BEPGw QbFykrGgWy8WCZa0Z/ttR23NAwLc51aGgNOcOqtieb0CF5T/TZsNBBJhFMNGaBLemBUDtCBokN7 jG2p2NqvgMUYqxQookk3I7oxgIQUB2wV6I3kVnU00ozisTkaHiB/IFKiFAN/xnnvb94PQy+bBqW d9aBiU+yviCsAwcDaOi6yIVwcWmbsyupLqdoj2bSGyGV+sMmi9VRYfKej49XsMQIKcXKT6wlwjd 5a7C82YNmP1Jep5BkAGn5tz8eE568RTKks4KG8rtD9gmlnQfHvNxtNdpdhqSA= X-Google-Smtp-Source: AGHT+IH6nlAIdovoOO9JPBVa24Ed1feEH7IXLB+tqnSEMSNpkIEfoQZF0oNcldwxw4a5gXb6MOoxqw== X-Received: by 2002:a05:6a20:7d8b:b0:35d:b5a1:a61d with SMTP id adf61e73a8af0-3614eb58328mr14450508637.26.1764011807627; Mon, 24 Nov 2025 11:16:47 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.16.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:16:47 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:13:57 +0800 Subject: [PATCH v3 14/19] mm, swap: cleanup swap entry management workflow Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-14-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song , linux-pm@vger.kernel.org X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=27851; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=TNvJgTmp4p+U1cGLuby5av+dCpBDzm0thwIGxegn2XA=; b=BwDa/rDFxBt/46/mi3X17p+r96GUJcDxA4DMsqb4RLNnn9HTwIN4Qfwj0wy4ZSYSxPNBJo5zF GTUgQaGMdbNDlYy+CVtAUmX4r+SiTi15Tsq4bzjc07Xe1n4U2zNXy7D X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song The current swap entry allocation/freeing workflow has never had a clear definition. This makes it hard to debug or add new optimizations. This commit introduces a proper definition of how swap entries would be allocated and freed. Now, most operations are folio based, so they will never exceed one swap cluster, and we now have a cleaner border between swap and the rest of mm, making it much easier to follow and debug, especially with new added sanity checks. Also making more optimization possible. Swap entry will be mostly allocated and free with a folio bound. The folio lock will be useful for resolving many swap ralated races. Now swap allocation (except hibernation) always starts with a folio in the swap cache, and gets duped/freed protected by the folio lock: - folio_alloc_swap() - The only allocation entry point now. Context: The folio must be locked. This allocates one or a set of continuous swap slots for a folio and binds them to the folio by adding the folio to the swap cache. The swap slots' swap count start with zero value. - folio_dup_swap() - Increase the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This increases the ref count of swap entries allocated to a folio. Newly allocated swap slots' count has to be increased by this helper as the folio got unmapped (and swap entries got installed). - folio_put_swap() - Decrease the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This decreases the ref count of swap entries allocated to a folio. Typically, swapin will decrease the swap count as the folio got installed back and the swap entry got uninstalled This won't remove the folio from the swap cache and free the slot. Lazy freeing of swap cache is helpful for reducing IO. There is already a folio_free_swap() for immediate cache reclaim. This part could be further optimized later. The above locking constraints could be further relaxed when the swap table if fully implemented. Currently dup still needs the caller to lock the swap entry container (e.g. PTL), or a concurrent zap may underflow the swap count. Some swap users need to interact with swap count without involving folio (e.g. forking/zapping the page table or mapping truncate without swapin). In such cases, the caller has to ensure there is no race condition on whatever owns the swap count and use the below helpers: - swap_put_entries_direct() - Decrease the swap count directly. Context: The caller must lock whatever is referencing the slots to avoid a race. Typically the page table zapping or shmem mapping truncate will need to free swap slots directly. If a slot is cached (has a folio bound), this will also try to release the swap cache. - swap_dup_entry_direct() - Increase the swap count directly. Context: The caller must lock whatever is referencing the entries to avoid race, and the entries must already have a swap count > 1. Typically, forking will need to copy the page table and hence needs to increase the swap count of the entries in the table. The page table is locked while referencing the swap entries, so the entries all have a swap count > 1 and can't be freed. Hibernation subsystem is a bit different, so two special wrappers are here: - swap_alloc_hibernation_slot() - Allocate one entry from one device. - swap_free_hibernation_slot() - Free one entry allocated by the above helper. All hibernation entries are exclusive to the hibernation subsystem and should not interact with ordinary swap routines. By separating the workflows, it will be possible to bind folio more tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary pin. This commit should not introduce any behavior change Cc: linux-pm@vger.kernel.org Signed-off-by: Kairui Song Acked-by: Rafael J. Wysocki (Intel) Suggested-by: Chris Li --- arch/s390/mm/gmap_helpers.c | 2 +- arch/s390/mm/pgtable.c | 2 +- include/linux/swap.h | 58 ++++++++--------- kernel/power/swap.c | 10 +-- mm/madvise.c | 2 +- mm/memory.c | 15 +++-- mm/rmap.c | 7 ++- mm/shmem.c | 10 +-- mm/swap.h | 37 +++++++++++ mm/swapfile.c | 148 +++++++++++++++++++++++++++++++---------= ---- 10 files changed, 193 insertions(+), 98 deletions(-) diff --git a/arch/s390/mm/gmap_helpers.c b/arch/s390/mm/gmap_helpers.c index 549f14ad08af..c3f56a096e8c 100644 --- a/arch/s390/mm/gmap_helpers.c +++ b/arch/s390/mm/gmap_helpers.c @@ -32,7 +32,7 @@ static void ptep_zap_softleaf_entry(struct mm_struct *mm,= softleaf_t entry) dec_mm_counter(mm, MM_SWAPENTS); else if (softleaf_is_migration(entry)) dec_mm_counter(mm, mm_counter(softleaf_to_folio(entry))); - free_swap_and_cache(entry); + swap_put_entries_direct(entry, 1); } =20 /** diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c index d670bfb47d9b..c3fa94a6ec15 100644 --- a/arch/s390/mm/pgtable.c +++ b/arch/s390/mm/pgtable.c @@ -692,7 +692,7 @@ static void ptep_zap_softleaf_entry(struct mm_struct *m= m, softleaf_t entry) =20 dec_mm_counter(mm, mm_counter(folio)); } - free_swap_and_cache(entry); + swap_put_entries_direct(entry, 1); } =20 void ptep_zap_unused(struct mm_struct *mm, unsigned long addr, diff --git a/include/linux/swap.h b/include/linux/swap.h index 69025b473472..ac3caa4c6999 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -452,14 +452,8 @@ static inline long get_nr_swap_pages(void) } =20 extern void si_swapinfo(struct sysinfo *); -int folio_alloc_swap(struct folio *folio); -bool folio_free_swap(struct folio *folio); void put_swap_folio(struct folio *folio, swp_entry_t entry); -extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); -extern int swap_duplicate_nr(swp_entry_t entry, int nr); -extern void swap_free_nr(swp_entry_t entry, int nr_pages); -extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); int swap_type_of(dev_t device, sector_t offset); int find_first_swap(dev_t *device); extern unsigned int count_swap_pages(int, int); @@ -472,6 +466,29 @@ struct backing_dev_info; extern struct swap_info_struct *get_swap_device(swp_entry_t entry); sector_t swap_folio_sector(struct folio *folio); =20 +/* + * If there is an existing swap slot reference (swap entry) and the caller + * guarantees that there is no race modification of it (e.g., PTL + * protecting the swap entry in page table; shmem's cmpxchg protects t + * he swap entry in shmem mapping), these two helpers below can be used + * to put/dup the entries directly. + * + * All entries must be allocated by folio_alloc_swap(). And they must have + * a swap count > 1. See comments of folio_*_swap helpers for more info. + */ +int swap_dup_entry_direct(swp_entry_t entry); +void swap_put_entries_direct(swp_entry_t entry, int nr); + +/* + * folio_free_swap tries to free the swap entries pinned by a swap cache + * folio, it has to be here to be called by other components. + */ +bool folio_free_swap(struct folio *folio); + +/* Allocate / free (hibernation) exclusive entries */ +swp_entry_t swap_alloc_hibernation_slot(int type); +void swap_free_hibernation_slot(swp_entry_t entry); + static inline void put_swap_device(struct swap_info_struct *si) { percpu_ref_put(&si->users); @@ -499,10 +516,6 @@ static inline void put_swap_device(struct swap_info_st= ruct *si) #define free_pages_and_swap_cache(pages, nr) \ release_pages((pages), (nr)); =20 -static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr) -{ -} - static inline void free_swap_cache(struct folio *folio) { } @@ -512,12 +525,12 @@ static inline int add_swap_count_continuation(swp_ent= ry_t swp, gfp_t gfp_mask) return 0; } =20 -static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages) +static inline int swap_dup_entry_direct(swp_entry_t ent) { return 0; } =20 -static inline void swap_free_nr(swp_entry_t entry, int nr_pages) +static inline void swap_put_entries_direct(swp_entry_t ent, int nr) { } =20 @@ -541,11 +554,6 @@ static inline int swp_swapcount(swp_entry_t entry) return 0; } =20 -static inline int folio_alloc_swap(struct folio *folio) -{ - return -EINVAL; -} - static inline bool folio_free_swap(struct folio *folio) { return false; @@ -558,22 +566,6 @@ static inline int add_swap_extent(struct swap_info_str= uct *sis, return -EINVAL; } #endif /* CONFIG_SWAP */ - -static inline int swap_duplicate(swp_entry_t entry) -{ - return swap_duplicate_nr(entry, 1); -} - -static inline void free_swap_and_cache(swp_entry_t entry) -{ - free_swap_and_cache_nr(entry, 1); -} - -static inline void swap_free(swp_entry_t entry) -{ - swap_free_nr(entry, 1); -} - #ifdef CONFIG_MEMCG static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) { diff --git a/kernel/power/swap.c b/kernel/power/swap.c index 0beff7eeaaba..546a0c701970 100644 --- a/kernel/power/swap.c +++ b/kernel/power/swap.c @@ -179,10 +179,10 @@ sector_t alloc_swapdev_block(int swap) { unsigned long offset; =20 - offset =3D swp_offset(get_swap_page_of_type(swap)); + offset =3D swp_offset(swap_alloc_hibernation_slot(swap)); if (offset) { if (swsusp_extents_insert(offset)) - swap_free(swp_entry(swap, offset)); + swap_free_hibernation_slot(swp_entry(swap, offset)); else return swapdev_block(swap, offset); } @@ -197,6 +197,7 @@ sector_t alloc_swapdev_block(int swap) =20 void free_all_swap_pages(int swap) { + unsigned long offset; struct rb_node *node; =20 while ((node =3D swsusp_extents.rb_node)) { @@ -204,8 +205,9 @@ void free_all_swap_pages(int swap) =20 ext =3D rb_entry(node, struct swsusp_extent, node); rb_erase(node, &swsusp_extents); - swap_free_nr(swp_entry(swap, ext->start), - ext->end - ext->start + 1); + + for (offset =3D ext->start; offset < ext->end; offset++) + swap_free_hibernation_slot(swp_entry(swap, offset)); =20 kfree(ext); } diff --git a/mm/madvise.c b/mm/madvise.c index b617b1be0f53..7cd69a02ce84 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -694,7 +694,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned = long addr, max_nr =3D (end - addr) / PAGE_SIZE; nr =3D swap_pte_batch(pte, max_nr, ptent); nr_swap -=3D nr; - free_swap_and_cache_nr(entry, nr); + swap_put_entries_direct(entry, nr); clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); } else if (softleaf_is_hwpoison(entry) || softleaf_is_poison_marker(entry)) { diff --git a/mm/memory.c b/mm/memory.c index ce9f56f77ae5..d89946ad63ec 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -934,7 +934,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm= _struct *src_mm, struct page *page; =20 if (likely(softleaf_is_swap(entry))) { - if (swap_duplicate(entry) < 0) + if (swap_dup_entry_direct(entry) < 0) return -EIO; =20 /* make sure dst_mm is on swapoff's mmlist. */ @@ -1744,7 +1744,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gath= er *tlb, =20 nr =3D swap_pte_batch(pte, max_nr, ptent); rss[MM_SWAPENTS] -=3D nr; - free_swap_and_cache_nr(entry, nr); + swap_put_entries_direct(entry, nr); } else if (softleaf_is_migration(entry)) { struct folio *folio =3D softleaf_to_folio(entry); =20 @@ -4933,7 +4933,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) /* * Some architectures may have to restore extra metadata to the page * when reading from swap. This metadata may be indexed by swap entry - * so this must be called before swap_free(). + * so this must be called before folio_put_swap(). */ arch_swap_restore(folio_swap(entry, folio), folio); =20 @@ -4971,6 +4971,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (unlikely(folio !=3D swapcache)) { folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); + folio_put_swap(swapcache, NULL); } else if (!folio_test_anon(folio)) { /* * We currently only expect !anon folios that are fully @@ -4979,9 +4980,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) !=3D nr_pages, folio); VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); + folio_put_swap(folio, NULL); } else { + VM_WARN_ON_ONCE(nr_pages !=3D 1 && nr_pages !=3D folio_nr_pages(folio)); folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address, - rmap_flags); + rmap_flags); + folio_put_swap(folio, nr_pages =3D=3D 1 ? page : NULL); } =20 VM_BUG_ON(!folio_test_anon(folio) || @@ -4995,7 +4999,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * Do it after mapping, so raced page faults will likely see the folio * in swap cache and wait on the folio lock. */ - swap_free_nr(entry, nr_pages); if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags)) folio_free_swap(folio); =20 @@ -5005,7 +5008,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * Hold the lock to avoid the swap entry to be reused * until we take the PT lock for the pte_same() check * (to avoid false positives from pte_same). For - * further safety release the lock after the swap_free + * further safety release the lock after the folio_put_swap * so that the swap count won't change under a * parallel locked swapcache. */ diff --git a/mm/rmap.c b/mm/rmap.c index f955f02d570e..f92c94954049 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -82,6 +82,7 @@ #include =20 #include "internal.h" +#include "swap.h" =20 static struct kmem_cache *anon_vma_cachep; static struct kmem_cache *anon_vma_chain_cachep; @@ -2148,7 +2149,7 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, goto discard; } =20 - if (swap_duplicate(entry) < 0) { + if (folio_dup_swap(folio, subpage) < 0) { set_pte_at(mm, address, pvmw.pte, pteval); goto walk_abort; } @@ -2159,7 +2160,7 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, * so we'll not check/care. */ if (arch_unmap_one(mm, vma, address, pteval) < 0) { - swap_free(entry); + folio_put_swap(folio, subpage); set_pte_at(mm, address, pvmw.pte, pteval); goto walk_abort; } @@ -2167,7 +2168,7 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, /* See folio_try_share_anon_rmap(): clear PTE first. */ if (anon_exclusive && folio_try_share_anon_rmap_pte(folio, subpage)) { - swap_free(entry); + folio_put_swap(folio, subpage); set_pte_at(mm, address, pvmw.pte, pteval); goto walk_abort; } diff --git a/mm/shmem.c b/mm/shmem.c index eb9bd9241f99..56a690e93cc2 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -971,7 +971,7 @@ static long shmem_free_swap(struct address_space *mappi= ng, old =3D xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0); if (old !=3D radswap) return 0; - free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order); + swap_put_entries_direct(radix_to_swp_entry(radswap), 1 << order); =20 return 1 << order; } @@ -1654,7 +1654,7 @@ int shmem_writeout(struct folio *folio, struct swap_i= ocb **plug, spin_unlock(&shmem_swaplist_lock); } =20 - swap_duplicate_nr(folio->swap, nr_pages); + folio_dup_swap(folio, NULL); shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap)); =20 BUG_ON(folio_mapped(folio)); @@ -1675,7 +1675,7 @@ int shmem_writeout(struct folio *folio, struct swap_i= ocb **plug, /* Swap entry might be erased by racing shmem_free_swap() */ if (!error) { shmem_recalc_inode(inode, 0, -nr_pages); - swap_free_nr(folio->swap, nr_pages); + folio_put_swap(folio, NULL); } =20 /* @@ -2161,6 +2161,7 @@ static void shmem_set_folio_swapin_error(struct inode= *inode, pgoff_t index, =20 nr_pages =3D folio_nr_pages(folio); folio_wait_writeback(folio); + folio_put_swap(folio, NULL); swap_cache_del_folio(folio); /* * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks @@ -2168,7 +2169,6 @@ static void shmem_set_folio_swapin_error(struct inode= *inode, pgoff_t index, * in shmem_evict_inode(). */ shmem_recalc_inode(inode, -nr_pages, -nr_pages); - swap_free_nr(swap, nr_pages); } =20 static int shmem_split_large_entry(struct inode *inode, pgoff_t index, @@ -2391,9 +2391,9 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, if (sgp =3D=3D SGP_WRITE) folio_mark_accessed(folio); =20 + folio_put_swap(folio, NULL); swap_cache_del_folio(folio); folio_mark_dirty(folio); - swap_free_nr(swap, nr_pages); put_swap_device(si); =20 *foliop =3D folio; diff --git a/mm/swap.h b/mm/swap.h index 6777b2ab9d92..9ed12936b889 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -183,6 +183,28 @@ static inline void swap_cluster_unlock_irq(struct swap= _cluster_info *ci) spin_unlock_irq(&ci->lock); } =20 +/* + * Below are the core routines for doing swap for a folio. + * All helpers requires the folio to be locked, and a locked folio + * in the swap cache pins the swap entries / slots allocated to the + * folio, swap relies heavily on the swap cache and folio lock for + * synchronization. + * + * folio_alloc_swap(): the entry point for a folio to be swapped + * out. It allocates swap slots and pins the slots with swap cache. + * The slots start with a swap count of zero. + * + * folio_dup_swap(): increases the swap count of a folio, usually + * during it gets unmapped and a swap entry is installed to replace + * it (e.g., swap entry in page table). A swap slot with swap + * count =3D=3D 0 should only be increasd by this helper. + * + * folio_put_swap(): does the opposite thing of folio_dup_swap(). + */ +int folio_alloc_swap(struct folio *folio); +int folio_dup_swap(struct folio *folio, struct page *subpage); +void folio_put_swap(struct folio *folio, struct page *subpage); + /* linux/mm/page_io.c */ int sio_pool_init(void); struct swap_iocb; @@ -363,9 +385,24 @@ static inline struct swap_info_struct *__swap_entry_to= _info(swp_entry_t entry) return NULL; } =20 +static inline int folio_alloc_swap(struct folio *folio) +{ + return -EINVAL; +} + +static inline int folio_dup_swap(struct folio *folio, struct page *page) +{ + return -EINVAL; +} + +static inline void folio_put_swap(struct folio *folio, struct page *page) +{ +} + static inline void swap_read_folio(struct folio *folio, struct swap_iocb *= *plug) { } + static inline void swap_write_unplug(struct swap_iocb *sio) { } diff --git a/mm/swapfile.c b/mm/swapfile.c index 567aea6f1cd4..7890039d2f65 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -58,6 +58,9 @@ static void swap_entries_free(struct swap_info_struct *si, swp_entry_t entry, unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); +static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr= ); +static bool swap_entries_put_map(struct swap_info_struct *si, + swp_entry_t entry, int nr); static bool folio_swapcache_freeable(struct folio *folio); static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, @@ -1478,6 +1481,12 @@ int folio_alloc_swap(struct folio *folio) */ WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true)); =20 + /* + * Allocator should always allocate aligned entries so folio based + * operations never crossed more than one cluster. + */ + VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio); + return 0; =20 out_free: @@ -1485,6 +1494,62 @@ int folio_alloc_swap(struct folio *folio) return -ENOMEM; } =20 +/** + * folio_dup_swap() - Increase swap count of swap entries of a folio. + * @folio: folio with swap entries bounded. + * @subpage: if not NULL, only increase the swap count of this subpage. + * + * Context: Caller must ensure the folio is locked and in the swap cache. + * The caller also has to ensure there is no raced call to + * swap_put_entries_direct before this helper returns, or the swap + * map may underflow (TODO: maybe we should allow or avoid underflow to + * make swap refcount lockless). + */ +int folio_dup_swap(struct folio *folio, struct page *subpage) +{ + int err =3D 0; + swp_entry_t entry =3D folio->swap; + unsigned long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio); + + if (subpage) { + entry.val +=3D folio_page_idx(folio, subpage); + nr_pages =3D 1; + } + + while (!err && __swap_duplicate(entry, 1, nr_pages) =3D=3D -ENOMEM) + err =3D add_swap_count_continuation(entry, GFP_ATOMIC); + + return err; +} + +/** + * folio_put_swap() - Decrease swap count of swap entries of a folio. + * @folio: folio with swap entries bounded, must be in swap cache and lock= ed. + * @subpage: if not NULL, only decrease the swap count of this subpage. + * + * This won't free the swap slots even if swap count drops to zero, they a= re + * still pinned by the swap cache. User may call folio_free_swap to free t= hem. + * Context: Caller must ensure the folio is locked and in the swap cache. + */ +void folio_put_swap(struct folio *folio, struct page *subpage) +{ + swp_entry_t entry =3D folio->swap; + unsigned long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio); + + if (subpage) { + entry.val +=3D folio_page_idx(folio, subpage); + nr_pages =3D 1; + } + + swap_entries_put_map(__swap_entry_to_info(entry), entry, nr_pages); +} + static struct swap_info_struct *_swap_info_get(swp_entry_t entry) { struct swap_info_struct *si; @@ -1725,28 +1790,6 @@ static void swap_entries_free(struct swap_info_struc= t *si, partial_free_cluster(si, ci); } =20 -/* - * Caller has made sure that the swap device corresponding to entry - * is still around or has not been recycled. - */ -void swap_free_nr(swp_entry_t entry, int nr_pages) -{ - int nr; - struct swap_info_struct *sis; - unsigned long offset =3D swp_offset(entry); - - sis =3D _swap_info_get(entry); - if (!sis) - return; - - while (nr_pages) { - nr =3D min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER= ); - swap_entries_put_map(sis, swp_entry(sis->type, offset), nr); - offset +=3D nr; - nr_pages -=3D nr; - } -} - /* * Called after dropping swapcache to decrease refcnt to swap entries. */ @@ -1935,16 +1978,19 @@ bool folio_free_swap(struct folio *folio) } =20 /** - * free_swap_and_cache_nr() - Release reference on range of swap entries a= nd - * reclaim their cache if no more references re= main. + * swap_put_entries_direct() - Release reference on range of swap entries = and + * reclaim their cache if no more references r= emain. * @entry: First entry of range. * @nr: Number of entries in range. * * For each swap entry in the contiguous range, release a reference. If an= y swap * entries become free, try to reclaim their underlying folios, if present= . The * offset range is defined by [entry.offset, entry.offset + nr). + * + * Context: Caller must ensure there is no race condition on the reference + * owner. e.g., locking the PTL of a PTE containing the entry being releas= ed. */ -void free_swap_and_cache_nr(swp_entry_t entry, int nr) +void swap_put_entries_direct(swp_entry_t entry, int nr) { const unsigned long start_offset =3D swp_offset(entry); const unsigned long end_offset =3D start_offset + nr; @@ -1953,10 +1999,9 @@ void free_swap_and_cache_nr(swp_entry_t entry, int n= r) unsigned long offset; =20 si =3D get_swap_device(entry); - if (!si) + if (WARN_ON_ONCE(!si)) return; - - if (WARN_ON(end_offset > si->max)) + if (WARN_ON_ONCE(end_offset > si->max)) goto out; =20 /* @@ -2000,8 +2045,8 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr) } =20 #ifdef CONFIG_HIBERNATION - -swp_entry_t get_swap_page_of_type(int type) +/* Allocate a slot for hibernation */ +swp_entry_t swap_alloc_hibernation_slot(int type) { struct swap_info_struct *si =3D swap_type_to_info(type); unsigned long offset; @@ -2029,6 +2074,27 @@ swp_entry_t get_swap_page_of_type(int type) return entry; } =20 +/* Free a slot allocated by swap_alloc_hibernation_slot */ +void swap_free_hibernation_slot(swp_entry_t entry) +{ + struct swap_info_struct *si; + struct swap_cluster_info *ci; + pgoff_t offset =3D swp_offset(entry); + + si =3D get_swap_device(entry); + if (WARN_ON(!si)) + return; + + ci =3D swap_cluster_lock(si, offset); + swap_entry_put_locked(si, ci, entry, 1); + WARN_ON(swap_entry_swapped(si, offset)); + swap_cluster_unlock(ci); + + /* In theory readahead might add it to the swap cache by accident */ + __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); + put_swap_device(si); +} + /* * Find the swap type that corresponds to given device (if any). * @@ -2190,7 +2256,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_= t *pmd, /* * Some architectures may have to restore extra metadata to the page * when reading from swap. This metadata may be indexed by swap entry - * so this must be called before swap_free(). + * so this must be called before folio_put_swap(). */ arch_swap_restore(folio_swap(entry, folio), folio); =20 @@ -2231,7 +2297,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_= t *pmd, new_pte =3D pte_mkuffd_wp(new_pte); setpte: set_pte_at(vma->vm_mm, addr, pte, new_pte); - swap_free(entry); + folio_put_swap(folio, page); out: if (pte) pte_unmap_unlock(pte, ptl); @@ -3741,28 +3807,22 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) return err; } =20 -/** - * swap_duplicate_nr() - Increase reference count of nr contiguous swap en= tries - * by 1. - * +/* + * swap_dup_entry_direct() - Increase reference count of a swap entry by o= ne. * @entry: first swap entry from which we want to increase the refcount. - * @nr: Number of entries in range. * * Returns 0 for success, or -ENOMEM if a swap_count_continuation is requi= red * but could not be atomically allocated. Returns 0, just as if it succee= ded, * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), wh= ich * might occur if a page table entry has got corrupted. * - * Note that we are currently not handling the case where nr > 1 and we ne= ed to - * add swap count continuation. This is OK, because no such user exists - = shmem - * is the only user that can pass nr > 1, and it never re-duplicates any s= wap - * entry it owns. + * Context: Caller must ensure there is no race condition on the reference + * owner. e.g., locking the PTL of a PTE containing the entry being increa= sed. */ -int swap_duplicate_nr(swp_entry_t entry, int nr) +int swap_dup_entry_direct(swp_entry_t entry) { int err =3D 0; - - while (!err && __swap_duplicate(entry, 1, nr) =3D=3D -ENOMEM) + while (!err && __swap_duplicate(entry, 1, 1) =3D=3D -ENOMEM) err =3D add_swap_count_continuation(entry, GFP_ATOMIC); return err; } --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pf1-f179.google.com (mail-pf1-f179.google.com [209.85.210.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8DA622D94A9 for ; Mon, 24 Nov 2025 19:16:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011815; cv=none; b=VRKF1XYen6otQdyHJznhjsl+/ZCvcDArdnRsNX1BmlbPehIEwEc1ugra2g5E2ePGsxjauCMajfzvmn8lZ6cuRPGvI8NVSOBA0esJ2aFLZs5hWul8ez7CFJpm2Nr/MJ47IFVDmsXhWFDNFYWUbBanWb254Ahi+LHFC5xMO0PikN8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011815; c=relaxed/simple; bh=zNuICNwfmeIL82ZwwGgUJSCGnrZFTpE2skv72HAUrrA=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=RYImPiZx2BYXibtPwp7v57YHJZ2ffbnJY4m4/xVdeEEyP8wT8mcK0VmYXtvV2/K/Aokq4ERWbM0Q4t1A0s9MFvxWHM3qrYvOUjqmIWWycNmwvF92aQf5rtRIUCJezuHXwB+3YpWLZym3wYFO7OZxnBeP64X7rNt+Fs8zH+HxFSI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=QK/wdWmn; arc=none smtp.client-ip=209.85.210.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="QK/wdWmn" Received: by mail-pf1-f179.google.com with SMTP id d2e1a72fcca58-7ba49f92362so2731969b3a.1 for ; Mon, 24 Nov 2025 11:16:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011813; x=1764616613; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=dW2hSKXTwR8T2CwBP/MRRLrpmfjOQ9zqQiuqar3vsV4=; b=QK/wdWmn4GXr1vRmqIoU1OSvqL0C+q97RI71H+aK7ILsMu+WBvDsaE5x4gUD+vSaxp AzUabTjYktM+2RLmA0uRSQyvBwEtf+ThA7u/nU9eXobvLDuJ8XusgJJmgQJIRFNg5kFB 0xFmzfRm0zpt3AiM8Ndxrsc+nNKiNeXTu8wXQUm5SEAYJrrd6Xz5rU9ewmRLrUyFzi2l EaIDmvZ+zTXObufnsDwYpGAs0TybLhwz5VD52YR2CtKKB3+JhD65OFhq6KEwRJSUjVO9 sNxcGYkTK8AWuUy4+jmJT5778PRJe/OAUz+XHYx7zWBmfKhGiBtRNkKKDFElQ+88mSz2 skFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011813; x=1764616613; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=dW2hSKXTwR8T2CwBP/MRRLrpmfjOQ9zqQiuqar3vsV4=; b=fif/w8aEKGaJ/0A1scQLSxWNFV8g06HGpWorrGPtJ2peeVhXeNRtPQAvFuWhMghE0r gqkNTwj0JVT0jSJzPxVxt0HSFXlKL0VLgjFyo7fWFs99mPsz5RwI0j0205BkcQZ/VN34 Xk2xCVA5uFx16apSvAk9dzQV9wrJXmtyoxZNLujuI0cJxggTJQg/mg9x03YSpZWbYanO jRDfNVCasQjGWJlpz9gkBgB6owrE7obUj+4YC36+eHUDRTyteW9AkEFDF5H5NG+3tjUh eBR+lBuX0c0iBbKytXPiQXN7LndbmdOvlo87fUKpy8ZZDaVNOLFm46sbA7OBADK2d87+ FE+w== X-Forwarded-Encrypted: i=1; AJvYcCV9zE0bQurcUipO5lbKWvfvDrxUOXtG6G+Bh3o06TkzCppOexSi8q+DUwUZnT0rv1ofX9taktUrV/jyl7w=@vger.kernel.org X-Gm-Message-State: AOJu0YwtSHWlpe0nC49YzHRkl/dGQDJogDfLqdddYtFYv7ESBoYVvSyt QiLl7WkSaDu+LyaIMJqtAP+FcgC3N5qCydNZ/2/MjE3OO3T8s8Y2aCGU X-Gm-Gg: ASbGnct9iw3olB2ErVfQK3bRvjNff63nm9Ptk7vK+Lk2stNBeYS/CB7wWxR3PyfH+OL /8UimrtzT60q/wBjTh2KrMx8AP39H2NSqpG5TmwtRCYiUL3UcBry9fTOPhTyOUN1agUWiGRs4a8 Ho1pALKTIsEKrBz8LKMbLLuC3+YUCQHY9u0Q2SwYws/jHmcRqACe5NHyaam8FKTQj5vmFHJUvzO hzqsLwQg+MaeED0tAMvpO5JsGeY4WJNi1ayhyeHttP8k8LxbsOjo2q+xQ0f39aGVNACaDauU9JH XMZQUvD8VcDih/Dc+D72AvhvIgMdfl2tfQtdmHR4LiwgkhpuxLm+A0xy7QC6X0ybHYscaUmFEEf SDC16j1ZlIe43CAPfdN/xdyTIZqqR8gfbLR02gu8jAGXMQbKDf09wJBo4yCi7s7IBZ4DddtHdIi g0qJt24pFUjs7nkBjbYkXmIuGWAsC3S/6N1BS34uPZvTIit/5tStkQ3ndXZDc= X-Google-Smtp-Source: AGHT+IEJWjl/z5j8RKVC9K8o8FTnjnCf1r/hFIhxiR9Mkw5UkeNOePRPXbwYfdwjKuCnb/LdcCXnPA== X-Received: by 2002:a05:6a20:a125:b0:35e:835:7ec9 with SMTP id adf61e73a8af0-3613e506c63mr20155913637.17.1764011812685; Mon, 24 Nov 2025 11:16:52 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.16.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:16:52 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:13:58 +0800 Subject: [PATCH v3 15/19] mm, swap: add folio to swap cache directly on allocation Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-15-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=18567; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=isfv5Ia1RYLaAnBGIOaD0Ewhzh9QTFk5l53JXG2l+S8=; b=2Dbui++dBDCvWX+pGy0ME3ZWHBHPq8rezk3u4V4eX4x/XiUlCH/eDWkpliHmO2KltawfWtXLR 58O6D2CwyvXBY3DqG+LD1w1k00URJ+8u/R1vvPe0XK/yGwIhtHDm1yk X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation. SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion. This pinning usage here can be dropped by adding the folio to swap cache directly on allocation. All swap allocations are folio-based now (except for hibernation), so the swap allocator can always take the folio as the parameter. And now both swap cache (swap table) and swap map are protected by the cluster lock, scanning the map and inserting the folio can be done in the same critical section. This eliminates the time window that a slot is pinned by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock multiple times. This is both a cleanup and an optimization. Signed-off-by: Kairui Song Suggested-by: Chris Li --- include/linux/swap.h | 5 -- mm/swap.h | 10 +--- mm/swap_state.c | 58 +++++++++++-------- mm/swapfile.c | 157 +++++++++++++++++++++--------------------------= ---- 4 files changed, 101 insertions(+), 129 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index ac3caa4c6999..4b4b81fbc6a3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -452,7 +452,6 @@ static inline long get_nr_swap_pages(void) } =20 extern void si_swapinfo(struct sysinfo *); -void put_swap_folio(struct folio *folio, swp_entry_t entry); extern int add_swap_count_continuation(swp_entry_t, gfp_t); int swap_type_of(dev_t device, sector_t offset); int find_first_swap(dev_t *device); @@ -534,10 +533,6 @@ static inline void swap_put_entries_direct(swp_entry_t= ent, int nr) { } =20 -static inline void put_swap_folio(struct folio *folio, swp_entry_t swp) -{ -} - static inline int __swap_count(swp_entry_t entry) { return 0; diff --git a/mm/swap.h b/mm/swap.h index 9ed12936b889..ec1ef7d0c35b 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -277,13 +277,13 @@ void __swapcache_clear_cached(struct swap_info_struct= *si, */ struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, - void **shadow, bool alloc); void swap_cache_del_folio(struct folio *folio); struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, struct mempolicy *mpol, pgoff_t ilx, bool *alloced); /* Below helpers require the caller to lock and pass in the swap cluster. = */ +void __swap_cache_add_folio(struct swap_cluster_info *ci, + struct folio *folio, swp_entry_t entry); void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry, void *shadow); void __swap_cache_replace_folio(struct swap_cluster_info *ci, @@ -459,12 +459,6 @@ static inline void *swap_cache_get_shadow(swp_entry_t = entry) return NULL; } =20 -static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t en= try, - void **shadow, bool alloc) -{ - return -ENOENT; -} - static inline void swap_cache_del_folio(struct folio *folio) { } diff --git a/mm/swap_state.c b/mm/swap_state.c index c29b7e386a7c..eb7710120d5f 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -122,35 +122,56 @@ void *swap_cache_get_shadow(swp_entry_t entry) return NULL; } =20 +void __swap_cache_add_folio(struct swap_cluster_info *ci, + struct folio *folio, swp_entry_t entry) +{ + unsigned long new_tb; + unsigned int ci_start, ci_off, ci_end; + unsigned long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); + + new_tb =3D folio_to_swp_tb(folio); + ci_start =3D swp_cluster_offset(entry); + ci_off =3D ci_start; + ci_end =3D ci_start + nr_pages; + do { + VM_WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off))); + __swap_table_set(ci, ci_off, new_tb); + } while (++ci_off < ci_end); + + folio_ref_add(folio, nr_pages); + folio_set_swapcache(folio); + folio->swap =3D entry; + + node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); + lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); +} + /** * swap_cache_add_folio - Add a folio into the swap cache. * @folio: The folio to be added. * @entry: The swap entry corresponding to the folio. * @gfp: gfp_mask for XArray node allocation. * @shadowp: If a shadow is found, return the shadow. - * @alloc: If it's the allocator that is trying to insert a folio. Allocat= or - * sets SWAP_HAS_CACHE to pin slots before insert so skip map upda= te. * * Context: Caller must ensure @entry is valid and protect the swap device * with reference count or locks. */ -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, - void **shadowp, bool alloc) +static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, + void **shadowp) { int err; void *shadow =3D NULL; + unsigned long old_tb; struct swap_info_struct *si; - unsigned long old_tb, new_tb; struct swap_cluster_info *ci; unsigned int ci_start, ci_off, ci_end, offset; unsigned long nr_pages =3D folio_nr_pages(folio); =20 - VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); - VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); - VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); - si =3D __swap_entry_to_info(entry); - new_tb =3D folio_to_swp_tb(folio); ci_start =3D swp_cluster_offset(entry); ci_end =3D ci_start + nr_pages; ci_off =3D ci_start; @@ -166,7 +187,7 @@ int swap_cache_add_folio(struct folio *folio, swp_entry= _t entry, err =3D -EEXIST; goto failed; } - if (!alloc && unlikely(!__swap_count(swp_entry(swp_type(entry), offset))= )) { + if (unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) { err =3D -ENOENT; goto failed; } @@ -182,20 +203,11 @@ int swap_cache_add_folio(struct folio *folio, swp_ent= ry_t entry, * Still need to pin the slots with SWAP_HAS_CACHE since * swap allocator depends on that. */ - if (!alloc) - __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset)); - __swap_table_set(ci, ci_off, new_tb); + __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset)); offset++; } while (++ci_off < ci_end); - - folio_ref_add(folio, nr_pages); - folio_set_swapcache(folio); - folio->swap =3D entry; + __swap_cache_add_folio(ci, folio, entry); swap_cluster_unlock(ci); - - node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); - lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); - if (shadowp) *shadowp =3D shadow; return 0; @@ -464,7 +476,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, __folio_set_locked(folio); __folio_set_swapbacked(folio); for (;;) { - ret =3D swap_cache_add_folio(folio, entry, &shadow, false); + ret =3D swap_cache_add_folio(folio, entry, &shadow); if (!ret) break; =20 diff --git a/mm/swapfile.c b/mm/swapfile.c index 7890039d2f65..91368294170f 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -884,28 +884,53 @@ static void swap_cluster_assert_table_empty(struct sw= ap_cluster_info *ci, } } =20 -static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_c= luster_info *ci, - unsigned int start, unsigned char usage, - unsigned int order) +static bool cluster_alloc_range(struct swap_info_struct *si, + struct swap_cluster_info *ci, + struct folio *folio, + unsigned int offset) { - unsigned int nr_pages =3D 1 << order; + unsigned long nr_pages; + unsigned int order; =20 lockdep_assert_held(&ci->lock); =20 if (!(si->flags & SWP_WRITEOK)) return false; =20 + /* + * All mm swap allocation starts with a folio (folio_alloc_swap), + * it's also the only allocation path for large orders allocation. + * Such swap slots starts with count =3D=3D 0 and will be increased + * upon folio unmap. + * + * Else, it's a exclusive order 0 allocation for hibernation. + * The slot starts with count =3D=3D 1 and never increases. + */ + if (likely(folio)) { + order =3D folio_order(folio); + nr_pages =3D 1 << order; + /* + * Pin the slot with SWAP_HAS_CACHE to satisfy swap_dup_entries. + * This is the legacy allocation behavior, will drop it very soon. + */ + memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages); + __swap_cache_add_folio(ci, folio, swp_entry(si->type, offset)); + } else { + order =3D 0; + nr_pages =3D 1; + WARN_ON_ONCE(si->swap_map[offset]); + si->swap_map[offset] =3D 1; + swap_cluster_assert_table_empty(ci, offset, 1); + } + /* * The first allocation in a cluster makes the * cluster exclusive to this order */ if (cluster_is_empty(ci)) ci->order =3D order; - - memset(si->swap_map + start, usage, nr_pages); - swap_cluster_assert_table_empty(ci, start, nr_pages); - swap_range_alloc(si, nr_pages); ci->count +=3D nr_pages; + swap_range_alloc(si, nr_pages); =20 return true; } @@ -913,13 +938,12 @@ static bool cluster_alloc_range(struct swap_info_stru= ct *si, struct swap_cluster /* Try use a new cluster for current CPU and allocate from it. */ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long offset, - unsigned int order, - unsigned char usage) + struct folio *folio, unsigned long offset) { unsigned int next =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTRY_INVALID; unsigned long start =3D ALIGN_DOWN(offset, SWAPFILE_CLUSTER); unsigned long end =3D min(start + SWAPFILE_CLUSTER, si->max); + unsigned int order =3D likely(folio) ? folio_order(folio) : 0; unsigned int nr_pages =3D 1 << order; bool need_reclaim, ret, usable; =20 @@ -943,7 +967,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap= _info_struct *si, if (!ret) continue; } - if (!cluster_alloc_range(si, ci, offset, usage, order)) + if (!cluster_alloc_range(si, ci, folio, offset)) break; found =3D offset; offset +=3D nr_pages; @@ -965,8 +989,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap= _info_struct *si, =20 static unsigned int alloc_swap_scan_list(struct swap_info_struct *si, struct list_head *list, - unsigned int order, - unsigned char usage, + struct folio *folio, bool scan_all) { unsigned int found =3D SWAP_ENTRY_INVALID; @@ -978,7 +1001,7 @@ static unsigned int alloc_swap_scan_list(struct swap_i= nfo_struct *si, if (!ci) break; offset =3D cluster_offset(si, ci); - found =3D alloc_swap_scan_cluster(si, ci, offset, order, usage); + found =3D alloc_swap_scan_cluster(si, ci, folio, offset); if (found) break; } while (scan_all); @@ -1039,10 +1062,11 @@ static void swap_reclaim_work(struct work_struct *w= ork) * Try to allocate swap entries with specified order and try set a new * cluster for current CPU too. */ -static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,= int order, - unsigned char usage) +static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, + struct folio *folio) { struct swap_cluster_info *ci; + unsigned int order =3D likely(folio) ? folio_order(folio) : 0; unsigned int offset =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTRY_INVALID; =20 /* @@ -1064,8 +1088,7 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o if (cluster_is_usable(ci, order)) { if (cluster_is_empty(ci)) offset =3D cluster_offset(si, ci); - found =3D alloc_swap_scan_cluster(si, ci, offset, - order, usage); + found =3D alloc_swap_scan_cluster(si, ci, folio, offset); } else { swap_cluster_unlock(ci); } @@ -1079,22 +1102,19 @@ static unsigned long cluster_alloc_swap_entry(struc= t swap_info_struct *si, int o * to spread out the writes. */ if (si->flags & SWP_PAGE_DISCARD) { - found =3D alloc_swap_scan_list(si, &si->free_clusters, order, usage, - false); + found =3D alloc_swap_scan_list(si, &si->free_clusters, folio, false); if (found) goto done; } =20 if (order < PMD_ORDER) { - found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[order], - order, usage, true); + found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[order], folio, = true); if (found) goto done; } =20 if (!(si->flags & SWP_PAGE_DISCARD)) { - found =3D alloc_swap_scan_list(si, &si->free_clusters, order, usage, - false); + found =3D alloc_swap_scan_list(si, &si->free_clusters, folio, false); if (found) goto done; } @@ -1110,8 +1130,7 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o * failure is not critical. Scanning one cluster still * keeps the list rotated and reclaimed (for HAS_CACHE). */ - found =3D alloc_swap_scan_list(si, &si->frag_clusters[order], order, - usage, false); + found =3D alloc_swap_scan_list(si, &si->frag_clusters[order], folio, fal= se); if (found) goto done; } @@ -1125,13 +1144,11 @@ static unsigned long cluster_alloc_swap_entry(struc= t swap_info_struct *si, int o * Clusters here have at least one usable slots and can't fail order 0 * allocation, but reclaim may drop si->lock and race with another user. */ - found =3D alloc_swap_scan_list(si, &si->frag_clusters[o], - 0, usage, true); + found =3D alloc_swap_scan_list(si, &si->frag_clusters[o], folio, true); if (found) goto done; =20 - found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[o], - 0, usage, true); + found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[o], folio, true= ); if (found) goto done; } @@ -1322,12 +1339,12 @@ static bool get_swap_device_info(struct swap_info_s= truct *si) * Fast path try to get swap entries with specified order from current * CPU's swap entry pool (a cluster). */ -static bool swap_alloc_fast(swp_entry_t *entry, - int order) +static bool swap_alloc_fast(struct folio *folio) { + unsigned int order =3D folio_order(folio); struct swap_cluster_info *ci; struct swap_info_struct *si; - unsigned int offset, found =3D SWAP_ENTRY_INVALID; + unsigned int offset; =20 /* * Once allocated, swap_info_struct will never be completely freed, @@ -1342,22 +1359,18 @@ static bool swap_alloc_fast(swp_entry_t *entry, if (cluster_is_usable(ci, order)) { if (cluster_is_empty(ci)) offset =3D cluster_offset(si, ci); - found =3D alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE); - if (found) - *entry =3D swp_entry(si->type, found); + alloc_swap_scan_cluster(si, ci, folio, offset); } else { swap_cluster_unlock(ci); } =20 put_swap_device(si); - return !!found; + return folio_test_swapcache(folio); } =20 /* Rotate the device and switch to a new cluster */ -static void swap_alloc_slow(swp_entry_t *entry, - int order) +static void swap_alloc_slow(struct folio *folio) { - unsigned long offset; struct swap_info_struct *si, *next; =20 spin_lock(&swap_avail_lock); @@ -1367,13 +1380,11 @@ static void swap_alloc_slow(swp_entry_t *entry, plist_requeue(&si->avail_list, &swap_avail_head); spin_unlock(&swap_avail_lock); if (get_swap_device_info(si)) { - offset =3D cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE); + cluster_alloc_swap_entry(si, folio); put_swap_device(si); - if (offset) { - *entry =3D swp_entry(si->type, offset); + if (folio_test_swapcache(folio)) return; - } - if (order) + if (folio_test_large(folio)) return; } =20 @@ -1434,7 +1445,6 @@ int folio_alloc_swap(struct folio *folio) { unsigned int order =3D folio_order(folio); unsigned int size =3D 1 << order; - swp_entry_t entry =3D {}; =20 VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio); @@ -1459,39 +1469,23 @@ int folio_alloc_swap(struct folio *folio) =20 again: local_lock(&percpu_swap_cluster.lock); - if (!swap_alloc_fast(&entry, order)) - swap_alloc_slow(&entry, order); + if (!swap_alloc_fast(folio)) + swap_alloc_slow(folio); local_unlock(&percpu_swap_cluster.lock); =20 - if (unlikely(!order && !entry.val)) { + if (!order && unlikely(!folio_test_swapcache(folio))) { if (swap_sync_discard()) goto again; } =20 /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */ - if (mem_cgroup_try_charge_swap(folio, entry)) - goto out_free; + if (unlikely(mem_cgroup_try_charge_swap(folio, folio->swap))) + swap_cache_del_folio(folio); =20 - if (!entry.val) + if (unlikely(!folio_test_swapcache(folio))) return -ENOMEM; =20 - /* - * Allocator has pinned the slots with SWAP_HAS_CACHE - * so it should never fail - */ - WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true)); - - /* - * Allocator should always allocate aligned entries so folio based - * operations never crossed more than one cluster. - */ - VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio); - return 0; - -out_free: - put_swap_folio(folio, entry); - return -ENOMEM; } =20 /** @@ -1790,29 +1784,6 @@ static void swap_entries_free(struct swap_info_struc= t *si, partial_free_cluster(si, ci); } =20 -/* - * Called after dropping swapcache to decrease refcnt to swap entries. - */ -void put_swap_folio(struct folio *folio, swp_entry_t entry) -{ - struct swap_info_struct *si; - struct swap_cluster_info *ci; - unsigned long offset =3D swp_offset(entry); - int size =3D 1 << swap_entry_order(folio_order(folio)); - - si =3D _swap_info_get(entry); - if (!si) - return; - - ci =3D swap_cluster_lock(si, offset); - if (swap_only_has_cache(si, offset, size)) - swap_entries_free(si, ci, entry, size); - else - for (int i =3D 0; i < size; i++, entry.val++) - swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); - swap_cluster_unlock(ci); -} - int __swap_count(swp_entry_t entry) { struct swap_info_struct *si =3D __swap_entry_to_info(entry); @@ -2063,7 +2034,7 @@ swp_entry_t swap_alloc_hibernation_slot(int type) * with swap table allocation. */ local_lock(&percpu_swap_cluster.lock); - offset =3D cluster_alloc_swap_entry(si, 0, 1); + offset =3D cluster_alloc_swap_entry(si, NULL); local_unlock(&percpu_swap_cluster.lock); if (offset) entry =3D swp_entry(si->type, offset); --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pf1-f170.google.com (mail-pf1-f170.google.com [209.85.210.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9109B318136 for ; Mon, 24 Nov 2025 19:16:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011820; cv=none; b=MBCPqHfvCpuqEevbUoaHDt0WtD6bcVh/0TJJXVQWlUF5xbKJ+rP3vHqV308tswTI9qClY6GVv7EvBxtKweLe+WhpB69SJCo6kOx+FjU7u+b34fCfshD2agcPXcGl3pvxXFz7uPtlymApHwKkRdI8x+zURUP/oCdAUMdTqCZs7d0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011820; c=relaxed/simple; bh=LYBvtJCCNBOihUgW/uBeH8Dg9ldLNOl46SnbYfl+h64=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=pEIAOGasDb36tM6Zjo+9BcHyqEfR35xuHxSjpezZd/kjXoOwxoqkOj4InPB7Cp+7KkU8UOzVhBiSgYjKDR6PYTOaA0FHQVfgiGB/T/B7Pm1KmscXzJhmvDlypIvauNongR08QNcwQPw4mCdHZZoVXOWwS6I4EsKbYqoggBn7ZAc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=NW3gES2r; arc=none smtp.client-ip=209.85.210.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="NW3gES2r" Received: by mail-pf1-f170.google.com with SMTP id d2e1a72fcca58-7b9a98b751eso3659991b3a.1 for ; Mon, 24 Nov 2025 11:16:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011818; x=1764616618; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=GcrO4+VrgA5lfPDgsn5lPfD0YSBWSARCYG+aVCir9Rs=; b=NW3gES2rMzPSKgPqpkdaLl/4oeO+1RUWJaRGchfTuBC1OZ5f/li1UG1lmD+0dW4bwL 7IeTUqZ2sBvNVV28J6h7Z1P5HQELL6tCVdBgIl6ZJiK9g8IuBHyTaWw2MNOvlawn6f0Y 6RnBNFgwyhft07cJaZvoUUnAD/vUh4cXGAybs+QdYHB4gFpZYcMVHMfKWkJl8Qd0lNmR cz3j83+hAAYcn+5tBdsbRStPlLKoIhUH5ByZ0CHhS3ikwLL6iVn7UEzRKLj87ReDYfU3 Qa9GPGUKBAxbuwZ+TKx3bj8+1cdciT5Wx5Bfz0TsEsGItVw1Of3fA53ueJIDGjT2pXZk b7cg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011818; x=1764616618; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=GcrO4+VrgA5lfPDgsn5lPfD0YSBWSARCYG+aVCir9Rs=; b=XLmaxLw2/XAS938B/g+djkR00Vpa9bxxEbLYOqrAuz4wkjc7AQYVVFrUrrCT7A+3An Jh2dLkkwVoeSubrePxr4H/kMQ8fKhMAyaltFEdpoSWCTtDYjj3vzNsB1a665Z0vHA7mY Yqu/ReLUoK6nse6uvFVmTWOL421CNmiXSOfQ6rRGUDhKyovEGfHav1+IPrdvBDwN72np AYgUUXwnuCOvY8boctx3UMOsDYsQX95SlQoi0LMq/lhwC0zGF9R1L/8zC0s8rCF8w4Mt 3qoO2Bb8x3dgao9YhFqD96eKUsH76T2QTpzyRdPpQTasuxgBlC9w+lstOW9KOOQvF/oD TKeA== X-Forwarded-Encrypted: i=1; AJvYcCWFzXoIpnOUeNBQSam1dWLuxcMBuH5NYugrMLkcXyGbvc+vLe2XRrhTvz32WLOmukrG1X+DpqfjE4zvP4E=@vger.kernel.org X-Gm-Message-State: AOJu0YyloqHISyalCsTYfNGiPIrODTkE7okqJsjeRkvY7AvHgh0VFmq2 +H/hBEH2dIErp7k1mMnHbZUNn+WyFMdcbc9ss7LoZe/QUKCfE2pCmq4m X-Gm-Gg: ASbGnctvlr6XBn+6QeR/1/PkUOTv/haNqQP93SwjvRFngR7OY11lAHJyIfu3DSq3926 AbWhnhVf5jFmmZJNkJCi7SLPR8NQAZGapshR4JC2RU4cSqCfnhg3N4pf1Iy/JCX4C9n1/qF14e3 kS+1TvClDGuEbfFo/0FFVGeZWrCaM7LRVJx29eVsTRMJn94sFupgQFbQDISAuAgjOSwNjPoWeOb rrugp3421/qF/7COJDitDDpRUnInvJtSOt0u/ws7+XOFsY7fAya5m+RCsjD/hwxt3TXcKuZ0utW ejkPPdG6YNk73yMMHcWFrNK2DuIrGqeZb+j82OmURT9AwCFnqljzKCfPGTDi540axaGp/YGy89n Nxpy/Q+cgDQyJFtSxXsQLDl2WG+AN/x3ltZmNdsTumtWeRFXjwQtgrpnar1yFAMBFhjWyT1+7Ni k4TNBBdrLywN6H6U6zLMq0QHERx+mt66wKAwNU/UJ53AvQ8hOb X-Google-Smtp-Source: AGHT+IHJUVf2bjQh7PYsc3slGlVxxnG0MimHJ3XPxaJaJmjt78qZ65YN1P9zZdlu49oz0qIj8JsQ9g== X-Received: by 2002:a05:6a20:918d:b0:343:5d53:c0ab with SMTP id adf61e73a8af0-3614ecc985emr13628204637.20.1764011817625; Mon, 24 Nov 2025 11:16:57 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.16.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:16:57 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:13:59 +0800 Subject: [PATCH v3 16/19] mm, swap: check swap table directly for checking cache Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-16-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=7912; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=l2u8+7ztYOIx0yRaQjFhJKppLX2cjxl7BELgafgfB5U=; b=D4cavwQuPEBTnSw4UhrTn5nzBgDX/18iZdPVPrIpEkirBt9h4hmGJR4XigoJICqcJ3g67PoIX nkzAJPUJZoYBfWnIoSqnK9IRqBY73FZ5F/4zQ4l4K7DtL8IeGhIEgA4 X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Instead of looking at the swap map, check swap table directly to tell if a swap slot is cached. Prepares for the removal of SWAP_HAS_CACHE. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swap.h | 11 ++++++++--- mm/swap_state.c | 16 ++++++++++++++++ mm/swapfile.c | 55 +++++++++++++++++++++++++++++-----------------------= --- mm/userfaultfd.c | 10 +++------- 4 files changed, 56 insertions(+), 36 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index ec1ef7d0c35b..3692e143eeba 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -275,6 +275,7 @@ void __swapcache_clear_cached(struct swap_info_struct *= si, * swap entries in the page table, similar to locking swap cache folio. * - See the comment of get_swap_device() for more complex usage. */ +bool swap_cache_has_folio(swp_entry_t entry); struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); void swap_cache_del_folio(struct folio *folio); @@ -335,8 +336,6 @@ static inline int swap_zeromap_batch(swp_entry_t entry,= int max_nr, =20 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr) { - struct swap_info_struct *si =3D __swap_entry_to_info(entry); - pgoff_t offset =3D swp_offset(entry); int i; =20 /* @@ -345,8 +344,9 @@ static inline int non_swapcache_batch(swp_entry_t entry= , int max_nr) * be in conflict with the folio in swap cache. */ for (i =3D 0; i < max_nr; i++) { - if ((si->swap_map[offset + i] & SWAP_HAS_CACHE)) + if (swap_cache_has_folio(entry)) return i; + entry.val++; } =20 return i; @@ -449,6 +449,11 @@ static inline int swap_writeout(struct folio *folio, return 0; } =20 +static inline bool swap_cache_has_folio(swp_entry_t entry) +{ + return false; +} + static inline struct folio *swap_cache_get_folio(swp_entry_t entry) { return NULL; diff --git a/mm/swap_state.c b/mm/swap_state.c index eb7710120d5f..94b6d368e3e8 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -103,6 +103,22 @@ struct folio *swap_cache_get_folio(swp_entry_t entry) return NULL; } =20 +/** + * swap_cache_has_folio - Check if a swap slot has cache. + * @entry: swap entry indicating the slot. + * + * Context: Caller must ensure @entry is valid and protect the swap + * device with reference count or locks. + */ +bool swap_cache_has_folio(swp_entry_t entry) +{ + unsigned long swp_tb; + + swp_tb =3D swap_table_get(__swap_entry_to_cluster(entry), + swp_cluster_offset(entry)); + return swp_tb_is_folio(swp_tb); +} + /** * swap_cache_get_shadow - Looks up a shadow in the swap cache. * @entry: swap entry used for the lookup. diff --git a/mm/swapfile.c b/mm/swapfile.c index 91368294170f..7e28d60d90e1 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -792,23 +792,18 @@ static bool cluster_reclaim_range(struct swap_info_st= ruct *si, unsigned int nr_pages =3D 1 << order; unsigned long offset =3D start, end =3D start + nr_pages; unsigned char *map =3D si->swap_map; - int nr_reclaim; + unsigned long swp_tb; =20 spin_unlock(&ci->lock); do { - switch (READ_ONCE(map[offset])) { - case 0: + if (swap_count(READ_ONCE(map[offset]))) break; - case SWAP_HAS_CACHE: - nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); - if (nr_reclaim < 0) - goto out; - break; - default: - goto out; + swp_tb =3D swap_table_get(ci, offset % SWAPFILE_CLUSTER); + if (swp_tb_is_folio(swp_tb)) { + if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY) < 0) + break; } } while (++offset < end); -out: spin_lock(&ci->lock); =20 /* @@ -829,37 +824,41 @@ static bool cluster_reclaim_range(struct swap_info_st= ruct *si, * Recheck the range no matter reclaim succeeded or not, the slot * could have been be freed while we are not holding the lock. */ - for (offset =3D start; offset < end; offset++) - if (READ_ONCE(map[offset])) + for (offset =3D start; offset < end; offset++) { + swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); + if (swap_count(map[offset]) || !swp_tb_is_null(swp_tb)) return false; + } =20 return true; } =20 static bool cluster_scan_range(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long start, unsigned int nr_pages, + unsigned long offset, unsigned int nr_pages, bool *need_reclaim) { - unsigned long offset, end =3D start + nr_pages; + unsigned long end =3D offset + nr_pages; unsigned char *map =3D si->swap_map; + unsigned long swp_tb; =20 if (cluster_is_empty(ci)) return true; =20 - for (offset =3D start; offset < end; offset++) { - switch (READ_ONCE(map[offset])) { - case 0: - continue; - case SWAP_HAS_CACHE: + do { + if (swap_count(map[offset])) + return false; + swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); + if (swp_tb_is_folio(swp_tb)) { + WARN_ON_ONCE(!(map[offset] & SWAP_HAS_CACHE)); if (!vm_swap_full()) return false; *need_reclaim =3D true; - continue; - default: - return false; + } else { + /* A entry with no count and no cache must be null */ + VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb)); } - } + } while (++offset < end); =20 return true; } @@ -1026,7 +1025,8 @@ static void swap_reclaim_full_clusters(struct swap_in= fo_struct *si, bool force) to_scan--; =20 while (offset < end) { - if (READ_ONCE(map[offset]) =3D=3D SWAP_HAS_CACHE) { + if (!swap_count(READ_ONCE(map[offset])) && + swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER))) { spin_unlock(&ci->lock); nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); @@ -1968,6 +1968,7 @@ void swap_put_entries_direct(swp_entry_t entry, int n= r) struct swap_info_struct *si; bool any_only_cache =3D false; unsigned long offset; + unsigned long swp_tb; =20 si =3D get_swap_device(entry); if (WARN_ON_ONCE(!si)) @@ -1992,7 +1993,9 @@ void swap_put_entries_direct(swp_entry_t entry, int n= r) */ for (offset =3D start_offset; offset < end_offset; offset +=3D nr) { nr =3D 1; - if (READ_ONCE(si->swap_map[offset]) =3D=3D SWAP_HAS_CACHE) { + swp_tb =3D swap_table_get(__swap_offset_to_cluster(si, offset), + offset % SWAPFILE_CLUSTER); + if (!swap_count(READ_ONCE(si->swap_map[offset])) && swp_tb_is_folio(swp_= tb)) { /* * Folios are always naturally aligned in swap so * advance forward to the next boundary. Zero means no diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index e6dfd5f28acd..3f28aa319988 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1190,17 +1190,13 @@ static int move_swap_pte(struct mm_struct *mm, stru= ct vm_area_struct *dst_vma, * Check if the swap entry is cached after acquiring the src_pte * lock. Otherwise, we might miss a newly loaded swap cache folio. * - * Check swap_map directly to minimize overhead, READ_ONCE is sufficient. * We are trying to catch newly added swap cache, the only possible case= is * when a folio is swapped in and out again staying in swap cache, using= the * same entry before the PTE check above. The PTL is acquired and releas= ed - * twice, each time after updating the swap_map's flag. So holding - * the PTL here ensures we see the updated value. False positive is poss= ible, - * e.g. SWP_SYNCHRONOUS_IO swapin may set the flag without touching the - * cache, or during the tiny synchronization window between swap cache a= nd - * swap_map, but it will be gone very quickly, worst result is retry jit= ters. + * twice, each time after updating the swap table. So holding + * the PTL here ensures we see the updated value. */ - if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) { + if (swap_cache_has_folio(entry)) { double_pt_unlock(dst_ptl, src_ptl); return -EAGAIN; } --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9873331A7F5 for ; Mon, 24 Nov 2025 19:17:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.169 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011825; cv=none; b=NB5MfrcSiCDoXLJ/petXfxwGJ75P7qSmI43AKzG6YsZAEZuhnPBQ8t54bdy5TbO2yNxuoXjn6Q6/+Da8HiEt3pqvRvUWfyKNH7WaEEINu7f0bxNr5iG1YPPrPcWNMYf9qjIF3jhgLj17GOd6H7gGT252oQA480rMGvoDrX4O63Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011825; c=relaxed/simple; bh=cZz7U3kIi7Fr9frfKGJxHv0xp1z5XeYwfIqPzfvKeo4=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=cNQtGdzExQCkaRw0iMNyalbhD4z10s/wbMRWAAr0TBFZpaAliykZnHezL83BU+0+CYV/LIIyBwXrajwfdY7J7X2Rgz9cev1cEygy9JQc1cur9KwRcagCgIuyCXUgFAd199wH0KfGmpgVwMdJ93O58s2iGUJhRIg8pj4DZZhUq0M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=JWfEQ5Q9; arc=none smtp.client-ip=209.85.210.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="JWfEQ5Q9" Received: by mail-pf1-f169.google.com with SMTP id d2e1a72fcca58-7ba49f92362so2732171b3a.1 for ; Mon, 24 Nov 2025 11:17:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011823; x=1764616623; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=qKFQ/7z2VBxapImzqSseNC4lTKb3ZTDXKAUmT+VuH5Q=; b=JWfEQ5Q9P8vIj9mujsqZclcHWTH131HBUtHioAcCY6v1UaE8WdZHDEGN7iJa4FqzUr wRUP37sDyBDN5mLl5bDsIhTKYiUxppR3HCvS8kljDO9tZO8wny1xVu98+ayYPhjFCF/L uJbfNFRfZJ/CFh5RD1Bj+bYd9/WB/64a0l5tyAMTJgl3cE3DousSqhq2yvG74nTIjj41 3IeRFFHDfhmyVfgKRiYOZTdfHVDcQLEUFMle2FQQKnz/Gbw5BUtlqfs0JbxBTiFhRxPu cXYTkDdGMGuxOGN27XIkyQutTVjZNORGAcD7N5Ycsy9lKyeyCMzBRsTaI/esd5uWUAcl v7KA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011823; x=1764616623; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=qKFQ/7z2VBxapImzqSseNC4lTKb3ZTDXKAUmT+VuH5Q=; b=fGSAZim+Le7cwFe5AwCHyHqJpFJXl8tHiqZySzJsBdeFzXEdiOKSHqhVEAFKQFcMUw 1vkYze8yKud1E9lCBtKZgutMIuF2rH1qCgSOC6xvpE0W+g38cowyMFCEovSKy9wo4Txg 49dvL3LzId7SesEhREx0yi/ViKf2aZ+1f0q3IhKOSCcsJDJcA3zUrE2+UVkhmRUiuBPW 9+arT3oFF51g4FgKPQ9sS736YJztUEj3Zj32vTW5U6X6pC/9Ix8qbKBYD2f9mRcANPOF gMQehrakEoBC0UtEfPF/GYgfmp/cFD5zrt+AOHrNZXFJis62vM3dnW5wy/ZI1ozgBeI8 4HEA== X-Forwarded-Encrypted: i=1; AJvYcCWYxjN8Du8ofmbvZ6XSitiZguu7qXfjjeDgy/hQ7afhI4aeFCkcFrl6aYc7F8pnAMXG0wCdPYFIExXNd5Q=@vger.kernel.org X-Gm-Message-State: AOJu0Yw3TRCj3uXzyWg3KBdcZAVJL1VRPHkJOZXOakmhX2nRDActN3Yz nFpuOhDjwcT3wVSzBKXAiwKRvXmAn1fTTco9ujG5hxtp5f/lOehPKsJf X-Gm-Gg: ASbGncubkl4idMUoYdcmJL8AZ+LZnfjbGjWxR39EENSz6++ZHVr4AOunNxbUpYgiZzZ nXPPKYb86FmVhr0R1F0Pzous9FSxlYwfyaoLmbMKz/T7M5ZdypJXgtin0hT5z5EPvBe6aiKA/+T 34WBVL9RzXrTV/9WnSI1AjqovcGhpZexeGOJpOTy1cpDzgjrYYTSNYBXS16z3X00u4aOjZaMhqT FIFiXeELhkq1/X6Q9a8n4s3mzLFGYPSjrTAUyY85W6eXasqnbWuaBjEY/dnyBmfnqCJNVYEqFt2 nUTJAk37L3MyMH/tb3PMIhOZWfsbKbGy/NRbOCyV0eEmsf2TC3+qZ5csKfNq1MGtdzpb/IRRy0I 8NDgxTodwGYXM15t0L3suYCHCo22Pe+FVnvVdT4IxZqY9f0KQJ9KbdZIBA6+70RYEB/yyEWWRKf NVRZ+ctLURpK83M63N8dovnnerLhQSMD6f/k/B1Q8Uv2Ip3117 X-Google-Smtp-Source: AGHT+IF/PT8KbR1tAqo7AK8HYvyLqub/CkizxT9rRbD7bM96UwbGh1ZzrVk0tYYBIMOKFHyN6Nu1Lw== X-Received: by 2002:a05:6a20:4321:b0:32a:745f:bee3 with SMTP id adf61e73a8af0-3613e56b2f5mr17248958637.21.1764011822559; Mon, 24 Nov 2025 11:17:02 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.16.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:17:02 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:14:00 +0800 Subject: [PATCH v3 17/19] mm, swap: clean up and improve swap entries freeing Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-17-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=13631; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=BF6jwfku37yd6zUniJTYo0hL0HpTBM3Sf7/qOkB7gY8=; b=ARV/qDNrgBe1feo6i09/IbKQGMHm9Hs8ASYADj5iWg83GPTlsVzGx5KL6Mn9lMsffaJkiFLqz 43pX1Ke+r1tAbw3pF8PomjR1Ci+gSHrS/VUojzQReVVsiYLop2EqYt6 X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song There are a few problems with the current freeing of swap entries. When freeing a set of swap entries directly (swap_put_entries_direct, typically from zapping the page table), it scans the whole swap region multiple times. First, it scans the whole region to check if it can be batch freed and if there is any cached folio. Then do a batch free only if the whole region's swap count equals 1. And if any entry is cached, even if only one, it will have to walk the whole region again to clean up the cache. And if any entry is not in a consistent status with other entries, it will fall back to order 0 freeing. For example, if only one of them is cached, the batch free will fall back. And the current batch freeing workflow relies on the swap map's SWAP_HAS_CACHE bit for both continuous checking and batch freeing, which isn't compatible with the swap table design. Tidy this up, introduce a new cluster scoped helper for all swap entry freeing job. It will batch frees all continuous entries, and just start a new batch if any inconsistent entry is found. This may improve the batch size when the clusters are fragmented. This should also be more robust with more sanity checks, and make it clear that a slot pinned by swap cache will be cleared upon cache reclaim. And the cache reclaim scan is also now limited to each cluster. If a cluster has any clean swap cache left after putting the swap count, reclaim the cluster only instead of the whole region. And since a folio's entries are always in the same cluster, putting swap entries from a folio can also use the new helper directly. This should be both an optimization and a cleanup, and the new helper is adapted to the swap table. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swapfile.c | 238 +++++++++++++++++++++++-------------------------------= ---- 1 file changed, 96 insertions(+), 142 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 7e28d60d90e1..dd1a138ae265 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -55,12 +55,14 @@ static bool swap_count_continued(struct swap_info_struc= t *, pgoff_t, static void free_swap_count_continuations(struct swap_info_struct *); static void swap_entries_free(struct swap_info_struct *si, struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr_pages); + unsigned long start, unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr= ); -static bool swap_entries_put_map(struct swap_info_struct *si, - swp_entry_t entry, int nr); +static void swap_put_entry_locked(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, + unsigned char usage); static bool folio_swapcache_freeable(struct folio *folio); static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, @@ -197,25 +199,6 @@ static bool swap_only_has_cache(struct swap_info_struc= t *si, return true; } =20 -static bool swap_is_last_map(struct swap_info_struct *si, - unsigned long offset, int nr_pages, bool *has_cache) -{ - unsigned char *map =3D si->swap_map + offset; - unsigned char *map_end =3D map + nr_pages; - unsigned char count =3D *map; - - if (swap_count(count) !=3D 1) - return false; - - while (++map < map_end) { - if (*map !=3D count) - return false; - } - - *has_cache =3D !!(count & SWAP_HAS_CACHE); - return true; -} - /* * returns number of pages in the folio that backs the swap entry. If posi= tive, * the folio was reclaimed. If negative, the folio was not reclaimed. If 0= , no @@ -1431,6 +1414,76 @@ static bool swap_sync_discard(void) return false; } =20 +/** + * swap_put_entries_cluster - Decrease the swap count of a set of slots. + * @si: The swap device. + * @start: start offset of slots. + * @nr: number of slots. + * @reclaim_cache: if true, also reclaim the swap cache. + * + * This helper decreases the swap count of a set of slots and tries to + * batch free them. Also reclaims the swap cache if @reclaim_cache is true. + * Context: The caller must ensure that all slots belong to the same + * cluster and their swap count doesn't go underflow. + */ +static void swap_put_entries_cluster(struct swap_info_struct *si, + unsigned long start, int nr, + bool reclaim_cache) +{ + unsigned long offset =3D start, end =3D start + nr; + unsigned long batch_start =3D SWAP_ENTRY_INVALID; + struct swap_cluster_info *ci; + bool need_reclaim =3D false; + unsigned int nr_reclaimed; + unsigned long swp_tb; + unsigned int count; + + ci =3D swap_cluster_lock(si, offset); + do { + swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); + count =3D si->swap_map[offset]; + VM_WARN_ON(swap_count(count) < 1 || count =3D=3D SWAP_MAP_BAD); + if (swap_count(count) =3D=3D 1) { + /* count =3D=3D 1 and non-cached slots will be batch freed. */ + if (!swp_tb_is_folio(swp_tb)) { + if (!batch_start) + batch_start =3D offset; + continue; + } + /* count will be 0 after put, slot can be reclaimed */ + VM_WARN_ON(!(count & SWAP_HAS_CACHE)); + need_reclaim =3D true; + } + /* + * A count !=3D 1 or cached slot can't be freed. Put its swap + * count and then free the interrupted pending batch. Cached + * slots will be freed when folio is removed from swap cache + * (__swap_cache_del_folio). + */ + swap_put_entry_locked(si, ci, offset, 1); + if (batch_start) { + swap_entries_free(si, ci, batch_start, offset - batch_start); + batch_start =3D SWAP_ENTRY_INVALID; + } + } while (++offset < end); + + if (batch_start) + swap_entries_free(si, ci, batch_start, offset - batch_start); + swap_cluster_unlock(ci); + + if (!need_reclaim || !reclaim_cache) + return; + + offset =3D start; + do { + nr_reclaimed =3D __try_to_reclaim_swap(si, offset, + TTRS_UNMAPPED | TTRS_FULL); + offset++; + if (nr_reclaimed) + offset =3D round_up(offset, abs(nr_reclaimed)); + } while (offset < end); +} + /** * folio_alloc_swap - allocate swap space for a folio * @folio: folio we want to move to swap @@ -1532,6 +1585,7 @@ void folio_put_swap(struct folio *folio, struct page = *subpage) { swp_entry_t entry =3D folio->swap; unsigned long nr_pages =3D folio_nr_pages(folio); + struct swap_info_struct *si =3D __swap_entry_to_info(entry); =20 VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio); @@ -1541,7 +1595,7 @@ void folio_put_swap(struct folio *folio, struct page = *subpage) nr_pages =3D 1; } =20 - swap_entries_put_map(__swap_entry_to_info(entry), entry, nr_pages); + swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false); } =20 static struct swap_info_struct *_swap_info_get(swp_entry_t entry) @@ -1578,12 +1632,11 @@ static struct swap_info_struct *_swap_info_get(swp_= entry_t entry) return NULL; } =20 -static unsigned char swap_entry_put_locked(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry, - unsigned char usage) +static void swap_put_entry_locked(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, + unsigned char usage) { - unsigned long offset =3D swp_offset(entry); unsigned char count; unsigned char has_cache; =20 @@ -1609,9 +1662,7 @@ static unsigned char swap_entry_put_locked(struct swa= p_info_struct *si, if (usage) WRITE_ONCE(si->swap_map[offset], usage); else - swap_entries_free(si, ci, entry, 1); - - return usage; + swap_entries_free(si, ci, offset, 1); } =20 /* @@ -1679,70 +1730,6 @@ struct swap_info_struct *get_swap_device(swp_entry_t= entry) return NULL; } =20 -static bool swap_entries_put_map(struct swap_info_struct *si, - swp_entry_t entry, int nr) -{ - unsigned long offset =3D swp_offset(entry); - struct swap_cluster_info *ci; - bool has_cache =3D false; - unsigned char count; - int i; - - if (nr <=3D 1) - goto fallback; - count =3D swap_count(data_race(si->swap_map[offset])); - if (count !=3D 1) - goto fallback; - - ci =3D swap_cluster_lock(si, offset); - if (!swap_is_last_map(si, offset, nr, &has_cache)) { - goto locked_fallback; - } - if (!has_cache) - swap_entries_free(si, ci, entry, nr); - else - for (i =3D 0; i < nr; i++) - WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE); - swap_cluster_unlock(ci); - - return has_cache; - -fallback: - ci =3D swap_cluster_lock(si, offset); -locked_fallback: - for (i =3D 0; i < nr; i++, entry.val++) { - count =3D swap_entry_put_locked(si, ci, entry, 1); - if (count =3D=3D SWAP_HAS_CACHE) - has_cache =3D true; - } - swap_cluster_unlock(ci); - return has_cache; -} - -/* - * Only functions with "_nr" suffix are able to free entries spanning - * cross multi clusters, so ensure the range is within a single cluster - * when freeing entries with functions without "_nr" suffix. - */ -static bool swap_entries_put_map_nr(struct swap_info_struct *si, - swp_entry_t entry, int nr) -{ - int cluster_nr, cluster_rest; - unsigned long offset =3D swp_offset(entry); - bool has_cache =3D false; - - cluster_rest =3D SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER; - while (nr) { - cluster_nr =3D min(nr, cluster_rest); - has_cache |=3D swap_entries_put_map(si, entry, cluster_nr); - cluster_rest =3D SWAPFILE_CLUSTER; - nr -=3D cluster_nr; - entry.val +=3D cluster_nr; - } - - return has_cache; -} - /* * Check if it's the last ref of swap entry in the freeing path. */ @@ -1757,9 +1744,9 @@ static inline bool __maybe_unused swap_is_last_ref(un= signed char count) */ static void swap_entries_free(struct swap_info_struct *si, struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr_pages) + unsigned long offset, unsigned int nr_pages) { - unsigned long offset =3D swp_offset(entry); + swp_entry_t entry =3D swp_entry(si->type, offset); unsigned char *map =3D si->swap_map + offset; unsigned char *map_end =3D map + nr_pages; =20 @@ -1965,10 +1952,8 @@ void swap_put_entries_direct(swp_entry_t entry, int = nr) { const unsigned long start_offset =3D swp_offset(entry); const unsigned long end_offset =3D start_offset + nr; + unsigned long offset, cluster_end; struct swap_info_struct *si; - bool any_only_cache =3D false; - unsigned long offset; - unsigned long swp_tb; =20 si =3D get_swap_device(entry); if (WARN_ON_ONCE(!si)) @@ -1976,44 +1961,13 @@ void swap_put_entries_direct(swp_entry_t entry, int= nr) if (WARN_ON_ONCE(end_offset > si->max)) goto out; =20 - /* - * First free all entries in the range. - */ - any_only_cache =3D swap_entries_put_map_nr(si, entry, nr); - - /* - * Short-circuit the below loop if none of the entries had their - * reference drop to zero. - */ - if (!any_only_cache) - goto out; - - /* - * Now go back over the range trying to reclaim the swap cache. - */ - for (offset =3D start_offset; offset < end_offset; offset +=3D nr) { - nr =3D 1; - swp_tb =3D swap_table_get(__swap_offset_to_cluster(si, offset), - offset % SWAPFILE_CLUSTER); - if (!swap_count(READ_ONCE(si->swap_map[offset])) && swp_tb_is_folio(swp_= tb)) { - /* - * Folios are always naturally aligned in swap so - * advance forward to the next boundary. Zero means no - * folio was found for the swap entry, so advance by 1 - * in this case. Negative value means folio was found - * but could not be reclaimed. Here we can still advance - * to the next boundary. - */ - nr =3D __try_to_reclaim_swap(si, offset, - TTRS_UNMAPPED | TTRS_FULL); - if (nr =3D=3D 0) - nr =3D 1; - else if (nr < 0) - nr =3D -nr; - nr =3D ALIGN(offset + 1, nr) - offset; - } - } - + /* Put entries and reclaim cache in each cluster */ + offset =3D start_offset; + do { + cluster_end =3D min(round_up(offset + 1, SWAPFILE_CLUSTER), end_offset); + swap_put_entries_cluster(si, offset, cluster_end - offset, true); + offset =3D cluster_end; + } while (offset < end_offset); out: put_swap_device(si); } @@ -2060,7 +2014,7 @@ void swap_free_hibernation_slot(swp_entry_t entry) return; =20 ci =3D swap_cluster_lock(si, offset); - swap_entry_put_locked(si, ci, entry, 1); + swap_put_entry_locked(si, ci, offset, 1); WARN_ON(swap_entry_swapped(si, offset)); swap_cluster_unlock(ci); =20 @@ -3815,10 +3769,10 @@ void __swapcache_clear_cached(struct swap_info_stru= ct *si, swp_entry_t entry, unsigned int nr) { if (swap_only_has_cache(si, swp_offset(entry), nr)) { - swap_entries_free(si, ci, entry, nr); + swap_entries_free(si, ci, swp_offset(entry), nr); } else { for (int i =3D 0; i < nr; i++, entry.val++) - swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); + swap_put_entry_locked(si, ci, swp_offset(entry), SWAP_HAS_CACHE); } } =20 @@ -3939,7 +3893,7 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) * into, carry if so, or else fail until a new continuation page is alloca= ted; * when the original swap_map count is decremented from 0 with continuatio= n, * borrow from the continuation and report whether it still holds more. - * Called while __swap_duplicate() or caller of swap_entry_put_locked() + * Called while __swap_duplicate() or caller of swap_put_entry_locked() * holds cluster lock. */ static bool swap_count_continued(struct swap_info_struct *si, --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C4E252D838C for ; Mon, 24 Nov 2025 19:17:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.169 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011830; cv=none; b=uaHGMZ4SXeJs2SMwZIX975JZmZafyU5plR/WPYvbDxt1lGvOCevdO5Zj1ylXxeQXzUZEv2OGCNYQH06m8FFz413Y74jZlo2lBVzZD8ayPFiLl8r+gxAN5zvCPgYxbSBqwqj9fwb1GLhfVAsBNP9j9exR1Fa6Nyk3z8r+hxOthzI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011830; c=relaxed/simple; bh=5Jn37etSXmMYxWw5MQDfSlgyQhGrsbuyU3jgZIJCNUU=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=H45M+e6QghH/oKCzwa1MbHo9mZRE/PAHYT7k70rL3gm25oq1c+M2qRJV3sRkY9k9tYTFCawu0e94PN9cSPX9CBRE+1Atr8/+Ht17qMTatoy+ZePvL8BgkxWMRy2DIEvOt+QkUoJdRlaq5o5qR/GOzVUKd5hohPI3guw2s203ssI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=fOThyY2F; arc=none smtp.client-ip=209.85.210.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="fOThyY2F" Received: by mail-pf1-f169.google.com with SMTP id d2e1a72fcca58-7a9cdf62d31so5727845b3a.3 for ; Mon, 24 Nov 2025 11:17:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011828; x=1764616628; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=oavhXabVAvvtZsCcC1cvU1AnZafkH+SsQyZTTlJRTjA=; b=fOThyY2FFGlzSANt8mddbmxMb0WRi6BXuM6vz3yTdeI/X//qdP5CxUwmH1xnXp9bcb z1Zu+mKl7Vm7fbRXeBdnMPL7IcrzrLNUrYyvV/zsFIMmvJU6vsPbCpCQ53zLUQ+KjNtM 2zaF1chTl7R/lU+UlocS+OQIQerr0c7B0w33MYArKS7ajdrbciiRVdKbHtHhl+hB7bMk Z+vnL+CNqrxLK7U45hTjtwvX5+fPH9v/97FKWM4OlIhdiaTBn7BaA3N1SZsbSnYMT80g PTDpPTzPkQ8Cwzu2ZisB0DAm3sA/wGtjN3SwtjjChqjBV4GbH17bkUocIvaeSkvRFiYj 3Ymw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011828; x=1764616628; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=oavhXabVAvvtZsCcC1cvU1AnZafkH+SsQyZTTlJRTjA=; b=sCpU3WOBx6EgC0BGjJgiqtBe/TAjlNP38dvmheMkQfMDx8PVcc3n8a0xDH3vDkiUVQ SZoWXNc526SjjFwLAHOAdmhpIJpAJIIkTBLR66WTE1Dk1YdL+ALAkJyffPKUFx3QZeEm 5yAJYtdsyYfKcc0MGs6cWunlu28MFxCaec7+tJ33slkeGDKoI1wdNchYFoVLz3SMXkh3 wK1GPMRsk954I4yBCZ58FLLs+9426BKDYQuDnvGcM3gnEK2B80zzVt5u07VuIXT1PbOO cUbYnA591sdaPuQAesjH4LvQSgY1DISMXIMtA21FIRR0dGjzINvhc/fRSuChEwvsm4rz GzHw== X-Forwarded-Encrypted: i=1; AJvYcCWK9zCbeqkKDJoD2vDm86w3eTN7RCEyJrjvBw+Do+EiNLfSyOAcYwkXxqdqBTOg7EvjudO44DCJ5pwA+Xw=@vger.kernel.org X-Gm-Message-State: AOJu0YyFe1aGeJKU6qK+OLB2aNi13n0M+fklifxUze06HcKMyzKJQpEO LgqVrppXPzboIa2N82U3KBwrbOul+MpJqaRPyu4D2tHJaJCYMeFhCUVo X-Gm-Gg: ASbGnctnig+bVxfFrf6Z9GTIZsBrQiJvqktoV/tELNaGl72tIRCi7MlaGjz/xOgMJjR AipmyxoxPmjT+YEIPZ2w77vrXU/uG+8gInKsY5o5lykgKDPRK3D5qwzuk/hgklaNmdfj1OUMZkE HU4EntfRUqce32VxxdascFr8qM93NxbPUEAIsW/Db16ohO4R6lNAutqI67CWy0TulXGCujAHCg4 JdXzKNIWCzEyjhHFI9Pl+yqMpC9WUowqhYzt6suBfdeGOGnTl3ZcDVqrkZf2BpvVKSk7lRvARu0 tdKpdbCd4DklhlHb2eA+6bTDfNFa/VyVp5JA47/YCmrLNvZ2kUnf4Zg7OapwAzGeL+rdZOzRScV iB9f/iEtd42eulAc0CecPK93LmggoSWl5X0+6iIJeb1kcdEz2NBvLGr4pZlvRra7OqDaMMR9rsA TIZ7sSVqm8mbmthfjN49R+wu08YVy3FL2ySdpPff5jqwwt21Zp X-Google-Smtp-Source: AGHT+IEEa0/4DBIyOYVlSoSGhkD1FGqCrGz2j/iQmnzJiouWp299MREFaVh1uEO6xrXh4DYlBHXw+w== X-Received: by 2002:a05:6a20:a12a:b0:342:e2ef:332d with SMTP id adf61e73a8af0-3614edf0345mr13982274637.40.1764011827592; Mon, 24 Nov 2025 11:17:07 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.17.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:17:07 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:14:01 +0800 Subject: [PATCH v3 18/19] mm, swap: drop the SWAP_HAS_CACHE flag Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-18-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=20744; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=o84OpWXB1jw3q+8P1SWDt5xur29PoKfAkoQZg5MRGnU=; b=Rc5hMxyabCrYaKu5dbJIvtTYpYR/7TSXXdNmp7udJJaWHZTuFqmCRe7D+eQTu4iXwBUUTiFfu +m7XZWUfhvOBjZdudWffxqNL8gGOGTTj9vm/6gLVY15BKgNijMQGYyI X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Now, the swap cache is managed by the swap table. All swap cache users are checking the swap table directly to check the swap cache state. SWAP_HAS_CACHE is now just a temporary pin before the first increase from 0 to 1 of a slot's swap count (swap_dup_entries), or before the final free of slots pinned by folio in swap cache (put_swap_folio). Drop these two usages. For the first dup, SWAP_HAS_CACHE pinning was hard to kill because it used to have multiple meanings, more than just "a slot is cached". We have simplified that and just defined that the first dup is always done with folio locked in swap cache (folio_dup_swap), so it can just check the swap cache (swap table) directly. As for freeing, just let the swap cache free all swap entries of a folio that have a swap count of zero directly upon folio removal. We have also just cleaned up freeing to cover the swap cache usage in the swap table, a slot with swap cache will not be freed until its cache is gone. Now, making the removal of a folio and freeing the slots being done in the same critical section, this should improve the performance and gets rid of the SWAP_HAS_CACHE pin. After these two changes, SWAP_HAS_CACHE no longer has any users. Remove all related logic and helpers. swap_map is now only used for tracking the count, so all swap_map users can just need to read it directly, ignoring the swap_count helper, which was previously used to filter out the SWAP_HAS_CACHE bit. The idea of dropping SWAP_HAS_CACHE and using the swap table directly was initially from Chris's idea of merging all the metadata usage of all swaps into one place. Suggested-by: Chris Li Signed-off-by: Kairui Song --- include/linux/swap.h | 1 - mm/swap.h | 13 ++-- mm/swap_state.c | 28 +++++---- mm/swapfile.c | 168 +++++++++++++++++------------------------------= ---- 4 files changed, 77 insertions(+), 133 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 4b4b81fbc6a3..dcb1760e36c3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -224,7 +224,6 @@ enum { #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX =20 /* Bit flag in swap_map */ -#define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */ #define COUNT_CONTINUED 0x80 /* Flag swap_map continuation for full count = */ =20 /* Special value in first swap_map */ diff --git a/mm/swap.h b/mm/swap.h index 3692e143eeba..b2d83e661132 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -205,6 +205,11 @@ int folio_alloc_swap(struct folio *folio); int folio_dup_swap(struct folio *folio, struct page *subpage); void folio_put_swap(struct folio *folio, struct page *subpage); =20 +/* For internal use */ +extern void swap_entries_free(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, unsigned int nr_pages); + /* linux/mm/page_io.c */ int sio_pool_init(void); struct swap_iocb; @@ -256,14 +261,6 @@ static inline bool folio_matches_swap_entry(const stru= ct folio *folio, return folio_entry.val =3D=3D round_down(entry.val, nr_pages); } =20 -/* Temporary internal helpers */ -void __swapcache_set_cached(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry); -void __swapcache_clear_cached(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr); - /* * All swap cache helpers below require the caller to ensure the swap entr= ies * used are valid and stablize the device by any of the following ways: diff --git a/mm/swap_state.c b/mm/swap_state.c index 94b6d368e3e8..2251ab8e569e 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -211,17 +211,6 @@ static int swap_cache_add_folio(struct folio *folio, s= wp_entry_t entry, shadow =3D swp_tb_to_shadow(old_tb); offset++; } while (++ci_off < ci_end); - - ci_off =3D ci_start; - offset =3D swp_offset(entry); - do { - /* - * Still need to pin the slots with SWAP_HAS_CACHE since - * swap allocator depends on that. - */ - __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset)); - offset++; - } while (++ci_off < ci_end); __swap_cache_add_folio(ci, folio, entry); swap_cluster_unlock(ci); if (shadowp) @@ -252,6 +241,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *c= i, struct folio *folio, struct swap_info_struct *si; unsigned long old_tb, new_tb; unsigned int ci_start, ci_off, ci_end; + bool folio_swapped =3D false, need_free =3D false; unsigned long nr_pages =3D folio_nr_pages(folio); =20 VM_WARN_ON_ONCE(__swap_entry_to_cluster(entry) !=3D ci); @@ -269,13 +259,27 @@ void __swap_cache_del_folio(struct swap_cluster_info = *ci, struct folio *folio, old_tb =3D __swap_table_xchg(ci, ci_off, new_tb); WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) !=3D folio); + if (__swap_count(swp_entry(si->type, + swp_offset(entry) + ci_off - ci_start))) + folio_swapped =3D true; + else + need_free =3D true; } while (++ci_off < ci_end); =20 folio->swap.val =3D 0; folio_clear_swapcache(folio); node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages); lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages); - __swapcache_clear_cached(si, ci, entry, nr_pages); + + if (!folio_swapped) { + swap_entries_free(si, ci, swp_offset(entry), nr_pages); + } else if (need_free) { + do { + if (!__swap_count(entry)) + swap_entries_free(si, ci, swp_offset(entry), 1); + entry.val++; + } while (--nr_pages); + } } =20 /** diff --git a/mm/swapfile.c b/mm/swapfile.c index dd1a138ae265..eea904dc08d9 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -48,21 +48,18 @@ #include #include "swap_table.h" #include "internal.h" +#include "swap_table.h" #include "swap.h" =20 static bool swap_count_continued(struct swap_info_struct *, pgoff_t, unsigned char); static void free_swap_count_continuations(struct swap_info_struct *); -static void swap_entries_free(struct swap_info_struct *si, - struct swap_cluster_info *ci, - unsigned long start, unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr= ); static void swap_put_entry_locked(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long offset, - unsigned char usage); + unsigned long offset); static bool folio_swapcache_freeable(struct folio *folio); static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, @@ -149,11 +146,6 @@ static struct swap_info_struct *swap_entry_to_info(swp= _entry_t entry) return swap_type_to_info(swp_type(entry)); } =20 -static inline unsigned char swap_count(unsigned char ent) -{ - return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */ -} - /* * Use the second highest bit of inuse_pages counter as the indicator * if one swap device is on the available plist, so the atomic can @@ -185,15 +177,20 @@ static long swap_usage_in_pages(struct swap_info_stru= ct *si) #define TTRS_FULL 0x4 =20 static bool swap_only_has_cache(struct swap_info_struct *si, - unsigned long offset, int nr_pages) + struct swap_cluster_info *ci, + unsigned long offset, int nr_pages) { + unsigned int ci_off =3D offset % SWAPFILE_CLUSTER; unsigned char *map =3D si->swap_map + offset; unsigned char *map_end =3D map + nr_pages; + unsigned long swp_tb; =20 do { - VM_BUG_ON(!(*map & SWAP_HAS_CACHE)); - if (*map !=3D SWAP_HAS_CACHE) + swp_tb =3D __swap_table_get(ci, ci_off); + VM_WARN_ON_ONCE(!swp_tb_is_folio(swp_tb)); + if (*map) return false; + ++ci_off; } while (++map < map_end); =20 return true; @@ -248,12 +245,12 @@ static int __try_to_reclaim_swap(struct swap_info_str= uct *si, goto out_unlock; =20 /* - * It's safe to delete the folio from swap cache only if the folio's - * swap_map is HAS_CACHE only, which means the slots have no page table + * It's safe to delete the folio from swap cache only if the folio + * is in swap cache with swap count =3D=3D 0. The slots have no page table * reference or pending writeback, and can't be allocated to others. */ ci =3D swap_cluster_lock(si, offset); - need_reclaim =3D swap_only_has_cache(si, offset, nr_pages); + need_reclaim =3D swap_only_has_cache(si, ci, offset, nr_pages); swap_cluster_unlock(ci); if (!need_reclaim) goto out_unlock; @@ -779,7 +776,7 @@ static bool cluster_reclaim_range(struct swap_info_stru= ct *si, =20 spin_unlock(&ci->lock); do { - if (swap_count(READ_ONCE(map[offset]))) + if (READ_ONCE(map[offset])) break; swp_tb =3D swap_table_get(ci, offset % SWAPFILE_CLUSTER); if (swp_tb_is_folio(swp_tb)) { @@ -809,7 +806,7 @@ static bool cluster_reclaim_range(struct swap_info_stru= ct *si, */ for (offset =3D start; offset < end; offset++) { swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); - if (swap_count(map[offset]) || !swp_tb_is_null(swp_tb)) + if (map[offset] || !swp_tb_is_null(swp_tb)) return false; } =20 @@ -829,11 +826,10 @@ static bool cluster_scan_range(struct swap_info_struc= t *si, return true; =20 do { - if (swap_count(map[offset])) + if (map[offset]) return false; swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); if (swp_tb_is_folio(swp_tb)) { - WARN_ON_ONCE(!(map[offset] & SWAP_HAS_CACHE)); if (!vm_swap_full()) return false; *need_reclaim =3D true; @@ -891,11 +887,6 @@ static bool cluster_alloc_range(struct swap_info_struc= t *si, if (likely(folio)) { order =3D folio_order(folio); nr_pages =3D 1 << order; - /* - * Pin the slot with SWAP_HAS_CACHE to satisfy swap_dup_entries. - * This is the legacy allocation behavior, will drop it very soon. - */ - memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages); __swap_cache_add_folio(ci, folio, swp_entry(si->type, offset)); } else { order =3D 0; @@ -1008,8 +999,8 @@ static void swap_reclaim_full_clusters(struct swap_inf= o_struct *si, bool force) to_scan--; =20 while (offset < end) { - if (!swap_count(READ_ONCE(map[offset])) && - swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER))) { + if (!READ_ONCE(map[offset]) && + swp_tb_is_folio(swap_table_get(ci, offset % SWAPFILE_CLUSTER))) { spin_unlock(&ci->lock); nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); @@ -1111,7 +1102,7 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, * Scan only one fragment cluster is good enough. Order 0 * allocation will surely success, and large allocation * failure is not critical. Scanning one cluster still - * keeps the list rotated and reclaimed (for HAS_CACHE). + * keeps the list rotated and reclaimed (for clean swap cache). */ found =3D alloc_swap_scan_list(si, &si->frag_clusters[order], folio, fal= se); if (found) @@ -1442,8 +1433,8 @@ static void swap_put_entries_cluster(struct swap_info= _struct *si, do { swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); count =3D si->swap_map[offset]; - VM_WARN_ON(swap_count(count) < 1 || count =3D=3D SWAP_MAP_BAD); - if (swap_count(count) =3D=3D 1) { + VM_WARN_ON(count < 1 || count =3D=3D SWAP_MAP_BAD); + if (count =3D=3D 1) { /* count =3D=3D 1 and non-cached slots will be batch freed. */ if (!swp_tb_is_folio(swp_tb)) { if (!batch_start) @@ -1451,7 +1442,6 @@ static void swap_put_entries_cluster(struct swap_info= _struct *si, continue; } /* count will be 0 after put, slot can be reclaimed */ - VM_WARN_ON(!(count & SWAP_HAS_CACHE)); need_reclaim =3D true; } /* @@ -1460,7 +1450,7 @@ static void swap_put_entries_cluster(struct swap_info= _struct *si, * slots will be freed when folio is removed from swap cache * (__swap_cache_del_folio). */ - swap_put_entry_locked(si, ci, offset, 1); + swap_put_entry_locked(si, ci, offset); if (batch_start) { swap_entries_free(si, ci, batch_start, offset - batch_start); batch_start =3D SWAP_ENTRY_INVALID; @@ -1613,7 +1603,8 @@ static struct swap_info_struct *_swap_info_get(swp_en= try_t entry) offset =3D swp_offset(entry); if (offset >=3D si->max) goto bad_offset; - if (data_race(!si->swap_map[swp_offset(entry)])) + if (data_race(!si->swap_map[swp_offset(entry)]) && + !swap_cache_has_folio(entry)) goto bad_free; return si; =20 @@ -1634,21 +1625,12 @@ static struct swap_info_struct *_swap_info_get(swp_= entry_t entry) =20 static void swap_put_entry_locked(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long offset, - unsigned char usage) + unsigned long offset) { unsigned char count; - unsigned char has_cache; =20 count =3D si->swap_map[offset]; - - has_cache =3D count & SWAP_HAS_CACHE; - count &=3D ~SWAP_HAS_CACHE; - - if (usage =3D=3D SWAP_HAS_CACHE) { - VM_BUG_ON(!has_cache); - has_cache =3D 0; - } else if ((count & ~COUNT_CONTINUED) <=3D SWAP_MAP_MAX) { + if ((count & ~COUNT_CONTINUED) <=3D SWAP_MAP_MAX) { if (count =3D=3D COUNT_CONTINUED) { if (swap_count_continued(si, offset, count)) count =3D SWAP_MAP_MAX | COUNT_CONTINUED; @@ -1658,10 +1640,8 @@ static void swap_put_entry_locked(struct swap_info_s= truct *si, count--; } =20 - usage =3D count | has_cache; - if (usage) - WRITE_ONCE(si->swap_map[offset], usage); - else + WRITE_ONCE(si->swap_map[offset], count); + if (!count && !swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLU= STER))) swap_entries_free(si, ci, offset, 1); } =20 @@ -1730,21 +1710,13 @@ struct swap_info_struct *get_swap_device(swp_entry_= t entry) return NULL; } =20 -/* - * Check if it's the last ref of swap entry in the freeing path. - */ -static inline bool __maybe_unused swap_is_last_ref(unsigned char count) -{ - return (count =3D=3D SWAP_HAS_CACHE) || (count =3D=3D 1); -} - /* * Drop the last ref of swap entries, caller have to ensure all entries * belong to the same cgroup and cluster. */ -static void swap_entries_free(struct swap_info_struct *si, - struct swap_cluster_info *ci, - unsigned long offset, unsigned int nr_pages) +void swap_entries_free(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, unsigned int nr_pages) { swp_entry_t entry =3D swp_entry(si->type, offset); unsigned char *map =3D si->swap_map + offset; @@ -1757,7 +1729,7 @@ static void swap_entries_free(struct swap_info_struct= *si, =20 ci->count -=3D nr_pages; do { - VM_BUG_ON(!swap_is_last_ref(*map)); + VM_WARN_ON(*map > 1); *map =3D 0; } while (++map < map_end); =20 @@ -1776,7 +1748,7 @@ int __swap_count(swp_entry_t entry) struct swap_info_struct *si =3D __swap_entry_to_info(entry); pgoff_t offset =3D swp_offset(entry); =20 - return swap_count(si->swap_map[offset]); + return si->swap_map[offset]; } =20 /** @@ -1790,7 +1762,7 @@ bool swap_entry_swapped(struct swap_info_struct *si, = unsigned long offset) int count; =20 ci =3D swap_cluster_lock(si, offset); - count =3D swap_count(si->swap_map[offset]); + count =3D si->swap_map[offset]; swap_cluster_unlock(ci); =20 return count && count !=3D SWAP_MAP_BAD; @@ -1817,7 +1789,7 @@ int swp_swapcount(swp_entry_t entry) =20 ci =3D swap_cluster_lock(si, offset); =20 - count =3D swap_count(si->swap_map[offset]); + count =3D si->swap_map[offset]; if (!(count & COUNT_CONTINUED)) goto out; =20 @@ -1855,12 +1827,12 @@ static bool swap_page_trans_huge_swapped(struct swa= p_info_struct *si, =20 ci =3D swap_cluster_lock(si, offset); if (nr_pages =3D=3D 1) { - if (swap_count(map[roffset])) + if (map[roffset]) ret =3D true; goto unlock_out; } for (i =3D 0; i < nr_pages; i++) { - if (swap_count(map[offset + i])) { + if (map[offset + i]) { ret =3D true; break; } @@ -2014,7 +1986,7 @@ void swap_free_hibernation_slot(swp_entry_t entry) return; =20 ci =3D swap_cluster_lock(si, offset); - swap_put_entry_locked(si, ci, offset, 1); + swap_put_entry_locked(si, ci, offset); WARN_ON(swap_entry_swapped(si, offset)); swap_cluster_unlock(ci); =20 @@ -2420,6 +2392,7 @@ static unsigned int find_next_to_unuse(struct swap_in= fo_struct *si, unsigned int prev) { unsigned int i; + unsigned long swp_tb; unsigned char count; =20 /* @@ -2430,7 +2403,11 @@ static unsigned int find_next_to_unuse(struct swap_i= nfo_struct *si, */ for (i =3D prev + 1; i < si->max; i++) { count =3D READ_ONCE(si->swap_map[i]); - if (count && swap_count(count) !=3D SWAP_MAP_BAD) + swp_tb =3D swap_table_get(__swap_offset_to_cluster(si, i), + i % SWAPFILE_CLUSTER); + if (count =3D=3D SWAP_MAP_BAD) + continue; + if (count || swp_tb_is_folio(swp_tb)) break; if ((i % LATENCY_LIMIT) =3D=3D 0) cond_resched(); @@ -3655,8 +3632,7 @@ void si_swapinfo(struct sysinfo *val) * Returns error code in following case. * - success -> 0 * - swp_entry is invalid -> EINVAL - * - swap-cache reference is requested but there is already one. -> EEXIST - * - swap-cache reference is requested but the entry is not used. -> ENOENT + * - swap-mapped reference is requested but the entry is not used. -> ENOE= NT * - swap-mapped reference requested but needs continued swap count. -> EN= OMEM */ static int swap_dup_entries(struct swap_info_struct *si, @@ -3665,39 +3641,28 @@ static int swap_dup_entries(struct swap_info_struct= *si, unsigned char usage, int nr) { int i; - unsigned char count, has_cache; + unsigned char count; =20 for (i =3D 0; i < nr; i++) { count =3D si->swap_map[offset + i]; - /* * Allocator never allocates bad slots, and readahead is guarded * by swap_entry_swapped. */ - if (WARN_ON(swap_count(count) =3D=3D SWAP_MAP_BAD)) + if (WARN_ON(count =3D=3D SWAP_MAP_BAD)) return -ENOENT; - - has_cache =3D count & SWAP_HAS_CACHE; - count &=3D ~SWAP_HAS_CACHE; - - if (!count && !has_cache) { + /* + * Swap count duplication must be guarded by either locked swap cache + * folio (from folio_dup_swap) or external lock (from swap_dup_entry_dir= ect). + */ + if (WARN_ON(!count && + !swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER)))) return -ENOENT; - } else if (usage =3D=3D SWAP_HAS_CACHE) { - if (has_cache) - return -EEXIST; - } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) { - return -EINVAL; - } } =20 for (i =3D 0; i < nr; i++) { count =3D si->swap_map[offset + i]; - has_cache =3D count & SWAP_HAS_CACHE; - count &=3D ~SWAP_HAS_CACHE; - - if (usage =3D=3D SWAP_HAS_CACHE) - has_cache =3D SWAP_HAS_CACHE; - else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) + if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) count +=3D usage; else if (swap_count_continued(si, offset + i, count)) count =3D COUNT_CONTINUED; @@ -3709,7 +3674,7 @@ static int swap_dup_entries(struct swap_info_struct *= si, return -ENOMEM; } =20 - WRITE_ONCE(si->swap_map[offset + i], count | has_cache); + WRITE_ONCE(si->swap_map[offset + i], count); } =20 return 0; @@ -3755,27 +3720,6 @@ int swap_dup_entry_direct(swp_entry_t entry) return err; } =20 -/* Mark the swap map as HAS_CACHE, caller need to hold the cluster lock */ -void __swapcache_set_cached(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry) -{ - WARN_ON(swap_dup_entries(si, ci, swp_offset(entry), SWAP_HAS_CACHE, 1)); -} - -/* Clear the swap map as !HAS_CACHE, caller need to hold the cluster lock = */ -void __swapcache_clear_cached(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr) -{ - if (swap_only_has_cache(si, swp_offset(entry), nr)) { - swap_entries_free(si, ci, swp_offset(entry), nr); - } else { - for (int i =3D 0; i < nr; i++, entry.val++) - swap_put_entry_locked(si, ci, swp_offset(entry), SWAP_HAS_CACHE); - } -} - /* * add_swap_count_continuation - called when a swap count is duplicated * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entr= y's @@ -3821,7 +3765,7 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) =20 ci =3D swap_cluster_lock(si, offset); =20 - count =3D swap_count(si->swap_map[offset]); + count =3D si->swap_map[offset]; =20 if ((count & ~COUNT_CONTINUED) !=3D SWAP_MAP_MAX) { /* --=20 2.52.0 From nobody Tue Dec 2 00:25:23 2025 Received: from mail-pf1-f181.google.com (mail-pf1-f181.google.com [209.85.210.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 636882D9EE8 for ; Mon, 24 Nov 2025 19:17:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011834; cv=none; b=JOSSPY2D+ugSgCQI/Q1K4fV0hpo0e9hKLE7M3YQDEBJFbBhl3JfjWxeaaJlwAI3+22Y/D2aZnM30coXH0Ys/NZb03R0Oy25w9C2WV87WvW17XhHya8/RyDZotJag9qxjafAtWEaAufWG+UkFsG0aR0c2hqbPhvZTzFV8JlSdODQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764011834; c=relaxed/simple; bh=NuOXeDU6kBPbCkYEU1LXrj0BxPy9HLvcKFDjBH49Wzg=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=kmvAfkbJFjR9Dkio6YSsFFwCeeOJzktM/L69ioCTOrGdKCoWr6ETDV5q6HTJqxggF5MJuUJcKsk34YPlMqEl8yozKy7BqmgWMVzLTkhJjNsDhwTJNHdhzXA5ZwH+oSaeWB4J6GoKOY6uZ/vZ3NMwOIpFwCFFYatP5clWDOMlcp4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=KKhVcW0h; arc=none smtp.client-ip=209.85.210.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="KKhVcW0h" Received: by mail-pf1-f181.google.com with SMTP id d2e1a72fcca58-7bf0ad0cb87so4510822b3a.2 for ; Mon, 24 Nov 2025 11:17:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764011832; x=1764616632; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=ipXeMGLwUS00AgxE4fTHfgEHnRGf4+zO4yZv4QMX+tY=; b=KKhVcW0hN+NAcLBCM3Pr78sbsT/aAqmUOvVd6x4vTW8iAfj8f9xsLIkTiGKFMTzRML 85dk3lD4OcfDTRIp2dan6RVcsjxiozh5iuKZ4QfLSmXq/PfMFHnfyXI+f1kqT15LD7rD Ajnkc5s+g0gJwGZjABnCcbbdplQB8i5K5J3VBiI+FaiIZTZ1hsuwkpSojRvSLyv+fZrg N2QAlpj3+bdIrPRbp77hJISJtz5fhTExvW9gjn5GTJVqyR07doD5HQarBZwEzZwcaBHC UPO0NWSkLF9sbNFJ9mqqVUuuAWEizH1fCn3W9CjDVIBSwFJegLaVtrt8LwKYLG/uAddY /8Ww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764011832; x=1764616632; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=ipXeMGLwUS00AgxE4fTHfgEHnRGf4+zO4yZv4QMX+tY=; b=dg7rR52jSZgcoK2alp8OzyMG5l2mvOAgv2pDHvA6HjYeqaK+nUl0J5ZLaYml37H0d6 uWvOC+gf7kgCF6m3ev2TQ2ksueXWHsvmbERAxNCAKNVACHYOTKE4EUEo+zCSWoDI8v0V Kjt4iMCcAFg9Al+A/MoDY31UWloH0frORyZB0wvk2uXwCQmrfBwidv6kdeqNzMqzKmgN z3eC/WNiVFeszGC+zyCG/XuFZ5CyZrRGI9V3r3r0k4Cx7mHjUj6NysLyXRQSS19Zcobt jqfgo3wPRobQz3BJ3hZ9VDQWWSWJ/LVk3m2f+optRU+vkRQMlgTRPeLnFkJUOY8/1VyE UIeA== X-Forwarded-Encrypted: i=1; AJvYcCXBu31TiEYZA5DW6o65iL5SimFz/KhzhSFFRMk8wi+h5nA1Af+SU9AS5G3tZuqyyL07EHYSkdfEaZiCA4A=@vger.kernel.org X-Gm-Message-State: AOJu0Yy02i37mrB84esD/he5LL7Fh4RX5iZPU3Nk/rjv2Pbupfv/+c44 3JxGVLBs/YIHtq76ok4tChYZ4RxdApIryCQn+TFKexZomBBuiHauApIM X-Gm-Gg: ASbGncsaTyVIQ5hVlqbDg2C6s1jswG5AfnpOFhiTXc1UDgHW1QpUdCUJWIwoDuQmDIT SlDEWPUw1XwejM3VAAGKN9+rqya5Qp4qmi0dDSgb7+PrC+tt92fZOHq7EHUmD80X0lYF1BV3REl lqYCVtY17GtI+JHA2jdkH310IoGxb54JNR3xS9IgkHm2oao30yd89nw+DTyudAul9T0xjFRF9X1 CV8T0J6p/ihAtijjfiR5oW5NVf1Pm1BQ/Qqcao6kU+Klr7HLyYf9JVIZxwsclWNEOQIRMd39t/e D176z+B84+atWBB6xvkMvKh/T9lqO/yqyiDuiyc7Gel5vjUyeZ5fD6Ecd6CNEAMpmdiE/fyXXsA Bj52eZNq4RzXSKBdMULfqLfoTHLWwF4UQl36by4Gkj/OTlj7MJbewCmD48aLG32z5XNgsX1jE+J 6iG7vJnNITILHt7GG5fsGqqGpqWtxZ4LZIhdSWNngRJfTEXzbs X-Google-Smtp-Source: AGHT+IF9pmdrLRfwkVSrwzmwVH/8sR8ttEwuXPdu0FQKbT2rnRW0P3S5L6s9lD9AjOwFibcZl8+83Q== X-Received: by 2002:a05:6a20:9189:b0:334:a180:b7ac with SMTP id adf61e73a8af0-36150f05e9bmr13653593637.39.1764011832539; Mon, 24 Nov 2025 11:17:12 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd75def75ffsm14327479a12.3.2025.11.24.11.17.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Nov 2025 11:17:11 -0800 (PST) From: Kairui Song Date: Tue, 25 Nov 2025 03:14:02 +0800 Subject: [PATCH v3 19/19] mm, swap: remove no longer needed _swap_info_get Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251125-swap-table-p2-v3-19-33f54f707a5c@tencent.com> References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764011730; l=3397; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=C+wGETtNWjFBAIszWBWUQ0AB6Ts3cedoLdPlGNc1qko=; b=xAXi8hRLR0CKEwSihjhMaupTrBszmAPy2MMlSOA+zsn3GBmo/KsQJzG0X12L1V0ywNHH1JQIZ wj0EI+ZhtflA+uAjgX35RDPXjkdyoeyNrh1egeU6J4HBvuZNzMCHHRh X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song There are now only two users of _swap_info_get after consolidating these callers, folio_free_swap and swp_swapcount. folio_free_swap already holds the folio lock, and the folio must be in the swap cache, _swap_info_get is redundant. For swp_swapcount, it should use get_swap_device instead. get_swap_device increases the device ref count, which is actually a bit safer. The only current use is smap walking, and the performance change here is tiny. And after these changes, _swap_info_get is no longer used, so we can safely remove it. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swapfile.c | 47 ++++++----------------------------------------- 1 file changed, 6 insertions(+), 41 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index eea904dc08d9..feb57e040ef1 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -83,9 +83,7 @@ bool swap_migration_ad_supported; #endif /* CONFIG_MIGRATION */ =20 static const char Bad_file[] =3D "Bad swap file entry "; -static const char Unused_file[] =3D "Unused swap file entry "; static const char Bad_offset[] =3D "Bad swap offset entry "; -static const char Unused_offset[] =3D "Unused swap offset entry "; =20 /* * all active swap_info_structs @@ -1588,41 +1586,6 @@ void folio_put_swap(struct folio *folio, struct page= *subpage) swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false); } =20 -static struct swap_info_struct *_swap_info_get(swp_entry_t entry) -{ - struct swap_info_struct *si; - unsigned long offset; - - if (!entry.val) - goto out; - si =3D swap_entry_to_info(entry); - if (!si) - goto bad_nofile; - if (data_race(!(si->flags & SWP_USED))) - goto bad_device; - offset =3D swp_offset(entry); - if (offset >=3D si->max) - goto bad_offset; - if (data_race(!si->swap_map[swp_offset(entry)]) && - !swap_cache_has_folio(entry)) - goto bad_free; - return si; - -bad_free: - pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val); - goto out; -bad_offset: - pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); - goto out; -bad_device: - pr_err("%s: %s%08lx\n", __func__, Unused_file, entry.val); - goto out; -bad_nofile: - pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val); -out: - return NULL; -} - static void swap_put_entry_locked(struct swap_info_struct *si, struct swap_cluster_info *ci, unsigned long offset) @@ -1781,7 +1744,7 @@ int swp_swapcount(swp_entry_t entry) pgoff_t offset; unsigned char *map; =20 - si =3D _swap_info_get(entry); + si =3D get_swap_device(entry); if (!si) return 0; =20 @@ -1811,6 +1774,7 @@ int swp_swapcount(swp_entry_t entry) } while (tmp_count & COUNT_CONTINUED); out: swap_cluster_unlock(ci); + put_swap_device(si); return count; } =20 @@ -1845,11 +1809,12 @@ static bool swap_page_trans_huge_swapped(struct swa= p_info_struct *si, static bool folio_swapped(struct folio *folio) { swp_entry_t entry =3D folio->swap; - struct swap_info_struct *si =3D _swap_info_get(entry); + struct swap_info_struct *si; =20 - if (!si) - return false; + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); =20 + si =3D __swap_entry_to_info(entry); if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio))) return swap_entry_swapped(si, swp_offset(entry)); =20 --=20 2.52.0