From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 39D3834B186 for ; Thu, 4 Dec 2025 19:29:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.169 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876586; cv=none; b=TpeH28MtjbhSLpvZp+RWCGWPglMNPyVqy0xbZEQlcIA40CCT8spxF0rutK1cOyx6lmk7NzCiUmN2rsmQmBb11/RutAVdqjEmhnn+vGnoa+zS8tG0v6O0/89eNSGiRZNDFtHMQGkeVM2JbIsT19vUmTl88aIyeSuqcoTSCRg6VRQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876586; c=relaxed/simple; bh=bEVBpXrPRO9yEJ4AajWHP560GQsJ0/HaiU6nlDZBNEU=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=GPceDu8TemW7Oisrl0IHmMUJYd9rs+z1qHL0D7vezPtvSPw70/u+xx0MkjmIC7/AMyLp1PMP7oS8Q1BtNPXiesLK6p0bYsKttnFbVHZI4v8yRCW7WGAssYg+3C02/Zbhq9yNiCSN7RCXCC8H8n7JXZyRb5CjKItyf857RrY3lrw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=LtI0z+YJ; arc=none smtp.client-ip=209.85.210.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="LtI0z+YJ" Received: by mail-pf1-f169.google.com with SMTP id d2e1a72fcca58-7b80fed1505so1487374b3a.3 for ; Thu, 04 Dec 2025 11:29:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876584; x=1765481384; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=Jq81d1H6YKMjZl5Zgb0MmVZ5YbUN3ZOXLRAy2GMP/7c=; b=LtI0z+YJHjV0hTCMu4Kf2M5IbH14F5UnSBVgW48azLagmn6Py2krUWJ7yc3KnOQURb nTaizsmPLAgy7VChg9aClZvqZJU1O6mVcNUkl4JM6liNje64ONfefZ8ORKfnd3VLgGjw JtXdwtO0Cflm5FH/m330nVN4lx0NBE5WAZ5mzWV8dKmp5zrEPbGDUB9wgNC87TQj5VY5 U/6/E7+FSa+N6w1RzK8vQKxYDsmzvQf7ht8Au22z1P2rpj40ev4KvTokNB44NMm2ILfc sZXNKpr4GtCq4FjCeNIYXyXCH/jQO7c46eoqqGRcMAB709CMJycXiZXtKOrea6em+lKj TPPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876584; x=1765481384; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=Jq81d1H6YKMjZl5Zgb0MmVZ5YbUN3ZOXLRAy2GMP/7c=; b=Vxp6AZH65UHTopkXAEKgSpjEjRpiKXTNMrsLAq5slVgbBZZeuNKwU4QA4fQiMvV2tT xb1+KYa3ldwDX1J2z2CvhSiIrDJYnkwkGy6w1tJcRj8mwd4EZbyXzn8T1A0KJWLNN7mG d4KVGtMc6UKvXqvduSHmtYoEk2A0GnE84gKx4yRCoQibW04OvEFHyPZydowBdyz4Ffg4 mwlfcaYe0tK6msfPnSilGS4sLZjKHZak9ObKbiE2eCuZe9OyPHAUlDk2BSPQv746i+t7 cHtzKOioVEQEmKsnLbm0tZglIuASdXZTPMgK3Aizp6G3bnENcxEZTVLzCnfRFbRcIdgb t5aw== X-Forwarded-Encrypted: i=1; AJvYcCU1aEI/v0mzJoz0mkFhSCRFbrIzJ/Remu9VARRHxC9e29u0w1x6cg9TRASpk50S8ohGR9swhrLuwJIkilw=@vger.kernel.org X-Gm-Message-State: AOJu0YzQ2Lehh3uA6gSVilmi5P3SKgx6mDdrxX0t+95LFFB1b45M6c1Z s7zciYJOli4Q9ucIvbdqLoDiq0mHTtZhbQyPsvcfcJlsM/Eo6EA9M4Eq X-Gm-Gg: ASbGncs7b6FkLdl2HtTYWsZ3cf30Wnf+MlQZmeJCTqinvUL+lrhzPSTvnuwoh2FNF1T 8bsk6pOAI0JoFDuyVcR395S59KgwHjlg5AdVJOlZovgH+xZTxQtQ5jBNoywJe/Bb6QzMyfMu1AH bPLquPYU0qPTeQXk3of3M3Y9eUJETsYO0kJ9PH5HBhWHDZtrBydj3TXrdTkWjGjnVhTo107QN3b DvoPnv8S5fjyoGbhwuT43AdFCcvIVVNvzjyvE7xpQj8+6hlM2YhVmjYGmD/8u7ayh/xRK8jiTwS fV3zXrFHbYRLF7A6gXZfwnoK4Mhua9MbTUuM8wL42KwLA5YRPt/u1pyslD8QsLt1sfGedRiWVlj ASfPT8DZkJbI1F4tXPoE5tE4W1ku1IEXgWs97Dn+u8uejTJuQemjDmOG38HXoeJamBtKlR6cKzE md6Nt0HRNVkAWWUJbU8JgH0OwS/eOyrAtyP2gm0ECKNfV7NH3/3MEEE19pxUs= X-Google-Smtp-Source: AGHT+IEzzWPUKgseDNtI+TAN3a+rf5FvnoqWhzpl1W0PHwGSKZNXSLgSNaIQHWAI4QWskmWptXHJCA== X-Received: by 2002:a05:6a20:432c:b0:35e:11ff:45b4 with SMTP id adf61e73a8af0-363f5d6270dmr9088976637.21.1764876584356; Thu, 04 Dec 2025 11:29:44 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.29.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:29:43 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:09 +0800 Subject: [PATCH v4 01/19] mm, swap: rename __read_swap_cache_async to swap_cache_alloc_folio Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-1-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=8299; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=+SR4a+Clox1gWWGvvj3O7fTgvXCBBPQIHCvWevNHVBU=; b=7JYvSNEeNJ2Sq4jdk+Kb2EUltwQCI7SO3G+uIZq1IG69COWybMWPF5xD+Sx+Vf7olTKI3QW2/ YBbUjHxa+L7AzodOdqoqg8Tes9iGKobj1YeCTfV2phjKOpFq+3pE8bC X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song __read_swap_cache_async is widely used to allocate and ensure a folio is in swapcache, or get the folio if a folio is already there. It's not async, and it's not doing any read. Rename it to better present its usage, and prepare to be reworked as part of new swap cache APIs. Also, add some comments for the function. Worth noting that the skip_if_exists argument is an long existing workaround that will be dropped soon. Reviewed-by: Yosry Ahmed Acked-by: Chris Li Reviewed-by: Barry Song Reviewed-by: Nhat Pham Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- mm/swap.h | 6 +++--- mm/swap_state.c | 46 +++++++++++++++++++++++++++++++++------------- mm/swapfile.c | 2 +- mm/zswap.c | 4 ++-- 4 files changed, 39 insertions(+), 19 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index d034c13d8dd2..0fff92e42cfe 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -249,6 +249,9 @@ struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **s= hadow); void swap_cache_del_folio(struct folio *folio); +struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, + struct mempolicy *mpol, pgoff_t ilx, + bool *alloced, bool skip_if_exists); /* Below helpers require the caller to lock and pass in the swap cluster. = */ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry, void *shadow); @@ -261,9 +264,6 @@ void swapcache_clear(struct swap_info_struct *si, swp_e= ntry_t entry, int nr); struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr, struct swap_iocb **plug); -struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags, - struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated, - bool skip_if_exists); struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, struct mempolicy *mpol, pgoff_t ilx); struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, diff --git a/mm/swap_state.c b/mm/swap_state.c index 5f97c6ae70a2..08252eaef32f 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -402,9 +402,29 @@ void swap_update_readahead(struct folio *folio, struct= vm_area_struct *vma, } } =20 -struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, - struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated, - bool skip_if_exists) +/** + * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap ca= che. + * @entry: the swapped out swap entry to be binded to the folio. + * @gfp_mask: memory allocation flags + * @mpol: NUMA memory allocation policy to be applied + * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE + * @new_page_allocated: sets true if allocation happened, false otherwise + * @skip_if_exists: if the slot is a partially cached state, return NULL. + * This is a workaround that would be removed shortly. + * + * Allocate a folio in the swap cache for one swap slot, typically before + * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by + * @entry must have a non-zero swap count (swapped out). + * Currently only supports order 0. + * + * Context: Caller must protect the swap device with reference count or lo= cks. + * Return: Returns the existing folio if @entry is cached already. Returns + * NULL if failed due to -ENOMEM or @entry have a swap count < 1. + */ +struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask, + struct mempolicy *mpol, pgoff_t ilx, + bool *new_page_allocated, + bool skip_if_exists) { struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct folio *folio; @@ -452,12 +472,12 @@ struct folio *__read_swap_cache_async(swp_entry_t ent= ry, gfp_t gfp_mask, goto put_and_return; =20 /* - * Protect against a recursive call to __read_swap_cache_async() + * Protect against a recursive call to swap_cache_alloc_folio() * on the same entry waiting forever here because SWAP_HAS_CACHE * is set but the folio is not the swap cache yet. This can * happen today if mem_cgroup_swapin_charge_folio() below * triggers reclaim through zswap, which may call - * __read_swap_cache_async() in the writeback path. + * swap_cache_alloc_folio() in the writeback path. */ if (skip_if_exists) goto put_and_return; @@ -466,7 +486,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry= , gfp_t gfp_mask, * We might race against __swap_cache_del_folio(), and * stumble across a swap_map entry whose SWAP_HAS_CACHE * has not yet been cleared. Or race against another - * __read_swap_cache_async(), which has set SWAP_HAS_CACHE + * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE * in swap_map, but not yet added its folio to swap cache. */ schedule_timeout_uninterruptible(1); @@ -525,7 +545,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, = gfp_t gfp_mask, return NULL; =20 mpol =3D get_vma_policy(vma, addr, 0, &ilx); - folio =3D __read_swap_cache_async(entry, gfp_mask, mpol, ilx, + folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, &page_allocated, false); mpol_cond_put(mpol); =20 @@ -643,9 +663,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, blk_start_plug(&plug); for (offset =3D start_offset; offset <=3D end_offset ; offset++) { /* Ok, do the async read-ahead now */ - folio =3D __read_swap_cache_async( - swp_entry(swp_type(entry), offset), - gfp_mask, mpol, ilx, &page_allocated, false); + folio =3D swap_cache_alloc_folio( + swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx, + &page_allocated, false); if (!folio) continue; if (page_allocated) { @@ -662,7 +682,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, lru_add_drain(); /* Push any new pages onto the LRU now */ skip: /* The page was likely read above, so no need for plugging here */ - folio =3D __read_swap_cache_async(entry, gfp_mask, mpol, ilx, + folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, &page_allocated, false); if (unlikely(page_allocated)) swap_read_folio(folio, NULL); @@ -767,7 +787,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, if (!si) continue; } - folio =3D __read_swap_cache_async(entry, gfp_mask, mpol, ilx, + folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, &page_allocated, false); if (si) put_swap_device(si); @@ -789,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, lru_add_drain(); skip: /* The folio was likely read above, so no need for plugging here */ - folio =3D __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx, + folio =3D swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx, &page_allocated, false); if (unlikely(page_allocated)) swap_read_folio(folio, NULL); diff --git a/mm/swapfile.c b/mm/swapfile.c index 46d2008e4b99..e5284067a442 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1574,7 +1574,7 @@ static unsigned char swap_entry_put_locked(struct swa= p_info_struct *si, * CPU1 CPU2 * do_swap_page() * ... swapoff+swapon - * __read_swap_cache_async() + * swap_cache_alloc_folio() * swapcache_prepare() * __swap_duplicate() * // check swap_map diff --git a/mm/zswap.c b/mm/zswap.c index 5d0f8b13a958..a7a2443912f4 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1014,8 +1014,8 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, return -EEXIST; =20 mpol =3D get_task_policy(current); - folio =3D __read_swap_cache_async(swpentry, GFP_KERNEL, mpol, - NO_INTERLEAVE_INDEX, &folio_was_allocated, true); + folio =3D swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol, + NO_INTERLEAVE_INDEX, &folio_was_allocated, true); put_swap_device(si); if (!folio) return -ENOMEM; --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BA1F834B1B6 for ; Thu, 4 Dec 2025 19:29:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876591; cv=none; b=NEp4U/eHcxPRKBR0hAYRyEh/sYHmTOKKpujb7whQwNHs6n0+nwCLF5V2B7XbLVPQFsDdpRz7izPqfSe/NGzLynBLGvivALPwsE56RAet3eLz5cp52spxKIwmOdu2jrSi69VVmklNaDSDZYlcX7o0O9mFfhYHNhsI0c1cBqy6LwQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876591; c=relaxed/simple; bh=mDR+6KpgZSns5kFDjnZ/kPqYNF5XkIY1kIQ6+gyx8GY=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=QqShkVwByAQqP5+ugrHaLIrpn0j1ChvgV9gNxhgFkI+YZOWLFV8NMcOP+3GxSMmLTled6OtlsGvUBdfZRDsdp7aRxvb6O57FtsYqv+cbSSfHVcf6n742BlqkE5p2KcTdHC8Ro6/d+7m1hw2zLjLc8I82k1B5HRpOik7axo2cpkM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=kjtD4MU2; arc=none smtp.client-ip=209.85.214.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="kjtD4MU2" Received: by mail-pl1-f170.google.com with SMTP id d9443c01a7336-29568d93e87so13788035ad.2 for ; Thu, 04 Dec 2025 11:29:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876589; x=1765481389; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=wOM3SvUw6equ2pewheYxYcWCJBK1kBvmm5iLOdPjOeg=; b=kjtD4MU2xCbU12R3kUiFA+godDarNJJ57DTfHtnU+xkW3UIExiZSt4GvndDoe+Q8Q5 zZMNx5iVPqstFSXJ54rxQiW4vkA6D+DrxHxukSpBM08zqHP3XSkeOjw+7+QBZ3Q1LT2z usPWPIcNbw9C6AT29UIQpKfQWxNSF+9moie/CIEZ3G462eu+Npbq5qouTUZI93qs9OtQ 9YXgiZyx+ELAwUAh+ERlqraMCqsfn3xV/I7uUNbxsGHRxxFpZmHTyQ9hWWvmmbpBTRa5 tE8TuHwY3zaO2TwdMl/8EScmG8+AAMK4nVqFWQZJSVoUjIjZvjETVEwgFEL0oO9C8c25 VIsA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876589; x=1765481389; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=wOM3SvUw6equ2pewheYxYcWCJBK1kBvmm5iLOdPjOeg=; b=jiuZQX2lqDc91DoZ8HKBAR03sXMEc0OfINdoP+cCaW6+3aq/59Op3jTtlkw4FaFJhd 1BHRDRTlcXmws91K1elVKdlmOTziJTzlUBviP+Rwx5ZT3etkhvs4DuhYYgvhTRqWoLG8 O2o3pMQJP5GfZujW+Birl68luvxpZdVSHQQLZdG9dFhDAJVIPCMCokkOjP2FXkqdwGk8 9w9i2sXiiCp9nt6AZHVFp+TOh1avTM6xGvc7RUXrMcNJJbcgQhzzkU15OY+VuMregvsk AIAbKNMHZXfStJNgfpZ7HlLfvpauToQ+JDIHcc/kR43lQgO2CC2DxzAOLEqP2aILZUKN m5Gg== X-Forwarded-Encrypted: i=1; AJvYcCW60O9gVdPU7IfklGgyv+bkEf+ZzV98Of61zjOIV9RjJyz82YJHUj3YdnCJacx9+1Jq+W6WyWC62yK0XdM=@vger.kernel.org X-Gm-Message-State: AOJu0YwO2+6yxvWXAKre/S2K3YWPf8KNXUSf9nXIaCRWPJ7U7IxwgyCR 2YCiWpVtcra1n5v2RcST2vXJyvdeCetp0Yw2XnDjkD0sJHcZicv2Fe1i X-Gm-Gg: ASbGncsg3jDpYW9PhBLiyoUMsu2nqmdZI393ClGiQc3TC/Zx8ggyPKXDKrGTRObvgCn nkcQk0gx6lsTsW9Gh+W5egFB3iDsZFcj95n8mzh0inS62aduU1GKrC4rnhsVTK8DhvRcZKbKp99 JcdwV6GnnoiYIofRAgIIBsV949rDa29um1bxIf93wjIrvGA2xVO1jQRhqTO2t1ZJCPYL2ZlN+5f 4PfST1F92iGoFhpSAhGeGTj009+5IoZkph9YknyueemYvsTMsiMVyj60aI1uXbyNDu2JKxDq+zs IpAANXdQ5KQDIHRLcJQyZSVc07CZjxEsXPm9Y/c775sJIfLoy58EaJNn6b9GwYyjYptViyQjNba lFMD6M8egGu8umLPvJepesLSVNEadnNVahWtm8Da56Rj4jll3emc/7LVyGZVEYE1QU0HmBABmEq xQrUCYtyye7oD4tnSxN62zbtdmjvCJjAi6vT5CBj91UphqRoTg X-Google-Smtp-Source: AGHT+IHdO2edJuvMK7Tdmwga/SAccjACFd4tKCXhD5hJFGNoxPw4sAHksA0RDQa7PXKOpMFQOxmb1g== X-Received: by 2002:a17:902:f603:b0:295:ed6:4625 with SMTP id d9443c01a7336-29da1eb4c96mr48731645ad.47.1764876588861; Thu, 04 Dec 2025 11:29:48 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.29.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:29:48 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:10 +0800 Subject: [PATCH v4 02/19] mm, swap: split swap cache preparation loop into a standalone helper Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-2-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=7996; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=zTQ0qqEjQqVMue2feX7IfMCwec9b88xoOj9LChkN4Ao=; b=rgkfXy5MtJkcEjGgJL0f43HP7rPfSU/RD1a9mE+hj9skY7zby2N4ptMkQ6KZcXdzhNRVXBYXm mC/FZQVBxhUC4eLfyPvx8iYrVBSRYmEP/OvTR8ach2w50MmIBUAHrlX X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song To prepare for the removal of swap cache bypass swapin, introduce a new helper that accepts an allocated and charged fresh folio, prepares the folio, the swap map, and then adds the folio to the swap cache. This doesn't change how swap cache works yet, we are still depending on the SWAP_HAS_CACHE in the swap map for synchronization. But all synchronization hacks are now all in this single helper. No feature change. Acked-by: Chris Li Reviewed-by: Barry Song Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swap_state.c | 197 +++++++++++++++++++++++++++++++---------------------= ---- 1 file changed, 109 insertions(+), 88 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 08252eaef32f..a8511ce43242 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -402,6 +402,97 @@ void swap_update_readahead(struct folio *folio, struct= vm_area_struct *vma, } } =20 +/** + * __swap_cache_prepare_and_add - Prepare the folio and add it to swap cac= he. + * @entry: swap entry to be bound to the folio. + * @folio: folio to be added. + * @gfp: memory allocation flags for charge, can be 0 if @charged if true. + * @charged: if the folio is already charged. + * @skip_if_exists: if the slot is in a cached state, return NULL. + * This is an old workaround that will be removed shortly. + * + * Update the swap_map and add folio as swap cache, typically before swapi= n. + * All swap slots covered by the folio must have a non-zero swap count. + * + * Context: Caller must protect the swap device with reference count or lo= cks. + * Return: Returns the folio being added on success. Returns the existing = folio + * if @entry is already cached. Returns NULL if raced with swapin or swapo= ff. + */ +static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry, + struct folio *folio, + gfp_t gfp, bool charged, + bool skip_if_exists) +{ + struct folio *swapcache; + void *shadow; + int ret; + + /* + * Check and pin the swap map with SWAP_HAS_CACHE, then add the folio + * into the swap cache. Loop with a schedule delay if raced with + * another process setting SWAP_HAS_CACHE. This hackish loop will + * be fixed very soon. + */ + for (;;) { + ret =3D swapcache_prepare(entry, folio_nr_pages(folio)); + if (!ret) + break; + + /* + * The skip_if_exists is for protecting against a recursive + * call to this helper on the same entry waiting forever + * here because SWAP_HAS_CACHE is set but the folio is not + * in the swap cache yet. This can happen today if + * mem_cgroup_swapin_charge_folio() below triggers reclaim + * through zswap, which may call this helper again in the + * writeback path. + * + * Large order allocation also needs special handling on + * race: if a smaller folio exists in cache, swapin needs + * to fallback to order 0, and doing a swap cache lookup + * might return a folio that is irrelevant to the faulting + * entry because @entry is aligned down. Just return NULL. + */ + if (ret !=3D -EEXIST || skip_if_exists || folio_test_large(folio)) + return NULL; + + /* + * Check the swap cache again, we can only arrive + * here because swapcache_prepare returns -EEXIST. + */ + swapcache =3D swap_cache_get_folio(entry); + if (swapcache) + return swapcache; + + /* + * We might race against __swap_cache_del_folio(), and + * stumble across a swap_map entry whose SWAP_HAS_CACHE + * has not yet been cleared. Or race against another + * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE + * in swap_map, but not yet added its folio to swap cache. + */ + schedule_timeout_uninterruptible(1); + } + + __folio_set_locked(folio); + __folio_set_swapbacked(folio); + + if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) { + put_swap_folio(folio, entry); + folio_unlock(folio); + return NULL; + } + + swap_cache_add_folio(folio, entry, &shadow); + memcg1_swapin(entry, folio_nr_pages(folio)); + if (shadow) + workingset_refault(folio, shadow); + + /* Caller will initiate read into locked folio */ + folio_add_lru(folio); + return folio; +} + /** * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap ca= che. * @entry: the swapped out swap entry to be binded to the folio. @@ -428,99 +519,29 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entr= y, gfp_t gfp_mask, { struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct folio *folio; - struct folio *new_folio =3D NULL; struct folio *result =3D NULL; - void *shadow =3D NULL; =20 *new_page_allocated =3D false; - for (;;) { - int err; - - /* - * Check the swap cache first, if a cached folio is found, - * return it unlocked. The caller will lock and check it. - */ - folio =3D swap_cache_get_folio(entry); - if (folio) - goto got_folio; - - /* - * Just skip read ahead for unused swap slot. - */ - if (!swap_entry_swapped(si, entry)) - goto put_and_return; - - /* - * Get a new folio to read into from swap. Allocate it now if - * new_folio not exist, before marking swap_map SWAP_HAS_CACHE, - * when -EEXIST will cause any racers to loop around until we - * add it to cache. - */ - if (!new_folio) { - new_folio =3D folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); - if (!new_folio) - goto put_and_return; - } - - /* - * Swap entry may have been freed since our caller observed it. - */ - err =3D swapcache_prepare(entry, 1); - if (!err) - break; - else if (err !=3D -EEXIST) - goto put_and_return; - - /* - * Protect against a recursive call to swap_cache_alloc_folio() - * on the same entry waiting forever here because SWAP_HAS_CACHE - * is set but the folio is not the swap cache yet. This can - * happen today if mem_cgroup_swapin_charge_folio() below - * triggers reclaim through zswap, which may call - * swap_cache_alloc_folio() in the writeback path. - */ - if (skip_if_exists) - goto put_and_return; + /* Check the swap cache again for readahead path. */ + folio =3D swap_cache_get_folio(entry); + if (folio) + return folio; =20 - /* - * We might race against __swap_cache_del_folio(), and - * stumble across a swap_map entry whose SWAP_HAS_CACHE - * has not yet been cleared. Or race against another - * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE - * in swap_map, but not yet added its folio to swap cache. - */ - schedule_timeout_uninterruptible(1); - } - - /* - * The swap entry is ours to swap in. Prepare the new folio. - */ - __folio_set_locked(new_folio); - __folio_set_swapbacked(new_folio); - - if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry)) - goto fail_unlock; - - swap_cache_add_folio(new_folio, entry, &shadow); - memcg1_swapin(entry, 1); + /* Skip allocation for unused swap slot for readahead path. */ + if (!swap_entry_swapped(si, entry)) + return NULL; =20 - if (shadow) - workingset_refault(new_folio, shadow); - - /* Caller will initiate read into locked new_folio */ - folio_add_lru(new_folio); - *new_page_allocated =3D true; - folio =3D new_folio; -got_folio: - result =3D folio; - goto put_and_return; - -fail_unlock: - put_swap_folio(new_folio, entry); - folio_unlock(new_folio); -put_and_return: - if (!(*new_page_allocated) && new_folio) - folio_put(new_folio); + /* Allocate a new folio to be added into the swap cache. */ + folio =3D folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); + if (!folio) + return NULL; + /* Try add the new folio, returns existing folio or NULL on failure. */ + result =3D __swap_cache_prepare_and_add(entry, folio, gfp_mask, + false, skip_if_exists); + if (result =3D=3D folio) + *new_page_allocated =3D true; + else + folio_put(folio); return result; } =20 --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pf1-f170.google.com (mail-pf1-f170.google.com [209.85.210.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8D918345CAC for ; Thu, 4 Dec 2025 19:29:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876596; cv=none; b=VBdHnDFrJ7ar9e5Gwi5urrmgZlnrSny9I/gcWdjATULuFWZm37ge4gxm3PFhKzue6itGC1o0TGGonCxS7Y0fsez2ntEKsYZSUVQlHfpoUsWbrs+xP9L+aqD8hTeAybfp/VuILIWkMDA1NSEpk1X27ak7I4XrsSR3a3Eu5aNyVGM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876596; c=relaxed/simple; bh=BJkZuUV2jsRjz0ivQrsu0RWW8P818o92hmKLlsVQVbQ=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=abDwG3zq7rr5mTxIQDBkHNjp8lmHtDwpKNAJoYt0rc0P0ePWPHw5z4VuYzTPK9gKUM1zeMDfHMVJACNzoPwniJVs/K0fEBEiSGbQFc0GgrYlYHS32keHBfbgKZQFf9cC/qJQqLQh0CLkgZE/uupo3U7NT6EqC7idWhFdvIxIb2k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=bfChinaa; arc=none smtp.client-ip=209.85.210.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="bfChinaa" Received: by mail-pf1-f170.google.com with SMTP id d2e1a72fcca58-7aab7623f42so1498788b3a.2 for ; Thu, 04 Dec 2025 11:29:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876593; x=1765481393; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=sMtnpXktxseEAfJmqRUtm2vGM2gGL4F/IU4EU59raAk=; b=bfChinaa4JsgQQ1ifWyPkh34aOtnojMim5nvpkWdxYFZGT7T+/0iysKMxAHthtunbG R7gtBRd4NPZGzEL3u4aDdIq1CTq2d4DahvL/6smwr7o4Z9/dm6mpw4MugsBoOtjgyskl KVX6fjh8eusIZRflv3+YLWRI8AHQFM0fc2wgY4b1UrYHSb3VVCRNcSSjFnsl8LwBzcuD e/WdRItC6NNXI9ze7r3lNxX4SMrb6cLG8go8Va3Bj0ppi4uARV9D4lu8/FLVQ75T5HNL +72lGIWddnZOsYULvAy5avPXi1f9YsbGb6tuVKqIg02Qdb+OII3lqxVqzUX62JWQouAg eagQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876593; x=1765481393; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=sMtnpXktxseEAfJmqRUtm2vGM2gGL4F/IU4EU59raAk=; b=CbdwB6AMcEujzloeyZvvqq4ovALLaJywRnL46WzQF91o5gJczj0nBwfXRo4Pp9TzIR d6mYuBEVQDZZ14MKWs7lrLu6ewFTDsQX56cvP2QmMrYGXP5I0hNtXMC0TTBommpHWmjn lo+GeSqVYam2vDrBQaO5evZcPSY+/WDbeuY9Mhw2xjM/RNAwZn17JTm/31gZ6vugjouC tws/6StI4qs47nDZm6hMYUEQFV159m9tg4AWS76SD3f2in5/QZd1wlcstx/VydgQNuw4 tsqChAeeHvWYN9fIt9e41w7lGnQqthEJWpDw6S1z/Ic+4nsemtCR0geVYBy4ZWLISoC/ YngQ== X-Forwarded-Encrypted: i=1; AJvYcCXATAk1maspVv9d1G5U9xI2hqlW7b9vV1APEfHqv8WpQBePlrjo3mEmw0Ythe3bRum2OjBrgfzb9Djyu8E=@vger.kernel.org X-Gm-Message-State: AOJu0YznEdN8rf6V2k1VpqSzs0m1GUP2WuvgT5vdYGl8AV/e4Tkslkzq eP9jQAqB8tDfWwenYjq1W52MXLYjPx0y4tvdgw8mWWR8/2FOfhFaRDSB X-Gm-Gg: ASbGncvtIvKxCDXWHXR315u6DDNS3c3rfiMPBNv1P4mDi0IAejKNVxLVH1LsDpU1Bbs O/AyVkFrLt7ohDD6DDEhS3V8XxAnek4425q7iftLOJOuCi8xpLw3F+QOs75xuz8sM8pAppiOybe fozW7mQfMtU3PAnelUXF0BBzkKa0Q3B79qd0simbWpCN3wP1rLGkATPdaCKOfBtAcHdGlFJbGNB D0lv9i0RrOr3nZZl5BkyTN848OIZDDdv9pQ5BBG1xZCOtMVjG3cE3t9P4JzLKCBtmdxXQcYUJ+6 BB/FrjMF18a38+Vm8UkmySvNQ7rmbkwfcbOq5dCVuUtxu2y1f6DdnW5JqmQDMn7aDNxv0AF6Osz KSriW//T8so5SttLFYtDmlVm6hsJf3u/58E3NlKY862oYCFuj4r+b7rCSlxxd1nn9Ny60EeLNZh SbWpVnKFtrSU+12NUfP0MdCKZ2q+dFt77hN955RlM9/Y+5MbME X-Google-Smtp-Source: AGHT+IFDml4zNGaJ1zKmMR1NyN0gSdJWNGBXLXoW5SMl/nv2qV0kjDFsiIu04w7rnc9M13ijsMxP8A== X-Received: by 2002:a05:6a20:7d9f:b0:35d:8881:e6ce with SMTP id adf61e73a8af0-363f5d738bamr9164823637.26.1764876593300; Thu, 04 Dec 2025 11:29:53 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.29.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:29:52 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:11 +0800 Subject: [PATCH v4 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-3-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=12641; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=CXPZcptUllHnmdZrRzWq+Am/UqMjEzTPQssi/cWdxwc=; b=Mq8neUUI2VZYCEQdis3FGQ9Tf1+Kl2YdVw2CNc1nR4f7MzY1OMAtQVoYghMAIFHhJIpva9tje FkknpeBDkYyA61Ja0M9R2zwn8mnneUgNWTsioY2qPGQmNO2PKt4UtbK X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Now the overhead of the swap cache is trivial. Bypassing the swap cache is no longer a valid optimization. So unify the swapin path using the swap cache. This changes the swap in behavior in two observable ways. Readahead is now always disabled for SWP_SYNCHRONOUS_IO devices, which is a huge win for some workloads: We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) =3D=3D 1` as the indicator to bypass both the swap cache and readahead, the swap count check made bypassing ineffective in many cases, and it's not a good indicator. The limitation existed because the current swap design made it hard to decouple readahead bypassing and swap cache bypassing. We do want to always bypass readahead for SWP_SYNCHRONOUS_IO devices, but bypassing swap cache at the same time will cause repeated IO and memory overhead. Now that swap cache bypassing is gone, this swap count check can be dropped. The second thing here is that this enabled large swapin for all swap entries on SWP_SYNCHRONOUS_IO devices. Previously, the large swap in is also coupled with swap cache bypassing, and so the swap count checking also makes large swapin less effective. Now this is also improved. We will always have large swapin supported for all SWP_SYNCHRONOUS_IO cases. And to catch potential issues with large swapin, especially with page exclusiveness and swap cache, more debug sanity checks and comments are added. But overall, the code is simpler. And new helper and routines will be used by other components in later commits too. And now it's possible to rely on the swap cache layer for resolving synchronization issues, which will also be done by a later commit. Worth mentioning that for a large folio workload, this may cause more serious thrashing. This isn't a problem with this commit, but a generic large folio issue. For a 4K workload, this commit increases the performance. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/memory.c | 137 +++++++++++++++++++++-------------------------------= ---- mm/swap.h | 6 +++ mm/swap_state.c | 27 +++++++++++ 3 files changed, 85 insertions(+), 85 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 6675e87eb7dd..41b690eb8c00 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4608,7 +4608,16 @@ static struct folio *alloc_swap_folio(struct vm_faul= t *vmf) } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 -static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq); +/* Sanity check that a folio is fully exclusive */ +static void check_swap_exclusive(struct folio *folio, swp_entry_t entry, + unsigned int nr_pages) +{ + /* Called under PT locked and folio locked, the swap count is stable */ + do { + VM_WARN_ON_ONCE_FOLIO(__swap_count(entry) !=3D 1, folio); + entry.val++; + } while (--nr_pages); +} =20 /* * We enter with non-exclusive mmap_lock (to exclude vma changes, @@ -4621,17 +4630,14 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq); vm_fault_t do_swap_page(struct vm_fault *vmf) { struct vm_area_struct *vma =3D vmf->vma; - struct folio *swapcache, *folio =3D NULL; - DECLARE_WAITQUEUE(wait, current); + struct folio *swapcache =3D NULL, *folio; struct page *page; struct swap_info_struct *si =3D NULL; rmap_t rmap_flags =3D RMAP_NONE; - bool need_clear_cache =3D false; bool exclusive =3D false; softleaf_t entry; pte_t pte; vm_fault_t ret =3D 0; - void *shadow =3D NULL; int nr_pages; unsigned long page_idx; unsigned long address; @@ -4702,57 +4708,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio =3D swap_cache_get_folio(entry); if (folio) swap_update_readahead(folio, vma, vmf->address); - swapcache =3D folio; - if (!folio) { - if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && - __swap_count(entry) =3D=3D 1) { - /* skip swapcache */ + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { folio =3D alloc_swap_folio(vmf); if (folio) { - __folio_set_locked(folio); - __folio_set_swapbacked(folio); - - nr_pages =3D folio_nr_pages(folio); - if (folio_test_large(folio)) - entry.val =3D ALIGN_DOWN(entry.val, nr_pages); /* - * Prevent parallel swapin from proceeding with - * the cache flag. Otherwise, another thread - * may finish swapin first, free the entry, and - * swapout reusing the same entry. It's - * undetectable as pte_same() returns true due - * to entry reuse. + * folio is charged, so swapin can only fail due + * to raced swapin and return NULL. */ - if (swapcache_prepare(entry, nr_pages)) { - /* - * Relax a bit to prevent rapid - * repeated page faults. - */ - add_wait_queue(&swapcache_wq, &wait); - schedule_timeout_uninterruptible(1); - remove_wait_queue(&swapcache_wq, &wait); - goto out_page; - } - need_clear_cache =3D true; - - memcg1_swapin(entry, nr_pages); - - shadow =3D swap_cache_get_shadow(entry); - if (shadow) - workingset_refault(folio, shadow); - - folio_add_lru(folio); - - /* To provide entry to swap_read_folio() */ - folio->swap =3D entry; - swap_read_folio(folio, NULL); - folio->private =3D NULL; + swapcache =3D swapin_folio(entry, folio); + if (swapcache !=3D folio) + folio_put(folio); + folio =3D swapcache; } } else { - folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, - vmf); - swapcache =3D folio; + folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf); } =20 if (!folio) { @@ -4774,6 +4744,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); } =20 + swapcache =3D folio; ret |=3D folio_lock_or_retry(folio, vmf); if (ret & VM_FAULT_RETRY) goto out_release; @@ -4843,24 +4814,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_nomap; } =20 - /* allocated large folios for SWP_SYNCHRONOUS_IO */ - if (folio_test_large(folio) && !folio_test_swapcache(folio)) { - unsigned long nr =3D folio_nr_pages(folio); - unsigned long folio_start =3D ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); - unsigned long idx =3D (vmf->address - folio_start) / PAGE_SIZE; - pte_t *folio_ptep =3D vmf->pte - idx; - pte_t folio_pte =3D ptep_get(folio_ptep); - - if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || - swap_pte_batch(folio_ptep, nr, folio_pte) !=3D nr) - goto out_nomap; - - page_idx =3D idx; - address =3D folio_start; - ptep =3D folio_ptep; - goto check_folio; - } - nr_pages =3D 1; page_idx =3D 0; address =3D vmf->address; @@ -4904,12 +4857,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio)); BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page)); =20 + /* + * If a large folio already belongs to anon mapping, then we + * can just go on and map it partially. + * If not, with the large swapin check above failing, the page table + * have changed, so sub pages might got charged to the wrong cgroup, + * or even should be shmem. So we have to free it and fallback. + * Nothing should have touched it, both anon and shmem checks if a + * large folio is fully appliable before use. + * + * This will be removed once we unify folio allocation in the swap cache + * layer, where allocation of a folio stabilizes the swap entries. + */ + if (!folio_test_anon(folio) && folio_test_large(folio) && + nr_pages !=3D folio_nr_pages(folio)) { + if (!WARN_ON_ONCE(folio_test_dirty(folio))) + swap_cache_del_folio(folio); + goto out_nomap; + } + /* * Check under PT lock (to protect against concurrent fork() sharing * the swap entry concurrently) for certainly exclusive pages. */ if (!folio_test_ksm(folio)) { + /* + * The can_swapin_thp check above ensures all PTE have + * same exclusiveness. Checking just one PTE is fine. + */ exclusive =3D pte_swp_exclusive(vmf->orig_pte); + if (exclusive) + check_swap_exclusive(folio, entry, nr_pages); if (folio !=3D swapcache) { /* * We have a fresh page that is not exposed to the @@ -4987,18 +4965,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) vmf->orig_pte =3D pte_advance_pfn(pte, page_idx); =20 /* ksm created a completely new copy */ - if (unlikely(folio !=3D swapcache && swapcache)) { + if (unlikely(folio !=3D swapcache)) { folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); } else if (!folio_test_anon(folio)) { /* - * We currently only expect small !anon folios which are either - * fully exclusive or fully shared, or new allocated large - * folios which are fully exclusive. If we ever get large - * folios within swapcache here, we have to be careful. + * We currently only expect !anon folios that are fully + * mappable. See the comment after can_swapin_thp above. */ - VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) !=3D nr_pages, folio); + VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); } else { folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address, @@ -5038,12 +5014,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); out: - /* Clear the swap cache pin for direct swapin after PTL unlock */ - if (need_clear_cache) { - swapcache_clear(si, entry, nr_pages); - if (waitqueue_active(&swapcache_wq)) - wake_up(&swapcache_wq); - } if (si) put_swap_device(si); return ret; @@ -5051,6 +5021,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); out_page: + if (folio_test_swapcache(folio)) + folio_free_swap(folio); folio_unlock(folio); out_release: folio_put(folio); @@ -5058,11 +5030,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_unlock(swapcache); folio_put(swapcache); } - if (need_clear_cache) { - swapcache_clear(si, entry, nr_pages); - if (waitqueue_active(&swapcache_wq)) - wake_up(&swapcache_wq); - } if (si) put_swap_device(si); return ret; diff --git a/mm/swap.h b/mm/swap.h index 0fff92e42cfe..214e7d041030 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -268,6 +268,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t flag, struct mempolicy *mpol, pgoff_t ilx); struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf); +struct folio *swapin_folio(swp_entry_t entry, struct folio *folio); void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, unsigned long addr); =20 @@ -386,6 +387,11 @@ static inline struct folio *swapin_readahead(swp_entry= _t swp, gfp_t gfp_mask, return NULL; } =20 +static inline struct folio *swapin_folio(swp_entry_t entry, struct folio *= folio) +{ + return NULL; +} + static inline void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, unsigned long addr) { diff --git a/mm/swap_state.c b/mm/swap_state.c index a8511ce43242..8c429dc33ca9 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -545,6 +545,33 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry= , gfp_t gfp_mask, return result; } =20 +/** + * swapin_folio - swap-in one or multiple entries skipping readahead. + * @entry: starting swap entry to swap in + * @folio: a new allocated and charged folio + * + * Reads @entry into @folio, @folio will be added to the swap cache. + * If @folio is a large folio, the @entry will be rounded down to align + * with the folio size. + * + * Return: returns pointer to @folio on success. If folio is a large folio + * and this raced with another swapin, NULL will be returned to allow fall= back + * to order 0. Else, if another folio was already added to the swap cache, + * return that swap cache folio instead. + */ +struct folio *swapin_folio(swp_entry_t entry, struct folio *folio) +{ + struct folio *swapcache; + pgoff_t offset =3D swp_offset(entry); + unsigned long nr_pages =3D folio_nr_pages(folio); + + entry =3D swp_entry(swp_type(entry), round_down(offset, nr_pages)); + swapcache =3D __swap_cache_prepare_and_add(entry, folio, 0, true, false); + if (swapcache =3D=3D folio) + swap_read_folio(folio, NULL); + return swapcache; +} + /* * Locate a page of swap in physical memory, reserving swap cache space * and reading the disk if it is not already cached. --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pf1-f178.google.com (mail-pf1-f178.google.com [209.85.210.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8D43334BA4E for ; Thu, 4 Dec 2025 19:29:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876601; cv=none; b=mPCtWgYjmoHARWr8bGfm2sIyuZtrkk+LX3YqG26pj0rfwyQDnQ0fmD9YENbumZQ86BMYyEGvxXKqZX6unOyeGV+9wOmScjydFMNegOwcoY2QkqnpAo22AyQk4H9MpmPthMhj9fMCKavyMA0Z+uazHNlXUr2Cz0r17UadmIRHUxI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876601; c=relaxed/simple; bh=mEECQ/wo8pHlcOj88GTHTL5nEJiE8fkV3P8vt0L/P0Q=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=HAPs9dAvHh6z1YunEeanIupSVW0CiB04jssUt8biSybKtibdpM16RK1lUPaaOEOdKnrPDhhYsMtkmsXfZKPWtQeze3fHeDJNHOaC6+D6YhtxKi77qWX6C71jS1vqQIeel6CFw5bsQApRLbV1mAzkulsxKYLDoxybzmaHaM3/diY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=cOWPsdGi; arc=none smtp.client-ip=209.85.210.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="cOWPsdGi" Received: by mail-pf1-f178.google.com with SMTP id d2e1a72fcca58-7b8eff36e3bso2103066b3a.2 for ; Thu, 04 Dec 2025 11:29:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876598; x=1765481398; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=Z74RzNdfSl8VMAwJnh85EBGE2QwwEumphH4Ky//xWjU=; b=cOWPsdGiVW76DrrC1MMNI9rcJ1oM/tOdWn9DHs3QF1GIzkl7YlV+FdBWJdSNaFlzU+ JBhR35oPgKNz/7iYN7DjHMUns1LKZWGJRPAGZolGcCBMlRErHLRb3JcHqqmAXIPfL1vu zug/53+IYtNuUTfNOcanjcmBs3SY8V3/LlW5viv/uRpSWRqLho20f4J+UTvBkfjdBT2W oVxE/1jJUYnGdXI0osUr9Yb77qZYkNqlPlZOme/bb4hIQew4Bi9k6o+YGoQUreAUBwre EfU4WDFgd6OTKe1WmKfzGWRDTpPQIFsEIpzdncEbCGmLdFvPTehn+r9GwbKKzP6PJSF+ ddsQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876598; x=1765481398; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=Z74RzNdfSl8VMAwJnh85EBGE2QwwEumphH4Ky//xWjU=; b=bGNlN8AAVAZj+tczxeIK5ewXHXfBa7Llpu507qHtwEFAR/yVEgHRYP6roaKDD/O7cW lRprFt1jj+x11EEjH6mC8V+D/0xCVhzSoFf+6UoCTikkqQBTITfNuhzqt/vMS6FxMGFp gRiIaXK3Zv4solJ/SxEw6XXvQ+U/xB7xi0TBLi+LLRw6iQGdtUWnk6b7GUSIfXqwTF3b bupfzodRloJqeVZd/Vz8RfUKeBa7/p++dnhC+LfQv5pbvC8tLzHIovTV1o5wYbQYckwy TdA9GUH9PH6O6CaUpaPVCrT8TF0rFaRoE4UY5VS84vsAp9Ed893pxpEz8EfYtLYcKFIo 6EBQ== X-Forwarded-Encrypted: i=1; AJvYcCW9AqHTR92Lbq+lcuJxJk5U05tKk6wlRjX4XYeHpKZP2YQJlP7SRqV6Aduf85u/gMR66bFbwlPCYHRY8Pk=@vger.kernel.org X-Gm-Message-State: AOJu0Yynqh3RxMtXnITOJu0S2X3sfnWDc8Dikq+Di3DHqcZ5s2AS6KGG xpGT5MJhlM639v9i2N5wATqYxg0OtzeLaOP4oY+hBIIrm2PGRNunGWZE X-Gm-Gg: ASbGncsZBC5XCBpWNNpggJYW9WXvlNNRgl0UxeL0QbgWYGaAmtevWF4tMHB9AWKjOwY bMQIaGi8uWmk6ReWZuXCcpnaXAwck5X8L6IpA/M3g/Fpx+tgmYKzLo2jtw/gJjnG571tMbt92kB ZLjBIGaTw6ybByKmsKZ3hie205QSUxLEGYTp3iWoikERPgMDPthun6EzvI4AobIaQRmc05pZjtb 3Dw5B4r1urNAWzUIpVxlOVamDzu+AY3L2jR3JjP6JbYW4G8DjsAQkw6o7Xn98I2IebpqhapsMBw QJRrDgyFjS7GmCOQ5hGsmOx+FPk1T9b0SUMXDLpYsTn7Xsl3b2okmV+1dYxv/PAMKkglotjCmwu 2Wv1ODw7Q1OyYsNDlDdRdcmjpocTGgVKSxJh5QrY9guKwqKYQFt3woYN0I68lHvP0QX0BFPIlOD 5MV2ZFWH/S68yJIyibL4zXNgtA3PtEFm9SFeoHb9kDayZn41Mq X-Google-Smtp-Source: AGHT+IGUfhOlcR84uG12iYI/zNuEssxSqcyplxyda4eDjMXB57WJfsD0fAS+avxYa45OMzapngs8TQ== X-Received: by 2002:a05:6300:210f:b0:35e:fce6:46e7 with SMTP id adf61e73a8af0-36403764175mr5059537637.5.1764876597809; Thu, 04 Dec 2025 11:29:57 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.29.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:29:57 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:12 +0800 Subject: [PATCH v4 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-4-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=3125; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=CI2m/8pPC1xTUXTV6AisxBkS+TKo9XryUKDABctQipg=; b=quttK8TYZdUR/zrStiXrO4jwsevsIWI+I1WKBXImPjLu5Sq7sQDCY06WCy3Is/O4LS0NkNspQ DZpdnsNeLpNCJ8dg65sILknSf2LjXHZKMF56JzYz1W3QZxKGGIMrW6G X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side effect is that a folio may stay in swap cache for a longer time due to lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios are being swapped out very frequently right after swapin, hence improving the performance. But the long pinning of swap slots also increases the fragmentation rate of the swap device significantly, and currently, all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also causes the backing memory to be pinned, increasing the memory pressure. So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices after swapin finishes. Swap cache has served its role as a synchronization layer to prevent any parallel swap-in from wasting CPU or memory allocation, and the redundant IO is not a major concern for SWP_SYNCHRONOUS_IO devices. Worth noting, without this patch, this series so far can provide a ~30% performance gain for certain workloads like MySQL or kernel compilation, but causes significant regression or OOM when under extreme global pressure. With this patch, we still have a nice performance gain for most workloads, and without introducing any observable regressions. This is a hint that further optimization can be done based on the new unified swapin with swap cache, but for now, just keep the behaviour consistent with before. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/memory.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 41b690eb8c00..9fb2032772f2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4354,12 +4354,26 @@ static vm_fault_t remove_device_exclusive_entry(str= uct vm_fault *vmf) return 0; } =20 -static inline bool should_try_to_free_swap(struct folio *folio, +/* + * Check if we should call folio_free_swap to free the swap cache. + * folio_free_swap only frees the swap cache to release the slot if swap + * count is zero, so we don't need to check the swap count here. + */ +static inline bool should_try_to_free_swap(struct swap_info_struct *si, + struct folio *folio, struct vm_area_struct *vma, unsigned int fault_flags) { if (!folio_test_swapcache(folio)) return false; + /* + * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap + * cache can help save some IO or memory overhead, but these devices + * are fast, and meanwhile, swap cache pinning the slot deferring the + * release of metadata or fragmentation is a more critical issue. + */ + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) + return true; if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) || folio_test_mlocked(folio)) return true; @@ -4931,7 +4945,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * yet. */ swap_free_nr(entry, nr_pages); - if (should_try_to_free_swap(folio, vma, vmf->flags)) + if (should_try_to_free_swap(si, folio, vma, vmf->flags)) folio_free_swap(folio); =20 add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pj1-f49.google.com (mail-pj1-f49.google.com [209.85.216.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DE34434B197 for ; Thu, 4 Dec 2025 19:30:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.49 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876604; cv=none; b=VezSeX9aCvqLzRnhEqfJHGgS3eJ3Es3j5CikDWjbtrYG/FwHfL1V8DR7VLD1BYYkFKiItf8zQJMvySzbGDh3tA5eeH124+5vKiledTt2UHvubiXCVf9kM6p5u9u+o4Puk+EYVmj57efmxNsPFB+ghvTgjuaMqb1EgRrTJ1+beNo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876604; c=relaxed/simple; bh=KMzyXdRK1sUzmmUco0/Q9JRq8HWOzI6kvSLBoif3w94=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=rB68HyjuiWkCiJ4ctY679fXikX0Xsy4YTOdHnaVyhg3u8Rgy/40TF/xv7zWxdsaz6NgefWlutRpZufT0GGsnAT37q2Ujbs8rqVwR1Th5fTCwwvtzYGvQctQi/ScyYolKvzOvmdTPqp0n7xIYm+n/ShQ56T2D5UqB0LGOLdLuLp4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=hIZRn+y4; arc=none smtp.client-ip=209.85.216.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="hIZRn+y4" Received: by mail-pj1-f49.google.com with SMTP id 98e67ed59e1d1-3436a97f092so1633037a91.3 for ; Thu, 04 Dec 2025 11:30:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876602; x=1765481402; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=uX9WKXwPMUhDf2nsEW9hz+YtSdKaTQoicE44/0zqUlg=; b=hIZRn+y4rd84Pw43FqmN8AzYT1tiTpSuOqml/kFPXqZK6BZGWVyC/Z0lXMqYijWhlK vZVosS9YufigyQq1JzAu7ConckTNsZX98rrga+DUvkYjHK01ueh+3Hx6Rs3Oraa26EE2 54uZ0gjcJ3/Z8WiFh47Ies7nUimSSyz6DkQ6saOisF7EVFn5bD7BrZOjUVdxq/S01dCv l8HxKaw1X8IyJOmibfP80mj3Qx4VXvRvmZhqazZPardMgT3PsyeU3ojDxdtifmVL/r+x a/edmqRqthq06b4OA8izOdw6E+QhTFNPLWAnVicl5uwM/JQxrXgxvE93WaVlWX76BwIe dfhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876602; x=1765481402; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=uX9WKXwPMUhDf2nsEW9hz+YtSdKaTQoicE44/0zqUlg=; b=KT79s6vzsjbPvjKszjZ7X1Vwv/CFv4lvdB+XsAFjd7V4WHCMODK265rUagZMWsQKVe zHuJQlLuGllrYNLKz0VJb2TIAkIKsMrByd82vAnigMxLi8wUvpcbpWdx9Q5VxWcDuAhe OuCZGoOf2SGXEFX+azagWtA68gH94I+KEBz4rZNPEpg20YGgv9qTLa+M9htqkIHxhztT TC4YAu0FL4o6WO0mJJU5HDdDMkLt35QzX0XKu8OEmsjD0Su07nCS7lZpi8dihy7YiCuz j+o8LLIGPxvW3HCHiTitIfcUmn8+4p5py9wOWoaiAvkptLo9+r8WSCCA5Oxe9AxFR3Xv E2kA== X-Forwarded-Encrypted: i=1; AJvYcCWVIdKLnl6ZXuptVYDhdzPwgTYac3vNOKcipdNCBIgRM5xOdI2G/Yp0LYL5Te97jStNIt1p3/D6YNssWpE=@vger.kernel.org X-Gm-Message-State: AOJu0YxGwHwdZQr7/9J9Ha/TCCD7xljWqrs67VsjyL2FcocqN3cetkpz dUABYxrOLdfV3U++4httvJo8eTije90jBCNBem1ywWIgCM28cemwJTsK X-Gm-Gg: ASbGncuhXA9Absuw4vxcjYxJUhPEXHissc0BFZeIzR3opJnKbB5/OMv+Q+rN+Mlcw11 yqla43oZSowzDBTnFyANlagQ/LoDK1jt2jR2S5dHMyZJPmlbXw/zJ1i3isLxMgUy5CFTWm2+0sO xiTXY/7cMKC9rHpzmZzmDA48rLYHhIvTfqrRI/cOfVn59QaDVH+CnHdtXZU11wDBY39JyTqgJrs qitF60lhnXYP0v0MMwKVNYntZYXvqLO8iqSOvR5F6MVHk6S44P2zhdbzf7FczmkxWbIXxucDJqj 6SyBWyjhO7YVlqlSPbujvhocDYgD9IIfTY1a4hwOJrD/y0z44+zaLk+A7241wHC0U5++1ALm7xy jdB3g7ftajuXIH5Uz3yfM6eOePp0Bwn4rfEGGtC3pxBsBkvXZ0dZcqSmmTDNC+KSBoHCZT8uWo7 FJ+BukIiodRCRn2Oprg0vwUTBmp2BJMia1SeU+rCe5v3Po+CxR X-Google-Smtp-Source: AGHT+IFnL9Mghpe839LUCxEVjILNs9G2bsHK3lBcPCEC+WiYnmxsRBn7wCZoMetOLy4kczsXt1oQHA== X-Received: by 2002:a17:90b:558c:b0:340:ff89:8b62 with SMTP id 98e67ed59e1d1-34947efa534mr3503288a91.21.1764876602174; Thu, 04 Dec 2025 11:30:02 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.29.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:30:01 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:13 +0800 Subject: [PATCH v4 05/19] mm, swap: simplify the code and reduce indention Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-5-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=4358; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=7VD9T4L1mbbNr1WJIIQ0GVUukfOHNZ1e5OxrzJG+X+g=; b=NCunfunLugw6YZ2wKhxtkhq7zcfMUiTcTJ1rfwXb5r7PDZpp2ieRFjcRZRJQ/jqRG7Wa5WzzJ WH6TTur95UGDtpuC+5TT7oGvNlBctI8wQp+WkMBLzjLPBgYHKzFyn6O X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Now swap cache is always used, multiple swap cache checks are no longer useful, remove them and reduce the code indention. No behavior change. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/memory.c | 89 +++++++++++++++++++++++++++++----------------------------= ---- 1 file changed, 43 insertions(+), 46 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 9fb2032772f2..3f707275d540 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4764,55 +4764,52 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_release; =20 page =3D folio_file_page(folio, swp_offset(entry)); - if (swapcache) { - /* - * Make sure folio_free_swap() or swapoff did not release the - * swapcache from under us. The page pin, and pte_same test - * below, are not enough to exclude that. Even if it is still - * swapcache, we need to check that the page's swap has not - * changed. - */ - if (unlikely(!folio_matches_swap_entry(folio, entry))) - goto out_page; - - if (unlikely(PageHWPoison(page))) { - /* - * hwpoisoned dirty swapcache pages are kept for killing - * owner processes (which may be unknown at hwpoison time) - */ - ret =3D VM_FAULT_HWPOISON; - goto out_page; - } - - /* - * KSM sometimes has to copy on read faults, for example, if - * folio->index of non-ksm folios would be nonlinear inside the - * anon VMA -- the ksm flag is lost on actual swapout. - */ - folio =3D ksm_might_need_to_copy(folio, vma, vmf->address); - if (unlikely(!folio)) { - ret =3D VM_FAULT_OOM; - folio =3D swapcache; - goto out_page; - } else if (unlikely(folio =3D=3D ERR_PTR(-EHWPOISON))) { - ret =3D VM_FAULT_HWPOISON; - folio =3D swapcache; - goto out_page; - } - if (folio !=3D swapcache) - page =3D folio_page(folio, 0); + /* + * Make sure folio_free_swap() or swapoff did not release the + * swapcache from under us. The page pin, and pte_same test + * below, are not enough to exclude that. Even if it is still + * swapcache, we need to check that the page's swap has not + * changed. + */ + if (unlikely(!folio_matches_swap_entry(folio, entry))) + goto out_page; =20 + if (unlikely(PageHWPoison(page))) { /* - * If we want to map a page that's in the swapcache writable, we - * have to detect via the refcount if we're really the exclusive - * owner. Try removing the extra reference from the local LRU - * caches if required. + * hwpoisoned dirty swapcache pages are kept for killing + * owner processes (which may be unknown at hwpoison time) */ - if ((vmf->flags & FAULT_FLAG_WRITE) && folio =3D=3D swapcache && - !folio_test_ksm(folio) && !folio_test_lru(folio)) - lru_add_drain(); + ret =3D VM_FAULT_HWPOISON; + goto out_page; } =20 + /* + * KSM sometimes has to copy on read faults, for example, if + * folio->index of non-ksm folios would be nonlinear inside the + * anon VMA -- the ksm flag is lost on actual swapout. + */ + folio =3D ksm_might_need_to_copy(folio, vma, vmf->address); + if (unlikely(!folio)) { + ret =3D VM_FAULT_OOM; + folio =3D swapcache; + goto out_page; + } else if (unlikely(folio =3D=3D ERR_PTR(-EHWPOISON))) { + ret =3D VM_FAULT_HWPOISON; + folio =3D swapcache; + goto out_page; + } else if (folio !=3D swapcache) + page =3D folio_page(folio, 0); + + /* + * If we want to map a page that's in the swapcache writable, we + * have to detect via the refcount if we're really the exclusive + * owner. Try removing the extra reference from the local LRU + * caches if required. + */ + if ((vmf->flags & FAULT_FLAG_WRITE) && + !folio_test_ksm(folio) && !folio_test_lru(folio)) + lru_add_drain(); + folio_throttle_swaprate(folio, GFP_KERNEL); =20 /* @@ -5002,7 +4999,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) pte, pte, nr_pages); =20 folio_unlock(folio); - if (folio !=3D swapcache && swapcache) { + if (unlikely(folio !=3D swapcache)) { /* * Hold the lock to avoid the swap entry to be reused * until we take the PT lock for the pte_same() check @@ -5040,7 +5037,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_unlock(folio); out_release: folio_put(folio); - if (folio !=3D swapcache && swapcache) { + if (folio !=3D swapcache) { folio_unlock(swapcache); folio_put(swapcache); } --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pf1-f171.google.com (mail-pf1-f171.google.com [209.85.210.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8611934C83D for ; Thu, 4 Dec 2025 19:30:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876609; cv=none; b=YOsQ/ofeh4FdeimwJx41pOKHWTzR99p3/odQ/V1iYX0W1iCfRIJ5edS03n0fmbNeDhIxC5hEclSB6wCg6oK7Jwzm+B46tkvoVSlVrG9wamTPaNdEPQrwU+ZcmQAfkZll+mZpe5xKsL08/Ul3aIl8pNzARowZveNeOZz5T4yFQu0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876609; c=relaxed/simple; bh=oai4UBxh3GEyx6xhRb3I2QqT99Mr+f1uES5spQpBFNY=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=hUoTgdtLznnrQI3Or/vnmU7zbKFLlT3PeFXg7fcnHIVdjfrw9Bp5SdbNhFDlfVdhn4BTphXsmh3KCfV/TZqX8KE9iwsJLi2KNA4OnSsEwfNr/MUsRlyQWyf8t6uY7cUmAC+Z9Ajp45RKHvK0TIii2AgcYHNXbFNWv6JLtgIaxO4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=UL/ffykW; arc=none smtp.client-ip=209.85.210.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="UL/ffykW" Received: by mail-pf1-f171.google.com with SMTP id d2e1a72fcca58-7baf61be569so1605733b3a.3 for ; Thu, 04 Dec 2025 11:30:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876607; x=1765481407; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=m3/znPoavjBgqqjpUZfnq7wwOL79gjtKqNHBOh+8lvo=; b=UL/ffykW4RvwksZSgWY+edf2+pIftiT5SAN7tlAb4+bS/W/Xdl2eyDeSUxG3z/oLQN XCiYEt+87YXlVef7RboiBLceF37Ku5KjEkd5oLiXSQF/cnleeV9n5AHynQFrAebSeUJM 0jtTnG7+EV/cuoHCgIv75d7fnmAlTU3N9gdryH7KcDCq+ESQI65fRIRuxfMd5seN+g17 u9TxX9r2yz26ksaf2WKveknHNuo4qpOwN+cY/e3Lodq9yKPbVPvG4Qf/dbQiz69hjZx7 cBW2oPtRYowGGRo7vI9MaBg8u+lQLhNiY+wyAiBauf3z/hlLWw+iC1j8DfeeXxd2vpVt BqgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876607; x=1765481407; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=m3/znPoavjBgqqjpUZfnq7wwOL79gjtKqNHBOh+8lvo=; b=HWdXrx5ZJInoWrF/LMlwbiRjJhSv5ezb6jqTNuHpUBunnlhSo71ltM/lU4wJPyO9Kt BbN7Jc+pyRv21iDADYC+Q5DfNLyNiJQGnoiGkvCoEnIsJcCJf3gu2lXDjqEY2GiMBKPV tnfYWKqS7PjL6APZeclVY2Q3G0yaMxHf5vAyoPaWCWC+tyl9rLGsNBjNdNoJkMN1VrvZ xxz7OWvTorzE94ze2rvsArlInMwv7PtY73e0uowfewpn92kUGWyAGTTJi/2UXmq9JFas ZyDqRsJx1sEjOeSoxk+gunypqZfNcmD7mBLbqRDHqSixGHD9U2i0WQBjHauC4f0k5WyY cT9A== X-Forwarded-Encrypted: i=1; AJvYcCW+EXFiL7KRIm0eHNeCyEdqmb5XvKb6p76Bv4zk0JNE8oX+fdXFJuHTENKGNliHAH831kKXPj6jbk+G9jU=@vger.kernel.org X-Gm-Message-State: AOJu0YyinUq+8QsbW5drdUuNhqTotwRh1qam8/eYfCHpSWOXfwEvGr14 LEAvVvVc8WX3KdQn9MveIDQCx/61rI9v70w2o2cMNK8jO0tKaAkSonNe X-Gm-Gg: ASbGncumWK8XMtcp5qbTbrwd91ItEqXFllFPSkMCVrQGl9klyzdhAPn5pg7oLxEcxjg 7cnYg2qCYxJkT2oT2fADXOKM5QkgxzZJWe8z90icBfP22GzyMAHDqvGAeWEL1Yb1v/6AYlLHTdA +1q5kz8DMO6bDiSGlYPpcoMDfHbk0LCBxvp52Gv4R36+aG/SxvFe7eKPIk56wPht98Ui458GS6O 98xPioPeL38Yqa48HzMfaeAvDiqCnnRBUbZzmZHEP42bUlRl5msFHNlg58tckbCfuzdRT/a3wCy id3ZLAVBUkWYp6QaX77hUwOHDaB8CYI5VvIUpAyF4HZdZtQY/pb4jdMwYPbehB5fI923rupxqXw sp3nWSGL1C9SIrFzqq63DJoPsVSfdADVITCUapNHsaQqa9nU95w5RkE5AR0ubBhpOyNOrfij9mz 99NZU7SnM+hYcIR7/M0wscfaJLO2+Xec9tAM9ShSax0PJqvLoU X-Google-Smtp-Source: AGHT+IHlCihpoml1bURIaE9C5o/lmQI/Qptg1W+OPUZo32rvNPAw+oZIUkYT4BZTK0zq/6gCdjTzQw== X-Received: by 2002:a05:6a21:3281:b0:361:3bdb:26df with SMTP id adf61e73a8af0-363f5cd95a3mr9195987637.5.1764876606643; Thu, 04 Dec 2025 11:30:06 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.30.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:30:06 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:14 +0800 Subject: [PATCH v4 06/19] mm, swap: free the swap cache after folio is mapped Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-6-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=2699; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=7ANx1B1sM45AeElSDuuzzNPN1Tv7606509lOkJkHgRE=; b=90zPFXqUoIJKzzeQgR2WIpWPnUlFjo51E3igJ0+jxGSHtxP9gmtjHUDBAh8+8WttPBQRL9Ia8 uE6YjgmVFcuDL4N9jy7qQnDwYpQZ/xV3RYh08HQsk+hxX9t7Xe8KXPW X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song To reduce repeated faults due to parallel swapins of the same PTE, remove the folio from the swap cache after it is mapped. So new faults from the swap PTE will be much more likely to see the folio in the swap cache and wait on it. This does not eliminate all swapin races: an ongoing swapin fault may still see an empty swap cache. That's harmless, as the PTE is changed before the swap cache is cleared, so it will just return and not trigger any repeated faults. Signed-off-by: Kairui Song Reviewed-by: Baoquan He Suggested-by: Chris Li --- mm/memory.c | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 3f707275d540..ce9f56f77ae5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4362,6 +4362,7 @@ static vm_fault_t remove_device_exclusive_entry(struc= t vm_fault *vmf) static inline bool should_try_to_free_swap(struct swap_info_struct *si, struct folio *folio, struct vm_area_struct *vma, + unsigned int extra_refs, unsigned int fault_flags) { if (!folio_test_swapcache(folio)) @@ -4384,7 +4385,7 @@ static inline bool should_try_to_free_swap(struct swa= p_info_struct *si, * reference only in case it's likely that we'll be the exclusive user. */ return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) && - folio_ref_count(folio) =3D=3D (1 + folio_nr_pages(folio)); + folio_ref_count(folio) =3D=3D (extra_refs + folio_nr_pages(folio)); } =20 static vm_fault_t pte_marker_clear(struct vm_fault *vmf) @@ -4936,15 +4937,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ arch_swap_restore(folio_swap(entry, folio), folio); =20 - /* - * Remove the swap entry and conditionally try to free up the swapcache. - * We're already holding a reference on the page but haven't mapped it - * yet. - */ - swap_free_nr(entry, nr_pages); - if (should_try_to_free_swap(si, folio, vma, vmf->flags)) - folio_free_swap(folio); - add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); pte =3D mk_pte(page, vma->vm_page_prot); @@ -4998,6 +4990,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) arch_do_swap_page_nr(vma->vm_mm, vma, address, pte, pte, nr_pages); =20 + /* + * Remove the swap entry and conditionally try to free up the swapcache. + * Do it after mapping, so raced page faults will likely see the folio + * in swap cache and wait on the folio lock. + */ + swap_free_nr(entry, nr_pages); + if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags)) + folio_free_swap(folio); + folio_unlock(folio); if (unlikely(folio !=3D swapcache)) { /* --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E05D134C9AE for ; Thu, 4 Dec 2025 19:30:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.169 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876613; cv=none; b=SqFE7e0t0OMUv3oiRAK4HhVILE0dv78T212mUQxdn4Y+cvbuwqK6OcM9y4LzmLZIy8RsdW7d1IokMKAGEWCnkPB0Aiid+whca+4ISnB7ggqOkzI2kMUlIgHJp8CINvHOOBuSckAva5p2eVb2RRIjJaCmrpC1eHsNw8gpAuTJNHI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876613; c=relaxed/simple; bh=VpmU7dnhJ7AtOmuxOiwPT77Wg1F9Y9a9Zu8+/fWt1Pk=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=MY858qIfM/wi7Hh+VLW2XMJAyvnncuEtnp8o2DQZ4mxRQ55fxAetUlb7WzTS1SsYpbb0cYOdDOY5CcitpjuuqTOqRBWGwuw60gmrzJBF4mFveGEjAQxWNpMg41pf6oE/N5WXnd+GEZaXsF2lZDy9NzVUy9FyJH++ZTmKcOxBWVw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Rf7ld7EU; arc=none smtp.client-ip=209.85.210.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Rf7ld7EU" Received: by mail-pf1-f169.google.com with SMTP id d2e1a72fcca58-7b8e49d8b35so1539353b3a.3 for ; Thu, 04 Dec 2025 11:30:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876611; x=1765481411; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=sAt+yjHK5qQ7D8m5tBR+wGQlT8hgZz0lqkymiFojj+o=; b=Rf7ld7EUlAmJliyRljjrjEUClXIs60r9SCxXB35b9Iz0HfGE/DwEelThUzt0VlKe35 /d8y9CG41s5PkTPq+DXAng535RtwrCUQwudAkT1t1/OZmyqwQKgLWOVmcnzezfe55LJv +5LV3sca9gdKAHEacePB+3s3bV+HglBo45eb7VpSzzQdOGXPZJnlNNJIMCquCmCDC0di T6rdRlPfahVVmIBxQRBg/Mcg+QJhHXMIcvtjc+QOaQQreQTgfq2M7ITHfmKfvgEvUdaa HfXWy47eVXEIK60xyeHj0nkPcE9OE6ogdh5IK0VBugu05wWzwBgwolG+5ExUASAjwng5 29pQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876611; x=1765481411; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=sAt+yjHK5qQ7D8m5tBR+wGQlT8hgZz0lqkymiFojj+o=; b=eRb70nmjf/8P3khTVp9Db+uvnT6mLuB6rKhSVed2Fz6dexmiLhnuWMetMylQlA1cTZ s9P4K+09/L7v/x4VpDxAb4mT9YWCNz4mUkYeflR+E9fr/CYy16pfMUzZQxM0WnWuIYwQ 9OJ/7t4ieUgv118ubrnMXtSWJpwSF/q5r9HQzvBSyfD38sEap+/MN72dBvboUayReMZK iQYfAQ/EpwG32gEftivuxMETfRZ8EUAdE1csx4Gv5cDDPLt6anrMnuRH3PdS/Xg4g/1R NLZyOqLsof5Ti8aUt6OIcu2euNp/qetch083T7soIkwWvMSB0yk9GHTQrr+kq1gkL0o5 RKHg== X-Forwarded-Encrypted: i=1; AJvYcCVL+/wvyXRgOY6RGWW5BBQeRG8JLcoTBpXODYfhK+/Y56XLeT12B7I12R0QuNpMOYoGmiqpARQ5HkcjIt4=@vger.kernel.org X-Gm-Message-State: AOJu0YykBOdY/b0FEAD4TsfeLIn1hv/JYauMGVpL68lJDyweXjmSHRWz U3tDy/n6HBz/eTa4mYGzeNU3fM6aWVijX2WzD2AdkxLkLLKCgz7rCGHG X-Gm-Gg: ASbGnctxUh22GhqULxoGMrpQQrH7onjDiG7/2HK94uLNkrY3hCAAhlfU7OlrR6K1aqS 7XuVaNUDBPfy0FkzdH5Uk++SV1fTaTdxCLB78QC/KNp4bi9D73aUal/mxHF+lGDWmJ52cyAFgMn cbuqWvfXddOP+nb9qMdSFiEffSqu0PLLRZE96A+lq936+hOJKzvTnnSMI6Yd7LuoGqnuHLeptTr h+50xH4yzKc0VmGqY4MUj8BxIrhDjdrEIcYOpmPapZfFHkZl0HEGVUEJRPLNpEYmPhh1IaBTQpQ VSRutYerIwKITtqELftpz6bYua6ARDv8znqbDQBfq47blV1LgkM0AqtgiDDhjxVXwPKYIILXGSM xImMDGqruJYAeB77ECYsdqAy+7zgzbiUAgeynuxy/VDbQ+cdnOnwRCGJWsRSJmRhvjS8DBPUXJX 4I5vOdhLpPiSNQKx9JHb+Ub7POsLVGTvNO5qncNId1HewDT7Zd X-Google-Smtp-Source: AGHT+IFird7IDAtp3z6cM38SwIz7DzceXHZjQT0hHj1JkWSs1Yleu/kOjAP8i1LO8Znvzpyb60f+vw== X-Received: by 2002:a05:6a20:a11e:b0:2e5:655c:7f93 with SMTP id adf61e73a8af0-363f5e6f5a3mr9553698637.33.1764876611072; Thu, 04 Dec 2025 11:30:11 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.30.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:30:10 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:15 +0800 Subject: [PATCH v4 07/19] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-7-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=8383; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=IYfI3KkZoVNQ86T3ocsCHfwQQaumhNpA42xiQvDymTk=; b=49OGMjEFd0WljHDwp6ddjbasyOv2SjOP3yKoOjFMTosQmtuhrE3wDGlIZCtGx0P+ex8ECbZ6y QNzkyzflYt3AFCWDKegarXhBec7Sp5u2M/jiUiI65OjkYrO9gfD3eN2 X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Now the overhead of the swap cache is trivial to none, bypassing the swap cache is no longer a good optimization. We have removed the cache bypass swapin for anon memory, now do the same for shmem. Many helpers and functions can be dropped now. The performance may slightly drop because of the co-existence and double update of swap_map and swap table, and this problem will be improved very soon in later commits by dropping the swap_map update partially: Swapin of 24 GB file with tmpfs with transparent_hugepage_tmpfs=3Dwithin_size and ZRAM, 3 test runs on my machine: Before: After this commit: After this series: 5.99s 6.29s 6.08s And later swap table phases drop the swap_map completely to avoid overhead and reduce memory usage. Reviewed-by: Baolin Wang Tested-by: Baolin Wang Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/shmem.c | 65 +++++++++++++++++--------------------------------------= ---- mm/swap.h | 4 ---- mm/swapfile.c | 35 +++++++++----------------------- 3 files changed, 27 insertions(+), 77 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index ad18172ff831..d08248fd67ff 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2001,10 +2001,9 @@ static struct folio *shmem_swap_alloc_folio(struct i= node *inode, swp_entry_t entry, int order, gfp_t gfp) { struct shmem_inode_info *info =3D SHMEM_I(inode); + struct folio *new, *swapcache; int nr_pages =3D 1 << order; - struct folio *new; gfp_t alloc_gfp; - void *shadow; =20 /* * We have arrived here because our zones are constrained, so don't @@ -2044,34 +2043,19 @@ static struct folio *shmem_swap_alloc_folio(struct = inode *inode, goto fallback; } =20 - /* - * Prevent parallel swapin from proceeding with the swap cache flag. - * - * Of course there is another possible concurrent scenario as well, - * that is to say, the swap cache flag of a large folio has already - * been set by swapcache_prepare(), while another thread may have - * already split the large swap entry stored in the shmem mapping. - * In this case, shmem_add_to_page_cache() will help identify the - * concurrent swapin and return -EEXIST. - */ - if (swapcache_prepare(entry, nr_pages)) { + swapcache =3D swapin_folio(entry, new); + if (swapcache !=3D new) { folio_put(new); - new =3D ERR_PTR(-EEXIST); - /* Try smaller folio to avoid cache conflict */ - goto fallback; + if (!swapcache) { + /* + * The new folio is charged already, swapin can + * only fail due to another raced swapin. + */ + new =3D ERR_PTR(-EEXIST); + goto fallback; + } } - - __folio_set_locked(new); - __folio_set_swapbacked(new); - new->swap =3D entry; - - memcg1_swapin(entry, nr_pages); - shadow =3D swap_cache_get_shadow(entry); - if (shadow) - workingset_refault(new, shadow); - folio_add_lru(new); - swap_read_folio(new, NULL); - return new; + return swapcache; fallback: /* Order 0 swapin failed, nothing to fallback to, abort */ if (!order) @@ -2161,8 +2145,7 @@ static int shmem_replace_folio(struct folio **foliop,= gfp_t gfp, } =20 static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t inde= x, - struct folio *folio, swp_entry_t swap, - bool skip_swapcache) + struct folio *folio, swp_entry_t swap) { struct address_space *mapping =3D inode->i_mapping; swp_entry_t swapin_error; @@ -2178,8 +2161,7 @@ static void shmem_set_folio_swapin_error(struct inode= *inode, pgoff_t index, =20 nr_pages =3D folio_nr_pages(folio); folio_wait_writeback(folio); - if (!skip_swapcache) - swap_cache_del_folio(folio); + swap_cache_del_folio(folio); /* * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks) @@ -2279,7 +2261,6 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, softleaf_t index_entry; struct swap_info_struct *si; struct folio *folio =3D NULL; - bool skip_swapcache =3D false; int error, nr_pages, order; pgoff_t offset; =20 @@ -2322,7 +2303,6 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, folio =3D NULL; goto failed; } - skip_swapcache =3D true; } else { /* Cached swapin only supports order 0 folio */ folio =3D shmem_swapin_cluster(swap, gfp, info, index); @@ -2378,9 +2358,8 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, * and swap cache folios are never partially freed. */ folio_lock(folio); - if ((!skip_swapcache && !folio_test_swapcache(folio)) || - shmem_confirm_swap(mapping, index, swap) < 0 || - folio->swap.val !=3D swap.val) { + if (!folio_matches_swap_entry(folio, swap) || + shmem_confirm_swap(mapping, index, swap) < 0) { error =3D -EEXIST; goto unlock; } @@ -2412,12 +2391,7 @@ static int shmem_swapin_folio(struct inode *inode, p= goff_t index, if (sgp =3D=3D SGP_WRITE) folio_mark_accessed(folio); =20 - if (skip_swapcache) { - folio->swap.val =3D 0; - swapcache_clear(si, swap, nr_pages); - } else { - swap_cache_del_folio(folio); - } + swap_cache_del_folio(folio); folio_mark_dirty(folio); swap_free_nr(swap, nr_pages); put_swap_device(si); @@ -2428,14 +2402,11 @@ static int shmem_swapin_folio(struct inode *inode, = pgoff_t index, if (shmem_confirm_swap(mapping, index, swap) < 0) error =3D -EEXIST; if (error =3D=3D -EIO) - shmem_set_folio_swapin_error(inode, index, folio, swap, - skip_swapcache); + shmem_set_folio_swapin_error(inode, index, folio, swap); unlock: if (folio) folio_unlock(folio); failed_nolock: - if (skip_swapcache) - swapcache_clear(si, folio->swap, folio_nr_pages(folio)); if (folio) folio_put(folio); put_swap_device(si); diff --git a/mm/swap.h b/mm/swap.h index 214e7d041030..e0f05babe13a 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -403,10 +403,6 @@ static inline int swap_writeout(struct folio *folio, return 0; } =20 -static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_= t entry, int nr) -{ -} - static inline struct folio *swap_cache_get_folio(swp_entry_t entry) { return NULL; diff --git a/mm/swapfile.c b/mm/swapfile.c index e5284067a442..3762b8f3f9e9 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1614,22 +1614,6 @@ struct swap_info_struct *get_swap_device(swp_entry_t= entry) return NULL; } =20 -static void swap_entries_put_cache(struct swap_info_struct *si, - swp_entry_t entry, int nr) -{ - unsigned long offset =3D swp_offset(entry); - struct swap_cluster_info *ci; - - ci =3D swap_cluster_lock(si, offset); - if (swap_only_has_cache(si, offset, nr)) { - swap_entries_free(si, ci, entry, nr); - } else { - for (int i =3D 0; i < nr; i++, entry.val++) - swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); - } - swap_cluster_unlock(ci); -} - static bool swap_entries_put_map(struct swap_info_struct *si, swp_entry_t entry, int nr) { @@ -1765,13 +1749,21 @@ void swap_free_nr(swp_entry_t entry, int nr_pages) void put_swap_folio(struct folio *folio, swp_entry_t entry) { struct swap_info_struct *si; + struct swap_cluster_info *ci; + unsigned long offset =3D swp_offset(entry); int size =3D 1 << swap_entry_order(folio_order(folio)); =20 si =3D _swap_info_get(entry); if (!si) return; =20 - swap_entries_put_cache(si, entry, size); + ci =3D swap_cluster_lock(si, offset); + if (swap_only_has_cache(si, offset, size)) + swap_entries_free(si, ci, entry, size); + else + for (int i =3D 0; i < size; i++, entry.val++) + swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); + swap_cluster_unlock(ci); } =20 int __swap_count(swp_entry_t entry) @@ -3784,15 +3776,6 @@ int swapcache_prepare(swp_entry_t entry, int nr) return __swap_duplicate(entry, SWAP_HAS_CACHE, nr); } =20 -/* - * Caller should ensure entries belong to the same folio so - * the entries won't span cross cluster boundary. - */ -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int n= r) -{ - swap_entries_put_cache(si, entry, nr); -} - /* * add_swap_count_continuation - called when a swap count is duplicated * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entr= y's --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pf1-f171.google.com (mail-pf1-f171.google.com [209.85.210.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 881B434D38E for ; Thu, 4 Dec 2025 19:30:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876619; cv=none; b=Q5kQfN0xuLRETVvrg3WfVqH1kre9QhmrErNCOHxwxTNCP08Bxsr45ibYzJ5n7IsA/RF04WZfYqC5Bi/WiCTuk7eoNrKejFF5bYz+bhU8T1cjaw2KmER49ABnTGn+Y2elTZZQC0+mUhkY0Y59aUX4t4JAD2Ccy2NZjqF+9nhEhBY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876619; c=relaxed/simple; bh=SY5bgYjGBbyPekog53bcEmm1Y9UCprRDCy3o9B04q9E=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=buuOKuc0FAEoRI1VqZZ8ejPKw9RWyl5X1obuwh+K9+DLIW8M+PsDJ7/4Ot3ksbBNW6EJw9z4hA+ALRe72/Maqoy6kvmb7DECEY9QU4WIbgQmhTGJYoeoO1B4QiQ2i3KEZ9bumE2JPOLFPAmgF3KWQqT7I6GBWDjHkRwUjng05bM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=IYgTCDYu; arc=none smtp.client-ip=209.85.210.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="IYgTCDYu" Received: by mail-pf1-f171.google.com with SMTP id d2e1a72fcca58-7baf61be569so1605937b3a.3 for ; Thu, 04 Dec 2025 11:30:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876616; x=1765481416; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=3tLTYBb1R8OZr/Q1lTi3HD1NNbsd1epvJzg5LKD0iOs=; b=IYgTCDYuZUe288uOOKl2tCMnm8S/FWcdhAGLN7E9aV4f16intU61+JL20QzcH6q7hY JG4dXg3NHDEX1XerK2UhqUSiRzksx8a8xzivUSWM4KXYnmINFgiiXcu40/ShIWpgXtzq tvZeouDDBOunGyQu4Z0ZClknByXtMNHb/VtiKaHFEuiii0YaJ034+XHygCRd2QLiq3Yq /UEFUKDzETjCBmUC6G0Wq/KAudUhbkb9O8iVpLzJfKZtyW7XIxYJegYLPZ6lwaeGR2U3 plQ3Z2Ncfcj6yng8FWZJdttxBQAxdjehr7w2b5d2kp00AXef+N0cZ0bCSwHVZlLNVJCh 9QzQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876616; x=1765481416; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=3tLTYBb1R8OZr/Q1lTi3HD1NNbsd1epvJzg5LKD0iOs=; b=q0+Xh1OjfzysKINBocTuGXLoYuQXTXKW+ikArHiRXwKnApb/+ICcxPydZVOfbfnXgI hTkRwjSLnbIar5ElvWz6YyM91Fi2W2MH0eZai+/Jm213buQ1NpySkbeihXa5vvTjm4nx Karzrl4iyvWkcXRuySgil8BWwwORFC5+aoH4z0CiPVvdVq313Z2sb6dl8+dbzFBjQIjM Ncn55WrEBsAUWL2xePJoHM4D/V1XxpaK9W09It7X5jZSQbt+LrolqYLAWza9nQR5dHYW uzCIt3x7ranc22lvMD49pNjbJJMCsiS2NBVj5TbkLWqkOAQmhrpgr2ulSVYRxHM80Cxd t4gw== X-Forwarded-Encrypted: i=1; AJvYcCXH3mzDJXG/SYBgGkxQCCkaVRdgrnWNSiyiHQjKL5TOgK8gn9SK6N+NFMAI1W/Mzi/6WeJGb/iPOxWFMUY=@vger.kernel.org X-Gm-Message-State: AOJu0YyC1P14pTvrreaIdzxCNc0kz8TRs8PeTFo++W9/gX0/He4e54jP DKZLiht27vXB0Xhkvrpo3ke3Y1hCF4OBSh5xj1WF1HE5jyomKqL/g5xh X-Gm-Gg: ASbGncuJG8zsH9XakQuFYngIpdOHCoOHJPkJKheijH6AOWn3gi6KDnZxsa3LFFFze3H dZrCNCv7qT0wlrbTlh7pvZBihtF7bpHCuOyX0wpW5AZaY9/iQkWXNcuJF9nR+XRa23bohJRD/KB q+p40i7uaySFqG3yuTU7SD/1dJQfXe97abPFl9FDkGG5Edb7QrL6t0SgOykgkXKLZwx7nr5p6jz AOI/r1Fxw0ebQ8sXAtbpS5eqn8RkEBUTGik5M84fOG4cRXl5y3d2P4e6U5aguw/Y/bxPFYOTQlY qNpCxi6K94NQ4Mq4ZN+bdeZEbpI9ahTZjkjAtHUZWVK9Ch+lz10YdhiOl1LRPDz4G7KZGeXEsmX kWXHnGyBtzjvpAmnNxs7JX7WqX4rDutmuyp7YpoiXy27g+4WtoLDshcFLYosWyEVulOh6oiy0Kk py3ynB4oX15l6T3q30UDApuRzKjDgzFbLTNCVzTfjA0Xn7KtUlHY7Qhkf2MtY= X-Google-Smtp-Source: AGHT+IHvJYLo8F8+iVs5g+AC7Z1DpoGJWGbAAjlkPHHFpXjznCercQgieQwO0K2vl+49n6Q84tHoRQ== X-Received: by 2002:a05:6a20:748d:b0:35e:a390:ca13 with SMTP id adf61e73a8af0-363f5e953eemr9424236637.57.1764876615737; Thu, 04 Dec 2025 11:30:15 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.30.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:30:15 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:16 +0800 Subject: [PATCH v4 08/19] mm/shmem, swap: remove SWAP_MAP_SHMEM Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-8-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=7240; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=TGM4pt6Vv98jEk/nHCqFhIlNFGrXBJCep7RNsaXlLtY=; b=2ZtxbooaUmEm69Q7j6Un4XL+5k8vWv15mF1yXP0Jzx5HPPvYKOBqYBDrzdiF8Ky+e19Y5W3Fs 8SIvKR6OGpMCZCbHPiV6hmppOtDCIKekU+9Jm3Vnk+caHaNmLi5r/sT X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Nhat Pham The SWAP_MAP_SHMEM state was introduced in the commit aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry belongs to shmem during swapoff. However, swapoff has since been rewritten in the commit b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now having swap count =3D=3D SWAP_MAP_SHMEM value is basically the same as having swap count =3D=3D 1, and swap_shmem_alloc() behaves analogously to swap_duplicate(). The only difference of note is that swap_shmem_alloc() does not check for -ENOMEM returned from __swap_duplicate(), but it is OK because shmem never re-duplicates any swap entry it owns. This will stil be safe if we use (batched) swap_duplicate() instead. This commit adds swap_duplicate_nr(), the batched variant of swap_duplicate(), and removes the SWAP_MAP_SHMEM state and the associated swap_shmem_alloc() helper to simplify the state machine (both mentally and in terms of actual code). We will also have an extra state/special value that can be repurposed (for swap entries that never gets re-duplicated). Signed-off-by: Nhat Pham Reviewed-by: Baolin Wang Tested-by: Baolin Wang Signed-off-by: Kairui Song Suggested-by: Chris Li --- include/linux/swap.h | 15 +++++++-------- mm/shmem.c | 2 +- mm/swapfile.c | 42 +++++++++++++++++------------------------- 3 files changed, 25 insertions(+), 34 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 38ca3df68716..bf72b548a96d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -230,7 +230,6 @@ enum { /* Special value in first swap_map */ #define SWAP_MAP_MAX 0x3e /* Max count */ #define SWAP_MAP_BAD 0x3f /* Note page is bad */ -#define SWAP_MAP_SHMEM 0xbf /* Owned by shmem/tmpfs */ =20 /* Special value in each swap_map continuation */ #define SWAP_CONT_MAX 0x7f /* Max count */ @@ -458,8 +457,7 @@ bool folio_free_swap(struct folio *folio); void put_swap_folio(struct folio *folio, swp_entry_t entry); extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); -extern void swap_shmem_alloc(swp_entry_t, int); -extern int swap_duplicate(swp_entry_t); +extern int swap_duplicate_nr(swp_entry_t entry, int nr); extern int swapcache_prepare(swp_entry_t entry, int nr); extern void swap_free_nr(swp_entry_t entry, int nr_pages); extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); @@ -514,11 +512,7 @@ static inline int add_swap_count_continuation(swp_entr= y_t swp, gfp_t gfp_mask) return 0; } =20 -static inline void swap_shmem_alloc(swp_entry_t swp, int nr) -{ -} - -static inline int swap_duplicate(swp_entry_t swp) +static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages) { return 0; } @@ -569,6 +563,11 @@ static inline int add_swap_extent(struct swap_info_str= uct *sis, } #endif /* CONFIG_SWAP */ =20 +static inline int swap_duplicate(swp_entry_t entry) +{ + return swap_duplicate_nr(entry, 1); +} + static inline void free_swap_and_cache(swp_entry_t entry) { free_swap_and_cache_nr(entry, 1); diff --git a/mm/shmem.c b/mm/shmem.c index d08248fd67ff..eb9bd9241f99 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1654,7 +1654,7 @@ int shmem_writeout(struct folio *folio, struct swap_i= ocb **plug, spin_unlock(&shmem_swaplist_lock); } =20 - swap_shmem_alloc(folio->swap, nr_pages); + swap_duplicate_nr(folio->swap, nr_pages); shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap)); =20 BUG_ON(folio_mapped(folio)); diff --git a/mm/swapfile.c b/mm/swapfile.c index 3762b8f3f9e9..e23287c06f1c 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -201,7 +201,7 @@ static bool swap_is_last_map(struct swap_info_struct *s= i, unsigned char *map_end =3D map + nr_pages; unsigned char count =3D *map; =20 - if (swap_count(count) !=3D 1 && swap_count(count) !=3D SWAP_MAP_SHMEM) + if (swap_count(count) !=3D 1) return false; =20 while (++map < map_end) { @@ -1523,12 +1523,6 @@ static unsigned char swap_entry_put_locked(struct sw= ap_info_struct *si, if (usage =3D=3D SWAP_HAS_CACHE) { VM_BUG_ON(!has_cache); has_cache =3D 0; - } else if (count =3D=3D SWAP_MAP_SHMEM) { - /* - * Or we could insist on shmem.c using a special - * swap_shmem_free() and free_shmem_swap_and_cache()... - */ - count =3D 0; } else if ((count & ~COUNT_CONTINUED) <=3D SWAP_MAP_MAX) { if (count =3D=3D COUNT_CONTINUED) { if (swap_count_continued(si, offset, count)) @@ -1626,7 +1620,7 @@ static bool swap_entries_put_map(struct swap_info_str= uct *si, if (nr <=3D 1) goto fallback; count =3D swap_count(data_race(si->swap_map[offset])); - if (count !=3D 1 && count !=3D SWAP_MAP_SHMEM) + if (count !=3D 1) goto fallback; =20 ci =3D swap_cluster_lock(si, offset); @@ -1680,12 +1674,10 @@ static bool swap_entries_put_map_nr(struct swap_inf= o_struct *si, =20 /* * Check if it's the last ref of swap entry in the freeing path. - * Qualified value includes 1, SWAP_HAS_CACHE or SWAP_MAP_SHMEM. */ static inline bool __maybe_unused swap_is_last_ref(unsigned char count) { - return (count =3D=3D SWAP_HAS_CACHE) || (count =3D=3D 1) || - (count =3D=3D SWAP_MAP_SHMEM); + return (count =3D=3D SWAP_HAS_CACHE) || (count =3D=3D 1); } =20 /* @@ -3678,7 +3670,6 @@ static int __swap_duplicate(swp_entry_t entry, unsign= ed char usage, int nr) =20 offset =3D swp_offset(entry); VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); - VM_WARN_ON(usage =3D=3D 1 && nr > 1); ci =3D swap_cluster_lock(si, offset); =20 err =3D 0; @@ -3738,27 +3729,28 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) return err; } =20 -/* - * Help swapoff by noting that swap entry belongs to shmem/tmpfs - * (in which case its reference count is never incremented). - */ -void swap_shmem_alloc(swp_entry_t entry, int nr) -{ - __swap_duplicate(entry, SWAP_MAP_SHMEM, nr); -} - -/* - * Increase reference count of swap entry by 1. +/** + * swap_duplicate_nr() - Increase reference count of nr contiguous swap en= tries + * by 1. + * + * @entry: first swap entry from which we want to increase the refcount. + * @nr: Number of entries in range. + * * Returns 0 for success, or -ENOMEM if a swap_count_continuation is requi= red * but could not be atomically allocated. Returns 0, just as if it succee= ded, * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), wh= ich * might occur if a page table entry has got corrupted. + * + * Note that we are currently not handling the case where nr > 1 and we ne= ed to + * add swap count continuation. This is OK, because no such user exists - = shmem + * is the only user that can pass nr > 1, and it never re-duplicates any s= wap + * entry it owns. */ -int swap_duplicate(swp_entry_t entry) +int swap_duplicate_nr(swp_entry_t entry, int nr) { int err =3D 0; =20 - while (!err && __swap_duplicate(entry, 1, 1) =3D=3D -ENOMEM) + while (!err && __swap_duplicate(entry, 1, nr) =3D=3D -ENOMEM) err =3D add_swap_count_continuation(entry, GFP_ATOMIC); return err; } --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pf1-f177.google.com (mail-pf1-f177.google.com [209.85.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 77F3C34D4FC for ; Thu, 4 Dec 2025 19:30:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876624; cv=none; b=DlbOEgoS+6LWTD90b/KPtqmleGILccGdbffUfIN0tbSEd9oIZe1MdXDu2gejC+oUXYUBf1Itn4y6Shhtj0wAproD4IoY3OAX3zOfm+GetezzOZH8IBIyC9hbvLhos5gdE8GSKEOQCV/YhUGD/9KdTVEBPdMcaGZYO1oXXRay2Qo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876624; c=relaxed/simple; bh=F524m90C6Fx+yH0qVBU9xvhQ1OhmohTeUS/8yQTKZwU=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=jbNQlNfodvLTh+Uq29gUfXoSCi43s6MfK3lA1aJaEjLuBQDuKnVIwupMV5lmYM5fhuxBDxohvubvuEehX2LYaRD8uTLh32fzwUhZQxphQZAlFmVjj387/jK++5zJWpMj6Lp/JMJ5C5x7PaTJG4Kzp3QEWHjty212LOeYqlsq/bc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=EXL05Ixq; arc=none smtp.client-ip=209.85.210.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="EXL05Ixq" Received: by mail-pf1-f177.google.com with SMTP id d2e1a72fcca58-7b9387df58cso2046886b3a.3 for ; Thu, 04 Dec 2025 11:30:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876620; x=1765481420; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=MfXmgn+mF1Fs5ynI6NKuBVi8g4Y7O33miAkqyHznxPk=; b=EXL05Ixqgs+6bK7HVz4O4bSIQw0EKnnJDHwu4Om6l2VGmLXGR93vHKxyP8xURjty63 WavMEkurxfhjO+W2Jmfh+sS7kRz2Zfnw+oTyQKFg5CbFGD2+aH75gv6bgyDl5PZrM8xO yhQCi88VS+gZY2zWAqTsx+DdOyH4C8GntNJEta4fBkkjx9T/6n4KhAXwktK3IYcPRzYD 0DlJ8fF1LDiTI8P57rjX8aXGWEhvouw+KNyPSKo2qfeEw0NNuaiLBVbIZI5lLmgCSbx6 1YSoSITNai5PdVU+Vkq8bVElVEsqejWZgtefOMIMr3wAGLio8dXGdEuefa3XKsqqpzhg 3v8w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876620; x=1765481420; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=MfXmgn+mF1Fs5ynI6NKuBVi8g4Y7O33miAkqyHznxPk=; b=FLj12sHieWJaR9YkPQCMFBcXL3FzpqcH2QETn/IxjO7xRcsK8xNTLVWfuVoPH08Kwg gF3hDRBblW6Sx/gCoazDxSIKHf8wAl2FCrWwMpBUComU6JkxGWQ8rF3zreSTf6ev4Fn9 OG3MHv70yKsz2jh8BtNerIgJBYRqdHTYFbMre0EHrfrxjJvkzv2t0y4G+jIo99Hdag76 RlG6/ZvK+EAo6F3s5H7E0BNyqbogJKtmbjxX1KiuE8dcnoyP6udNq79yZ6pCoGckX/L5 QGvPWStFqLqgA2lLFbwxyxmmANS77zIH0GaxKZpC8TnqjAQtNHLP1otzxCeb7gTyxiSQ APYA== X-Forwarded-Encrypted: i=1; AJvYcCXUfDCr7voDvjifCVQL6EumbJNFrfZNFBXiU4TSSIgB2iwQG0ZRtR92IDmYAAAqGyjzJ0PGXIpeSR+y8mg=@vger.kernel.org X-Gm-Message-State: AOJu0YyZQC/uerFz8IY5PIPU+ZPFeI73VhCsqXx3LC+r3K5CJtUkq0m2 r2VZeTzeboEhonZ4rxBQHlHKyP/F3fEzgRifUygLO9FjigTi/I4g1v2f X-Gm-Gg: ASbGncvdxcWomHbdctToWn+DUk/vMXh4nY67fKUoGbYt3Q77b+76GpR8vAjGMWHS4OZ nmqoGsCGX2ju3JiLErPU3EgIzfzUE3jwszVYc5K6/9Ih1oC2abMcy6r3+DfxM8H2qf8Z6qTA3+j zvAJ3bbBY5boHi1DmO9AUO/K8wFUk9qhNFzvGQKTArQxGJ0LcvEeW+7BA1AIDk04/IhhjbVIcJk hkYwMa/Gv5M1E199tSUNgjwRO94Iramyl250DW1NomoOgkV+YezG+DWs9s9x1XmQ/882Hxs8h3W MttoJ/Cd20RhbVhuI6ZtJw6yXh8xJEgnI8jOwIq/AnudmiZsEu6aicCLqtA4A8T1J2utWgNs+zf 2FxkFgGed7bijWFlOQJ6t4A5g3s1FNZGfs/4OKnohKLElWRO18iWoNRI13aQHxfUFbAj7kOs/i8 vYrDYwsCecG5IkC+lC56Xn5IO892LT6cNtgzm7ICDJirB/aECx X-Google-Smtp-Source: AGHT+IEUWrwHCnrhMjQORd2GgwZECUW61fOPvfGX6b9uWWfuqrPKPIwXKaylq+ZvKLERLAGax0X09A== X-Received: by 2002:a05:6a20:a11a:b0:34f:1623:2354 with SMTP id adf61e73a8af0-3640387e178mr5129120637.42.1764876620236; Thu, 04 Dec 2025 11:30:20 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.30.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:30:19 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:17 +0800 Subject: [PATCH v4 09/19] mm, swap: swap entry of a bad slot should not be considered as swapped out Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-9-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=4721; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=gH+fgrUx68nGGL/szu90Y/dtzwtL/bZYzwxE+J7xnUA=; b=2q56WTzht4vUrFKpHdhgGpReADKRqdNRfeOvZzOu5S1rtV+JTT3jfSolWpfl6BtzzwdfEN8Cy 8K6i5p4Zm96CiKI35b3cQALEp3Gjy7GE/qs6EHQ8hGD803rPXvxNKQy X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song When checking if a swap entry is swapped out, we simply check if the bitwise result of the count value is larger than 0. But SWAP_MAP_BAD will also be considered as a swao count value larger than 0. SWAP_MAP_BAD being considered as a count value larger than 0 is useful for the swap allocator: they will be seen as a used slot, so the allocator will skip them. But for the swapped out check, this isn't correct. There is currently no observable issue. The swapped out check is only useful for readahead and folio swapped-out status check. For readahead, the swap cache layer will abort upon checking and updating the swap map. For the folio swapped out status check, the swap allocator will never allocate an entry of bad slots to folio, so that part is fine too. The worst that could happen now is redundant allocation/freeing of folios and waste CPU time. This also makes it easier to get rid of swap map checking and update during folio insertion in the swap cache layer. Signed-off-by: Kairui Song Suggested-by: Chris Li --- include/linux/swap.h | 6 ++++-- mm/swap_state.c | 4 ++-- mm/swapfile.c | 22 +++++++++++----------- 3 files changed, 17 insertions(+), 15 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index bf72b548a96d..936fa8f9e5f3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -466,7 +466,8 @@ int find_first_swap(dev_t *device); extern unsigned int count_swap_pages(int, int); extern sector_t swapdev_block(int, pgoff_t); extern int __swap_count(swp_entry_t entry); -extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t en= try); +extern bool swap_entry_swapped(struct swap_info_struct *si, + unsigned long offset); extern int swp_swapcount(swp_entry_t entry); struct backing_dev_info; extern struct swap_info_struct *get_swap_device(swp_entry_t entry); @@ -535,7 +536,8 @@ static inline int __swap_count(swp_entry_t entry) return 0; } =20 -static inline bool swap_entry_swapped(struct swap_info_struct *si, swp_ent= ry_t entry) +static inline bool swap_entry_swapped(struct swap_info_struct *si, + unsigned long offset) { return false; } diff --git a/mm/swap_state.c b/mm/swap_state.c index 8c429dc33ca9..0c5aad537716 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -527,8 +527,8 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry,= gfp_t gfp_mask, if (folio) return folio; =20 - /* Skip allocation for unused swap slot for readahead path. */ - if (!swap_entry_swapped(si, entry)) + /* Skip allocation for unused and bad swap slot for readahead. */ + if (!swap_entry_swapped(si, swp_offset(entry))) return NULL; =20 /* Allocate a new folio to be added into the swap cache. */ diff --git a/mm/swapfile.c b/mm/swapfile.c index e23287c06f1c..5a766d4fcaa5 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1766,21 +1766,21 @@ int __swap_count(swp_entry_t entry) return swap_count(si->swap_map[offset]); } =20 -/* - * How many references to @entry are currently swapped out? - * This does not give an exact answer when swap count is continued, - * but does include the high COUNT_CONTINUED flag to allow for that. +/** + * swap_entry_swapped - Check if the swap entry at @offset is swapped. + * @si: the swap device. + * @offset: offset of the swap entry. */ -bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry) +bool swap_entry_swapped(struct swap_info_struct *si, unsigned long offset) { - pgoff_t offset =3D swp_offset(entry); struct swap_cluster_info *ci; int count; =20 ci =3D swap_cluster_lock(si, offset); count =3D swap_count(si->swap_map[offset]); swap_cluster_unlock(ci); - return !!count; + + return count && count !=3D SWAP_MAP_BAD; } =20 /* @@ -1866,7 +1866,7 @@ static bool folio_swapped(struct folio *folio) return false; =20 if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio))) - return swap_entry_swapped(si, entry); + return swap_entry_swapped(si, swp_offset(entry)); =20 return swap_page_trans_huge_swapped(si, entry, folio_order(folio)); } @@ -3677,10 +3677,10 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) count =3D si->swap_map[offset + i]; =20 /* - * swapin_readahead() doesn't check if a swap entry is valid, so the - * swap entry could be SWAP_MAP_BAD. Check here with lock held. + * Allocator never allocates bad slots, and readahead is guarded + * by swap_entry_swapped. */ - if (unlikely(swap_count(count) =3D=3D SWAP_MAP_BAD)) { + if (WARN_ON(swap_count(count) =3D=3D SWAP_MAP_BAD)) { err =3D -ENOENT; goto unlock_out; } --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 85F9D34D91E for ; Thu, 4 Dec 2025 19:30:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876627; cv=none; b=lwhXWT/jesb8iSbIYelCFiVy3YClUwjZYgk3K5ZQSeFXsSL+haK9tf3i8t9L592tG9L2qEV/OQDWlUDUNNhsIirnSLrz7N2ZY2fEgmPQFv5GuMB828y1hx6R5Rb7fJpIxAh174rXoCgigWaDxi1w9tbHi4+jf4zrtCmYsd6Qfus= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876627; c=relaxed/simple; bh=FxduN62BIxwiKsv0ce0msdiZXipiWdd3pAEz9c8aVYQ=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=ah1waVauY1hmHIQxLz2X9ykIsYdEeEVymrPmqNVfa5WvKcCM6FegskDIKX45CFJAidToPTwSIPtZ3+WxfDF5Gexj0TA0BJHpX8AmQj4tlJNHV8EMJKayzF/3YKZemAr7ykyu75H9zT0wRZm2dhpQ/6oQuS0ONvgZr2LFichhVsY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=jUSgI7ru; arc=none smtp.client-ip=209.85.214.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="jUSgI7ru" Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-297dc3e299bso12692775ad.1 for ; Thu, 04 Dec 2025 11:30:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876625; x=1765481425; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=k6CQ8ZhfqyyFU9/2UA1NETtU6jN0ODo0V1rMBQeu5q0=; b=jUSgI7ru/5AcQddalYbZcwoBNPkzu+SRJLInyWZeChaC/6f7NUmEdm9fZLR5l0PH/G 0StxsCeebT6aH5YXUthiOSh5iipUeqrn3ydTGnPKE0NEtylrNFMdspk1iWTLv2pTtslo 1tgr4AMQd7C/IXrTX5al7qWt6fEPHYfMd4c7w3Qm0t4J+clrp9LNX4v4dysL//vPgSzA cD51ug3HpG95bn3ma5zZtBMksEHHOAvS1lvGgVlCkGeWGEwOD4+vKASjByZFhTVrqUud mwvmCv/LSufSuP/XrIUz/Cwy3f7dhau4TxRW4BpDeHmk7/fnCjX2eO9m+rsWZY8RxrYI fuUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876625; x=1765481425; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=k6CQ8ZhfqyyFU9/2UA1NETtU6jN0ODo0V1rMBQeu5q0=; b=cIeorcyRvxSY2BDj2AACtmZfCw/qI5pN4Iui/NTkfMafkEqEwwGw+wjIIDDE1dt+aZ h/1ivPSnW5HdRRS184wMVW86Za9HY2lNAtB1Th9di+n90NCK0fEJK5AVSyAb/AyU07lC MEBLrI0xSV3uI6UetBgN5TB13pF91iA76y0/dwsMWadHFa5+9Tpsncfh5iNv0oiawBem 2oIF3IRcnV/aK0Rrww1Cee0tkqkQuIzS0Q2e5KSLP5t4udmxAu61yfWPJwgO1e1/pXr7 odSu5mjhlCVtcckVvBo0WryZe9f1W6h6RgoW0/zD1hHqIJyc0E3CYxZ7hrjKKWSTr05x Q5ig== X-Forwarded-Encrypted: i=1; AJvYcCWInNYS2E1CDVuoiuKMyK9VzlwCen2fFuceE127rDrnssk/4W67lvuqy3NcyijJw2Y7nLPjqdh00kVZn1Q=@vger.kernel.org X-Gm-Message-State: AOJu0YwfbxUA70nO7K/tH2jwUaXqNH/7gxwIsFjlDA5TaJZxi0bT7XT9 SugLjQBGJdhWJkEN9O1LB62HX/aaPL6r/9WB28QbINk52nyNvWMpApRN X-Gm-Gg: ASbGnctwegop1wJLROmyg29dIMDUlKvTLhg2PaGiSTGpwAJs4s4kkR1n2CArHSFl/lL MJnRis0PdbbuBcE50IMAf3Co1/Z5o+WpjxP+e5di7Naw0cgARMfWi/U5YWigk5KBIEmTAJH1JrT koLxwj0h5SPI9LnM4ONdufW8ezgDzU0XbLHK5/RqHC2Lz88wD2RBu7pXpZ/dLx78Qt5XuVVykSi b7ZreqRl/fGFzXeMR+53l30KkDFAYvdhdmybOKbs8+p0MT82hDyCgkl+WhPXgCMB5SqS5Fu7Czg 0FuUBcDH2hlBgrhDmpthiaZzgeP7c4pQGkEeOdDOVS/xH6YmplkqBZ78eyIk5I2aDX0o4uzMLja q+oM4hEDzg+26XQGbAvEfUpqE3vfxVmB1TXIeY8XmeWWnqjrDBR1Etg05g7WgUT1RY4hc9JYHaP lzhXwjmIgaa4//5/UNUlCuAYiLdDQOsh3XtFUWF0FRf3H+9p5+FXHAF4Nej28= X-Google-Smtp-Source: AGHT+IFHyywcPX1/2rRywR3A3G3qtQXd16j/VV0NMDNxSNumz/8b1jacGP1q4rkQk6BYnCOs2mhl6Q== X-Received: by 2002:a17:903:298b:b0:298:58ae:f91a with SMTP id d9443c01a7336-29da1eea914mr49921075ad.57.1764876624820; Thu, 04 Dec 2025 11:30:24 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.30.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:30:24 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:18 +0800 Subject: [PATCH v4 10/19] mm, swap: consolidate cluster reclaim and usability check Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-10-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=4270; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=EXBJk1CmynZyFl/MBOO52+UVDAeGzVYnf+qyN2ZRgnU=; b=Zd7rpSLnPPYVIUN/r4Lc3fHgl3rBvcSEdXFO75CbxeJaKwwNOpHKjGCvFPVxKi7nFw4/XqKRd 61GAcg1cDkjBmr3CJMoif3qk9vXF4k4zAkoLoB0Jn0B18JpyUIPmrXc X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Swap cluster cache reclaim requires releasing the lock, so the cluster may become unusable after the reclaim. To prepare for checking swap cache using the swap table directly, consolidate the swap cluster reclaim and the check logic. We will want to avoid touching the cluster's data completely with the swap table, to avoid RCU overhead here. And by moving the cluster usable check into the reclaim helper, it will also help avoid a redundant scan of the slots if the cluster is no longer usable, and we will want to avoid touching the cluster. Also, adjust it very slightly while at it: always scan the whole region during reclaim, don't skip slots covered by a reclaimed folio. Because the reclaim is lockless, it's possible that new cache lands at any time. And for allocation, we want all caches to be reclaimed to avoid fragmentation. Besides, if the scan offset is not aligned with the size of the reclaimed folio, we might skip some existing cache and fail the reclaim unexpectedly. There should be no observable behavior change. It might slightly improve the fragmentation issue or performance. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swapfile.c | 45 +++++++++++++++++++++++++++++---------------- 1 file changed, 29 insertions(+), 16 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 5a766d4fcaa5..2703dfafc632 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -777,33 +777,51 @@ static int swap_cluster_setup_bad_slot(struct swap_cl= uster_info *cluster_info, return 0; } =20 +/* + * Reclaim drops the ci lock, so the cluster may become unusable (freed or + * stolen by a lower order). @usable will be set to false if that happens. + */ static bool cluster_reclaim_range(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long start, unsigned long end) + unsigned long start, unsigned int order, + bool *usable) { + unsigned int nr_pages =3D 1 << order; + unsigned long offset =3D start, end =3D start + nr_pages; unsigned char *map =3D si->swap_map; - unsigned long offset =3D start; int nr_reclaim; =20 spin_unlock(&ci->lock); do { switch (READ_ONCE(map[offset])) { case 0: - offset++; break; case SWAP_HAS_CACHE: nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); - if (nr_reclaim > 0) - offset +=3D nr_reclaim; - else + if (nr_reclaim < 0) goto out; break; default: goto out; } - } while (offset < end); + } while (++offset < end); out: spin_lock(&ci->lock); + + /* + * We just dropped ci->lock so cluster could be used by another + * order or got freed, check if it's still usable or empty. + */ + if (!cluster_is_usable(ci, order)) { + *usable =3D false; + return false; + } + *usable =3D true; + + /* Fast path, no need to scan if the whole cluster is empty */ + if (cluster_is_empty(ci)) + return true; + /* * Recheck the range no matter reclaim succeeded or not, the slot * could have been be freed while we are not holding the lock. @@ -900,9 +918,10 @@ static unsigned int alloc_swap_scan_cluster(struct swa= p_info_struct *si, unsigned long start =3D ALIGN_DOWN(offset, SWAPFILE_CLUSTER); unsigned long end =3D min(start + SWAPFILE_CLUSTER, si->max); unsigned int nr_pages =3D 1 << order; - bool need_reclaim, ret; + bool need_reclaim, ret, usable; =20 lockdep_assert_held(&ci->lock); + VM_WARN_ON(!cluster_is_usable(ci, order)); =20 if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) goto out; @@ -912,14 +931,8 @@ static unsigned int alloc_swap_scan_cluster(struct swa= p_info_struct *si, if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim)) continue; if (need_reclaim) { - ret =3D cluster_reclaim_range(si, ci, offset, offset + nr_pages); - /* - * Reclaim drops ci->lock and cluster could be used - * by another order. Not checking flag as off-list - * cluster has no flag set, and change of list - * won't cause fragmentation. - */ - if (!cluster_is_usable(ci, order)) + ret =3D cluster_reclaim_range(si, ci, offset, order, &usable); + if (!usable) goto out; if (cluster_is_empty(ci)) offset =3D start; --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3F65934DB4A for ; Thu, 4 Dec 2025 19:30:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.175 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876631; cv=none; b=nyWrBCBLZlmMl+U/2QaXWNOAVvAR/gYjvGd+mqTq+pDxqFio4k6aCssKgVtQiNgqK9DaiXDJYCtuIItZwpyaW5TqUN3q+rS+6iU/7OWUlRuzWlmfLfkJiLUIenYsq2R0yZSG6Za+cKGAxpfHv2ldyzxTxyR+km5mKJs+2JTj8QU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876631; c=relaxed/simple; bh=VlDCm54ah9liVJ/Nk6cWG/1/qPfdYyE/ePJ761vI6Y4=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=ZGs3jXb19IWzz/pv48Q0knYa/qK5eG8N9R3ULShys8gtr6g3b3Jhj0nr9fst7eHXsxGRMZO4WOfdY4NsrnXYQq6jrszvzaJGkImhDS5ISXT6wKVTEU5wDFzdG2LeqAvavLFf/iZKGvmNCZSRCepDw0dQTh+4CRbBTmKRuwPIcq8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=X4J9vOsZ; arc=none smtp.client-ip=209.85.214.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="X4J9vOsZ" Received: by mail-pl1-f175.google.com with SMTP id d9443c01a7336-297e264528aso13355035ad.2 for ; Thu, 04 Dec 2025 11:30:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876629; x=1765481429; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=pdpDGFKobTDZcaA+oYO377RhHuNYAnjN7TF4QVNsFf8=; b=X4J9vOsZ3HPWD26zur2NE8renTMW+dz4mV4sWWcI6dB2x6g53mm5LKejUbWLlDvVXL gVr0TReJVswhOGtpzQXB2K2mc3i2gVVoTYGxKldX1zc7tKnHQm82PiUtbY0EzV7lyUv1 +RflHFIFeyO3xLczTHFE2S0P1Fs0wxaI5w/htP63dJ9SETuklAZabm41Z0kGOOfM5V0Q qcg7zrYj4SZLTLW4jF/NYyZ9BU8iIZgIdN/reAHFpnWqIDq4mx2oaLapEKXOTDznExmS IwWBiWaah1/XxAde8B2e/m/gtjp1nSVOXurQ1gxpuJfvSCMA3lo8+GKVsxKVq1/i7Gtw J47w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876629; x=1765481429; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=pdpDGFKobTDZcaA+oYO377RhHuNYAnjN7TF4QVNsFf8=; b=A/ajMrHUEW7NMO/hAFLcra/iVJ5UlErsAoc66HInq1WCw0uXxMW37wZ2fyDt87sIrj apoRFlFGEOgf6hhtoBwO51WwIL31z6vaIeEGSzOVrP9cyaKsuE3w+pyFlwY1HFfBhqqI XlQO/BteR4jxYhOoYbMzbea+vBgc1KNHsNo/9Evo5bQqRm8gzqahRLI1sw9CqPUJ/K8S W6Eg1EmGlKrli4M5UBzVwF47RoUAdogfSh7vRoTzmrYdKn8UqoaICq5cBy6CPzY5LpV2 E59EWJwXWzTnWTFKR01ZWjS2WWH2YOW/tEvbBxz5xDIXhRIL8aYRX7oi6I755HfzNONw 2BPA== X-Forwarded-Encrypted: i=1; AJvYcCVWJ7NBXGcZ0g2IzvPmuqQjWCeqq2FljuBfkP4O/o7JefYljNr0UdwigLvaSyzdZwoIWfmiuKvrFVQcHb8=@vger.kernel.org X-Gm-Message-State: AOJu0Yzx3urucbhAK/fgaWxNNgCHj/XPf3Aiu5Yn9fLSzp5h/D4Gvcx0 EU+E8P19yH9jmi5O8iFb6Qac0iqmri7rL+cilxVSL04jhC7yDa1FhOuC X-Gm-Gg: ASbGnctUNMhkFvYOEOAeEkw2VSBVvZP2w7kHsajTk6k5NNTJeGXi9v+k6/gsx+QTN0U wH8QP9KbUhuEz0bKlQSEx0JrSq5WIHqWm4r/DSO0VMz3oGWZpnGIvkBTEXFJFGtMP1Uuf4U0XH6 G4d7eRwpbNPgTQDqtV8P6bJmWb+eHOa9Rhe61P3M+qUMBFPOMVHOgvNePiZBo+/9JNXqXUvfsuJ k+KOfTNpZ1NwVwvHwJMv8pMFpDlxrPoC17h4Gtj9Y9NHp5G1cK7VezjJLq7LTpHNlBxNgBhDTkn GvqRMcufK8XvA9l/shrW7RM+EMOybqyoF0hz2YuuhCi2MselgaaWfhxUEIrYEQ1NPxP/H18Ojhv Fl5Ix5sgOwX1KZZJdp/+H+a9Y7x2A4KKe+cRwXZQ637bey0JSaGK7WkVxe69OO2kz3qeckGHtNQ ivLPK/GCChnDWAHsPCkns0ABgOl7PsniHUV0VSgZYCvDpe123v X-Google-Smtp-Source: AGHT+IFficBiEMz/pYiFH4fMZ1x1Z847vNsC3DafnIyYiLDS2+cnBVh0UanWRv7rLW1Nv8XeJgOXlA== X-Received: by 2002:a17:902:ccd0:b0:297:e231:f410 with SMTP id d9443c01a7336-29d9fb65b25mr43914435ad.13.1764876629308; Thu, 04 Dec 2025 11:30:29 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.30.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:30:28 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:19 +0800 Subject: [PATCH v4 11/19] mm, swap: split locked entry duplicating into a standalone helper Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-11-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=3213; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=UCtbWcSkIiRY+e/EucxLikEc+mfHA8d9Ghbzu0X0yNQ=; b=uBiHRTVs5Yf1quAWq5aqh+UbkP/IGjU3Qt//+q+F8Aj0HHKN7bFns4gRjtpoF5uyfmUMxZk0Y 4C81u+2HDaACCprv7DUtWzXX9t+Ur6JqeLPVF4a0a2SxiNEaldfC5I8 X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song No feature change, split the common logic into a stand alone helper to be reused later. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swapfile.c | 62 +++++++++++++++++++++++++++++--------------------------= ---- 1 file changed, 31 insertions(+), 31 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 2703dfafc632..d9d943fc7b8d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3666,26 +3666,14 @@ void si_swapinfo(struct sysinfo *val) * - swap-cache reference is requested but the entry is not used. -> ENOENT * - swap-mapped reference requested but needs continued swap count. -> EN= OMEM */ -static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) +static int swap_dup_entries(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, + unsigned char usage, int nr) { - struct swap_info_struct *si; - struct swap_cluster_info *ci; - unsigned long offset; - unsigned char count; - unsigned char has_cache; - int err, i; - - si =3D swap_entry_to_info(entry); - if (WARN_ON_ONCE(!si)) { - pr_err("%s%08lx\n", Bad_file, entry.val); - return -EINVAL; - } - - offset =3D swp_offset(entry); - VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); - ci =3D swap_cluster_lock(si, offset); + int i; + unsigned char count, has_cache; =20 - err =3D 0; for (i =3D 0; i < nr; i++) { count =3D si->swap_map[offset + i]; =20 @@ -3693,25 +3681,20 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) * Allocator never allocates bad slots, and readahead is guarded * by swap_entry_swapped. */ - if (WARN_ON(swap_count(count) =3D=3D SWAP_MAP_BAD)) { - err =3D -ENOENT; - goto unlock_out; - } + if (WARN_ON(swap_count(count) =3D=3D SWAP_MAP_BAD)) + return -ENOENT; =20 has_cache =3D count & SWAP_HAS_CACHE; count &=3D ~SWAP_HAS_CACHE; =20 if (!count && !has_cache) { - err =3D -ENOENT; + return -ENOENT; } else if (usage =3D=3D SWAP_HAS_CACHE) { if (has_cache) - err =3D -EEXIST; + return -EEXIST; } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) { - err =3D -EINVAL; + return -EINVAL; } - - if (err) - goto unlock_out; } =20 for (i =3D 0; i < nr; i++) { @@ -3730,14 +3713,31 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) * Don't need to rollback changes, because if * usage =3D=3D 1, there must be nr =3D=3D 1. */ - err =3D -ENOMEM; - goto unlock_out; + return -ENOMEM; } =20 WRITE_ONCE(si->swap_map[offset + i], count | has_cache); } =20 -unlock_out: + return 0; +} + +static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) +{ + int err; + struct swap_info_struct *si; + struct swap_cluster_info *ci; + unsigned long offset =3D swp_offset(entry); + + si =3D swap_entry_to_info(entry); + if (WARN_ON_ONCE(!si)) { + pr_err("%s%08lx\n", Bad_file, entry.val); + return -EINVAL; + } + + VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); + ci =3D swap_cluster_lock(si, offset); + err =3D swap_dup_entries(si, ci, offset, usage, nr); swap_cluster_unlock(ci); return err; } --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AB2D234DCD8 for ; Thu, 4 Dec 2025 19:30:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876637; cv=none; b=FJjFDVg2jxinaL9KRBDwxcHq9gzx1MUaa5cwRy8GSzXflnQMl5k3QOqAGZvgrI+AYJfJ6sCjiBYukPgRNLgWx4ZSabPq3uyh2I92gzVX0QjbbBGFReJeqCglQw/8jVIVytI5wkZRxbW7U57CZLbUeijmJYma7WaFQ26XMZN86wU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876637; c=relaxed/simple; bh=nNcv92EqV7JtfYTPM3p1SMaWRc/TyxRJd9qC2/MPbHQ=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=KOTT1z86rkfTxBHmpDiauUOI0Pq3uEmzeqxAXxTnTgpuzgdvZoNQPbeX6SxBkv0QTe/epOzZXdqcXH1mGyWg22EEvg9YQC2xy1ipBHVdXeg/PC4wh+uzTGZIMgEuREQvZtzw1fqlB/pLPjJ9wjmJcHart6NHmLAWvjndH+0hKyw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Qkql8zQt; arc=none smtp.client-ip=209.85.214.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Qkql8zQt" Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-29555b384acso13743165ad.1 for ; Thu, 04 Dec 2025 11:30:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876634; x=1765481434; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=tv4aqICLd9L4YLKBploMWc2xd03ojcYwn13tRVcUQXg=; b=Qkql8zQtkc1ns+sLzXFH3QkXWdQ54QuEYTo+z05V2ntNz5OH9PuyTASI+j9F08n61O k6M0m9YXYJdd2D2IPn7NosrNtQ3l8pNr17MCdJc7yzvosu3jAwSoaVdAHaB3NxSOAfFJ koLDIXIk9oRr5bQmTMtnGtkdaz+M/yHSZJY+rXEpi2lNxUc/kfjnzzLI2WBEHIqzWdT2 JIUEDlFOJGYVba0YMbdwKBpTXerOj9194JPncCKUkJes+chojkuK0gqIBW5tpX9Lyw9s +7ei9ZTnbwVx1rIeniMcvzpIh/+Q8MUv4COyDuicUVTAO+XvUlRWYXqczOL4x8yL+sSd d4ig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876634; x=1765481434; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=tv4aqICLd9L4YLKBploMWc2xd03ojcYwn13tRVcUQXg=; b=isL+Uffeb1QU03tC0lWXN9kH89CBCfm8rkHkX05oXj7OZy3CvK1+DfhflOFabmJBO/ OFnLCkD9/U8uWM0M7zybG5VI3xvTdthr+rIX1mICHLWpP1jM9VX3GrXpCjnZuG+uJA2M ft/Glp6lE6PIFYPCwkbUVmeN0oRHbkijMsFVrDv3PvDZ7N9NaLawrb9fBfhP0tuIWdTA l52bMppUX8Fr7XOckCsUvNP38N1QYutPAb2VV+i93sxnZzpECzAF4dNbLD8oAfHL47pK wfhFjRZlUpGlE55zkSsUEiW6UhaEHls9f2xsp61bjZzYc7TDMhBR+xxYI17lQ34NyuEE lDGw== X-Forwarded-Encrypted: i=1; AJvYcCXdgWfy6NpE2AKRnidftP+ylFS6dURnRkFPdea1ljidIADg91FsiNlm4DqExygqRMWPvBFblON4VlcY/JY=@vger.kernel.org X-Gm-Message-State: AOJu0Yy+FwukIi5+zzOjw9tgC2CWgDV5/Ayk/VwwOvMmzPrx6cX3Z8FG TJ8dGlIxH3TiQMoRRKsr3WKwxjWx1YFRWxKbmIWD+bk/6+bjPxMFHECg X-Gm-Gg: ASbGncsNYjoIS0O9Wv+EX1TXi/TcKmPLQiaxwzZF2K2ZgpBBMp9/kSqIuwJMRIaRP25 Oq0OtvISjbXemxE64EATfwylLLOVwkTq42UD5Q+uplvmxBaQOOabp9ybGWIlljiMpzyI+N6KI/b ETlUZM2A+Kd9buJ6RwXuvwLvVhuDr5cMNHwY0GByxA3SPkXyEu7YKj+Vp9l9wqqsIAfN7NPrXjT SOGdrNU0f7y4qkSJWxY7Wxo8yucET3npu38+NK1xmBy8Ziz9QXRximBopkM2bLR7VghXQtiD81N 8JLmS/nAclLr+xcD5ytwAc92Bi0uQcfLGTOecUWvuJygL+bgNpUhee0Z2xjOXXH9rvlLYjyvunJ GkABiWaCTN5wL8UDfE3DKi+Kifh3V8OsWk9enJ07bFHHKQ91KnpQyGy1WSl+vUxIIYydni94fwD poHi8iQ9GqZMSJx36AGMX1tfDaRurECsi+HMbaVn/2ei+Y9tYv X-Google-Smtp-Source: AGHT+IHXg3klf1bHAlWj25YlNRKccjBu8M+Fvq8/uOKLqB9fpvOKGlOKjcK7yS/s07jgXOUbSyhISw== X-Received: by 2002:a17:902:c408:b0:295:557e:746a with SMTP id d9443c01a7336-29d682ef5edmr91403145ad.13.1764876633812; Thu, 04 Dec 2025 11:30:33 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.30.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:30:33 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:20 +0800 Subject: [PATCH v4 12/19] mm, swap: use swap cache as the swap in synchronize layer Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-12-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=14399; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=k7XA6vA3z6eaz1At5HxTObmE+fBHx81TMOa8X6ZDU2c=; b=/r73yBSBbHZb0OA0OGJVs53E47Q6H9PNK4+1TyQTPPhXeZpRNFQXkWB7YdrsEBfJd/KviAoEw Wtj3Ry1Hk4ZAOP9V4qDgTRc7zVnNZ/EBWkahaeS4tSJXUXKMej+m4HK X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Current swap in synchronization mostly uses the swap_map's SWAP_HAS_CACHE bit. Whoever sets the bit first does the actual work to swap in a folio. This has been causing many issues as it's just a poor implementation of a bit lock. Raced users have no idea what is pinning a slot, so it has to loop with a schedule_timeout_uninterruptible(1), which is ugly and causes long-tailing or other performance issues. Besides, the abuse of SWAP_HAS_CACHE has been causing many other troubles for synchronization or maintenance. This is the first step to remove this bit completely. We have just removed all swap in paths that bypass the swap cache, and now both the swap cache and swap map are protected by the cluster lock. So now we can just resolve the swap synchronization with the swap cache layer directly using the cluster lock. Whoever inserts a folio in the swap cache first does the swap in work. And because folios are locked during swap operations, other raced users will just wait on the folio lock. The SWAP_HAS_CACHE will be removed in later commit. For now, we still set it for some remaining users. But now we do the bit setting and swap cache folio adding in the same critical section, after swap cache is ready. No one will have to spin on the SWAP_HAS_CACHE bit anymore. This both simplifies the logic and should improve the performance, eliminating issues like the one solved in commit 01626a1823024 ("mm: avoid unconditional one-tick sleep when swapcache_prepare fails"), or the "skip_if_exists" from commit a65b0e7607ccb ("zswap: make shrinking memcg-aware"), which will be removed very soon. Signed-off-by: Kairui Song Suggested-by: Chris Li --- include/linux/swap.h | 6 --- mm/swap.h | 15 +++++++- mm/swap_state.c | 105 ++++++++++++++++++++++++++++-------------------= ---- mm/swapfile.c | 39 ++++++++++++------- mm/vmscan.c | 1 - 5 files changed, 96 insertions(+), 70 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 936fa8f9e5f3..69025b473472 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -458,7 +458,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t en= try); extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern int swap_duplicate_nr(swp_entry_t entry, int nr); -extern int swapcache_prepare(swp_entry_t entry, int nr); extern void swap_free_nr(swp_entry_t entry, int nr_pages); extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); int swap_type_of(dev_t device, sector_t offset); @@ -518,11 +517,6 @@ static inline int swap_duplicate_nr(swp_entry_t swp, i= nt nr_pages) return 0; } =20 -static inline int swapcache_prepare(swp_entry_t swp, int nr) -{ - return 0; -} - static inline void swap_free_nr(swp_entry_t entry, int nr_pages) { } diff --git a/mm/swap.h b/mm/swap.h index e0f05babe13a..b5075a1aee04 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -234,6 +234,14 @@ static inline bool folio_matches_swap_entry(const stru= ct folio *folio, return folio_entry.val =3D=3D round_down(entry.val, nr_pages); } =20 +/* Temporary internal helpers */ +void __swapcache_set_cached(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry); +void __swapcache_clear_cached(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry, unsigned int nr); + /* * All swap cache helpers below require the caller to ensure the swap entr= ies * used are valid and stablize the device by any of the following ways: @@ -247,7 +255,8 @@ static inline bool folio_matches_swap_entry(const struc= t folio *folio, */ struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); -void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **s= hadow); +int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, + void **shadow, bool alloc); void swap_cache_del_folio(struct folio *folio); struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, struct mempolicy *mpol, pgoff_t ilx, @@ -413,8 +422,10 @@ static inline void *swap_cache_get_shadow(swp_entry_t = entry) return NULL; } =20 -static inline void swap_cache_add_folio(struct folio *folio, swp_entry_t e= ntry, void **shadow) +static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t en= try, + void **shadow, bool alloc) { + return -ENOENT; } =20 static inline void swap_cache_del_folio(struct folio *folio) diff --git a/mm/swap_state.c b/mm/swap_state.c index 0c5aad537716..df7df8b75e52 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -128,34 +128,64 @@ void *swap_cache_get_shadow(swp_entry_t entry) * @entry: The swap entry corresponding to the folio. * @gfp: gfp_mask for XArray node allocation. * @shadowp: If a shadow is found, return the shadow. + * @alloc: If it's the allocator that is trying to insert a folio. Allocat= or + * sets SWAP_HAS_CACHE to pin slots before insert so skip map upda= te. * * Context: Caller must ensure @entry is valid and protect the swap device * with reference count or locks. - * The caller also needs to update the corresponding swap_map slots with - * SWAP_HAS_CACHE bit to avoid race or conflict. */ -void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **s= hadowp) +int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, + void **shadowp, bool alloc) { + int err; void *shadow =3D NULL; + struct swap_info_struct *si; unsigned long old_tb, new_tb; struct swap_cluster_info *ci; - unsigned int ci_start, ci_off, ci_end; + unsigned int ci_start, ci_off, ci_end, offset; unsigned long nr_pages =3D folio_nr_pages(folio); =20 VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); =20 + si =3D __swap_entry_to_info(entry); new_tb =3D folio_to_swp_tb(folio); ci_start =3D swp_cluster_offset(entry); ci_end =3D ci_start + nr_pages; ci_off =3D ci_start; - ci =3D swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry)); + offset =3D swp_offset(entry); + ci =3D swap_cluster_lock(si, swp_offset(entry)); + if (unlikely(!ci->table)) { + err =3D -ENOENT; + goto failed; + } do { - old_tb =3D __swap_table_xchg(ci, ci_off, new_tb); - WARN_ON_ONCE(swp_tb_is_folio(old_tb)); + old_tb =3D __swap_table_get(ci, ci_off); + if (unlikely(swp_tb_is_folio(old_tb))) { + err =3D -EEXIST; + goto failed; + } + if (!alloc && unlikely(!__swap_count(swp_entry(swp_type(entry), offset))= )) { + err =3D -ENOENT; + goto failed; + } if (swp_tb_is_shadow(old_tb)) shadow =3D swp_tb_to_shadow(old_tb); + offset++; + } while (++ci_off < ci_end); + + ci_off =3D ci_start; + offset =3D swp_offset(entry); + do { + /* + * Still need to pin the slots with SWAP_HAS_CACHE since + * swap allocator depends on that. + */ + if (!alloc) + __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset)); + __swap_table_set(ci, ci_off, new_tb); + offset++; } while (++ci_off < ci_end); =20 folio_ref_add(folio, nr_pages); @@ -168,6 +198,11 @@ void swap_cache_add_folio(struct folio *folio, swp_ent= ry_t entry, void **shadowp =20 if (shadowp) *shadowp =3D shadow; + return 0; + +failed: + swap_cluster_unlock(ci); + return err; } =20 /** @@ -186,6 +221,7 @@ void swap_cache_add_folio(struct folio *folio, swp_entr= y_t entry, void **shadowp void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *fo= lio, swp_entry_t entry, void *shadow) { + struct swap_info_struct *si; unsigned long old_tb, new_tb; unsigned int ci_start, ci_off, ci_end; unsigned long nr_pages =3D folio_nr_pages(folio); @@ -195,6 +231,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *c= i, struct folio *folio, VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio); =20 + si =3D __swap_entry_to_info(entry); new_tb =3D shadow_swp_to_tb(shadow); ci_start =3D swp_cluster_offset(entry); ci_end =3D ci_start + nr_pages; @@ -210,6 +247,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *c= i, struct folio *folio, folio_clear_swapcache(folio); node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages); lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages); + __swapcache_clear_cached(si, ci, entry, nr_pages); } =20 /** @@ -231,7 +269,6 @@ void swap_cache_del_folio(struct folio *folio) __swap_cache_del_folio(ci, folio, entry, NULL); swap_cluster_unlock(ci); =20 - put_swap_folio(folio, entry); folio_ref_sub(folio, folio_nr_pages(folio)); } =20 @@ -423,67 +460,37 @@ static struct folio *__swap_cache_prepare_and_add(swp= _entry_t entry, gfp_t gfp, bool charged, bool skip_if_exists) { - struct folio *swapcache; + struct folio *swapcache =3D NULL; void *shadow; int ret; =20 - /* - * Check and pin the swap map with SWAP_HAS_CACHE, then add the folio - * into the swap cache. Loop with a schedule delay if raced with - * another process setting SWAP_HAS_CACHE. This hackish loop will - * be fixed very soon. - */ + __folio_set_locked(folio); + __folio_set_swapbacked(folio); for (;;) { - ret =3D swapcache_prepare(entry, folio_nr_pages(folio)); + ret =3D swap_cache_add_folio(folio, entry, &shadow, false); if (!ret) break; =20 /* - * The skip_if_exists is for protecting against a recursive - * call to this helper on the same entry waiting forever - * here because SWAP_HAS_CACHE is set but the folio is not - * in the swap cache yet. This can happen today if - * mem_cgroup_swapin_charge_folio() below triggers reclaim - * through zswap, which may call this helper again in the - * writeback path. - * - * Large order allocation also needs special handling on + * Large order allocation needs special handling on * race: if a smaller folio exists in cache, swapin needs * to fallback to order 0, and doing a swap cache lookup * might return a folio that is irrelevant to the faulting * entry because @entry is aligned down. Just return NULL. */ if (ret !=3D -EEXIST || skip_if_exists || folio_test_large(folio)) - return NULL; + goto failed; =20 - /* - * Check the swap cache again, we can only arrive - * here because swapcache_prepare returns -EEXIST. - */ swapcache =3D swap_cache_get_folio(entry); if (swapcache) - return swapcache; - - /* - * We might race against __swap_cache_del_folio(), and - * stumble across a swap_map entry whose SWAP_HAS_CACHE - * has not yet been cleared. Or race against another - * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE - * in swap_map, but not yet added its folio to swap cache. - */ - schedule_timeout_uninterruptible(1); + goto failed; } =20 - __folio_set_locked(folio); - __folio_set_swapbacked(folio); - if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) { - put_swap_folio(folio, entry); - folio_unlock(folio); - return NULL; + swap_cache_del_folio(folio); + goto failed; } =20 - swap_cache_add_folio(folio, entry, &shadow); memcg1_swapin(entry, folio_nr_pages(folio)); if (shadow) workingset_refault(folio, shadow); @@ -491,6 +498,10 @@ static struct folio *__swap_cache_prepare_and_add(swp_= entry_t entry, /* Caller will initiate read into locked folio */ folio_add_lru(folio); return folio; + +failed: + folio_unlock(folio); + return swapcache; } =20 /** diff --git a/mm/swapfile.c b/mm/swapfile.c index d9d943fc7b8d..f7c0a9eb5f04 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1476,7 +1476,11 @@ int folio_alloc_swap(struct folio *folio) if (!entry.val) return -ENOMEM; =20 - swap_cache_add_folio(folio, entry, NULL); + /* + * Allocator has pinned the slots with SWAP_HAS_CACHE + * so it should never fail + */ + WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true)); =20 return 0; =20 @@ -1582,9 +1586,8 @@ static unsigned char swap_entry_put_locked(struct swa= p_info_struct *si, * do_swap_page() * ... swapoff+swapon * swap_cache_alloc_folio() - * swapcache_prepare() - * __swap_duplicate() - * // check swap_map + * swap_cache_add_folio() + * // check swap_map * // verify PTE not changed * * In __swap_duplicate(), the swap_map need to be checked before @@ -3768,17 +3771,25 @@ int swap_duplicate_nr(swp_entry_t entry, int nr) return err; } =20 -/* - * @entry: first swap entry from which we allocate nr swap cache. - * - * Called when allocating swap cache for existing swap entries, - * This can return error codes. Returns 0 at success. - * -EEXIST means there is a swap cache. - * Note: return code is different from swap_duplicate(). - */ -int swapcache_prepare(swp_entry_t entry, int nr) +/* Mark the swap map as HAS_CACHE, caller need to hold the cluster lock */ +void __swapcache_set_cached(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry) +{ + WARN_ON(swap_dup_entries(si, ci, swp_offset(entry), SWAP_HAS_CACHE, 1)); +} + +/* Clear the swap map as !HAS_CACHE, caller need to hold the cluster lock = */ +void __swapcache_clear_cached(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry, unsigned int nr) { - return __swap_duplicate(entry, SWAP_HAS_CACHE, nr); + if (swap_only_has_cache(si, swp_offset(entry), nr)) { + swap_entries_free(si, ci, entry, nr); + } else { + for (int i =3D 0; i < nr; i++, entry.val++) + swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); + } } =20 /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 3b85652a42b9..9483267ebf70 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -761,7 +761,6 @@ static int __remove_mapping(struct address_space *mappi= ng, struct folio *folio, __swap_cache_del_folio(ci, folio, swap, shadow); memcg1_swapout(folio, swap); swap_cluster_unlock_irq(ci); - put_swap_folio(folio, swap); } else { void (*free_folio)(struct folio *); =20 --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 25B7434E749 for ; Thu, 4 Dec 2025 19:30:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876640; cv=none; b=LNmLqR8SV6Kil3ualJOoQ1qNIRhEIifk3nt9soOI1MaEUabSLxLUiLeH8JXWRl8OYrLtCi9he7N71OV+sa5qAlpegmgTkThFgp+LJPBK2gSz6GipJDqTPbMxNdqgox5o4EopBaicsVwTYIVKefPLdAMhVfbW+9TpURZOQCJ023A= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876640; c=relaxed/simple; bh=7V3z6QyMtwAETHD1IqGfcpC8uYH0tKWQwoxXjylZCX4=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=sy4UemeZdei7Ty8Jp1WoU4xt0WI9SCgG7qpfljNpTKWR+bCNKT09MlpehpKy7XmSSU0IbLEDI5RhpV75KRF3oEzc+Hf88EgaQ0NxMJ8ctQ4Y2Qax1EA5RQ9Ceb0f4Xc3jVFZwun69TToDzBbH7Q9DzT4Kz46ZlVl160oD4ttgCo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=AIbaxql7; arc=none smtp.client-ip=209.85.214.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="AIbaxql7" Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-297d4a56f97so17988455ad.1 for ; Thu, 04 Dec 2025 11:30:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876638; x=1765481438; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=iynThVMRdzdMBrnu7L3dH8nlr9uonc5S69ZtO1nbUe4=; b=AIbaxql7scjtZkTEa9TyAMcqoD/DEOgfpZsZcAt/f34kkskD5g5aP4I4PKgkAE8y5Q dIidRswCtDaET7is6dNR4DMrptn1E6R6wHnym1abZYJLoUCboTOzsKJcygiUuZqMk9uN +7vVHH1Ajtg9kg393s8tveR7yUXX22kjq0CJSI31KKyR0xhZjIM68925rL7vNF2qKlie EKPdHKoyYGi55CrHaZpykXjkoyfvIr6rVxDeyuw6/mtpqrCCwZL/byuh4StgKn0rHe/n 2MUOhJzXrg39A5b40Ep0d440FP36x90fy9pYbzp6v94adiP/aXYhgg1V+tA2uyRRhJAg rIvQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876638; x=1765481438; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=iynThVMRdzdMBrnu7L3dH8nlr9uonc5S69ZtO1nbUe4=; b=HAeqzIQROi5a3+dMJZdZgYqIGqDiMlN49q93dVY5e+GdmHu6nrm85PmdSCqH1vtQMe Kp/C+4AItTZzw2Q1mauSJ71EmTxpSHjECrEVzC2OVRLUYFhwCq49GqqQ6aK0yLTDnoNJ aso7ufc5TeAfMeraST1R/Ofnt+/VkM2HqG25aTFg/V7JlyAQEYfcY5HmMrLXSGoJS6mt h5BltuNxPkgVnP+0ZnenxB5MdOhJS7qtLOwea0FlwOhUWXLucNPfx8lioROc0De8EjFo KYMfcmUhs1n0ZqmBOHuXeroJ+W9IbNrdNhbEi26eO+lU1fncC39Q/psEvKgJTrGU2+2Y rTDA== X-Forwarded-Encrypted: i=1; AJvYcCVqUgSTrE48SLMiP9yHOojmBM/+ICwS2sJqxmoPJuiErMDF0aYzE9ttjQ+omiAUXUfX9ijzQ9qepL4iTf8=@vger.kernel.org X-Gm-Message-State: AOJu0YwAdVJLUsoC8/Aw5JAeep5jY+hwu9VxesGLWFhY8qNXLrX5hjCS KmZNJvUzaWhQhQTuAtWsOZmQnuqIb3prup1d7HkTRMGCqAc0feq4Q5ZS X-Gm-Gg: ASbGncudf7mNOF36EiBXHUq8uITXnNyKwJHJxiuRRrbJe4rkr2RKME70RVJrVUPVFmO mHcLxGj5QE1RXykFNOIDqDfxbWSk+lynBjZ9YozkT2M75lJWOH1Op3R1dz2g/iQQAK9cyNn9nM5 C7jCmEP1e0fIG43oGMPUogQqRc5d2TCNqBYOqKkOwge/FRHJyHxawqVcvhm8gYW0av2Fze2KPhC wcAhD643QwUn4/lWo1VDYpophPYlsMnpcRgehOMgccHUmbrxquBw1tpGzyQFO7dWs4Bl5Z9LVHe ZVxkjL/iBjXqgXqvDEj3ggIV7I3j5ndIh1bNiR/H+k5TU6FSPetzuU8kXC1zb9FghWPpyvVHRsO YS1oq7DhTOhkQzJ3oo6jaoTo4iyAxmOHOFuXfFu/s72rcoJ5Z7UL1xqA+CQF6b7gDn5ZOAp+hCr Ssg/0chdvLLTzur4e2ty48TGmgpDqnLSbunxsh1i+6mPRzPw3d X-Google-Smtp-Source: AGHT+IHkAo95IOwWZh4V1qJmP0GJYz+gC5Ov7f6z3VfqkVSM2RXpu273c86T89bxenPC+IcvjibDoQ== X-Received: by 2002:a17:902:c406:b0:298:49db:a9c5 with SMTP id d9443c01a7336-29d68430f25mr77084565ad.43.1764876638354; Thu, 04 Dec 2025 11:30:38 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.30.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:30:37 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:21 +0800 Subject: [PATCH v4 13/19] mm, swap: remove workaround for unsynchronized swap map cache state Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-13-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=7032; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=SYqPgb2o84SebrcaWLsBII5bIr04ISZ9VA6R4j7AeY4=; b=18DhOG0mhl1tF3AnvCIQZl0WLi8NH5XB5nif23XJSLd0gnQqWZPk0OeZ6+x8LGsfRGVAEvDSJ DNd9sb4BWPRDbRzc+Ht9ROQgN9O3yghC06z0xSLZZoJlEwY1lsiqO7R X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Remove the "skip if exists" check from commit a65b0e7607ccb ("zswap: make shrinking memcg-aware"). It was needed because there is a tiny time window between setting the SWAP_HAS_CACHE bit and actually adding the folio to the swap cache. If a user is trying to add the folio into the swap cache but another user was interrupted after setting SWAP_HAS_CACHE but hasn't added the folio to the swap cache yet, it might lead to a deadlock. We have moved the bit setting to the same critical section as adding the folio, so this is no longer needed. Remove it and clean it up. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swap.h | 2 +- mm/swap_state.c | 27 ++++++++++----------------- mm/zswap.c | 2 +- 3 files changed, 12 insertions(+), 19 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index b5075a1aee04..6777b2ab9d92 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -260,7 +260,7 @@ int swap_cache_add_folio(struct folio *folio, swp_entry= _t entry, void swap_cache_del_folio(struct folio *folio); struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, struct mempolicy *mpol, pgoff_t ilx, - bool *alloced, bool skip_if_exists); + bool *alloced); /* Below helpers require the caller to lock and pass in the swap cluster. = */ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry, void *shadow); diff --git a/mm/swap_state.c b/mm/swap_state.c index df7df8b75e52..1a69ba3be87f 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -445,8 +445,6 @@ void swap_update_readahead(struct folio *folio, struct = vm_area_struct *vma, * @folio: folio to be added. * @gfp: memory allocation flags for charge, can be 0 if @charged if true. * @charged: if the folio is already charged. - * @skip_if_exists: if the slot is in a cached state, return NULL. - * This is an old workaround that will be removed shortly. * * Update the swap_map and add folio as swap cache, typically before swapi= n. * All swap slots covered by the folio must have a non-zero swap count. @@ -457,8 +455,7 @@ void swap_update_readahead(struct folio *folio, struct = vm_area_struct *vma, */ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry, struct folio *folio, - gfp_t gfp, bool charged, - bool skip_if_exists) + gfp_t gfp, bool charged) { struct folio *swapcache =3D NULL; void *shadow; @@ -478,7 +475,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, * might return a folio that is irrelevant to the faulting * entry because @entry is aligned down. Just return NULL. */ - if (ret !=3D -EEXIST || skip_if_exists || folio_test_large(folio)) + if (ret !=3D -EEXIST || folio_test_large(folio)) goto failed; =20 swapcache =3D swap_cache_get_folio(entry); @@ -511,8 +508,6 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, * @mpol: NUMA memory allocation policy to be applied * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE * @new_page_allocated: sets true if allocation happened, false otherwise - * @skip_if_exists: if the slot is a partially cached state, return NULL. - * This is a workaround that would be removed shortly. * * Allocate a folio in the swap cache for one swap slot, typically before * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by @@ -525,8 +520,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, */ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx, - bool *new_page_allocated, - bool skip_if_exists) + bool *new_page_allocated) { struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct folio *folio; @@ -547,8 +541,7 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry,= gfp_t gfp_mask, if (!folio) return NULL; /* Try add the new folio, returns existing folio or NULL on failure. */ - result =3D __swap_cache_prepare_and_add(entry, folio, gfp_mask, - false, skip_if_exists); + result =3D __swap_cache_prepare_and_add(entry, folio, gfp_mask, false); if (result =3D=3D folio) *new_page_allocated =3D true; else @@ -577,7 +570,7 @@ struct folio *swapin_folio(swp_entry_t entry, struct fo= lio *folio) unsigned long nr_pages =3D folio_nr_pages(folio); =20 entry =3D swp_entry(swp_type(entry), round_down(offset, nr_pages)); - swapcache =3D __swap_cache_prepare_and_add(entry, folio, 0, true, false); + swapcache =3D __swap_cache_prepare_and_add(entry, folio, 0, true); if (swapcache =3D=3D folio) swap_read_folio(folio, NULL); return swapcache; @@ -605,7 +598,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, = gfp_t gfp_mask, =20 mpol =3D get_vma_policy(vma, addr, 0, &ilx); folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, - &page_allocated, false); + &page_allocated); mpol_cond_put(mpol); =20 if (page_allocated) @@ -724,7 +717,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, /* Ok, do the async read-ahead now */ folio =3D swap_cache_alloc_folio( swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx, - &page_allocated, false); + &page_allocated); if (!folio) continue; if (page_allocated) { @@ -742,7 +735,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, skip: /* The page was likely read above, so no need for plugging here */ folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, - &page_allocated, false); + &page_allocated); if (unlikely(page_allocated)) swap_read_folio(folio, NULL); return folio; @@ -847,7 +840,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, continue; } folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, - &page_allocated, false); + &page_allocated); if (si) put_swap_device(si); if (!folio) @@ -869,7 +862,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, skip: /* The folio was likely read above, so no need for plugging here */ folio =3D swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx, - &page_allocated, false); + &page_allocated); if (unlikely(page_allocated)) swap_read_folio(folio, NULL); return folio; diff --git a/mm/zswap.c b/mm/zswap.c index a7a2443912f4..d8a33db9d3cc 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1015,7 +1015,7 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, =20 mpol =3D get_task_policy(current); folio =3D swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol, - NO_INTERLEAVE_INDEX, &folio_was_allocated, true); + NO_INTERLEAVE_INDEX, &folio_was_allocated); put_swap_device(si); if (!folio) return -ENOMEM; --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pf1-f179.google.com (mail-pf1-f179.google.com [209.85.210.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5F13834E776 for ; Thu, 4 Dec 2025 19:30:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876648; cv=none; b=NMX9Sw8akfeWrESZaNxHsXt1DUWqmEaWQ63ev+O7TjgS6cpJ/aawvC+0zpm6enrHYEb8Xv4DsiPzt5vFUgNxJWxS/YTBjhq1pt5i9hvVHdPrUIFBce+RLK3CvUCs6bYz5zZKGMDsbiWKchDccVn9F+3y1536cMeyOFzvI29BT9E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876648; c=relaxed/simple; bh=POl7rZPYbO72yowF2TZekIIwNkwIJPrcMOkU9zZrk9w=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=OcEQ15/n1Ovxb5dGNrTGhLczvLr1inMho9v4jm5csKB2qCjl0nKaRt6725u0WtptJBmD6bdg/XdY+NzA7TjaxGQhQ4TncT8U2dowu4jwhN1JQHd6wTN2lteJYs2NNrCWUEBcEn7E0kp/zjVyMSJk+/gmi4oYXW0oTg9R0We44so= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=lVODlU2a; arc=none smtp.client-ip=209.85.210.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="lVODlU2a" Received: by mail-pf1-f179.google.com with SMTP id d2e1a72fcca58-7ade456b6abso1150640b3a.3 for ; Thu, 04 Dec 2025 11:30:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876644; x=1765481444; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=aYnXdQa7kgWqm3/RS3drNpeGJjkiE7iryhqpV2kBqlw=; b=lVODlU2aqPZkCIbAsA2szgD2IeMiDLoBlHpepNuAz4DJexBzalWSYdJ1ALSIaWWhmV AIp8idpuxgTH9yijfHB5zb4pWo7jUITHGsf7o9BqFrbT4VKC9Avto414iFS3lqBTaP1P EsNusEvRQjDlLf0Kupnv1jJzz8GtGKNHCvbixrO/MUgTfHsXWm5awtZfh1VfYq9+90h6 0BDM0dOKrBpWxeRuflG2o6oB66VplMYG8rW4gwYXordirA2NroX6/+QRjfXV68yW07My jKO0oNTWC1cNj8tncy5diSf4Q5/nrFUNx7aI2mmciXt+MXGtX1+Yuyc2M3hXmIB76+2F y7Cg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876644; x=1765481444; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=aYnXdQa7kgWqm3/RS3drNpeGJjkiE7iryhqpV2kBqlw=; b=Y9XsnQacSEQzsK5ICDp/PTeY+bYzM2gXFAh74ETlgkxmmZ7tljWLIYSJTowK6x6klL JogquH2I1ks0HOl6xKbABZL2OJzQXAu9Up/rgNS4ax/ZWkt3ftrfG5XE1hA4IpEfAxnG UHlNg42W4MNCbeGkVSYhUXCFsOcaEQeSm/pkD7+qbMp6Ie1cGGdtFPK2MdV1YnBRQIHO uLiE2lbCG5pHWgbwcCXamfUaBmRUvHRcgnwuRIysfmYm+Zf3MiecNQaJbWmq7fNCEOq2 GivDEB2ALMBqB1EGArMj3R4IrEvZEHCMbGdbFowEtbbS8FhZk3fQxlrEekJDLbVzYR9S 25lQ== X-Forwarded-Encrypted: i=1; AJvYcCUPQyZJI3voQW7c7T28WpiB+4wCPGVshQRxIBKXyGTwcymQhMVCYht6KPA5TNQeNps+dSvxAC/kYmzN/I4=@vger.kernel.org X-Gm-Message-State: AOJu0YzAoQZFgZCCWHfKEZQLXguI65lKE07k/esam3uU8sZgcmW1aajV SCM+k8eF1YXxt8ziVHNyiDK8pedbJH5zm2MqGcqw2N+9sc4ob8beUIyi X-Gm-Gg: ASbGncvGl74U3biqxDuU+DQtGJeunKaSraUu1VBSs6et44otiJ41Qe0SwPP9n204lmi JtFejeJDscu1f6sc/CYHFIlfAZ/e6Mm7keT0KSnhKmolqlo3QdQNSYbu78TKV5VgczfLXwPDfzk ytza9omAXIxZPl2/xSOlYQXAuVwueoKN9khqKuluOoeHH5ottmSxROd3g48zlIZQe7z88deUVWl Vb6fpfl5JOKxXsEbtygCUIFhoz09pzQqM6nbSGyGiGZnO0fczKL/9D2cR3t+JSv01FjFEveScpR anvcpJtVqj6nQIKsFilpEden6JLsCBgXzY11n+p5nkGJuCVOtxH3k5D6lKi2GKWe2TneWXNT6C5 ci/uODkJFsFVJs4x+LmGPDgq8wR2jR++LXAAuMqLJbYD1xdl/6sxxnR9/FnYNBPvRlB2htTKOTC ZrkWoJD2NYXQanO5z6b/4U9+NXJVLxm3jlpvJIpvbDKLlUqDQQ X-Google-Smtp-Source: AGHT+IE565fqDPPdTQMOwYKFGPPHokq9nTgxcJZqxpSET3fyTRk5FoH5c8NaAYLyV3N4jtPutVSPuQ== X-Received: by 2002:a05:6a20:12d2:b0:35d:492e:2ed0 with SMTP id adf61e73a8af0-363f5eaf2c1mr9107168637.52.1764876643299; Thu, 04 Dec 2025 11:30:43 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.30.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:30:42 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:22 +0800 Subject: [PATCH v4 14/19] mm, swap: cleanup swap entry management workflow Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-14-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song , linux-pm@vger.kernel.org, "Rafael J. Wysocki (Intel)" X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=28139; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=sKOLZGzQ+5PPYbhu85Y+d6++0e+6CFoLYnZyXDyDTvg=; b=TdjPoZronpfA0dpz//w1x6wz4plK2UQAWNLdYFbXJR5U+0zoc1tcc7tNWUgG0Pe7csbaKFOW/ S8YmHksyMxyByad7j7dpaEJ74VK4FbnZn1moCCSOVxy8sVi3rUAvlS5 X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song The current swap entry allocation/freeing workflow has never had a clear definition. This makes it hard to debug or add new optimizations. This commit introduces a proper definition of how swap entries would be allocated and freed. Now, most operations are folio based, so they will never exceed one swap cluster, and we now have a cleaner border between swap and the rest of mm, making it much easier to follow and debug, especially with new added sanity checks. Also making more optimization possible. Swap entry will be mostly allocated and free with a folio bound. The folio lock will be useful for resolving many swap ralated races. Now swap allocation (except hibernation) always starts with a folio in the swap cache, and gets duped/freed protected by the folio lock: - folio_alloc_swap() - The only allocation entry point now. Context: The folio must be locked. This allocates one or a set of continuous swap slots for a folio and binds them to the folio by adding the folio to the swap cache. The swap slots' swap count start with zero value. - folio_dup_swap() - Increase the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This increases the ref count of swap entries allocated to a folio. Newly allocated swap slots' count has to be increased by this helper as the folio got unmapped (and swap entries got installed). - folio_put_swap() - Decrease the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This decreases the ref count of swap entries allocated to a folio. Typically, swapin will decrease the swap count as the folio got installed back and the swap entry got uninstalled This won't remove the folio from the swap cache and free the slot. Lazy freeing of swap cache is helpful for reducing IO. There is already a folio_free_swap() for immediate cache reclaim. This part could be further optimized later. The above locking constraints could be further relaxed when the swap table if fully implemented. Currently dup still needs the caller to lock the swap entry container (e.g. PTL), or a concurrent zap may underflow the swap count. Some swap users need to interact with swap count without involving folio (e.g. forking/zapping the page table or mapping truncate without swapin). In such cases, the caller has to ensure there is no race condition on whatever owns the swap count and use the below helpers: - swap_put_entries_direct() - Decrease the swap count directly. Context: The caller must lock whatever is referencing the slots to avoid a race. Typically the page table zapping or shmem mapping truncate will need to free swap slots directly. If a slot is cached (has a folio bound), this will also try to release the swap cache. - swap_dup_entry_direct() - Increase the swap count directly. Context: The caller must lock whatever is referencing the entries to avoid race, and the entries must already have a swap count > 1. Typically, forking will need to copy the page table and hence needs to increase the swap count of the entries in the table. The page table is locked while referencing the swap entries, so the entries all have a swap count > 1 and can't be freed. Hibernation subsystem is a bit different, so two special wrappers are here: - swap_alloc_hibernation_slot() - Allocate one entry from one device. - swap_free_hibernation_slot() - Free one entry allocated by the above helper. All hibernation entries are exclusive to the hibernation subsystem and should not interact with ordinary swap routines. By separating the workflows, it will be possible to bind folio more tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary pin. This commit should not introduce any behavior change Cc: linux-pm@vger.kernel.org Acked-by: Rafael J. Wysocki (Intel) Signed-off-by: Kairui Song Suggested-by: Chris Li --- arch/s390/mm/gmap_helpers.c | 2 +- arch/s390/mm/pgtable.c | 2 +- include/linux/swap.h | 58 ++++++++--------- kernel/power/swap.c | 10 +-- mm/madvise.c | 2 +- mm/memory.c | 15 +++-- mm/rmap.c | 7 +- mm/shmem.c | 10 +-- mm/swap.h | 37 +++++++++++ mm/swapfile.c | 152 +++++++++++++++++++++++++++++++---------= ---- 10 files changed, 197 insertions(+), 98 deletions(-) diff --git a/arch/s390/mm/gmap_helpers.c b/arch/s390/mm/gmap_helpers.c index 549f14ad08af..c3f56a096e8c 100644 --- a/arch/s390/mm/gmap_helpers.c +++ b/arch/s390/mm/gmap_helpers.c @@ -32,7 +32,7 @@ static void ptep_zap_softleaf_entry(struct mm_struct *mm,= softleaf_t entry) dec_mm_counter(mm, MM_SWAPENTS); else if (softleaf_is_migration(entry)) dec_mm_counter(mm, mm_counter(softleaf_to_folio(entry))); - free_swap_and_cache(entry); + swap_put_entries_direct(entry, 1); } =20 /** diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c index d670bfb47d9b..c3fa94a6ec15 100644 --- a/arch/s390/mm/pgtable.c +++ b/arch/s390/mm/pgtable.c @@ -692,7 +692,7 @@ static void ptep_zap_softleaf_entry(struct mm_struct *m= m, softleaf_t entry) =20 dec_mm_counter(mm, mm_counter(folio)); } - free_swap_and_cache(entry); + swap_put_entries_direct(entry, 1); } =20 void ptep_zap_unused(struct mm_struct *mm, unsigned long addr, diff --git a/include/linux/swap.h b/include/linux/swap.h index 69025b473472..ac3caa4c6999 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -452,14 +452,8 @@ static inline long get_nr_swap_pages(void) } =20 extern void si_swapinfo(struct sysinfo *); -int folio_alloc_swap(struct folio *folio); -bool folio_free_swap(struct folio *folio); void put_swap_folio(struct folio *folio, swp_entry_t entry); -extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); -extern int swap_duplicate_nr(swp_entry_t entry, int nr); -extern void swap_free_nr(swp_entry_t entry, int nr_pages); -extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); int swap_type_of(dev_t device, sector_t offset); int find_first_swap(dev_t *device); extern unsigned int count_swap_pages(int, int); @@ -472,6 +466,29 @@ struct backing_dev_info; extern struct swap_info_struct *get_swap_device(swp_entry_t entry); sector_t swap_folio_sector(struct folio *folio); =20 +/* + * If there is an existing swap slot reference (swap entry) and the caller + * guarantees that there is no race modification of it (e.g., PTL + * protecting the swap entry in page table; shmem's cmpxchg protects t + * he swap entry in shmem mapping), these two helpers below can be used + * to put/dup the entries directly. + * + * All entries must be allocated by folio_alloc_swap(). And they must have + * a swap count > 1. See comments of folio_*_swap helpers for more info. + */ +int swap_dup_entry_direct(swp_entry_t entry); +void swap_put_entries_direct(swp_entry_t entry, int nr); + +/* + * folio_free_swap tries to free the swap entries pinned by a swap cache + * folio, it has to be here to be called by other components. + */ +bool folio_free_swap(struct folio *folio); + +/* Allocate / free (hibernation) exclusive entries */ +swp_entry_t swap_alloc_hibernation_slot(int type); +void swap_free_hibernation_slot(swp_entry_t entry); + static inline void put_swap_device(struct swap_info_struct *si) { percpu_ref_put(&si->users); @@ -499,10 +516,6 @@ static inline void put_swap_device(struct swap_info_st= ruct *si) #define free_pages_and_swap_cache(pages, nr) \ release_pages((pages), (nr)); =20 -static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr) -{ -} - static inline void free_swap_cache(struct folio *folio) { } @@ -512,12 +525,12 @@ static inline int add_swap_count_continuation(swp_ent= ry_t swp, gfp_t gfp_mask) return 0; } =20 -static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages) +static inline int swap_dup_entry_direct(swp_entry_t ent) { return 0; } =20 -static inline void swap_free_nr(swp_entry_t entry, int nr_pages) +static inline void swap_put_entries_direct(swp_entry_t ent, int nr) { } =20 @@ -541,11 +554,6 @@ static inline int swp_swapcount(swp_entry_t entry) return 0; } =20 -static inline int folio_alloc_swap(struct folio *folio) -{ - return -EINVAL; -} - static inline bool folio_free_swap(struct folio *folio) { return false; @@ -558,22 +566,6 @@ static inline int add_swap_extent(struct swap_info_str= uct *sis, return -EINVAL; } #endif /* CONFIG_SWAP */ - -static inline int swap_duplicate(swp_entry_t entry) -{ - return swap_duplicate_nr(entry, 1); -} - -static inline void free_swap_and_cache(swp_entry_t entry) -{ - free_swap_and_cache_nr(entry, 1); -} - -static inline void swap_free(swp_entry_t entry) -{ - swap_free_nr(entry, 1); -} - #ifdef CONFIG_MEMCG static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) { diff --git a/kernel/power/swap.c b/kernel/power/swap.c index 0beff7eeaaba..546a0c701970 100644 --- a/kernel/power/swap.c +++ b/kernel/power/swap.c @@ -179,10 +179,10 @@ sector_t alloc_swapdev_block(int swap) { unsigned long offset; =20 - offset =3D swp_offset(get_swap_page_of_type(swap)); + offset =3D swp_offset(swap_alloc_hibernation_slot(swap)); if (offset) { if (swsusp_extents_insert(offset)) - swap_free(swp_entry(swap, offset)); + swap_free_hibernation_slot(swp_entry(swap, offset)); else return swapdev_block(swap, offset); } @@ -197,6 +197,7 @@ sector_t alloc_swapdev_block(int swap) =20 void free_all_swap_pages(int swap) { + unsigned long offset; struct rb_node *node; =20 while ((node =3D swsusp_extents.rb_node)) { @@ -204,8 +205,9 @@ void free_all_swap_pages(int swap) =20 ext =3D rb_entry(node, struct swsusp_extent, node); rb_erase(node, &swsusp_extents); - swap_free_nr(swp_entry(swap, ext->start), - ext->end - ext->start + 1); + + for (offset =3D ext->start; offset < ext->end; offset++) + swap_free_hibernation_slot(swp_entry(swap, offset)); =20 kfree(ext); } diff --git a/mm/madvise.c b/mm/madvise.c index b617b1be0f53..7cd69a02ce84 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -694,7 +694,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned = long addr, max_nr =3D (end - addr) / PAGE_SIZE; nr =3D swap_pte_batch(pte, max_nr, ptent); nr_swap -=3D nr; - free_swap_and_cache_nr(entry, nr); + swap_put_entries_direct(entry, nr); clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); } else if (softleaf_is_hwpoison(entry) || softleaf_is_poison_marker(entry)) { diff --git a/mm/memory.c b/mm/memory.c index ce9f56f77ae5..d89946ad63ec 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -934,7 +934,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm= _struct *src_mm, struct page *page; =20 if (likely(softleaf_is_swap(entry))) { - if (swap_duplicate(entry) < 0) + if (swap_dup_entry_direct(entry) < 0) return -EIO; =20 /* make sure dst_mm is on swapoff's mmlist. */ @@ -1744,7 +1744,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gath= er *tlb, =20 nr =3D swap_pte_batch(pte, max_nr, ptent); rss[MM_SWAPENTS] -=3D nr; - free_swap_and_cache_nr(entry, nr); + swap_put_entries_direct(entry, nr); } else if (softleaf_is_migration(entry)) { struct folio *folio =3D softleaf_to_folio(entry); =20 @@ -4933,7 +4933,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) /* * Some architectures may have to restore extra metadata to the page * when reading from swap. This metadata may be indexed by swap entry - * so this must be called before swap_free(). + * so this must be called before folio_put_swap(). */ arch_swap_restore(folio_swap(entry, folio), folio); =20 @@ -4971,6 +4971,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (unlikely(folio !=3D swapcache)) { folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); + folio_put_swap(swapcache, NULL); } else if (!folio_test_anon(folio)) { /* * We currently only expect !anon folios that are fully @@ -4979,9 +4980,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) !=3D nr_pages, folio); VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); + folio_put_swap(folio, NULL); } else { + VM_WARN_ON_ONCE(nr_pages !=3D 1 && nr_pages !=3D folio_nr_pages(folio)); folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address, - rmap_flags); + rmap_flags); + folio_put_swap(folio, nr_pages =3D=3D 1 ? page : NULL); } =20 VM_BUG_ON(!folio_test_anon(folio) || @@ -4995,7 +4999,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * Do it after mapping, so raced page faults will likely see the folio * in swap cache and wait on the folio lock. */ - swap_free_nr(entry, nr_pages); if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags)) folio_free_swap(folio); =20 @@ -5005,7 +5008,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * Hold the lock to avoid the swap entry to be reused * until we take the PT lock for the pte_same() check * (to avoid false positives from pte_same). For - * further safety release the lock after the swap_free + * further safety release the lock after the folio_put_swap * so that the swap count won't change under a * parallel locked swapcache. */ diff --git a/mm/rmap.c b/mm/rmap.c index f955f02d570e..f92c94954049 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -82,6 +82,7 @@ #include =20 #include "internal.h" +#include "swap.h" =20 static struct kmem_cache *anon_vma_cachep; static struct kmem_cache *anon_vma_chain_cachep; @@ -2148,7 +2149,7 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, goto discard; } =20 - if (swap_duplicate(entry) < 0) { + if (folio_dup_swap(folio, subpage) < 0) { set_pte_at(mm, address, pvmw.pte, pteval); goto walk_abort; } @@ -2159,7 +2160,7 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, * so we'll not check/care. */ if (arch_unmap_one(mm, vma, address, pteval) < 0) { - swap_free(entry); + folio_put_swap(folio, subpage); set_pte_at(mm, address, pvmw.pte, pteval); goto walk_abort; } @@ -2167,7 +2168,7 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, /* See folio_try_share_anon_rmap(): clear PTE first. */ if (anon_exclusive && folio_try_share_anon_rmap_pte(folio, subpage)) { - swap_free(entry); + folio_put_swap(folio, subpage); set_pte_at(mm, address, pvmw.pte, pteval); goto walk_abort; } diff --git a/mm/shmem.c b/mm/shmem.c index eb9bd9241f99..56a690e93cc2 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -971,7 +971,7 @@ static long shmem_free_swap(struct address_space *mappi= ng, old =3D xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0); if (old !=3D radswap) return 0; - free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order); + swap_put_entries_direct(radix_to_swp_entry(radswap), 1 << order); =20 return 1 << order; } @@ -1654,7 +1654,7 @@ int shmem_writeout(struct folio *folio, struct swap_i= ocb **plug, spin_unlock(&shmem_swaplist_lock); } =20 - swap_duplicate_nr(folio->swap, nr_pages); + folio_dup_swap(folio, NULL); shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap)); =20 BUG_ON(folio_mapped(folio)); @@ -1675,7 +1675,7 @@ int shmem_writeout(struct folio *folio, struct swap_i= ocb **plug, /* Swap entry might be erased by racing shmem_free_swap() */ if (!error) { shmem_recalc_inode(inode, 0, -nr_pages); - swap_free_nr(folio->swap, nr_pages); + folio_put_swap(folio, NULL); } =20 /* @@ -2161,6 +2161,7 @@ static void shmem_set_folio_swapin_error(struct inode= *inode, pgoff_t index, =20 nr_pages =3D folio_nr_pages(folio); folio_wait_writeback(folio); + folio_put_swap(folio, NULL); swap_cache_del_folio(folio); /* * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks @@ -2168,7 +2169,6 @@ static void shmem_set_folio_swapin_error(struct inode= *inode, pgoff_t index, * in shmem_evict_inode(). */ shmem_recalc_inode(inode, -nr_pages, -nr_pages); - swap_free_nr(swap, nr_pages); } =20 static int shmem_split_large_entry(struct inode *inode, pgoff_t index, @@ -2391,9 +2391,9 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, if (sgp =3D=3D SGP_WRITE) folio_mark_accessed(folio); =20 + folio_put_swap(folio, NULL); swap_cache_del_folio(folio); folio_mark_dirty(folio); - swap_free_nr(swap, nr_pages); put_swap_device(si); =20 *foliop =3D folio; diff --git a/mm/swap.h b/mm/swap.h index 6777b2ab9d92..9ed12936b889 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -183,6 +183,28 @@ static inline void swap_cluster_unlock_irq(struct swap= _cluster_info *ci) spin_unlock_irq(&ci->lock); } =20 +/* + * Below are the core routines for doing swap for a folio. + * All helpers requires the folio to be locked, and a locked folio + * in the swap cache pins the swap entries / slots allocated to the + * folio, swap relies heavily on the swap cache and folio lock for + * synchronization. + * + * folio_alloc_swap(): the entry point for a folio to be swapped + * out. It allocates swap slots and pins the slots with swap cache. + * The slots start with a swap count of zero. + * + * folio_dup_swap(): increases the swap count of a folio, usually + * during it gets unmapped and a swap entry is installed to replace + * it (e.g., swap entry in page table). A swap slot with swap + * count =3D=3D 0 should only be increasd by this helper. + * + * folio_put_swap(): does the opposite thing of folio_dup_swap(). + */ +int folio_alloc_swap(struct folio *folio); +int folio_dup_swap(struct folio *folio, struct page *subpage); +void folio_put_swap(struct folio *folio, struct page *subpage); + /* linux/mm/page_io.c */ int sio_pool_init(void); struct swap_iocb; @@ -363,9 +385,24 @@ static inline struct swap_info_struct *__swap_entry_to= _info(swp_entry_t entry) return NULL; } =20 +static inline int folio_alloc_swap(struct folio *folio) +{ + return -EINVAL; +} + +static inline int folio_dup_swap(struct folio *folio, struct page *page) +{ + return -EINVAL; +} + +static inline void folio_put_swap(struct folio *folio, struct page *page) +{ +} + static inline void swap_read_folio(struct folio *folio, struct swap_iocb *= *plug) { } + static inline void swap_write_unplug(struct swap_iocb *sio) { } diff --git a/mm/swapfile.c b/mm/swapfile.c index f7c0a9eb5f04..772356c38b83 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -58,6 +58,9 @@ static void swap_entries_free(struct swap_info_struct *si, swp_entry_t entry, unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); +static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr= ); +static bool swap_entries_put_map(struct swap_info_struct *si, + swp_entry_t entry, int nr); static bool folio_swapcache_freeable(struct folio *folio); static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, @@ -1482,6 +1485,12 @@ int folio_alloc_swap(struct folio *folio) */ WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true)); =20 + /* + * Allocator should always allocate aligned entries so folio based + * operations never crossed more than one cluster. + */ + VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio); + return 0; =20 out_free: @@ -1489,6 +1498,66 @@ int folio_alloc_swap(struct folio *folio) return -ENOMEM; } =20 +/** + * folio_dup_swap() - Increase swap count of swap entries of a folio. + * @folio: folio with swap entries bounded. + * @subpage: if not NULL, only increase the swap count of this subpage. + * + * Typically called when the folio is unmapped and have its swap entry to + * take its palce. + * + * Context: Caller must ensure the folio is locked and in the swap cache. + * NOTE: The caller also has to ensure there is no raced call to + * swap_put_entries_direct on its swap entry before this helper returns, or + * the swap map may underflow. Currently, we only accept @subpage =3D=3D N= ULL + * for shmem due to the limitation of swap continuation: shmem always + * duplicates the swap entry only once, so there is no such issue for it. + */ +int folio_dup_swap(struct folio *folio, struct page *subpage) +{ + int err =3D 0; + swp_entry_t entry =3D folio->swap; + unsigned long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio); + + if (subpage) { + entry.val +=3D folio_page_idx(folio, subpage); + nr_pages =3D 1; + } + + while (!err && __swap_duplicate(entry, 1, nr_pages) =3D=3D -ENOMEM) + err =3D add_swap_count_continuation(entry, GFP_ATOMIC); + + return err; +} + +/** + * folio_put_swap() - Decrease swap count of swap entries of a folio. + * @folio: folio with swap entries bounded, must be in swap cache and lock= ed. + * @subpage: if not NULL, only decrease the swap count of this subpage. + * + * This won't free the swap slots even if swap count drops to zero, they a= re + * still pinned by the swap cache. User may call folio_free_swap to free t= hem. + * Context: Caller must ensure the folio is locked and in the swap cache. + */ +void folio_put_swap(struct folio *folio, struct page *subpage) +{ + swp_entry_t entry =3D folio->swap; + unsigned long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio); + + if (subpage) { + entry.val +=3D folio_page_idx(folio, subpage); + nr_pages =3D 1; + } + + swap_entries_put_map(__swap_entry_to_info(entry), entry, nr_pages); +} + static struct swap_info_struct *_swap_info_get(swp_entry_t entry) { struct swap_info_struct *si; @@ -1729,28 +1798,6 @@ static void swap_entries_free(struct swap_info_struc= t *si, partial_free_cluster(si, ci); } =20 -/* - * Caller has made sure that the swap device corresponding to entry - * is still around or has not been recycled. - */ -void swap_free_nr(swp_entry_t entry, int nr_pages) -{ - int nr; - struct swap_info_struct *sis; - unsigned long offset =3D swp_offset(entry); - - sis =3D _swap_info_get(entry); - if (!sis) - return; - - while (nr_pages) { - nr =3D min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER= ); - swap_entries_put_map(sis, swp_entry(sis->type, offset), nr); - offset +=3D nr; - nr_pages -=3D nr; - } -} - /* * Called after dropping swapcache to decrease refcnt to swap entries. */ @@ -1939,16 +1986,19 @@ bool folio_free_swap(struct folio *folio) } =20 /** - * free_swap_and_cache_nr() - Release reference on range of swap entries a= nd - * reclaim their cache if no more references re= main. + * swap_put_entries_direct() - Release reference on range of swap entries = and + * reclaim their cache if no more references r= emain. * @entry: First entry of range. * @nr: Number of entries in range. * * For each swap entry in the contiguous range, release a reference. If an= y swap * entries become free, try to reclaim their underlying folios, if present= . The * offset range is defined by [entry.offset, entry.offset + nr). + * + * Context: Caller must ensure there is no race condition on the reference + * owner. e.g., locking the PTL of a PTE containing the entry being releas= ed. */ -void free_swap_and_cache_nr(swp_entry_t entry, int nr) +void swap_put_entries_direct(swp_entry_t entry, int nr) { const unsigned long start_offset =3D swp_offset(entry); const unsigned long end_offset =3D start_offset + nr; @@ -1957,10 +2007,9 @@ void free_swap_and_cache_nr(swp_entry_t entry, int n= r) unsigned long offset; =20 si =3D get_swap_device(entry); - if (!si) + if (WARN_ON_ONCE(!si)) return; - - if (WARN_ON(end_offset > si->max)) + if (WARN_ON_ONCE(end_offset > si->max)) goto out; =20 /* @@ -2004,8 +2053,8 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr) } =20 #ifdef CONFIG_HIBERNATION - -swp_entry_t get_swap_page_of_type(int type) +/* Allocate a slot for hibernation */ +swp_entry_t swap_alloc_hibernation_slot(int type) { struct swap_info_struct *si =3D swap_type_to_info(type); unsigned long offset; @@ -2033,6 +2082,27 @@ swp_entry_t get_swap_page_of_type(int type) return entry; } =20 +/* Free a slot allocated by swap_alloc_hibernation_slot */ +void swap_free_hibernation_slot(swp_entry_t entry) +{ + struct swap_info_struct *si; + struct swap_cluster_info *ci; + pgoff_t offset =3D swp_offset(entry); + + si =3D get_swap_device(entry); + if (WARN_ON(!si)) + return; + + ci =3D swap_cluster_lock(si, offset); + swap_entry_put_locked(si, ci, entry, 1); + WARN_ON(swap_entry_swapped(si, offset)); + swap_cluster_unlock(ci); + + /* In theory readahead might add it to the swap cache by accident */ + __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); + put_swap_device(si); +} + /* * Find the swap type that corresponds to given device (if any). * @@ -2194,7 +2264,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_= t *pmd, /* * Some architectures may have to restore extra metadata to the page * when reading from swap. This metadata may be indexed by swap entry - * so this must be called before swap_free(). + * so this must be called before folio_put_swap(). */ arch_swap_restore(folio_swap(entry, folio), folio); =20 @@ -2235,7 +2305,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_= t *pmd, new_pte =3D pte_mkuffd_wp(new_pte); setpte: set_pte_at(vma->vm_mm, addr, pte, new_pte); - swap_free(entry); + folio_put_swap(folio, page); out: if (pte) pte_unmap_unlock(pte, ptl); @@ -3745,28 +3815,22 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) return err; } =20 -/** - * swap_duplicate_nr() - Increase reference count of nr contiguous swap en= tries - * by 1. - * +/* + * swap_dup_entry_direct() - Increase reference count of a swap entry by o= ne. * @entry: first swap entry from which we want to increase the refcount. - * @nr: Number of entries in range. * * Returns 0 for success, or -ENOMEM if a swap_count_continuation is requi= red * but could not be atomically allocated. Returns 0, just as if it succee= ded, * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), wh= ich * might occur if a page table entry has got corrupted. * - * Note that we are currently not handling the case where nr > 1 and we ne= ed to - * add swap count continuation. This is OK, because no such user exists - = shmem - * is the only user that can pass nr > 1, and it never re-duplicates any s= wap - * entry it owns. + * Context: Caller must ensure there is no race condition on the reference + * owner. e.g., locking the PTL of a PTE containing the entry being increa= sed. */ -int swap_duplicate_nr(swp_entry_t entry, int nr) +int swap_dup_entry_direct(swp_entry_t entry) { int err =3D 0; - - while (!err && __swap_duplicate(entry, 1, nr) =3D=3D -ENOMEM) + while (!err && __swap_duplicate(entry, 1, 1) =3D=3D -ENOMEM) err =3D add_swap_count_continuation(entry, GFP_ATOMIC); return err; } --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B293834F24A for ; Thu, 4 Dec 2025 19:30:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.169 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876651; cv=none; b=biR/rCN5LeSKGSMPemZJEMkVYELVndzuU9iLC7Drdp40KImNOXtXK7V3oz2dk2z8VCtXWErnOwBgqKlq4PUBA5jkAJ5aKqoVbRzD0qTkkqSTOxRD4cc5o1G7xKj63EXohmg+Bw3xhBe44HGLryDNKOVXcXjAiLVpU/094T38mwU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876651; c=relaxed/simple; bh=3RpPuLgzyg/h4VhNO/aK9H70fidj9OBUI5C/T9eAduo=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=AnVkwtim0V0ilqbPUBeLHwkoF1PZlmON5XF2aEvsvDOqx6SEQH9Oa7s6Hp0GV+XupbUGZDu6S3YhOWXaN8JYlpQPwfQc8JSIilAPt15pLyUWBSwzkMc/MkuVpeVQyQSXd5uWfOm6+AqgThlrcgow9ph041x6lpjEileqGbqXodI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=mwZt9D7R; arc=none smtp.client-ip=209.85.210.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="mwZt9D7R" Received: by mail-pf1-f169.google.com with SMTP id d2e1a72fcca58-7aa2170adf9so1153678b3a.0 for ; Thu, 04 Dec 2025 11:30:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876648; x=1765481448; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=8w/Sn1GWGZ3JxXkJdoutyob4Ias8G2wfwgOnm+/mbHA=; b=mwZt9D7RdAnpeZ9EEhqSxp7TaIsZweY5M1GPCHFDxvdDZiDZ0EhUQbJWdqnkW3if3d bTn0w/+bf2W2s/A8FR9w9KVjjYSECDu44maffilKRgXlgSf7myE5FcUDnr8R1QSWh1Re C49t8AnnGAoo1NvuxQBLH+tHcs/l/saN1tvEBLU9mXWfW+I+2+3PTsXZOxnTIuhABLez kNfrOWUzsY2X2Uqdnd1r3UKPrpRcmsAoSJCqroe5oqOnrWXw6Z9tgH0PhfNDIExfp2vD cOYsqUSIR2CZ1iOxKIeB3oLrZJ9uT+Qjv9hcYogbQC7MZ+BLkZkWScfCavLrkCVDgBDE rEUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876648; x=1765481448; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=8w/Sn1GWGZ3JxXkJdoutyob4Ias8G2wfwgOnm+/mbHA=; b=TTAK+1OWuI0JJxwy+g4DEeZb+iC6DGXhkb/bS2SAebAkBsErDtpMPZsBTd1aCTboV2 zZ70PxJlF52KrOGtIMniyEKSfskr1BAKOK1nNW4y+ygj1e5fLKDBzAt880Ae4VOCbphH PFEw3PakOMBvCRpHKAzOjbqLzlRXjPwt8ENsh/qkV/lI5XiKXF5uLNoiTZms6EapWjOT 9LRmYcl5xQSq5R4VvkRQyg2O6DbwWqFKZvQC/45Fg66GmCJzdMfhFnsIuaVlSqux1hsZ HSF+ycY8M+WAJDat9lGejiA91BHHBUKl8XGTQNOr6/NgiOzkOiKiKv5UOQXAlcnChLzQ n1nQ== X-Forwarded-Encrypted: i=1; AJvYcCWVrOdyPNcjyvcRLrAWt7DHEnVIsLfAoSvLgjk44TjVhe6+Ur3GDpBsFPKmROhdP5jJpdjP0479KlOdrtA=@vger.kernel.org X-Gm-Message-State: AOJu0YwacUBQj8lhZ3/ENm5xqdfoUdCvfVMmA1RQ938XT11eillDDK0W skDe79bz6Ob5veOk60oUGPFZGWP/76QFAMZMYbYLmFUF31hVrCMg+n9M X-Gm-Gg: ASbGncuwNfRRl/fpFAtdLn641hoV1vP3ZlZ2fRb821cti63al8Ckbj20u4sttmHb+8K +++GvXt2cOcQoCTE35uaIcgEBHn7ObN+EGpaNChSvM+x4GlAkCCl4Xkkuf8hQQTPQvYMOew9MH9 AJI19WqDeBSY0jYLPJ71RHf8Upmz9ROPVfBFGq9M7fQ34IaXH/TLIQtl3/I/xYK+7MS+jajVhPH S9z2J3FjW6kZvgdIiac91OEvMZZDXxuh+WTRNXjgBDXQHxHnX1c4mXRwGTwPROfcvulrldUiVmK zSfAu3ycHuzRg0O3oQUpPiYjk9IYAheMjoWW8TLYJ88cHUYde84dSd1bqxPE6T3LWElkjbj8i65 3vPtkGu6eEs1cG4R2O0aI6OQwBR3p27JGtROV+/enx/emycUHMAGsP4ydK66eBCYXwlUTAF5qN6 4vN1HseSBtMWCEiGifhGoxsV3vb3cMCPLhVS44Ce+TxhtrGdMu X-Google-Smtp-Source: AGHT+IGRyFfv1OwsxmrB4sJXk3FPbwdfaQcgh7svGxSCGbyiO4YSLHqhycFBFhQ4NQh0PM9g5RMCJA== X-Received: by 2002:a05:6300:210a:b0:34e:63bd:81b6 with SMTP id adf61e73a8af0-364038a8011mr5080964637.57.1764876647880; Thu, 04 Dec 2025 11:30:47 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.30.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:30:47 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:23 +0800 Subject: [PATCH v4 15/19] mm, swap: add folio to swap cache directly on allocation Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-15-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=18723; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=BxXagGN27AymwvRdBIePUrZRA0RRAtj5IVycx0gjxDs=; b=pF49cXCbStz4Hga0KWpwTB/BQpANlKQWM0Tnm+xrcnsGRhTEmaTStnNNCjaMQEixLZjY3iPvn YhMQUNk2n9VCNpFBa6GdL2groKvWAK4OCQrzTvwOX1gjmnH+iAQ1k3q X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation. SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion. This pinning usage here can be dropped by adding the folio to swap cache directly on allocation. All swap allocations are folio-based now (except for hibernation), so the swap allocator can always take the folio as the parameter. And now both swap cache (swap table) and swap map are protected by the cluster lock, scanning the map and inserting the folio can be done in the same critical section. This eliminates the time window that a slot is pinned by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock multiple times. This is both a cleanup and an optimization. Signed-off-by: Kairui Song Suggested-by: Chris Li --- include/linux/swap.h | 5 -- mm/swap.h | 10 +--- mm/swap_state.c | 58 +++++++++++-------- mm/swapfile.c | 161 ++++++++++++++++++++++-------------------------= ---- 4 files changed, 105 insertions(+), 129 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index ac3caa4c6999..4b4b81fbc6a3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -452,7 +452,6 @@ static inline long get_nr_swap_pages(void) } =20 extern void si_swapinfo(struct sysinfo *); -void put_swap_folio(struct folio *folio, swp_entry_t entry); extern int add_swap_count_continuation(swp_entry_t, gfp_t); int swap_type_of(dev_t device, sector_t offset); int find_first_swap(dev_t *device); @@ -534,10 +533,6 @@ static inline void swap_put_entries_direct(swp_entry_t= ent, int nr) { } =20 -static inline void put_swap_folio(struct folio *folio, swp_entry_t swp) -{ -} - static inline int __swap_count(swp_entry_t entry) { return 0; diff --git a/mm/swap.h b/mm/swap.h index 9ed12936b889..ec1ef7d0c35b 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -277,13 +277,13 @@ void __swapcache_clear_cached(struct swap_info_struct= *si, */ struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, - void **shadow, bool alloc); void swap_cache_del_folio(struct folio *folio); struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, struct mempolicy *mpol, pgoff_t ilx, bool *alloced); /* Below helpers require the caller to lock and pass in the swap cluster. = */ +void __swap_cache_add_folio(struct swap_cluster_info *ci, + struct folio *folio, swp_entry_t entry); void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry, void *shadow); void __swap_cache_replace_folio(struct swap_cluster_info *ci, @@ -459,12 +459,6 @@ static inline void *swap_cache_get_shadow(swp_entry_t = entry) return NULL; } =20 -static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t en= try, - void **shadow, bool alloc) -{ - return -ENOENT; -} - static inline void swap_cache_del_folio(struct folio *folio) { } diff --git a/mm/swap_state.c b/mm/swap_state.c index 1a69ba3be87f..f478a16f43e9 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -122,35 +122,56 @@ void *swap_cache_get_shadow(swp_entry_t entry) return NULL; } =20 +void __swap_cache_add_folio(struct swap_cluster_info *ci, + struct folio *folio, swp_entry_t entry) +{ + unsigned long new_tb; + unsigned int ci_start, ci_off, ci_end; + unsigned long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); + + new_tb =3D folio_to_swp_tb(folio); + ci_start =3D swp_cluster_offset(entry); + ci_off =3D ci_start; + ci_end =3D ci_start + nr_pages; + do { + VM_WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off))); + __swap_table_set(ci, ci_off, new_tb); + } while (++ci_off < ci_end); + + folio_ref_add(folio, nr_pages); + folio_set_swapcache(folio); + folio->swap =3D entry; + + node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); + lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); +} + /** * swap_cache_add_folio - Add a folio into the swap cache. * @folio: The folio to be added. * @entry: The swap entry corresponding to the folio. * @gfp: gfp_mask for XArray node allocation. * @shadowp: If a shadow is found, return the shadow. - * @alloc: If it's the allocator that is trying to insert a folio. Allocat= or - * sets SWAP_HAS_CACHE to pin slots before insert so skip map upda= te. * * Context: Caller must ensure @entry is valid and protect the swap device * with reference count or locks. */ -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, - void **shadowp, bool alloc) +static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, + void **shadowp) { int err; void *shadow =3D NULL; + unsigned long old_tb; struct swap_info_struct *si; - unsigned long old_tb, new_tb; struct swap_cluster_info *ci; unsigned int ci_start, ci_off, ci_end, offset; unsigned long nr_pages =3D folio_nr_pages(folio); =20 - VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); - VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); - VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); - si =3D __swap_entry_to_info(entry); - new_tb =3D folio_to_swp_tb(folio); ci_start =3D swp_cluster_offset(entry); ci_end =3D ci_start + nr_pages; ci_off =3D ci_start; @@ -166,7 +187,7 @@ int swap_cache_add_folio(struct folio *folio, swp_entry= _t entry, err =3D -EEXIST; goto failed; } - if (!alloc && unlikely(!__swap_count(swp_entry(swp_type(entry), offset))= )) { + if (unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) { err =3D -ENOENT; goto failed; } @@ -182,20 +203,11 @@ int swap_cache_add_folio(struct folio *folio, swp_ent= ry_t entry, * Still need to pin the slots with SWAP_HAS_CACHE since * swap allocator depends on that. */ - if (!alloc) - __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset)); - __swap_table_set(ci, ci_off, new_tb); + __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset)); offset++; } while (++ci_off < ci_end); - - folio_ref_add(folio, nr_pages); - folio_set_swapcache(folio); - folio->swap =3D entry; + __swap_cache_add_folio(ci, folio, entry); swap_cluster_unlock(ci); - - node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); - lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); - if (shadowp) *shadowp =3D shadow; return 0; @@ -464,7 +476,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, __folio_set_locked(folio); __folio_set_swapbacked(folio); for (;;) { - ret =3D swap_cache_add_folio(folio, entry, &shadow, false); + ret =3D swap_cache_add_folio(folio, entry, &shadow); if (!ret) break; =20 diff --git a/mm/swapfile.c b/mm/swapfile.c index 772356c38b83..aaa8790241a8 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -884,28 +884,57 @@ static void swap_cluster_assert_table_empty(struct sw= ap_cluster_info *ci, } } =20 -static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_c= luster_info *ci, - unsigned int start, unsigned char usage, - unsigned int order) +static bool cluster_alloc_range(struct swap_info_struct *si, + struct swap_cluster_info *ci, + struct folio *folio, + unsigned int offset) { - unsigned int nr_pages =3D 1 << order; + unsigned long nr_pages; + unsigned int order; =20 lockdep_assert_held(&ci->lock); =20 if (!(si->flags & SWP_WRITEOK)) return false; =20 + /* + * All mm swap allocation starts with a folio (folio_alloc_swap), + * it's also the only allocation path for large orders allocation. + * Such swap slots starts with count =3D=3D 0 and will be increased + * upon folio unmap. + * + * Else, it's a exclusive order 0 allocation for hibernation. + * The slot starts with count =3D=3D 1 and never increases. + */ + if (likely(folio)) { + order =3D folio_order(folio); + nr_pages =3D 1 << order; + /* + * Pin the slot with SWAP_HAS_CACHE to satisfy swap_dup_entries. + * This is the legacy allocation behavior, will drop it very soon. + */ + memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages); + __swap_cache_add_folio(ci, folio, swp_entry(si->type, offset)); + } else if (IS_ENABLED(CONFIG_HIBERNATION)) { + order =3D 0; + nr_pages =3D 1; + WARN_ON_ONCE(si->swap_map[offset]); + si->swap_map[offset] =3D 1; + swap_cluster_assert_table_empty(ci, offset, 1); + } else { + /* Allocation without folio is only possible with hibernation */ + WARN_ON_ONCE(1); + return false; + } + /* * The first allocation in a cluster makes the * cluster exclusive to this order */ if (cluster_is_empty(ci)) ci->order =3D order; - - memset(si->swap_map + start, usage, nr_pages); - swap_cluster_assert_table_empty(ci, start, nr_pages); - swap_range_alloc(si, nr_pages); ci->count +=3D nr_pages; + swap_range_alloc(si, nr_pages); =20 return true; } @@ -913,13 +942,12 @@ static bool cluster_alloc_range(struct swap_info_stru= ct *si, struct swap_cluster /* Try use a new cluster for current CPU and allocate from it. */ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long offset, - unsigned int order, - unsigned char usage) + struct folio *folio, unsigned long offset) { unsigned int next =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTRY_INVALID; unsigned long start =3D ALIGN_DOWN(offset, SWAPFILE_CLUSTER); unsigned long end =3D min(start + SWAPFILE_CLUSTER, si->max); + unsigned int order =3D likely(folio) ? folio_order(folio) : 0; unsigned int nr_pages =3D 1 << order; bool need_reclaim, ret, usable; =20 @@ -943,7 +971,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap= _info_struct *si, if (!ret) continue; } - if (!cluster_alloc_range(si, ci, offset, usage, order)) + if (!cluster_alloc_range(si, ci, folio, offset)) break; found =3D offset; offset +=3D nr_pages; @@ -965,8 +993,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap= _info_struct *si, =20 static unsigned int alloc_swap_scan_list(struct swap_info_struct *si, struct list_head *list, - unsigned int order, - unsigned char usage, + struct folio *folio, bool scan_all) { unsigned int found =3D SWAP_ENTRY_INVALID; @@ -978,7 +1005,7 @@ static unsigned int alloc_swap_scan_list(struct swap_i= nfo_struct *si, if (!ci) break; offset =3D cluster_offset(si, ci); - found =3D alloc_swap_scan_cluster(si, ci, offset, order, usage); + found =3D alloc_swap_scan_cluster(si, ci, folio, offset); if (found) break; } while (scan_all); @@ -1039,10 +1066,11 @@ static void swap_reclaim_work(struct work_struct *w= ork) * Try to allocate swap entries with specified order and try set a new * cluster for current CPU too. */ -static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,= int order, - unsigned char usage) +static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, + struct folio *folio) { struct swap_cluster_info *ci; + unsigned int order =3D likely(folio) ? folio_order(folio) : 0; unsigned int offset =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTRY_INVALID; =20 /* @@ -1064,8 +1092,7 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o if (cluster_is_usable(ci, order)) { if (cluster_is_empty(ci)) offset =3D cluster_offset(si, ci); - found =3D alloc_swap_scan_cluster(si, ci, offset, - order, usage); + found =3D alloc_swap_scan_cluster(si, ci, folio, offset); } else { swap_cluster_unlock(ci); } @@ -1079,22 +1106,19 @@ static unsigned long cluster_alloc_swap_entry(struc= t swap_info_struct *si, int o * to spread out the writes. */ if (si->flags & SWP_PAGE_DISCARD) { - found =3D alloc_swap_scan_list(si, &si->free_clusters, order, usage, - false); + found =3D alloc_swap_scan_list(si, &si->free_clusters, folio, false); if (found) goto done; } =20 if (order < PMD_ORDER) { - found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[order], - order, usage, true); + found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[order], folio, = true); if (found) goto done; } =20 if (!(si->flags & SWP_PAGE_DISCARD)) { - found =3D alloc_swap_scan_list(si, &si->free_clusters, order, usage, - false); + found =3D alloc_swap_scan_list(si, &si->free_clusters, folio, false); if (found) goto done; } @@ -1110,8 +1134,7 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o * failure is not critical. Scanning one cluster still * keeps the list rotated and reclaimed (for HAS_CACHE). */ - found =3D alloc_swap_scan_list(si, &si->frag_clusters[order], order, - usage, false); + found =3D alloc_swap_scan_list(si, &si->frag_clusters[order], folio, fal= se); if (found) goto done; } @@ -1125,13 +1148,11 @@ static unsigned long cluster_alloc_swap_entry(struc= t swap_info_struct *si, int o * Clusters here have at least one usable slots and can't fail order 0 * allocation, but reclaim may drop si->lock and race with another user. */ - found =3D alloc_swap_scan_list(si, &si->frag_clusters[o], - 0, usage, true); + found =3D alloc_swap_scan_list(si, &si->frag_clusters[o], folio, true); if (found) goto done; =20 - found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[o], - 0, usage, true); + found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[o], folio, true= ); if (found) goto done; } @@ -1322,12 +1343,12 @@ static bool get_swap_device_info(struct swap_info_s= truct *si) * Fast path try to get swap entries with specified order from current * CPU's swap entry pool (a cluster). */ -static bool swap_alloc_fast(swp_entry_t *entry, - int order) +static bool swap_alloc_fast(struct folio *folio) { + unsigned int order =3D folio_order(folio); struct swap_cluster_info *ci; struct swap_info_struct *si; - unsigned int offset, found =3D SWAP_ENTRY_INVALID; + unsigned int offset; =20 /* * Once allocated, swap_info_struct will never be completely freed, @@ -1342,22 +1363,18 @@ static bool swap_alloc_fast(swp_entry_t *entry, if (cluster_is_usable(ci, order)) { if (cluster_is_empty(ci)) offset =3D cluster_offset(si, ci); - found =3D alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE); - if (found) - *entry =3D swp_entry(si->type, found); + alloc_swap_scan_cluster(si, ci, folio, offset); } else { swap_cluster_unlock(ci); } =20 put_swap_device(si); - return !!found; + return folio_test_swapcache(folio); } =20 /* Rotate the device and switch to a new cluster */ -static void swap_alloc_slow(swp_entry_t *entry, - int order) +static void swap_alloc_slow(struct folio *folio) { - unsigned long offset; struct swap_info_struct *si, *next; =20 spin_lock(&swap_avail_lock); @@ -1367,13 +1384,11 @@ static void swap_alloc_slow(swp_entry_t *entry, plist_requeue(&si->avail_list, &swap_avail_head); spin_unlock(&swap_avail_lock); if (get_swap_device_info(si)) { - offset =3D cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE); + cluster_alloc_swap_entry(si, folio); put_swap_device(si); - if (offset) { - *entry =3D swp_entry(si->type, offset); + if (folio_test_swapcache(folio)) return; - } - if (order) + if (folio_test_large(folio)) return; } =20 @@ -1438,7 +1453,6 @@ int folio_alloc_swap(struct folio *folio) { unsigned int order =3D folio_order(folio); unsigned int size =3D 1 << order; - swp_entry_t entry =3D {}; =20 VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio); @@ -1463,39 +1477,23 @@ int folio_alloc_swap(struct folio *folio) =20 again: local_lock(&percpu_swap_cluster.lock); - if (!swap_alloc_fast(&entry, order)) - swap_alloc_slow(&entry, order); + if (!swap_alloc_fast(folio)) + swap_alloc_slow(folio); local_unlock(&percpu_swap_cluster.lock); =20 - if (unlikely(!order && !entry.val)) { + if (!order && unlikely(!folio_test_swapcache(folio))) { if (swap_sync_discard()) goto again; } =20 /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */ - if (mem_cgroup_try_charge_swap(folio, entry)) - goto out_free; + if (unlikely(mem_cgroup_try_charge_swap(folio, folio->swap))) + swap_cache_del_folio(folio); =20 - if (!entry.val) + if (unlikely(!folio_test_swapcache(folio))) return -ENOMEM; =20 - /* - * Allocator has pinned the slots with SWAP_HAS_CACHE - * so it should never fail - */ - WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true)); - - /* - * Allocator should always allocate aligned entries so folio based - * operations never crossed more than one cluster. - */ - VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio); - return 0; - -out_free: - put_swap_folio(folio, entry); - return -ENOMEM; } =20 /** @@ -1798,29 +1796,6 @@ static void swap_entries_free(struct swap_info_struc= t *si, partial_free_cluster(si, ci); } =20 -/* - * Called after dropping swapcache to decrease refcnt to swap entries. - */ -void put_swap_folio(struct folio *folio, swp_entry_t entry) -{ - struct swap_info_struct *si; - struct swap_cluster_info *ci; - unsigned long offset =3D swp_offset(entry); - int size =3D 1 << swap_entry_order(folio_order(folio)); - - si =3D _swap_info_get(entry); - if (!si) - return; - - ci =3D swap_cluster_lock(si, offset); - if (swap_only_has_cache(si, offset, size)) - swap_entries_free(si, ci, entry, size); - else - for (int i =3D 0; i < size; i++, entry.val++) - swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); - swap_cluster_unlock(ci); -} - int __swap_count(swp_entry_t entry) { struct swap_info_struct *si =3D __swap_entry_to_info(entry); @@ -2071,7 +2046,7 @@ swp_entry_t swap_alloc_hibernation_slot(int type) * with swap table allocation. */ local_lock(&percpu_swap_cluster.lock); - offset =3D cluster_alloc_swap_entry(si, 0, 1); + offset =3D cluster_alloc_swap_entry(si, NULL); local_unlock(&percpu_swap_cluster.lock); if (offset) entry =3D swp_entry(si->type, offset); --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pf1-f171.google.com (mail-pf1-f171.google.com [209.85.210.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6288A34F482 for ; Thu, 4 Dec 2025 19:30:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876655; cv=none; b=Q/xt+ezarMr/fH0HvwLThkkQ1sBGKkIuc3jDja3Aoo2NK5ALizmZjwHb5sKoDFofV++foy3zIEHl9lQi02FoGFjCv/ghJDGXX9D5pKvWRefhVLHj2My9QXwgVL3Tau+vKOe1IV5J+dfQBFImawI631NYPQ+3ntsyWurWruz+mbA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876655; c=relaxed/simple; bh=jCV1Q+M9Nv3xegku8uHENy+iBGpOXNdcIfiLgMZZqqk=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=Ld4xogH85l5tteJOyuBTTgCrjHmFmLStb80Tz+Y258pt+bvf5bCZcb1fRXhOHAIYP95U0P94Kc8WTsvX6xMg+jwJ/R6tfa4RQFU+4JVaTzvoilFTtXToQT8zmX+A35dlROHvPidWZ3UnszRZanG/POaR4xQY/uMmHcbdK4p2PBY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=XXO9xl16; arc=none smtp.client-ip=209.85.210.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="XXO9xl16" Received: by mail-pf1-f171.google.com with SMTP id d2e1a72fcca58-7b7828bf7bcso1511606b3a.2 for ; Thu, 04 Dec 2025 11:30:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876652; x=1765481452; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=Z+OV+xfMG8vEK+0t4KO8+hqW6O39H5Rzxm6GGnh7ZqQ=; b=XXO9xl16cFtrPVidPiiwT3AOpkTMnBSvoA7xJmLsnbaVXrItIb94NaqURYVwDMMECA gch7Qdjo8shVDmf1GeMR05FSdZ4U5LIXp3Q31CWadVQpCgRa/bINCFSp2rT+57cQyjFT wnx0w9qgw06CKY5fvkIgWo70UFs93TdRWxh1g1uvE4dhQBqU400HFGVMOQsJ0Mma+Uue robrKO3zI9KGyd96kyEXufIluxgxmwF/HZDSPBHoKMAB1uvSzGVOBbLVK6gBIV+rZL+n hGU51H41T7U2dVMJ88gtsvL8HN9WQXOIYGR/bTrRZQrbWnXWoPzyCZ31gd/6RdF9jkov 0J9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876652; x=1765481452; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=Z+OV+xfMG8vEK+0t4KO8+hqW6O39H5Rzxm6GGnh7ZqQ=; b=p0ZT+W23QcD4Vi5SRj50oLUQ2fZv0ApDRoqP9b8srBNuhCr1fi/j9LxKKKY2MLuw4z Y6wi4YaK/1GRV0PCKgT0et6aUq7Ko2WRD7r5Z1H6YhRTE9q1YcGnZmCWo4ZZIQtK2Zo/ Y0DMjPhQLD7ZNlZrqd1mXBsz6lVY/eB5l8YG5yXg18TDxJlEAd7PKJixwJq/x1qiGkpX sLsObBAueibESpBTy/QaBOnwLr/heikQwXZqHPHpAcz55Raeek1hErwEL/929G9znypc n0DoZ5dvp0JsQUPxb8M5YnbZyIj7L30cD3Us+alK1f9JfpaSy9Jq+ORW1MTsDVT6RJ41 SPXg== X-Forwarded-Encrypted: i=1; AJvYcCUBDLv8LfBqruRq9F3bUIbGqL1Um+5ZqDHacblQ333h+7sNXYq6i2/I6X0dgiYOLmKo/glw0HZZ4RMdaHQ=@vger.kernel.org X-Gm-Message-State: AOJu0YwrHQR43ITDvggflkmR9Zt/6iQa5rejkTj1k/aTW8NKOtJwrFNt 0hz7OVnuaGQctdHbgDvCzZXYlkE+A9HBDIIKrMU+Q2hvYV2Y55c+Skfw X-Gm-Gg: ASbGncsei7Y1+ryg9KMGrRzCn+DgPRS5hbD/xDHU+iBH7YLtXsvHAiLRa0CzEShFjZS 37Amvao6C8H7aSmF4F7rqgSmue2ba7Sf9p0DWFxl6syx7/5sRgRJ4TfbZe1LldvcSgGn58xCdij 3/+OCO4Cp3fmygA4+iUv6j8PLrr/kxYF5/X87huE+dMlrPTyzuhoQXuW3ejOS7RMDNJdo5eIGWV 4BQnYkovps8t0cyULepqzGXck51LpxxGFrDg+OO38UDf9MACf1LwE97K8QLezgsqQWnwfhkxUYO T2cOlVLYzA6Ll9LwHsa1m3LtBkFZ+9Q/fm7qaBPqLWrmEl0WVSrA8ek+6F/UHrsAySWvZCY2D6E c7Qlg+ziVgABd6NL18NkyX7GTfLmPnvwulHdhf9dgGa2nD3uNuw78+UXWR6i0JO7x/nzWWdGOC6 8CLs4VeMprEg3xltYaYNC+jWhHvHqK3jIcG+0aO8p3UYDv/R4T X-Google-Smtp-Source: AGHT+IFrS4+YI1PtHu4EckGeptTeEt08RBywVMSqzRAmb+q3EmeaVDzM97Qx8rxRv/8t3S8At73Uvw== X-Received: by 2002:a05:6a20:158d:b0:35d:3bcf:e518 with SMTP id adf61e73a8af0-363f5bd10bdmr8778613637.0.1764876652406; Thu, 04 Dec 2025 11:30:52 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.30.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:30:51 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:24 +0800 Subject: [PATCH v4 16/19] mm, swap: check swap table directly for checking cache Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-16-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=7912; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=VV8qeY6YSf7kGi1o5LI7gaOuEu3UiPZDdvkv5SrF01I=; b=ocSLlXivLRrvpyKbP8eKxqsuGmMYp+DRkuAoS4awZfncg3+dcSmilhcrfoxNhmGF8DtGfm73a 1Jd4UXJDxdwCddAP2vi11cwknzaDppHMO2+KTFKuhwD6QBl8DZCoKuY X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Instead of looking at the swap map, check swap table directly to tell if a swap slot is cached. Prepares for the removal of SWAP_HAS_CACHE. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swap.h | 11 ++++++++--- mm/swap_state.c | 16 ++++++++++++++++ mm/swapfile.c | 55 +++++++++++++++++++++++++++++-----------------------= --- mm/userfaultfd.c | 10 +++------- 4 files changed, 56 insertions(+), 36 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index ec1ef7d0c35b..3692e143eeba 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -275,6 +275,7 @@ void __swapcache_clear_cached(struct swap_info_struct *= si, * swap entries in the page table, similar to locking swap cache folio. * - See the comment of get_swap_device() for more complex usage. */ +bool swap_cache_has_folio(swp_entry_t entry); struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); void swap_cache_del_folio(struct folio *folio); @@ -335,8 +336,6 @@ static inline int swap_zeromap_batch(swp_entry_t entry,= int max_nr, =20 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr) { - struct swap_info_struct *si =3D __swap_entry_to_info(entry); - pgoff_t offset =3D swp_offset(entry); int i; =20 /* @@ -345,8 +344,9 @@ static inline int non_swapcache_batch(swp_entry_t entry= , int max_nr) * be in conflict with the folio in swap cache. */ for (i =3D 0; i < max_nr; i++) { - if ((si->swap_map[offset + i] & SWAP_HAS_CACHE)) + if (swap_cache_has_folio(entry)) return i; + entry.val++; } =20 return i; @@ -449,6 +449,11 @@ static inline int swap_writeout(struct folio *folio, return 0; } =20 +static inline bool swap_cache_has_folio(swp_entry_t entry) +{ + return false; +} + static inline struct folio *swap_cache_get_folio(swp_entry_t entry) { return NULL; diff --git a/mm/swap_state.c b/mm/swap_state.c index f478a16f43e9..6bf7556ca408 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -103,6 +103,22 @@ struct folio *swap_cache_get_folio(swp_entry_t entry) return NULL; } =20 +/** + * swap_cache_has_folio - Check if a swap slot has cache. + * @entry: swap entry indicating the slot. + * + * Context: Caller must ensure @entry is valid and protect the swap + * device with reference count or locks. + */ +bool swap_cache_has_folio(swp_entry_t entry) +{ + unsigned long swp_tb; + + swp_tb =3D swap_table_get(__swap_entry_to_cluster(entry), + swp_cluster_offset(entry)); + return swp_tb_is_folio(swp_tb); +} + /** * swap_cache_get_shadow - Looks up a shadow in the swap cache. * @entry: swap entry used for the lookup. diff --git a/mm/swapfile.c b/mm/swapfile.c index aaa8790241a8..2cb3bfef3234 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -792,23 +792,18 @@ static bool cluster_reclaim_range(struct swap_info_st= ruct *si, unsigned int nr_pages =3D 1 << order; unsigned long offset =3D start, end =3D start + nr_pages; unsigned char *map =3D si->swap_map; - int nr_reclaim; + unsigned long swp_tb; =20 spin_unlock(&ci->lock); do { - switch (READ_ONCE(map[offset])) { - case 0: + if (swap_count(READ_ONCE(map[offset]))) break; - case SWAP_HAS_CACHE: - nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); - if (nr_reclaim < 0) - goto out; - break; - default: - goto out; + swp_tb =3D swap_table_get(ci, offset % SWAPFILE_CLUSTER); + if (swp_tb_is_folio(swp_tb)) { + if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY) < 0) + break; } } while (++offset < end); -out: spin_lock(&ci->lock); =20 /* @@ -829,37 +824,41 @@ static bool cluster_reclaim_range(struct swap_info_st= ruct *si, * Recheck the range no matter reclaim succeeded or not, the slot * could have been be freed while we are not holding the lock. */ - for (offset =3D start; offset < end; offset++) - if (READ_ONCE(map[offset])) + for (offset =3D start; offset < end; offset++) { + swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); + if (swap_count(map[offset]) || !swp_tb_is_null(swp_tb)) return false; + } =20 return true; } =20 static bool cluster_scan_range(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long start, unsigned int nr_pages, + unsigned long offset, unsigned int nr_pages, bool *need_reclaim) { - unsigned long offset, end =3D start + nr_pages; + unsigned long end =3D offset + nr_pages; unsigned char *map =3D si->swap_map; + unsigned long swp_tb; =20 if (cluster_is_empty(ci)) return true; =20 - for (offset =3D start; offset < end; offset++) { - switch (READ_ONCE(map[offset])) { - case 0: - continue; - case SWAP_HAS_CACHE: + do { + if (swap_count(map[offset])) + return false; + swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); + if (swp_tb_is_folio(swp_tb)) { + WARN_ON_ONCE(!(map[offset] & SWAP_HAS_CACHE)); if (!vm_swap_full()) return false; *need_reclaim =3D true; - continue; - default: - return false; + } else { + /* A entry with no count and no cache must be null */ + VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb)); } - } + } while (++offset < end); =20 return true; } @@ -1030,7 +1029,8 @@ static void swap_reclaim_full_clusters(struct swap_in= fo_struct *si, bool force) to_scan--; =20 while (offset < end) { - if (READ_ONCE(map[offset]) =3D=3D SWAP_HAS_CACHE) { + if (!swap_count(READ_ONCE(map[offset])) && + swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER))) { spin_unlock(&ci->lock); nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); @@ -1980,6 +1980,7 @@ void swap_put_entries_direct(swp_entry_t entry, int n= r) struct swap_info_struct *si; bool any_only_cache =3D false; unsigned long offset; + unsigned long swp_tb; =20 si =3D get_swap_device(entry); if (WARN_ON_ONCE(!si)) @@ -2004,7 +2005,9 @@ void swap_put_entries_direct(swp_entry_t entry, int n= r) */ for (offset =3D start_offset; offset < end_offset; offset +=3D nr) { nr =3D 1; - if (READ_ONCE(si->swap_map[offset]) =3D=3D SWAP_HAS_CACHE) { + swp_tb =3D swap_table_get(__swap_offset_to_cluster(si, offset), + offset % SWAPFILE_CLUSTER); + if (!swap_count(READ_ONCE(si->swap_map[offset])) && swp_tb_is_folio(swp_= tb)) { /* * Folios are always naturally aligned in swap so * advance forward to the next boundary. Zero means no diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index e6dfd5f28acd..3f28aa319988 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1190,17 +1190,13 @@ static int move_swap_pte(struct mm_struct *mm, stru= ct vm_area_struct *dst_vma, * Check if the swap entry is cached after acquiring the src_pte * lock. Otherwise, we might miss a newly loaded swap cache folio. * - * Check swap_map directly to minimize overhead, READ_ONCE is sufficient. * We are trying to catch newly added swap cache, the only possible case= is * when a folio is swapped in and out again staying in swap cache, using= the * same entry before the PTE check above. The PTL is acquired and releas= ed - * twice, each time after updating the swap_map's flag. So holding - * the PTL here ensures we see the updated value. False positive is poss= ible, - * e.g. SWP_SYNCHRONOUS_IO swapin may set the flag without touching the - * cache, or during the tiny synchronization window between swap cache a= nd - * swap_map, but it will be gone very quickly, worst result is retry jit= ters. + * twice, each time after updating the swap table. So holding + * the PTL here ensures we see the updated value. */ - if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) { + if (swap_cache_has_folio(entry)) { double_pt_unlock(dst_ptl, src_ptl); return -EAGAIN; } --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pf1-f171.google.com (mail-pf1-f171.google.com [209.85.210.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C28C034FF4D for ; Thu, 4 Dec 2025 19:30:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876660; cv=none; b=kA1hAAQc9aT5uY2kx2lVwbODCH+M5svl2XjkXC2u2YouddpsKBI/1jqAeDNhOSvkYr/RPdnTGGkoZxve1x0696nsKfXlEgVRPX2oym6NC70RVcIx14YXEweyhxuUkNlRHPZy2056HRQAel3aQ7Eqyj7l75UJREL4qIzplZJ7iW8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876660; c=relaxed/simple; bh=8p+i2qxvOkiX01G6y0bwdR9yWLVWTCZNGOGGD3IB7gk=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=s3KMtWhXmQJ/Ieuho+DAb/gU5qD0CuDFJfVKMGcdl59jOH4NtQxakAVAcDoRUY0wE8I2t4gUhVk3C8P0SwP31hLL7s7SR2UsDd2ZUbABgo60zOpjUIg5MKSWzeXJ/k1ilsEBEpMGWvO9a6CANGCMIUqyntU0A6cEbXlGTtC+VY8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Wqx/Pwvm; arc=none smtp.client-ip=209.85.210.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Wqx/Pwvm" Received: by mail-pf1-f171.google.com with SMTP id d2e1a72fcca58-7bb710d1d1dso2056465b3a.1 for ; Thu, 04 Dec 2025 11:30:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876657; x=1765481457; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=I2hHCOKTmVEJWu9vcQFd3L/+raWQ+BsHL/zgl/R7UbQ=; b=Wqx/PwvmZLoo295LH7SAMuCXfmlbUo3Qa3C/RWs4/ikZ4hYPZ3YCVxt9O3qZufcP+2 HUcECV2FTwG/QV18urHDaaYZEy23GqLDfiCS7YkEQidFyme/bvrRDix7ZQTmmJo/YKal FbHL+3vTAv5PhmtfsOOgP2eIZsp+d8vkdhlq4v/enKTSnGraB6aH2augAljrdEaXmOYS DxW5LC8Cgse0hmIc6mzvMy/n9HuYo/hYSSsKH1p1uJphLYOoKxTLoiTY2e4e99QmCz6d QXygxMKUvtpVfKOfhjjcvQGBFKreYqvhUyS1HRaMLxD5pZI5iNIJcGZnn+ejI00QIsTb 7KJQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876657; x=1765481457; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=I2hHCOKTmVEJWu9vcQFd3L/+raWQ+BsHL/zgl/R7UbQ=; b=P9v5dv0RU49VDN2ILPLDzELil2f8uqeY1dunwcqA5zW7t3ACsVpNzG9gjSz4YZrTL4 /Cc7vSvg9EbrXivVNwumkxZrTdLg46GS8UtyGVGw0I17wrPMhVImlk7OhqQ79q19CzQO fgMFB+RQItDxTv3vRAla6RTVkIT0PLpPprQkEHJHlzwD1nxrdSnI6UCbFBHZylVMLFYn whNPHv4p0BWiVbyNr2lz3SRQ86pqj4C9nqcDDCsTf/io4HBnIf12U+KGAZKCUyfG+Ulc dybYlgCIAEbhUMi+SNns+a/uZs+Tzrik+tTMhkK0VE/qU05o3ownGNaWzT/6fYwD7qVr IkTg== X-Forwarded-Encrypted: i=1; AJvYcCWiPNNkMruXmk2QAjZnD24vyddtL5nO2UsncKQjNb2RZWLUPDLEKYoX2u9s1FqfeiVyHG+mIrVYSQBGwCM=@vger.kernel.org X-Gm-Message-State: AOJu0YzXZCrrwTaiyfK9nb7dUuLsBH5kJ/XZ28fR5gR7qqaFJ7haBI+I fy/2+gMydiacmrjLOigy0m3OtJu8rYvC4XEVb95fo1/D+D9R2eZxSgnP X-Gm-Gg: ASbGncv7c6C0YiNyxtHTJmvBqHZ5/QicvGS240UtOq7+/AuUcfPpgEC1qNWlCuC4ndZ 6y5XBWiQUC26YNQaCnTuz262RYfnB5eDeXtKGFDfyhA5obClZPlID6m9i7+AD/5FCq6h+fFBfTQ OxGAxFu91p2JMNls6XaIpByXjWU1AiA6h1rxncZjCsNyQl6ydzSmP3OCWdUofwdWtbQHdPAcN0t txDljfjW4+J4z7JkUUT2fsQQhzJ57uIqlO5RYm7gUywgXDKsPTPuJ9E3alRXi+ohwwdpp4HnuZL wz0X5zp+uB9e4cG09hMSE700ai5bDyRuaGA28tPRcZLiP6zzM35PmepcrIoab5loiT0bhh1Gfbl 30u0G90ZkN7zB5LzJLPL6ZDiFHTTqCWvzBC1wO6fpm/wNf8hFHFfmAhrbYvZefHfMzxQV1ApBUZ fahzqO4ZViEy1lh6DbofiCPxd4wi0UQIyl1FJPmignlVanbjZS X-Google-Smtp-Source: AGHT+IHRYBU2rcfuoo+jDuYxCOKhAYKI5obhnSagjEUUdYlBxzbom4/kWKQt1ICAZfdTv2/tSDI42g== X-Received: by 2002:a05:6300:210f:b0:35e:fce6:46e7 with SMTP id adf61e73a8af0-36403764175mr5062454637.5.1764876657054; Thu, 04 Dec 2025 11:30:57 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.30.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:30:56 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:25 +0800 Subject: [PATCH v4 17/19] mm, swap: clean up and improve swap entries freeing Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-17-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=13631; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=7RXtD1FvXy+eheLFQr5n4KvdqXdBT1d3xCF6452la6M=; b=i5fheAkj3sEsqTEkvqEBaZdZnbJmynpFZ+dO+qcQpzqzJ6jamSNO9id2I9ZFnfMFM0UgV7w4b ulNnmBskNANBEST0e8/3yR/9KAVldoehEUUG7gFBmh/0dX64LRs57cK X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song There are a few problems with the current freeing of swap entries. When freeing a set of swap entries directly (swap_put_entries_direct, typically from zapping the page table), it scans the whole swap region multiple times. First, it scans the whole region to check if it can be batch freed and if there is any cached folio. Then do a batch free only if the whole region's swap count equals 1. And if any entry is cached, even if only one, it will have to walk the whole region again to clean up the cache. And if any entry is not in a consistent status with other entries, it will fall back to order 0 freeing. For example, if only one of them is cached, the batch free will fall back. And the current batch freeing workflow relies on the swap map's SWAP_HAS_CACHE bit for both continuous checking and batch freeing, which isn't compatible with the swap table design. Tidy this up, introduce a new cluster scoped helper for all swap entry freeing job. It will batch frees all continuous entries, and just start a new batch if any inconsistent entry is found. This may improve the batch size when the clusters are fragmented. This should also be more robust with more sanity checks, and make it clear that a slot pinned by swap cache will be cleared upon cache reclaim. And the cache reclaim scan is also now limited to each cluster. If a cluster has any clean swap cache left after putting the swap count, reclaim the cluster only instead of the whole region. And since a folio's entries are always in the same cluster, putting swap entries from a folio can also use the new helper directly. This should be both an optimization and a cleanup, and the new helper is adapted to the swap table. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swapfile.c | 238 +++++++++++++++++++++++-------------------------------= ---- 1 file changed, 96 insertions(+), 142 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 2cb3bfef3234..979f0c562115 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -55,12 +55,14 @@ static bool swap_count_continued(struct swap_info_struc= t *, pgoff_t, static void free_swap_count_continuations(struct swap_info_struct *); static void swap_entries_free(struct swap_info_struct *si, struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr_pages); + unsigned long start, unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr= ); -static bool swap_entries_put_map(struct swap_info_struct *si, - swp_entry_t entry, int nr); +static void swap_put_entry_locked(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, + unsigned char usage); static bool folio_swapcache_freeable(struct folio *folio); static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, @@ -197,25 +199,6 @@ static bool swap_only_has_cache(struct swap_info_struc= t *si, return true; } =20 -static bool swap_is_last_map(struct swap_info_struct *si, - unsigned long offset, int nr_pages, bool *has_cache) -{ - unsigned char *map =3D si->swap_map + offset; - unsigned char *map_end =3D map + nr_pages; - unsigned char count =3D *map; - - if (swap_count(count) !=3D 1) - return false; - - while (++map < map_end) { - if (*map !=3D count) - return false; - } - - *has_cache =3D !!(count & SWAP_HAS_CACHE); - return true; -} - /* * returns number of pages in the folio that backs the swap entry. If posi= tive, * the folio was reclaimed. If negative, the folio was not reclaimed. If 0= , no @@ -1439,6 +1422,76 @@ static bool swap_sync_discard(void) return false; } =20 +/** + * swap_put_entries_cluster - Decrease the swap count of a set of slots. + * @si: The swap device. + * @start: start offset of slots. + * @nr: number of slots. + * @reclaim_cache: if true, also reclaim the swap cache. + * + * This helper decreases the swap count of a set of slots and tries to + * batch free them. Also reclaims the swap cache if @reclaim_cache is true. + * Context: The caller must ensure that all slots belong to the same + * cluster and their swap count doesn't go underflow. + */ +static void swap_put_entries_cluster(struct swap_info_struct *si, + unsigned long start, int nr, + bool reclaim_cache) +{ + unsigned long offset =3D start, end =3D start + nr; + unsigned long batch_start =3D SWAP_ENTRY_INVALID; + struct swap_cluster_info *ci; + bool need_reclaim =3D false; + unsigned int nr_reclaimed; + unsigned long swp_tb; + unsigned int count; + + ci =3D swap_cluster_lock(si, offset); + do { + swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); + count =3D si->swap_map[offset]; + VM_WARN_ON(swap_count(count) < 1 || count =3D=3D SWAP_MAP_BAD); + if (swap_count(count) =3D=3D 1) { + /* count =3D=3D 1 and non-cached slots will be batch freed. */ + if (!swp_tb_is_folio(swp_tb)) { + if (!batch_start) + batch_start =3D offset; + continue; + } + /* count will be 0 after put, slot can be reclaimed */ + VM_WARN_ON(!(count & SWAP_HAS_CACHE)); + need_reclaim =3D true; + } + /* + * A count !=3D 1 or cached slot can't be freed. Put its swap + * count and then free the interrupted pending batch. Cached + * slots will be freed when folio is removed from swap cache + * (__swap_cache_del_folio). + */ + swap_put_entry_locked(si, ci, offset, 1); + if (batch_start) { + swap_entries_free(si, ci, batch_start, offset - batch_start); + batch_start =3D SWAP_ENTRY_INVALID; + } + } while (++offset < end); + + if (batch_start) + swap_entries_free(si, ci, batch_start, offset - batch_start); + swap_cluster_unlock(ci); + + if (!need_reclaim || !reclaim_cache) + return; + + offset =3D start; + do { + nr_reclaimed =3D __try_to_reclaim_swap(si, offset, + TTRS_UNMAPPED | TTRS_FULL); + offset++; + if (nr_reclaimed) + offset =3D round_up(offset, abs(nr_reclaimed)); + } while (offset < end); +} + /** * folio_alloc_swap - allocate swap space for a folio * @folio: folio we want to move to swap @@ -1544,6 +1597,7 @@ void folio_put_swap(struct folio *folio, struct page = *subpage) { swp_entry_t entry =3D folio->swap; unsigned long nr_pages =3D folio_nr_pages(folio); + struct swap_info_struct *si =3D __swap_entry_to_info(entry); =20 VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio); @@ -1553,7 +1607,7 @@ void folio_put_swap(struct folio *folio, struct page = *subpage) nr_pages =3D 1; } =20 - swap_entries_put_map(__swap_entry_to_info(entry), entry, nr_pages); + swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false); } =20 static struct swap_info_struct *_swap_info_get(swp_entry_t entry) @@ -1590,12 +1644,11 @@ static struct swap_info_struct *_swap_info_get(swp_= entry_t entry) return NULL; } =20 -static unsigned char swap_entry_put_locked(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry, - unsigned char usage) +static void swap_put_entry_locked(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, + unsigned char usage) { - unsigned long offset =3D swp_offset(entry); unsigned char count; unsigned char has_cache; =20 @@ -1621,9 +1674,7 @@ static unsigned char swap_entry_put_locked(struct swa= p_info_struct *si, if (usage) WRITE_ONCE(si->swap_map[offset], usage); else - swap_entries_free(si, ci, entry, 1); - - return usage; + swap_entries_free(si, ci, offset, 1); } =20 /* @@ -1691,70 +1742,6 @@ struct swap_info_struct *get_swap_device(swp_entry_t= entry) return NULL; } =20 -static bool swap_entries_put_map(struct swap_info_struct *si, - swp_entry_t entry, int nr) -{ - unsigned long offset =3D swp_offset(entry); - struct swap_cluster_info *ci; - bool has_cache =3D false; - unsigned char count; - int i; - - if (nr <=3D 1) - goto fallback; - count =3D swap_count(data_race(si->swap_map[offset])); - if (count !=3D 1) - goto fallback; - - ci =3D swap_cluster_lock(si, offset); - if (!swap_is_last_map(si, offset, nr, &has_cache)) { - goto locked_fallback; - } - if (!has_cache) - swap_entries_free(si, ci, entry, nr); - else - for (i =3D 0; i < nr; i++) - WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE); - swap_cluster_unlock(ci); - - return has_cache; - -fallback: - ci =3D swap_cluster_lock(si, offset); -locked_fallback: - for (i =3D 0; i < nr; i++, entry.val++) { - count =3D swap_entry_put_locked(si, ci, entry, 1); - if (count =3D=3D SWAP_HAS_CACHE) - has_cache =3D true; - } - swap_cluster_unlock(ci); - return has_cache; -} - -/* - * Only functions with "_nr" suffix are able to free entries spanning - * cross multi clusters, so ensure the range is within a single cluster - * when freeing entries with functions without "_nr" suffix. - */ -static bool swap_entries_put_map_nr(struct swap_info_struct *si, - swp_entry_t entry, int nr) -{ - int cluster_nr, cluster_rest; - unsigned long offset =3D swp_offset(entry); - bool has_cache =3D false; - - cluster_rest =3D SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER; - while (nr) { - cluster_nr =3D min(nr, cluster_rest); - has_cache |=3D swap_entries_put_map(si, entry, cluster_nr); - cluster_rest =3D SWAPFILE_CLUSTER; - nr -=3D cluster_nr; - entry.val +=3D cluster_nr; - } - - return has_cache; -} - /* * Check if it's the last ref of swap entry in the freeing path. */ @@ -1769,9 +1756,9 @@ static inline bool __maybe_unused swap_is_last_ref(un= signed char count) */ static void swap_entries_free(struct swap_info_struct *si, struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr_pages) + unsigned long offset, unsigned int nr_pages) { - unsigned long offset =3D swp_offset(entry); + swp_entry_t entry =3D swp_entry(si->type, offset); unsigned char *map =3D si->swap_map + offset; unsigned char *map_end =3D map + nr_pages; =20 @@ -1977,10 +1964,8 @@ void swap_put_entries_direct(swp_entry_t entry, int = nr) { const unsigned long start_offset =3D swp_offset(entry); const unsigned long end_offset =3D start_offset + nr; + unsigned long offset, cluster_end; struct swap_info_struct *si; - bool any_only_cache =3D false; - unsigned long offset; - unsigned long swp_tb; =20 si =3D get_swap_device(entry); if (WARN_ON_ONCE(!si)) @@ -1988,44 +1973,13 @@ void swap_put_entries_direct(swp_entry_t entry, int= nr) if (WARN_ON_ONCE(end_offset > si->max)) goto out; =20 - /* - * First free all entries in the range. - */ - any_only_cache =3D swap_entries_put_map_nr(si, entry, nr); - - /* - * Short-circuit the below loop if none of the entries had their - * reference drop to zero. - */ - if (!any_only_cache) - goto out; - - /* - * Now go back over the range trying to reclaim the swap cache. - */ - for (offset =3D start_offset; offset < end_offset; offset +=3D nr) { - nr =3D 1; - swp_tb =3D swap_table_get(__swap_offset_to_cluster(si, offset), - offset % SWAPFILE_CLUSTER); - if (!swap_count(READ_ONCE(si->swap_map[offset])) && swp_tb_is_folio(swp_= tb)) { - /* - * Folios are always naturally aligned in swap so - * advance forward to the next boundary. Zero means no - * folio was found for the swap entry, so advance by 1 - * in this case. Negative value means folio was found - * but could not be reclaimed. Here we can still advance - * to the next boundary. - */ - nr =3D __try_to_reclaim_swap(si, offset, - TTRS_UNMAPPED | TTRS_FULL); - if (nr =3D=3D 0) - nr =3D 1; - else if (nr < 0) - nr =3D -nr; - nr =3D ALIGN(offset + 1, nr) - offset; - } - } - + /* Put entries and reclaim cache in each cluster */ + offset =3D start_offset; + do { + cluster_end =3D min(round_up(offset + 1, SWAPFILE_CLUSTER), end_offset); + swap_put_entries_cluster(si, offset, cluster_end - offset, true); + offset =3D cluster_end; + } while (offset < end_offset); out: put_swap_device(si); } @@ -2072,7 +2026,7 @@ void swap_free_hibernation_slot(swp_entry_t entry) return; =20 ci =3D swap_cluster_lock(si, offset); - swap_entry_put_locked(si, ci, entry, 1); + swap_put_entry_locked(si, ci, offset, 1); WARN_ON(swap_entry_swapped(si, offset)); swap_cluster_unlock(ci); =20 @@ -3827,10 +3781,10 @@ void __swapcache_clear_cached(struct swap_info_stru= ct *si, swp_entry_t entry, unsigned int nr) { if (swap_only_has_cache(si, swp_offset(entry), nr)) { - swap_entries_free(si, ci, entry, nr); + swap_entries_free(si, ci, swp_offset(entry), nr); } else { for (int i =3D 0; i < nr; i++, entry.val++) - swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); + swap_put_entry_locked(si, ci, swp_offset(entry), SWAP_HAS_CACHE); } } =20 @@ -3951,7 +3905,7 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) * into, carry if so, or else fail until a new continuation page is alloca= ted; * when the original swap_map count is decremented from 0 with continuatio= n, * borrow from the continuation and report whether it still holds more. - * Called while __swap_duplicate() or caller of swap_entry_put_locked() + * Called while __swap_duplicate() or caller of swap_put_entry_locked() * holds cluster lock. */ static bool swap_count_continued(struct swap_info_struct *si, --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pf1-f172.google.com (mail-pf1-f172.google.com [209.85.210.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ADCB434F46C for ; Thu, 4 Dec 2025 19:31:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876665; cv=none; b=db8/JN6nD16DMeCr79JM1+xBcAWYHMwrV7vzmzWV/4ugvbhf68WVDV7DyJtgFwrl4s/Wp4r6tgY/8vq/6e9O11mi4vaECcvzYNn4leZ+Nyw2WhHWqxpe6zY1b646VqIR9yQTz4u1vKRdIPc3LB8EAfRvmlKLU0IF/tg2mRZjnMw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876665; c=relaxed/simple; bh=MKHpT4fJM9GxIHwl26NrXn8F4Z1s5o5YNZTQXsy6vzw=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=rV/OGKXAKaYzONfWdRXDIn6BZDnfnvq1B3yoXQPcZ0e3s4/OmxucaizzgDedC1Bj+CLMAG/VQge/aCxw91J82ZDESevkE1kEcDh9J52+0DJd7R1At2U1lrb8g/ejF7vKF5Z60YBY4vOuc9w7/i09Tpre9D0L8V2rXKI5fsAPprg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=GNMKBheC; arc=none smtp.client-ip=209.85.210.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="GNMKBheC" Received: by mail-pf1-f172.google.com with SMTP id d2e1a72fcca58-7a9c64dfa8aso1121241b3a.3 for ; Thu, 04 Dec 2025 11:31:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876662; x=1765481462; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=MeUUnUTgOhElk0LOsWkR6GIkfqMFk43LYBQi8AJ9ASc=; b=GNMKBheC8JVZ4PUGA5UrpfQ04CURLpYCUEKkWgA76CuNAZ41TWSCaysPjx5tmxACBR rekWhp5BKZYPV+6pketWq+n3VFi+z6O1NUjQBXrb0FEc+AsrjLgncW9NOsDsdnA9hdCU 0/K/c3pYu99xsAVes9xGWlxth+3VnTr4ZRcPsZBRuvFRavtQsiiXHl8jAkn1YoZEnYDO giCV3v6y8fE05Y1dAVoaCu8eqz6qLjm+l6Vk9gFfN5QvNGMgzVuFBerv6bndG7jc7Zvp z3WUvEKS2prc+x0tQCc9zvH0sqseFmVy731al47PeJGxvy1dnLMACAKd2b/iDufxw252 oMqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876662; x=1765481462; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=MeUUnUTgOhElk0LOsWkR6GIkfqMFk43LYBQi8AJ9ASc=; b=XeI8dMHFM4mmXapME7JYo14TCpqy8/v/d4PC+N3XsZuMnzjKmUJ6qtrYBqXbzePHH/ xwS8MSD/yaKLzN/UICS02jqOLWU7JygGQgbEMqN1wETezdkjNQ/s7C4Bh+FeKQkYvizi GaUH6Q/7v+5vYmncRRHftRYIXXuUoSRnWtU4B7hDFH6m0Jn5p7KL/AAlEZwS2i0ZHFML XvdLTJGjsI6Xq4/I1Dy7caQw/EXBtbXw2jAha/qaf97uLj1CFdB90qq4syFigVogr3lw LtEo3/zjC6LIg9kcSrlEAEFmrS10xCG/Jo9VT9ERZbJmBxw/MVsHIJv0d372Hn+InhKO PsxQ== X-Forwarded-Encrypted: i=1; AJvYcCWG6eNHCKTTCIjrsM6jxSru/IinA3RxdNWzT9CpRlJl5Z3mo9Gx13bBx4FGh7N28X8Y+fI1KehBVfiz5+0=@vger.kernel.org X-Gm-Message-State: AOJu0Yx1kJknBTsUnMTtwgRTINpH5bSagn+48VXSMGadn+SQrmAJi1Jk ku+gnOYiErRkvzoOrAK5ctaEzN4CvWgP4TBJoeuQInj/yrA6aEvbJJwm X-Gm-Gg: ASbGnctNHag25RlGBgZokAQofC5t4Kj67b+uI3p+3bINMH+MSBzQLN/J+OssJtc1sCo tTgCI4RPqkphzWahB2eAGd7XoiwMMwg5AXOFNk9c7Ij8t8ErgsWC3O5qhnyMEENKHRUsXeI5hYb gUZxpFM+bNAV4tLkAzkBRXQz6d+IluXocVLimHeWKQLaMsfUydQG2rDd8J0t5MtoqVCB85pm0W3 A8sPOLBA4YO8NtUuo6F/pbTK1zJzXuOPxys3Vbvt3Cbw78hG3Xgk0vOx4S1Z2HMFnh4PP4Ahikk U3CM1Yqia5lm1Ta3NKNMyLe7vuaAEsE3hhr3ga9HF6bxWz7o2gPLq5dOhqD51ZQQ4llL2YvDtdt N8IrklgSCUYhCs6wIZCJXnEK4boiqNMPIaujGLSU0rWELxiIPc3DQx8Q4Aa+VCTB1YmgbcdFo7k jw5kxnG53wmaZk8yntuwFxncxy6SyNKpqOj6aYAVVZr2p1MHfw X-Google-Smtp-Source: AGHT+IFfHx/DyyEDIZPDCoYw/3n9uD7qeNZ0V+Fhh7xlgQ9bh5fuD8z6iLxw4co+EZZ8rITY+p1tjA== X-Received: by 2002:a05:6a20:7293:b0:35b:d302:e7be with SMTP id adf61e73a8af0-3640375de1bmr4860549637.2.1764876661594; Thu, 04 Dec 2025 11:31:01 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.30.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:31:01 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:26 +0800 Subject: [PATCH v4 18/19] mm, swap: drop the SWAP_HAS_CACHE flag Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-18-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=20781; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=pqFC+9PY37MQVaXQsYUS5RYpdyliymttf/rG5+As3Bo=; b=ZlnXEoVEzNNHF8sNLb3RMAAjk4FyGVb18P4035YMYSs5jHx7qtLNX8ZAKB8swwtf3Dk4rkwyb Py56dbiqFx2CB7VUOL5tegXugCaQ9/YNwc43C9AqCdetIcrmmxLoxoI X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song Now, the swap cache is managed by the swap table. All swap cache users are checking the swap table directly to check the swap cache state. SWAP_HAS_CACHE is now just a temporary pin before the first increase from 0 to 1 of a slot's swap count (swap_dup_entries), or before the final free of slots pinned by folio in swap cache (put_swap_folio). Drop these two usages. For the first dup, SWAP_HAS_CACHE pinning was hard to kill because it used to have multiple meanings, more than just "a slot is cached". We have simplified that and just defined that the first dup is always done with folio locked in swap cache (folio_dup_swap), so it can just check the swap cache (swap table) directly. As for freeing, just let the swap cache free all swap entries of a folio that have a swap count of zero directly upon folio removal. We have also just cleaned up freeing to cover the swap cache usage in the swap table, a slot with swap cache will not be freed until its cache is gone. Now, making the removal of a folio and freeing the slots being done in the same critical section, this should improve the performance and gets rid of the SWAP_HAS_CACHE pin. After these two changes, SWAP_HAS_CACHE no longer has any users. Remove all related logic and helpers. swap_map is now only used for tracking the count, so all swap_map users can just need to read it directly, ignoring the swap_count helper, which was previously used to filter out the SWAP_HAS_CACHE bit. The idea of dropping SWAP_HAS_CACHE and using the swap table directly was initially from Chris's idea of merging all the metadata usage of all swaps into one place. Suggested-by: Chris Li Signed-off-by: Kairui Song --- include/linux/swap.h | 1 - mm/swap.h | 13 ++-- mm/swap_state.c | 28 +++++---- mm/swapfile.c | 168 +++++++++++++++++------------------------------= ---- 4 files changed, 77 insertions(+), 133 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 4b4b81fbc6a3..dcb1760e36c3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -224,7 +224,6 @@ enum { #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX =20 /* Bit flag in swap_map */ -#define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */ #define COUNT_CONTINUED 0x80 /* Flag swap_map continuation for full count = */ =20 /* Special value in first swap_map */ diff --git a/mm/swap.h b/mm/swap.h index 3692e143eeba..b2d83e661132 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -205,6 +205,11 @@ int folio_alloc_swap(struct folio *folio); int folio_dup_swap(struct folio *folio, struct page *subpage); void folio_put_swap(struct folio *folio, struct page *subpage); =20 +/* For internal use */ +extern void swap_entries_free(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, unsigned int nr_pages); + /* linux/mm/page_io.c */ int sio_pool_init(void); struct swap_iocb; @@ -256,14 +261,6 @@ static inline bool folio_matches_swap_entry(const stru= ct folio *folio, return folio_entry.val =3D=3D round_down(entry.val, nr_pages); } =20 -/* Temporary internal helpers */ -void __swapcache_set_cached(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry); -void __swapcache_clear_cached(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr); - /* * All swap cache helpers below require the caller to ensure the swap entr= ies * used are valid and stablize the device by any of the following ways: diff --git a/mm/swap_state.c b/mm/swap_state.c index 6bf7556ca408..ed921cef222d 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -211,17 +211,6 @@ static int swap_cache_add_folio(struct folio *folio, s= wp_entry_t entry, shadow =3D swp_tb_to_shadow(old_tb); offset++; } while (++ci_off < ci_end); - - ci_off =3D ci_start; - offset =3D swp_offset(entry); - do { - /* - * Still need to pin the slots with SWAP_HAS_CACHE since - * swap allocator depends on that. - */ - __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset)); - offset++; - } while (++ci_off < ci_end); __swap_cache_add_folio(ci, folio, entry); swap_cluster_unlock(ci); if (shadowp) @@ -252,6 +241,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *c= i, struct folio *folio, struct swap_info_struct *si; unsigned long old_tb, new_tb; unsigned int ci_start, ci_off, ci_end; + bool folio_swapped =3D false, need_free =3D false; unsigned long nr_pages =3D folio_nr_pages(folio); =20 VM_WARN_ON_ONCE(__swap_entry_to_cluster(entry) !=3D ci); @@ -269,13 +259,27 @@ void __swap_cache_del_folio(struct swap_cluster_info = *ci, struct folio *folio, old_tb =3D __swap_table_xchg(ci, ci_off, new_tb); WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) !=3D folio); + if (__swap_count(swp_entry(si->type, + swp_offset(entry) + ci_off - ci_start))) + folio_swapped =3D true; + else + need_free =3D true; } while (++ci_off < ci_end); =20 folio->swap.val =3D 0; folio_clear_swapcache(folio); node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages); lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages); - __swapcache_clear_cached(si, ci, entry, nr_pages); + + if (!folio_swapped) { + swap_entries_free(si, ci, swp_offset(entry), nr_pages); + } else if (need_free) { + do { + if (!__swap_count(entry)) + swap_entries_free(si, ci, swp_offset(entry), 1); + entry.val++; + } while (--nr_pages); + } } =20 /** diff --git a/mm/swapfile.c b/mm/swapfile.c index 979f0c562115..50ed2d7f5b85 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -48,21 +48,18 @@ #include #include "swap_table.h" #include "internal.h" +#include "swap_table.h" #include "swap.h" =20 static bool swap_count_continued(struct swap_info_struct *, pgoff_t, unsigned char); static void free_swap_count_continuations(struct swap_info_struct *); -static void swap_entries_free(struct swap_info_struct *si, - struct swap_cluster_info *ci, - unsigned long start, unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr= ); static void swap_put_entry_locked(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long offset, - unsigned char usage); + unsigned long offset); static bool folio_swapcache_freeable(struct folio *folio); static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, @@ -149,11 +146,6 @@ static struct swap_info_struct *swap_entry_to_info(swp= _entry_t entry) return swap_type_to_info(swp_type(entry)); } =20 -static inline unsigned char swap_count(unsigned char ent) -{ - return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */ -} - /* * Use the second highest bit of inuse_pages counter as the indicator * if one swap device is on the available plist, so the atomic can @@ -185,15 +177,20 @@ static long swap_usage_in_pages(struct swap_info_stru= ct *si) #define TTRS_FULL 0x4 =20 static bool swap_only_has_cache(struct swap_info_struct *si, - unsigned long offset, int nr_pages) + struct swap_cluster_info *ci, + unsigned long offset, int nr_pages) { + unsigned int ci_off =3D offset % SWAPFILE_CLUSTER; unsigned char *map =3D si->swap_map + offset; unsigned char *map_end =3D map + nr_pages; + unsigned long swp_tb; =20 do { - VM_BUG_ON(!(*map & SWAP_HAS_CACHE)); - if (*map !=3D SWAP_HAS_CACHE) + swp_tb =3D __swap_table_get(ci, ci_off); + VM_WARN_ON_ONCE(!swp_tb_is_folio(swp_tb)); + if (*map) return false; + ++ci_off; } while (++map < map_end); =20 return true; @@ -248,12 +245,12 @@ static int __try_to_reclaim_swap(struct swap_info_str= uct *si, goto out_unlock; =20 /* - * It's safe to delete the folio from swap cache only if the folio's - * swap_map is HAS_CACHE only, which means the slots have no page table + * It's safe to delete the folio from swap cache only if the folio + * is in swap cache with swap count =3D=3D 0. The slots have no page table * reference or pending writeback, and can't be allocated to others. */ ci =3D swap_cluster_lock(si, offset); - need_reclaim =3D swap_only_has_cache(si, offset, nr_pages); + need_reclaim =3D swap_only_has_cache(si, ci, offset, nr_pages); swap_cluster_unlock(ci); if (!need_reclaim) goto out_unlock; @@ -779,7 +776,7 @@ static bool cluster_reclaim_range(struct swap_info_stru= ct *si, =20 spin_unlock(&ci->lock); do { - if (swap_count(READ_ONCE(map[offset]))) + if (READ_ONCE(map[offset])) break; swp_tb =3D swap_table_get(ci, offset % SWAPFILE_CLUSTER); if (swp_tb_is_folio(swp_tb)) { @@ -809,7 +806,7 @@ static bool cluster_reclaim_range(struct swap_info_stru= ct *si, */ for (offset =3D start; offset < end; offset++) { swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); - if (swap_count(map[offset]) || !swp_tb_is_null(swp_tb)) + if (map[offset] || !swp_tb_is_null(swp_tb)) return false; } =20 @@ -829,11 +826,10 @@ static bool cluster_scan_range(struct swap_info_struc= t *si, return true; =20 do { - if (swap_count(map[offset])) + if (map[offset]) return false; swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); if (swp_tb_is_folio(swp_tb)) { - WARN_ON_ONCE(!(map[offset] & SWAP_HAS_CACHE)); if (!vm_swap_full()) return false; *need_reclaim =3D true; @@ -891,11 +887,6 @@ static bool cluster_alloc_range(struct swap_info_struc= t *si, if (likely(folio)) { order =3D folio_order(folio); nr_pages =3D 1 << order; - /* - * Pin the slot with SWAP_HAS_CACHE to satisfy swap_dup_entries. - * This is the legacy allocation behavior, will drop it very soon. - */ - memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages); __swap_cache_add_folio(ci, folio, swp_entry(si->type, offset)); } else if (IS_ENABLED(CONFIG_HIBERNATION)) { order =3D 0; @@ -1012,8 +1003,8 @@ static void swap_reclaim_full_clusters(struct swap_in= fo_struct *si, bool force) to_scan--; =20 while (offset < end) { - if (!swap_count(READ_ONCE(map[offset])) && - swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER))) { + if (!READ_ONCE(map[offset]) && + swp_tb_is_folio(swap_table_get(ci, offset % SWAPFILE_CLUSTER))) { spin_unlock(&ci->lock); nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); @@ -1115,7 +1106,7 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, * Scan only one fragment cluster is good enough. Order 0 * allocation will surely success, and large allocation * failure is not critical. Scanning one cluster still - * keeps the list rotated and reclaimed (for HAS_CACHE). + * keeps the list rotated and reclaimed (for clean swap cache). */ found =3D alloc_swap_scan_list(si, &si->frag_clusters[order], folio, fal= se); if (found) @@ -1450,8 +1441,8 @@ static void swap_put_entries_cluster(struct swap_info= _struct *si, do { swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); count =3D si->swap_map[offset]; - VM_WARN_ON(swap_count(count) < 1 || count =3D=3D SWAP_MAP_BAD); - if (swap_count(count) =3D=3D 1) { + VM_WARN_ON(count < 1 || count =3D=3D SWAP_MAP_BAD); + if (count =3D=3D 1) { /* count =3D=3D 1 and non-cached slots will be batch freed. */ if (!swp_tb_is_folio(swp_tb)) { if (!batch_start) @@ -1459,7 +1450,6 @@ static void swap_put_entries_cluster(struct swap_info= _struct *si, continue; } /* count will be 0 after put, slot can be reclaimed */ - VM_WARN_ON(!(count & SWAP_HAS_CACHE)); need_reclaim =3D true; } /* @@ -1468,7 +1458,7 @@ static void swap_put_entries_cluster(struct swap_info= _struct *si, * slots will be freed when folio is removed from swap cache * (__swap_cache_del_folio). */ - swap_put_entry_locked(si, ci, offset, 1); + swap_put_entry_locked(si, ci, offset); if (batch_start) { swap_entries_free(si, ci, batch_start, offset - batch_start); batch_start =3D SWAP_ENTRY_INVALID; @@ -1625,7 +1615,8 @@ static struct swap_info_struct *_swap_info_get(swp_en= try_t entry) offset =3D swp_offset(entry); if (offset >=3D si->max) goto bad_offset; - if (data_race(!si->swap_map[swp_offset(entry)])) + if (data_race(!si->swap_map[swp_offset(entry)]) && + !swap_cache_has_folio(entry)) goto bad_free; return si; =20 @@ -1646,21 +1637,12 @@ static struct swap_info_struct *_swap_info_get(swp_= entry_t entry) =20 static void swap_put_entry_locked(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long offset, - unsigned char usage) + unsigned long offset) { unsigned char count; - unsigned char has_cache; =20 count =3D si->swap_map[offset]; - - has_cache =3D count & SWAP_HAS_CACHE; - count &=3D ~SWAP_HAS_CACHE; - - if (usage =3D=3D SWAP_HAS_CACHE) { - VM_BUG_ON(!has_cache); - has_cache =3D 0; - } else if ((count & ~COUNT_CONTINUED) <=3D SWAP_MAP_MAX) { + if ((count & ~COUNT_CONTINUED) <=3D SWAP_MAP_MAX) { if (count =3D=3D COUNT_CONTINUED) { if (swap_count_continued(si, offset, count)) count =3D SWAP_MAP_MAX | COUNT_CONTINUED; @@ -1670,10 +1652,8 @@ static void swap_put_entry_locked(struct swap_info_s= truct *si, count--; } =20 - usage =3D count | has_cache; - if (usage) - WRITE_ONCE(si->swap_map[offset], usage); - else + WRITE_ONCE(si->swap_map[offset], count); + if (!count && !swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLU= STER))) swap_entries_free(si, ci, offset, 1); } =20 @@ -1742,21 +1722,13 @@ struct swap_info_struct *get_swap_device(swp_entry_= t entry) return NULL; } =20 -/* - * Check if it's the last ref of swap entry in the freeing path. - */ -static inline bool __maybe_unused swap_is_last_ref(unsigned char count) -{ - return (count =3D=3D SWAP_HAS_CACHE) || (count =3D=3D 1); -} - /* * Drop the last ref of swap entries, caller have to ensure all entries * belong to the same cgroup and cluster. */ -static void swap_entries_free(struct swap_info_struct *si, - struct swap_cluster_info *ci, - unsigned long offset, unsigned int nr_pages) +void swap_entries_free(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, unsigned int nr_pages) { swp_entry_t entry =3D swp_entry(si->type, offset); unsigned char *map =3D si->swap_map + offset; @@ -1769,7 +1741,7 @@ static void swap_entries_free(struct swap_info_struct= *si, =20 ci->count -=3D nr_pages; do { - VM_BUG_ON(!swap_is_last_ref(*map)); + VM_WARN_ON(*map > 1); *map =3D 0; } while (++map < map_end); =20 @@ -1788,7 +1760,7 @@ int __swap_count(swp_entry_t entry) struct swap_info_struct *si =3D __swap_entry_to_info(entry); pgoff_t offset =3D swp_offset(entry); =20 - return swap_count(si->swap_map[offset]); + return si->swap_map[offset]; } =20 /** @@ -1802,7 +1774,7 @@ bool swap_entry_swapped(struct swap_info_struct *si, = unsigned long offset) int count; =20 ci =3D swap_cluster_lock(si, offset); - count =3D swap_count(si->swap_map[offset]); + count =3D si->swap_map[offset]; swap_cluster_unlock(ci); =20 return count && count !=3D SWAP_MAP_BAD; @@ -1829,7 +1801,7 @@ int swp_swapcount(swp_entry_t entry) =20 ci =3D swap_cluster_lock(si, offset); =20 - count =3D swap_count(si->swap_map[offset]); + count =3D si->swap_map[offset]; if (!(count & COUNT_CONTINUED)) goto out; =20 @@ -1867,12 +1839,12 @@ static bool swap_page_trans_huge_swapped(struct swa= p_info_struct *si, =20 ci =3D swap_cluster_lock(si, offset); if (nr_pages =3D=3D 1) { - if (swap_count(map[roffset])) + if (map[roffset]) ret =3D true; goto unlock_out; } for (i =3D 0; i < nr_pages; i++) { - if (swap_count(map[offset + i])) { + if (map[offset + i]) { ret =3D true; break; } @@ -2026,7 +1998,7 @@ void swap_free_hibernation_slot(swp_entry_t entry) return; =20 ci =3D swap_cluster_lock(si, offset); - swap_put_entry_locked(si, ci, offset, 1); + swap_put_entry_locked(si, ci, offset); WARN_ON(swap_entry_swapped(si, offset)); swap_cluster_unlock(ci); =20 @@ -2432,6 +2404,7 @@ static unsigned int find_next_to_unuse(struct swap_in= fo_struct *si, unsigned int prev) { unsigned int i; + unsigned long swp_tb; unsigned char count; =20 /* @@ -2442,7 +2415,11 @@ static unsigned int find_next_to_unuse(struct swap_i= nfo_struct *si, */ for (i =3D prev + 1; i < si->max; i++) { count =3D READ_ONCE(si->swap_map[i]); - if (count && swap_count(count) !=3D SWAP_MAP_BAD) + swp_tb =3D swap_table_get(__swap_offset_to_cluster(si, i), + i % SWAPFILE_CLUSTER); + if (count =3D=3D SWAP_MAP_BAD) + continue; + if (count || swp_tb_is_folio(swp_tb)) break; if ((i % LATENCY_LIMIT) =3D=3D 0) cond_resched(); @@ -3667,8 +3644,7 @@ void si_swapinfo(struct sysinfo *val) * Returns error code in following case. * - success -> 0 * - swp_entry is invalid -> EINVAL - * - swap-cache reference is requested but there is already one. -> EEXIST - * - swap-cache reference is requested but the entry is not used. -> ENOENT + * - swap-mapped reference is requested but the entry is not used. -> ENOE= NT * - swap-mapped reference requested but needs continued swap count. -> EN= OMEM */ static int swap_dup_entries(struct swap_info_struct *si, @@ -3677,39 +3653,28 @@ static int swap_dup_entries(struct swap_info_struct= *si, unsigned char usage, int nr) { int i; - unsigned char count, has_cache; + unsigned char count; =20 for (i =3D 0; i < nr; i++) { count =3D si->swap_map[offset + i]; - /* * Allocator never allocates bad slots, and readahead is guarded * by swap_entry_swapped. */ - if (WARN_ON(swap_count(count) =3D=3D SWAP_MAP_BAD)) + if (WARN_ON(count =3D=3D SWAP_MAP_BAD)) return -ENOENT; - - has_cache =3D count & SWAP_HAS_CACHE; - count &=3D ~SWAP_HAS_CACHE; - - if (!count && !has_cache) { + /* + * Swap count duplication must be guarded by either locked swap cache + * folio (from folio_dup_swap) or external lock (from swap_dup_entry_dir= ect). + */ + if (WARN_ON(!count && + !swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER)))) return -ENOENT; - } else if (usage =3D=3D SWAP_HAS_CACHE) { - if (has_cache) - return -EEXIST; - } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) { - return -EINVAL; - } } =20 for (i =3D 0; i < nr; i++) { count =3D si->swap_map[offset + i]; - has_cache =3D count & SWAP_HAS_CACHE; - count &=3D ~SWAP_HAS_CACHE; - - if (usage =3D=3D SWAP_HAS_CACHE) - has_cache =3D SWAP_HAS_CACHE; - else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) + if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) count +=3D usage; else if (swap_count_continued(si, offset + i, count)) count =3D COUNT_CONTINUED; @@ -3721,7 +3686,7 @@ static int swap_dup_entries(struct swap_info_struct *= si, return -ENOMEM; } =20 - WRITE_ONCE(si->swap_map[offset + i], count | has_cache); + WRITE_ONCE(si->swap_map[offset + i], count); } =20 return 0; @@ -3767,27 +3732,6 @@ int swap_dup_entry_direct(swp_entry_t entry) return err; } =20 -/* Mark the swap map as HAS_CACHE, caller need to hold the cluster lock */ -void __swapcache_set_cached(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry) -{ - WARN_ON(swap_dup_entries(si, ci, swp_offset(entry), SWAP_HAS_CACHE, 1)); -} - -/* Clear the swap map as !HAS_CACHE, caller need to hold the cluster lock = */ -void __swapcache_clear_cached(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr) -{ - if (swap_only_has_cache(si, swp_offset(entry), nr)) { - swap_entries_free(si, ci, swp_offset(entry), nr); - } else { - for (int i =3D 0; i < nr; i++, entry.val++) - swap_put_entry_locked(si, ci, swp_offset(entry), SWAP_HAS_CACHE); - } -} - /* * add_swap_count_continuation - called when a swap count is duplicated * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entr= y's @@ -3833,7 +3777,7 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) =20 ci =3D swap_cluster_lock(si, offset); =20 - count =3D swap_count(si->swap_map[offset]); + count =3D si->swap_map[offset]; =20 if ((count & ~COUNT_CONTINUED) !=3D SWAP_MAP_MAX) { /* --=20 2.52.0 From nobody Tue Dec 16 05:36:42 2025 Received: from mail-pj1-f49.google.com (mail-pj1-f49.google.com [209.85.216.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F20AC34EF12 for ; Thu, 4 Dec 2025 19:31:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.49 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876668; cv=none; b=FP/HIcBw+BeyNt19DNg7vWEnWH7UqdsPZX9yx54vgylM1HbwSJfTUF3N0JIv98YPu4I2TJeVeyegTFP19/vZF3rhkqMxNzC+lKDkvxDgUxP0AXPUbEJHrN+82lR0aJWzf9he2aWNTCNXfiL2XDTxIG4HXUGtrxuQaVF1zHyDS68= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764876668; c=relaxed/simple; bh=ZNrDIOvCVsDS0380qZ6vhztkRDw4nDreckVx6O6xHyo=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=AvC4O1edQrjUELAz3PJOa9w/vcWIivToZMILk9JOj8YSk6CHIkQ/uOpAlE5vLunIm9swxe0+6iHcwF2rK4748vWRxg8sLTv9VigbV9n2NmcfEGZaS3m6fJi81gv4WR+QE2cmG4q0S6EJ5uIW+aZpmezrZ72tmcwy64vggbZs5Zk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=KwOrCeqS; arc=none smtp.client-ip=209.85.216.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="KwOrCeqS" Received: by mail-pj1-f49.google.com with SMTP id 98e67ed59e1d1-3436cbb723fso976375a91.2 for ; Thu, 04 Dec 2025 11:31:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876666; x=1765481466; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=LIMoyBwJn7av90qkAfxn/q6wIjiZW7ZSAh6MwAXyokE=; b=KwOrCeqSXRQao9EqTToAmJhCoG4Q7uXfK/odZbru5zA3U5bS7UbTRYRomuoC2hF3RL ukph+O2Gh8D/hzI8OhZJdgT3Q2JsOvZ55FDe7Tcy+DEQdhJnGoUkdHi5j6naqgiw01tF MMe19lKIHuggvDGLLMAJLGTGYX98suekuwcrdYBPLIajZI+jPw1seuSsmwuZkxFXrF9W 0/6uDLRWA+xlA97cmL9D5nW+k/2apKkmJEDf9CiWPhKRSu06ak9NicSdFxHZ4u6CW9JK d+zSjVlpfwOQ09B2zauZ0cIEb92NMNH7tOiapaWbW8YIQNMcPxrmAT1uB8WAqOf6XUXw P/dg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876666; x=1765481466; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=LIMoyBwJn7av90qkAfxn/q6wIjiZW7ZSAh6MwAXyokE=; b=EbbwX0RQ8QqBUxifryKWdvqOrLTxyHOGuxhB7to0y5Saf4LGrQUBEBQEEUdPJ6qpOh /LVZXDTtuDBjORxa4uyjZEUlKxUv75mA+RQVDaXvDhzh3cDGdjSG2IpAEucIhLKloKlu NcDfJiGd3BXvS0r4cvVlpIHaNd/lU7OJlD5mvdKsw8gMWh8Ocmo9nvQ2J0zpbk4x2YXg UNYXtnJXJisEBgdabQaBxNlQWZHyYA26EhtLaSdQI9PgWsXVw5r9/VSTzExcyAFkVgQE uMv+M9ozEKRFluqN0nLKYPZYiSiKgR+jBRuWFAdCzJ1ArZ2NsFaCJtAltP0ZNIIuuSae HGXQ== X-Forwarded-Encrypted: i=1; AJvYcCX712CxKiucn7fS/nwSXzGP7QqN1wiLm1N6JahIDUql80l6qWzM77NsaMWewz7mfqnY2lOoA4BykH6jpyE=@vger.kernel.org X-Gm-Message-State: AOJu0YyjY9SvkVijlcI3zq/+I1p+pYR+Ws2JpbplnsdYuUiOsriSfb+q POuIOEz1M+xxQf8CwWyvVbiQ15aIpQPPtF7apo4ULCzrMgiOmZHL+YUI X-Gm-Gg: ASbGncsDq2SabA61arI97uQv4oZjpEX1hrBAnXduRL8LHH3MmQ/Lwed9YEZjuW/hHdl 2kzoYXR7Mf3S5CqTwmoDwZnlPns9RpxdNJH6gdmO+cTHVl9NUSIE7ifW0oRbwcN+GDU9LnFCgG2 9sStqsQIAZDj2nqUP4NM8y75CZvFeeCsZ450mY+utpNP1y9cSqlPuqsKCLf3IZd95bI0Cw4uiET 0v2t6hFvl5x+jAQ1XAKJW65vaQVlxMiMdkUyf2eemU3VSZu8EN892fIN1umOFIfTxUK7RrQIZjW JvOShBHxzZYkydPceoRHEeuPgGnhbMlNlfbtyYm6qD9CwHkZ63uawBgSVZ/KL9z1xsP4e6VLYvG TnrMQ244COHOXly+/lYfYZ8OXK4VNSPQEmsEA0eqqqizcCUigNfBWFZaf61hwJ2jF3Ndh3SZl1n pb3MbVPa9MuENgF54+Jtaj2m2ukC70QaUbUqWZeSOBwc8rlPIJ2UrGFXpSts0= X-Google-Smtp-Source: AGHT+IHc7nv/4V6WzRg5ybDSnfl1DlbA+tXjIOJmCku4lUOrp+NzASqiXTTHgcTYv3FBrgDA7wq+wg== X-Received: by 2002:a17:90b:4b82:b0:336:bfce:3b48 with SMTP id 98e67ed59e1d1-34947dabcbbmr3753351a91.9.1764876666104; Thu, 04 Dec 2025 11:31:06 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.31.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:31:05 -0800 (PST) From: Kairui Song Date: Fri, 05 Dec 2025 03:29:27 +0800 Subject: [PATCH v4 19/19] mm, swap: remove no longer needed _swap_info_get Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251205-swap-table-p2-v4-19-cb7e28a26a40@tencent.com> References: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> In-Reply-To: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=3397; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=Hx+6RnEV6CvOtID5C4mncKak3zAk83r8RmTaFWHQlOY=; b=KjUWPjf/oIz3MVxWSNNjn0n/F6ra+PBZg3ElsA0eu14B/aoaT73yAjmo6sF/TjF54R4crzumu indUS8L6+RJDdd6gxn9Behlg073fv1wDi9pSKEpUf282zsXcF1vZqEZ X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= From: Kairui Song There are now only two users of _swap_info_get after consolidating these callers, folio_free_swap and swp_swapcount. folio_free_swap already holds the folio lock, and the folio must be in the swap cache, _swap_info_get is redundant. For swp_swapcount, it should use get_swap_device instead. get_swap_device increases the device ref count, which is actually a bit safer. The only current use is smap walking, and the performance change here is tiny. And after these changes, _swap_info_get is no longer used, so we can safely remove it. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swapfile.c | 47 ++++++----------------------------------------- 1 file changed, 6 insertions(+), 41 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 50ed2d7f5b85..ce2a34858fa1 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -83,9 +83,7 @@ bool swap_migration_ad_supported; #endif /* CONFIG_MIGRATION */ =20 static const char Bad_file[] =3D "Bad swap file entry "; -static const char Unused_file[] =3D "Unused swap file entry "; static const char Bad_offset[] =3D "Bad swap offset entry "; -static const char Unused_offset[] =3D "Unused swap offset entry "; =20 /* * all active swap_info_structs @@ -1600,41 +1598,6 @@ void folio_put_swap(struct folio *folio, struct page= *subpage) swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false); } =20 -static struct swap_info_struct *_swap_info_get(swp_entry_t entry) -{ - struct swap_info_struct *si; - unsigned long offset; - - if (!entry.val) - goto out; - si =3D swap_entry_to_info(entry); - if (!si) - goto bad_nofile; - if (data_race(!(si->flags & SWP_USED))) - goto bad_device; - offset =3D swp_offset(entry); - if (offset >=3D si->max) - goto bad_offset; - if (data_race(!si->swap_map[swp_offset(entry)]) && - !swap_cache_has_folio(entry)) - goto bad_free; - return si; - -bad_free: - pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val); - goto out; -bad_offset: - pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); - goto out; -bad_device: - pr_err("%s: %s%08lx\n", __func__, Unused_file, entry.val); - goto out; -bad_nofile: - pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val); -out: - return NULL; -} - static void swap_put_entry_locked(struct swap_info_struct *si, struct swap_cluster_info *ci, unsigned long offset) @@ -1793,7 +1756,7 @@ int swp_swapcount(swp_entry_t entry) pgoff_t offset; unsigned char *map; =20 - si =3D _swap_info_get(entry); + si =3D get_swap_device(entry); if (!si) return 0; =20 @@ -1823,6 +1786,7 @@ int swp_swapcount(swp_entry_t entry) } while (tmp_count & COUNT_CONTINUED); out: swap_cluster_unlock(ci); + put_swap_device(si); return count; } =20 @@ -1857,11 +1821,12 @@ static bool swap_page_trans_huge_swapped(struct swa= p_info_struct *si, static bool folio_swapped(struct folio *folio) { swp_entry_t entry =3D folio->swap; - struct swap_info_struct *si =3D _swap_info_get(entry); + struct swap_info_struct *si; =20 - if (!si) - return false; + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); =20 + si =3D __swap_entry_to_info(entry); if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio))) return swap_entry_swapped(si, swp_offset(entry)); =20 --=20 2.52.0