From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pg1-f178.google.com (mail-pg1-f178.google.com [209.85.215.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D6204345CAD for ; Wed, 29 Oct 2025 15:59:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753544; cv=none; b=iP5gBAtxevOjIxwZsNFs+dhEyTYzYB7FXwryr56Nrd9DfTs4FYlD0mirqR5Qxn7h8n18asmmH+LT9pSAIzdNlcRLLIG6jljuWkbEHIgGuLLyBCz057vIXiayxoK5lZH/I/b4vvVxo+jONeJwHnmkNfmr0zab2y/K2+isyON16JU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753544; c=relaxed/simple; bh=0w+LhiAGXLbL/cQ8dosPZUQ597KiXaQmrRCfZk2pdCw=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=X/WoC+RozVACF1HeJJx9Dyrzisn1JJnmZCCUJGvCZwSF4XdbrmJQ9KpJQW1/xe6DMAD8+fCIYoOsnBH/jZ/evmtld4YXZnzbrUiSY29U6K45sI+529ky4ANGSYAm3ptPMYC5FfJeIUdcyc9v9a7Uv4sPeRL381l7P42+C9OtuD4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=lgB25pwo; arc=none smtp.client-ip=209.85.215.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="lgB25pwo" Received: by mail-pg1-f178.google.com with SMTP id 41be03b00d2f7-b6cf3174ca4so5202036a12.2 for ; Wed, 29 Oct 2025 08:59:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753542; x=1762358342; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=pu3PoeCbPMR/s39iM0CKCsQHdYun12Iw5kqNeSZR2xk=; b=lgB25pwoQcIT7hg+FYVx1dMTTzgT56cNPDiClMXQOlkciA9CuDddbYhfke0ymM6v1i 18GjmW5w6GoUNi5ACU+3xkf6zasbduX1mDLRIHNnY0gJ99Jmlbvvp+3ZOk3qnCssCEi4 HNScoOkLBxg5ziCqCPdRULF+W71sBP/v2lyvy/ZKB+sidWM9XXuuoOMuCRGdX1d5z4gh gwW+mgmocFxESCsOKaOctQiZZukiUlw1PyYqcRXZTpe3lAGPmKEkXdvDpkMJslt+4yAQ I6pcgq3hUaD8Hw0SmgAxNiBONKSUxyMxF+GfBl8Y5A2RKUWlSPvA9zTHawG1zKTXSnz5 IZcA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753542; x=1762358342; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=pu3PoeCbPMR/s39iM0CKCsQHdYun12Iw5kqNeSZR2xk=; b=caWqhzlKfx1NbBwrF76OzEUyg6LF39KgCFZuM0kk03LEHsxVxvkVt3s4loLtzqtxpu 2CFU0DzrEEKz4O0R5cJjQBSFaZdU/NK3cOIr+T/OX887TEj47iqo29lQH0COdk6SdAFT uYgdEZmWfYEPcYoM+t4znsiQgHpYVtvowmlo2UfB/3zrE2fdwJWWVq3+RrudHtMRwCXJ YRgN6wLt1cB6nTGt4Ggx5yExErFBxu8nKRmSuQqEaBHj2bTjlrWj2hr+cotCvaitUkKD iWXAG6jhT1lBL5Vwzd8+fbS8mndmWk39HdCrr/eFC43ZpMncklECZMBLCvp4QFYMIZUa 6dNQ== X-Forwarded-Encrypted: i=1; AJvYcCVINWseWNPh3iZCj9046L5TvkDY4vADTUH7234Ky/C3qD2Xtnr33KIR2cXdOFLML1Ku4BwN/9fcdxa1qd0=@vger.kernel.org X-Gm-Message-State: AOJu0YzJp+6Z/uIDC1hujj4ItyF6WZUD9vGOjIk4FugN16FMQ5plne7b lkOGOnQe2WBF25bsSVBthZcL9khO1Sp0R4a/Jqwe0BTEATpwfdrcM5bYT50ovBh1kZM= X-Gm-Gg: ASbGncuRH3q/NtrANvkMXe3+F+++v8/5v0JU9wsxuapKU5iT4aeJsxzjFvlLhtqGSTW gKvFY0pvT7WSuzsSlxwvth2I6LWUInwXMnnHjcBwkTU/yd1PsZ3s5KrCi4IZXuQUB+HU9Rn4AnG wh9G4R7BT9bUQZKAH44vOE4TJAaeidYBfG9hyj3MGvF6olzHTXZRTdKM8/WiM1YnHBQPjNyaAkA 0xszP+fs+godEcr+xVZMyl88c98zO64qj8IOGuZqGPpxaoi5idnj5Q6x5/Ja2rk6sYvOSe2v6Q9 A9SFCV4dQS+m3dkvFs2e6dmr87K3WTSU5c9NHa/8w+4ycnsptjJsq9Qp5DjAfKEhBjC1GNQicqo BBopt37t0lQgm2jU68Tm8IdcJ4rBalqze2vAEJkSF5wp0YqFd4GZbRhyG48zuCPgRbC/7OySE+r 43cPLpmww9eg== X-Google-Smtp-Source: AGHT+IFb1iX3tuV0DGXqg+fZe0C14p3sGVHz3BXLkVY5YOZsG0Sk2fwS8XqgdeRiLf9YYVqpNW2Uew== X-Received: by 2002:a17:903:2f8d:b0:269:9a8f:a4ab with SMTP id d9443c01a7336-294deef0a25mr38448885ad.60.1761753541822; Wed, 29 Oct 2025 08:59:01 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.08.58.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 08:59:01 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:27 +0800 Subject: [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-1-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song __read_swap_cache_async is widely used to allocate and ensure a folio is in swapcache, or get the folio if a folio is already there. It's not async, and it's not doing any read. Rename it to better present its usage, and prepare to be reworked as part of new swap cache APIs. Also, add some comments for the function. Worth noting that the skip_if_exists argument is an long existing workaround that will be dropped soon. Signed-off-by: Kairui Song Reviewed-by: Yosry Ahmed Suggested-by: Chris Li --- mm/swap.h | 6 +++--- mm/swap_state.c | 49 ++++++++++++++++++++++++++++++++----------------- mm/swapfile.c | 2 +- mm/zswap.c | 4 ++-- 4 files changed, 38 insertions(+), 23 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index d034c13d8dd2..0fff92e42cfe 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -249,6 +249,9 @@ struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **s= hadow); void swap_cache_del_folio(struct folio *folio); +struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, + struct mempolicy *mpol, pgoff_t ilx, + bool *alloced, bool skip_if_exists); /* Below helpers require the caller to lock and pass in the swap cluster. = */ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry, void *shadow); @@ -261,9 +264,6 @@ void swapcache_clear(struct swap_info_struct *si, swp_e= ntry_t entry, int nr); struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr, struct swap_iocb **plug); -struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags, - struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated, - bool skip_if_exists); struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, struct mempolicy *mpol, pgoff_t ilx); struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, diff --git a/mm/swap_state.c b/mm/swap_state.c index b13e9c4baa90..7765b9474632 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -402,9 +402,28 @@ void swap_update_readahead(struct folio *folio, struct= vm_area_struct *vma, } } =20 -struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, - struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated, - bool skip_if_exists) +/** + * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap ca= che. + * @entry: the swapped out swap entry to be binded to the folio. + * @gfp_mask: memory allocation flags + * @mpol: NUMA memory allocation policy to be applied + * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE + * @new_page_allocated: sets true if allocation happened, false otherwise + * @skip_if_exists: if the slot is a partially cached state, return NULL. + * This is a workaround that would be removed shortly. + * + * Allocate a folio in the swap cache for one swap slot, typically before + * doing IO (swap in or swap out). The swap slot indicated by @entry must + * have a non-zero swap count (swapped out). Currently only supports order= 0. + * + * Context: Caller must protect the swap device with reference count or lo= cks. + * Return: Returns the existing folio if @entry is cached already. Returns + * NULL if failed due to -ENOMEM or @entry have a swap count < 1. + */ +struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask, + struct mempolicy *mpol, pgoff_t ilx, + bool *new_page_allocated, + bool skip_if_exists) { struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct folio *folio; @@ -452,12 +471,12 @@ struct folio *__read_swap_cache_async(swp_entry_t ent= ry, gfp_t gfp_mask, goto put_and_return; =20 /* - * Protect against a recursive call to __read_swap_cache_async() + * Protect against a recursive call to swap_cache_alloc_folio() * on the same entry waiting forever here because SWAP_HAS_CACHE * is set but the folio is not the swap cache yet. This can * happen today if mem_cgroup_swapin_charge_folio() below * triggers reclaim through zswap, which may call - * __read_swap_cache_async() in the writeback path. + * swap_cache_alloc_folio() in the writeback path. */ if (skip_if_exists) goto put_and_return; @@ -466,7 +485,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry= , gfp_t gfp_mask, * We might race against __swap_cache_del_folio(), and * stumble across a swap_map entry whose SWAP_HAS_CACHE * has not yet been cleared. Or race against another - * __read_swap_cache_async(), which has set SWAP_HAS_CACHE + * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE * in swap_map, but not yet added its folio to swap cache. */ schedule_timeout_uninterruptible(1); @@ -509,10 +528,6 @@ struct folio *__read_swap_cache_async(swp_entry_t entr= y, gfp_t gfp_mask, * and reading the disk if it is not already cached. * A failure return means that either the page allocation failed or that * the swap entry is no longer in use. - * - * get/put_swap_device() aren't needed to call this function, because - * __read_swap_cache_async() call them and swap_read_folio() holds the - * swap cache folio lock. */ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr, @@ -529,7 +544,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, = gfp_t gfp_mask, return NULL; =20 mpol =3D get_vma_policy(vma, addr, 0, &ilx); - folio =3D __read_swap_cache_async(entry, gfp_mask, mpol, ilx, + folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, &page_allocated, false); mpol_cond_put(mpol); =20 @@ -647,9 +662,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, blk_start_plug(&plug); for (offset =3D start_offset; offset <=3D end_offset ; offset++) { /* Ok, do the async read-ahead now */ - folio =3D __read_swap_cache_async( - swp_entry(swp_type(entry), offset), - gfp_mask, mpol, ilx, &page_allocated, false); + folio =3D swap_cache_alloc_folio( + swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx, + &page_allocated, false); if (!folio) continue; if (page_allocated) { @@ -666,7 +681,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, lru_add_drain(); /* Push any new pages onto the LRU now */ skip: /* The page was likely read above, so no need for plugging here */ - folio =3D __read_swap_cache_async(entry, gfp_mask, mpol, ilx, + folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, &page_allocated, false); if (unlikely(page_allocated)) swap_read_folio(folio, NULL); @@ -761,7 +776,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, continue; pte_unmap(pte); pte =3D NULL; - folio =3D __read_swap_cache_async(entry, gfp_mask, mpol, ilx, + folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, &page_allocated, false); if (!folio) continue; @@ -781,7 +796,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, lru_add_drain(); skip: /* The folio was likely read above, so no need for plugging here */ - folio =3D __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx, + folio =3D swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx, &page_allocated, false); if (unlikely(page_allocated)) swap_read_folio(folio, NULL); diff --git a/mm/swapfile.c b/mm/swapfile.c index c35bb8593f50..849be32377d9 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1573,7 +1573,7 @@ static unsigned char swap_entry_put_locked(struct swa= p_info_struct *si, * CPU1 CPU2 * do_swap_page() * ... swapoff+swapon - * __read_swap_cache_async() + * swap_cache_alloc_folio() * swapcache_prepare() * __swap_duplicate() * // check swap_map diff --git a/mm/zswap.c b/mm/zswap.c index 5d0f8b13a958..a7a2443912f4 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1014,8 +1014,8 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, return -EEXIST; =20 mpol =3D get_task_policy(current); - folio =3D __read_swap_cache_async(swpentry, GFP_KERNEL, mpol, - NO_INTERLEAVE_INDEX, &folio_was_allocated, true); + folio =3D swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol, + NO_INTERLEAVE_INDEX, &folio_was_allocated, true); put_swap_device(si); if (!folio) return -ENOMEM; --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pj1-f49.google.com (mail-pj1-f49.google.com [209.85.216.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 87D52346A06 for ; Wed, 29 Oct 2025 15:59:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.49 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753549; cv=none; b=lHLfhb+dWNd9UmYjYAKCIA+gnLe1NuEBxnfUqufJUmNMg1QvCxszAJ1u4pmAGG6q88DCgYFxKeJKtku6fmd9fqxkuVHk12HYSr/c4xddEJAlScQ5VjJ1Od396b/Ie5s//7zsc2wkRsiJCaqPz+lQ5wFYnNJ+dvHCLJwo5LqkX2U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753549; c=relaxed/simple; bh=C4HNnx59ip33zop2vo9YU3h8Tqz0Rt0QbDIcaFbssQE=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=WmGv4clnrydqWoEPVMvo85Ddg3hi4wWFzhtHa+YdJNd726xRuMvaN88dpqT9HJI0b5MuBd4AB2yH+/iZ2G980zNENaXM+5781IlEbb+IObPTlt7zvLnV0GioOHsu1NJ5K1G14YVGt7oI258/RkQkjV2pVenKZ5OnXbVS53mVq0k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Wuh3tqtl; arc=none smtp.client-ip=209.85.216.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Wuh3tqtl" Received: by mail-pj1-f49.google.com with SMTP id 98e67ed59e1d1-33d463e79ddso42316a91.0 for ; Wed, 29 Oct 2025 08:59:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753547; x=1762358347; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=zhK86G+HkMMGSPEaLHkQ5V4087sDn+TsFFEp/mH7iIA=; b=Wuh3tqtl3bc6mwic2+vG1uyb5wvKq5gfy9pcRscoAbt1DbkIZuZBgT98HvcplcIdsX zqLa046OKrQ8xAz6UFkwL7rzlXpZy4haKE60jHLnlGvYHXU1Nn4BmxR7DCo4eFr5lwKb +iSLuKxyuUrDpUqHTey/M7RLLmRpWrTQM6lNpFhCAAqrdZOj+VzAD5lzV7tVqxjfXzGI be20Kr+az+g7WGfooMz+/DV+kncuXs10kpt+u0hBkgoScNahBo1+963x79URFLzpQjkD OToQN/kn4Kj9JKMizwTHry3kmvqsaSkNeNBnbsvQ7x2c1MYVCjQ6Tbl9L3b2E0yjrNfW 1Omw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753547; x=1762358347; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zhK86G+HkMMGSPEaLHkQ5V4087sDn+TsFFEp/mH7iIA=; b=WlejjubdL7PtKORRdq9ptFIXTVc/amZr4vHcvBp3pZIBksnDKHkp8gbY49JGZ3iD5z 1JJa+zXBRt4vm2mg9HUToAfX2yfas8WTy4Xpp/D6H7K4lvMxajBti/xyrjkZnYF7/DtO gUkyV3uv8hueyh2QdEfvZ+wRTt8/exvRWMmZm8e4Be2l80AED8pOqS1Ye5o8fydVvo1A y6/uiHqb0lkcKkyntIiSf0uA8AyQlf8W9Hezl6jbI/syxkeoV82Ac5GCBBbSliAWYuYG xLbX5Yo1tvBuitr0GDsj0ErX9ysw3YBleecFb/h7qDd4SF7DCVKzRQ75n++3tc7X1wEn AWsQ== X-Forwarded-Encrypted: i=1; AJvYcCVqCo3bCWFK2wIXRVWhY0m/1h0ScYkqOgO9xsgKXsWJ5B70NWXJjFJqJ2FI6EjN8ihB+fb6+FAVTK/KX2s=@vger.kernel.org X-Gm-Message-State: AOJu0YxiSX9eAw1iDfBjj1msInf+BdeOCBNMsBmnLzv4ZNqkX2b9Tu6E nCe7q1sFvA9PSWpn2YfyqSUPumVm+ifEvgaHgfetzv7ojyH4qEpi/Tzl X-Gm-Gg: ASbGncsjE+iHLWGFKjSwqu6mTrAWvrChdFzsqfjKqhUT1atJEwW+6fVM+k0UQnRWPM7 +x5SDCR4sxmRRIymjXnlOlE8t1BZ+A03ecgp4kYWLvJRDUEE1Yel4WyYPxsUxCjZ/EwCdgv4Tx+ NBIw51dKwS9hdpis4meUP+CPrfkI6x6WLwxoOQqnof9ZNGFqdHfxPbCZadVNTlPXOQaE5z5ufIf XmRisGmp2PJsN43zptyXO/m7/Bm5V3LOmufDT4eAlyTv9UOhyAYg5L7LGDTBC4fKDr5iXY4h5Fd jv3gRaYcqcl8efuWTO4pycSKAgQnIJfwfIbUAfJCpzHPvaNuhaHjS9FdvecaIjBbCpmJO1ti1CK Csj3P9UoaIL+eVnc7kz5n5F4pTGkFjZBq0vgrdPmmzXP/JPy4i3jlXjpMtTeuCIWU2QGIgll5va uSdWQLLcrABA== X-Google-Smtp-Source: AGHT+IHSchAAxtIz4UwFm8I+ff4dKglO2Nxh47GBLubgc7ZgXuU8QFpULQuCMGz4Njd/WWeyBqBAiQ== X-Received: by 2002:a17:90b:4a08:b0:33e:2d0f:4788 with SMTP id 98e67ed59e1d1-3403a2a2323mr3784546a91.18.1761753546699; Wed, 29 Oct 2025 08:59:06 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.08.59.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 08:59:06 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:28 +0800 Subject: [PATCH 02/19] mm, swap: split swap cache preparation loop into a standalone helper Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-2-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song To prepare for the removal of swap cache bypass swapin, introduce a new helper that accepts an allocated and charged fresh folio, prepares the folio, the swap map, and then adds the folio to the swap cache. This doesn't change how swap cache works yet, we are still depending on the SWAP_HAS_CACHE in the swap map for synchronization. But all synchronization hacks are now all in this single helper. No feature change. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swap_state.c | 197 +++++++++++++++++++++++++++++++---------------------= ---- 1 file changed, 109 insertions(+), 88 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 7765b9474632..d18ca765c04f 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -402,6 +402,97 @@ void swap_update_readahead(struct folio *folio, struct= vm_area_struct *vma, } } =20 +/** + * __swap_cache_prepare_and_add - Prepare the folio and add it to swap cac= he. + * @entry: swap entry to be bound to the folio. + * @folio: folio to be added. + * @gfp: memory allocation flags for charge, can be 0 if @charged if true. + * @charged: if the folio is already charged. + * @skip_if_exists: if the slot is in a cached state, return NULL. + * This is an old workaround that will be removed shortly. + * + * Update the swap_map and add folio as swap cache, typically before swapi= n. + * All swap slots covered by the folio must have a non-zero swap count. + * + * Context: Caller must protect the swap device with reference count or lo= cks. + * Return: Returns the folio being added on success. Returns the existing + * folio if @entry is cached. Returns NULL if raced with swapin or swapoff. + */ +static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry, + struct folio *folio, + gfp_t gfp, bool charged, + bool skip_if_exists) +{ + struct folio *swapcache; + void *shadow; + int ret; + + /* + * Check and pin the swap map with SWAP_HAS_CACHE, then add the folio + * into the swap cache. Loop with a schedule delay if raced with + * another process setting SWAP_HAS_CACHE. This hackish loop will + * be fixed very soon. + */ + for (;;) { + ret =3D swapcache_prepare(entry, folio_nr_pages(folio)); + if (!ret) + break; + + /* + * The skip_if_exists is for protecting against a recursive + * call to this helper on the same entry waiting forever + * here because SWAP_HAS_CACHE is set but the folio is not + * in the swap cache yet. This can happen today if + * mem_cgroup_swapin_charge_folio() below triggers reclaim + * through zswap, which may call this helper again in the + * writeback path. + * + * Large order allocation also needs special handling on + * race: if a smaller folio exists in cache, swapin needs + * to fallback to order 0, and doing a swap cache lookup + * might return a folio that is irrelevant to the faulting + * entry because @entry is aligned down. Just return NULL. + */ + if (ret !=3D -EEXIST || skip_if_exists || folio_test_large(folio)) + return NULL; + + /* + * Check the swap cache again, we can only arrive + * here because swapcache_prepare returns -EEXIST. + */ + swapcache =3D swap_cache_get_folio(entry); + if (swapcache) + return swapcache; + + /* + * We might race against __swap_cache_del_folio(), and + * stumble across a swap_map entry whose SWAP_HAS_CACHE + * has not yet been cleared. Or race against another + * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE + * in swap_map, but not yet added its folio to swap cache. + */ + schedule_timeout_uninterruptible(1); + } + + __folio_set_locked(folio); + __folio_set_swapbacked(folio); + + if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) { + put_swap_folio(folio, entry); + folio_unlock(folio); + return NULL; + } + + swap_cache_add_folio(folio, entry, &shadow); + memcg1_swapin(entry, folio_nr_pages(folio)); + if (shadow) + workingset_refault(folio, shadow); + + /* Caller will initiate read into locked folio */ + folio_add_lru(folio); + return folio; +} + /** * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap ca= che. * @entry: the swapped out swap entry to be binded to the folio. @@ -427,99 +518,29 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entr= y, gfp_t gfp_mask, { struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct folio *folio; - struct folio *new_folio =3D NULL; struct folio *result =3D NULL; - void *shadow =3D NULL; =20 *new_page_allocated =3D false; - for (;;) { - int err; - - /* - * Check the swap cache first, if a cached folio is found, - * return it unlocked. The caller will lock and check it. - */ - folio =3D swap_cache_get_folio(entry); - if (folio) - goto got_folio; - - /* - * Just skip read ahead for unused swap slot. - */ - if (!swap_entry_swapped(si, entry)) - goto put_and_return; - - /* - * Get a new folio to read into from swap. Allocate it now if - * new_folio not exist, before marking swap_map SWAP_HAS_CACHE, - * when -EEXIST will cause any racers to loop around until we - * add it to cache. - */ - if (!new_folio) { - new_folio =3D folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); - if (!new_folio) - goto put_and_return; - } - - /* - * Swap entry may have been freed since our caller observed it. - */ - err =3D swapcache_prepare(entry, 1); - if (!err) - break; - else if (err !=3D -EEXIST) - goto put_and_return; - - /* - * Protect against a recursive call to swap_cache_alloc_folio() - * on the same entry waiting forever here because SWAP_HAS_CACHE - * is set but the folio is not the swap cache yet. This can - * happen today if mem_cgroup_swapin_charge_folio() below - * triggers reclaim through zswap, which may call - * swap_cache_alloc_folio() in the writeback path. - */ - if (skip_if_exists) - goto put_and_return; + /* Check the swap cache again for readahead path. */ + folio =3D swap_cache_get_folio(entry); + if (folio) + return folio; =20 - /* - * We might race against __swap_cache_del_folio(), and - * stumble across a swap_map entry whose SWAP_HAS_CACHE - * has not yet been cleared. Or race against another - * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE - * in swap_map, but not yet added its folio to swap cache. - */ - schedule_timeout_uninterruptible(1); - } - - /* - * The swap entry is ours to swap in. Prepare the new folio. - */ - __folio_set_locked(new_folio); - __folio_set_swapbacked(new_folio); - - if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry)) - goto fail_unlock; - - swap_cache_add_folio(new_folio, entry, &shadow); - memcg1_swapin(entry, 1); + /* Skip allocation for unused swap slot for readahead path. */ + if (!swap_entry_swapped(si, entry)) + return NULL; =20 - if (shadow) - workingset_refault(new_folio, shadow); - - /* Caller will initiate read into locked new_folio */ - folio_add_lru(new_folio); - *new_page_allocated =3D true; - folio =3D new_folio; -got_folio: - result =3D folio; - goto put_and_return; - -fail_unlock: - put_swap_folio(new_folio, entry); - folio_unlock(new_folio); -put_and_return: - if (!(*new_page_allocated) && new_folio) - folio_put(new_folio); + /* Allocate a new folio to be added into the swap cache. */ + folio =3D folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); + if (!folio) + return NULL; + /* Try add the new folio, returns existing folio or NULL on failure. */ + result =3D __swap_cache_prepare_and_add(entry, folio, gfp_mask, + false, skip_if_exists); + if (result =3D=3D folio) + *new_page_allocated =3D true; + else + folio_put(folio); return result; } =20 --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pj1-f53.google.com (mail-pj1-f53.google.com [209.85.216.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8BE0E347BBA for ; Wed, 29 Oct 2025 15:59:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.53 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753554; cv=none; b=mkenhCYzDgfne7ntqQ/aE/RZ3T2Rrv2bwMW6JEe0XVuw7FB8bCcBpkKJOcSSFJzIsc128zNLyMn1jKoi4m5GM42gaqUldx+2f8rvHz1r3Zlckfm81745EeelTGqwiufsRHzCd/M1pbljX/2zhtM+fkjI1q34Ysj+ggzivotzAmY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753554; c=relaxed/simple; bh=1IVHDV0i+aP1djw0MDDj7S7+2W4CnRFTmxJMVUtd+O4=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=C3/gUv6Vs8TcX9zVcaqxZk1qhRKe9R13NIOcjmMeGb8Ntve+Nj8RqVn52TqnPhFbrGLu37ziDycTViEAFgylCQaZaewmzWd8IzhopNSwqUpigR0AcdiGopxm3Xf36XagF35rU8cs01VIu3IaUTmJP2AwiIfUkS/KE95Tq+iWhiY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ktRah+pl; arc=none smtp.client-ip=209.85.216.53 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ktRah+pl" Received: by mail-pj1-f53.google.com with SMTP id 98e67ed59e1d1-3304dd2f119so57182a91.2 for ; Wed, 29 Oct 2025 08:59:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753552; x=1762358352; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=rz2hi28hudNS4IT/qIl1dMVj4a37Bo5V3Z4J+XpiKOE=; b=ktRah+plgooAgGCzkopxyC/OE0gP9gcMxCNrA0N2054f4oa5/bwx2z2iq71DwmUNyl WxZhbaEQA8BP8bzw9JFuzftCU/volCrj9dHx+t63KAO4gqoNXuSNfgADxNBDCb6ii6Ux ykNmcvh1mh++1osUKGQCzT52jkqxpoeEUl0fegA8gmGq5zohRjSWYVzv4pu6ZDss/NkU M/wfOiOTLbL2Mdc8LVAFLfHN9uOUtXy5S4X4dd/rCvEi0f6FIiMfhMK2oFJURwtmsZ5s ah0JsH3v0Qj1yp70wJjsDSlRk8zwZyoi0KST9ToiT1Lp3ZNKtN5JsGSmvx+iEhCF3JmH eJAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753552; x=1762358352; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=rz2hi28hudNS4IT/qIl1dMVj4a37Bo5V3Z4J+XpiKOE=; b=JXlnEaheXii7IjtNwkJXvbbFANh2avdT4EvNWJzPfQ3vg4Uj6kCfj/rnxsPD+Py9Ba vD1Zp0DWFsOS+pBm2YkoZOwqeNWmeAe/aZrIgUHpLYgoxNPTtE3f6LRuLtDqsten6MvE wSV2pLEtG0tbu0WuFhCpNKUkgwgafDfZqoUQWX5YyUcL345mACr7lDQL8wKzd+DX/lWQ nIk6B9vXB/hW2zWgZovU2oxB5ugnYyfZfmLspQmg36uhREI49rePwegb2TIUFHjZSNBb bBRL1hjrH/UWWGxdrLN3aeFzG07OHXmWThIDyc4lke0TvVVbjCoGnfIoVVFj4A9pJ6rB 9Mvw== X-Forwarded-Encrypted: i=1; AJvYcCW70vuvfHye2+0SL15tvZfGcTI1K4BcXBBs4holYc0mJ7+L2i36i+CQDKh6k0CAeAJHYwdi5r6fsnQ7VQE=@vger.kernel.org X-Gm-Message-State: AOJu0Yzzqqhu5J18+DTytB87LIJC9E1dQIlJAsai2wPBnguw11Si0cLq PVEiIPBOtKSOm2L12J05orQgCYHoVarWd6WURaflBzOnF+bpUnf3HmAk X-Gm-Gg: ASbGncuTItMctX5rs9sPvhUSiiWuaBKeEmae7krgu9vg8PFx6Y0MyaXqXQ+eIuHE0c0 81ZIAWj19WfxsN/W6KYgVnvI3fOxV4SYYDUm/prc+uDgwWvy+ZKyO0rANDoMIs4BQm5edzOhNdE UR+LAU8huwERlZh1iS512Y8oHOL8v9gFw/uPJOWB+41EYk4GfJWWr75NYxUwLJwl+Z7VWBBYhfm SHOpbAKN07vnU6ObsGpY7ohOJ9WuvXNerK+PCbQuqlN9yQPn39RvyoXDGPtWu3dsVOAha6cJj5+ 5EateIb/gzBHY4Da2pE1rNVgN4V7yIU1OH1lVMWTtGVgT1VGpe+VEw0TNaF5yC59CGZIuxdEwrB 0XXkc0qrxvRDd3WFtNN7CBBGDvZSLZUgGJagJ5eDuRPuDpOdm3d1+Xit/7g1IEy23p1BSwJYo+6 a+h/E8gY1OXdzVy1TGYi/Q X-Google-Smtp-Source: AGHT+IE9F28VM9txLlXQyguLph+CDkmsOFET1zyslpkasaXaif9c0pcrkP9t25GnzgOy0wRXAaEphQ== X-Received: by 2002:a17:90a:fc46:b0:32e:38b0:1600 with SMTP id 98e67ed59e1d1-3403a25a8e2mr4106451a91.6.1761753551634; Wed, 29 Oct 2025 08:59:11 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.08.59.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 08:59:11 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:29 +0800 Subject: [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-3-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song Now the overhead of the swap cache is trivial, bypassing the swap cache is no longer a valid optimization. So unify the swapin path using the swap cache. This changes the swap in behavior in multiple ways: We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) =3D=3D 1` as the indicator to bypass both the swap cache and readahead. The swap count check is not a good indicator for readahead. It existed because the previously swap design made readahead strictly coupled with swap cache bypassing. We actually want to always bypass readahead for SWP_SYNCHRONOUS_IO devices even if swap count > 1, But bypassing the swap cache will cause redundant IO. Now that limitation is gone, with the new introduced helpers and design, we will always swap cache, so this check can be simplified to check SWP_SYNCHRONOUS_IO only, effectively disabling readahead for all SWP_SYNCHRONOUS_IO cases, this is a huge win for many workloads. The second thing here is that this enabled a large swap for all swap entries on SWP_SYNCHRONOUS_IO devices. Previously, the large swap in is also coupled with swap cache bypassing, and so the count checking side effect also makes large swap in less effective. Now this is also fixed. We will always have a large swap in support for all SWP_SYNCHRONOUS_IO cases. And to catch potential issues with large swap in, especially with page exclusiveness and swap cache, more debug sanity checks and comments are added. But overall, the code is simpler. And new helper and routines will be used by other components in later commits too. And now it's possible to rely on the swap cache layer for resolving synchronization issues, which will also be done by a later commit. Worth mentioning that for a large folio workload, this may cause more serious thrashing. This isn't a problem with this commit, but a generic large folio issue. For a 4K workload, this commit increases the performance. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/memory.c | 136 +++++++++++++++++++++-------------------------------= ---- mm/swap.h | 6 +++ mm/swap_state.c | 27 +++++++++++ 3 files changed, 84 insertions(+), 85 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 4c3a7e09a159..9a43d4811781 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4613,7 +4613,15 @@ static struct folio *alloc_swap_folio(struct vm_faul= t *vmf) } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 -static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq); +/* Sanity check that a folio is fully exclusive */ +static void check_swap_exclusive(struct folio *folio, swp_entry_t entry, + unsigned int nr_pages) +{ + do { + VM_WARN_ON_ONCE_FOLIO(__swap_count(entry) !=3D 1, folio); + entry.val++; + } while (--nr_pages); +} =20 /* * We enter with non-exclusive mmap_lock (to exclude vma changes, @@ -4626,17 +4634,14 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq); vm_fault_t do_swap_page(struct vm_fault *vmf) { struct vm_area_struct *vma =3D vmf->vma; - struct folio *swapcache, *folio =3D NULL; - DECLARE_WAITQUEUE(wait, current); + struct folio *swapcache =3D NULL, *folio; struct page *page; struct swap_info_struct *si =3D NULL; rmap_t rmap_flags =3D RMAP_NONE; - bool need_clear_cache =3D false; bool exclusive =3D false; swp_entry_t entry; pte_t pte; vm_fault_t ret =3D 0; - void *shadow =3D NULL; int nr_pages; unsigned long page_idx; unsigned long address; @@ -4707,57 +4712,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio =3D swap_cache_get_folio(entry); if (folio) swap_update_readahead(folio, vma, vmf->address); - swapcache =3D folio; - if (!folio) { - if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && - __swap_count(entry) =3D=3D 1) { - /* skip swapcache */ + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { folio =3D alloc_swap_folio(vmf); if (folio) { - __folio_set_locked(folio); - __folio_set_swapbacked(folio); - - nr_pages =3D folio_nr_pages(folio); - if (folio_test_large(folio)) - entry.val =3D ALIGN_DOWN(entry.val, nr_pages); /* - * Prevent parallel swapin from proceeding with - * the cache flag. Otherwise, another thread - * may finish swapin first, free the entry, and - * swapout reusing the same entry. It's - * undetectable as pte_same() returns true due - * to entry reuse. + * folio is charged, so swapin can only fail due + * to raced swapin and return NULL. */ - if (swapcache_prepare(entry, nr_pages)) { - /* - * Relax a bit to prevent rapid - * repeated page faults. - */ - add_wait_queue(&swapcache_wq, &wait); - schedule_timeout_uninterruptible(1); - remove_wait_queue(&swapcache_wq, &wait); - goto out_page; - } - need_clear_cache =3D true; - - memcg1_swapin(entry, nr_pages); - - shadow =3D swap_cache_get_shadow(entry); - if (shadow) - workingset_refault(folio, shadow); - - folio_add_lru(folio); - - /* To provide entry to swap_read_folio() */ - folio->swap =3D entry; - swap_read_folio(folio, NULL); - folio->private =3D NULL; + swapcache =3D swapin_folio(entry, folio); + if (swapcache !=3D folio) + folio_put(folio); + folio =3D swapcache; } } else { - folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, - vmf); - swapcache =3D folio; + folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf); } =20 if (!folio) { @@ -4779,6 +4748,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); } =20 + swapcache =3D folio; ret |=3D folio_lock_or_retry(folio, vmf); if (ret & VM_FAULT_RETRY) goto out_release; @@ -4848,24 +4818,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_nomap; } =20 - /* allocated large folios for SWP_SYNCHRONOUS_IO */ - if (folio_test_large(folio) && !folio_test_swapcache(folio)) { - unsigned long nr =3D folio_nr_pages(folio); - unsigned long folio_start =3D ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); - unsigned long idx =3D (vmf->address - folio_start) / PAGE_SIZE; - pte_t *folio_ptep =3D vmf->pte - idx; - pte_t folio_pte =3D ptep_get(folio_ptep); - - if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || - swap_pte_batch(folio_ptep, nr, folio_pte) !=3D nr) - goto out_nomap; - - page_idx =3D idx; - address =3D folio_start; - ptep =3D folio_ptep; - goto check_folio; - } - nr_pages =3D 1; page_idx =3D 0; address =3D vmf->address; @@ -4909,12 +4861,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio)); BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page)); =20 + /* + * If a large folio already belongs to anon mapping, then we + * can just go on and map it partially. + * If not, with the large swapin check above failing, the page table + * have changed, so sub pages might got charged to the wrong cgroup, + * or even should be shmem. So we have to free it and fallback. + * Nothing should have touched it, both anon and shmem checks if a + * large folio is fully appliable before use. + * + * This will be removed once we unify folio allocation in the swap cache + * layer, where allocation of a folio stabilizes the swap entries. + */ + if (!folio_test_anon(folio) && folio_test_large(folio) && + nr_pages !=3D folio_nr_pages(folio)) { + if (!WARN_ON_ONCE(folio_test_dirty(folio))) + swap_cache_del_folio(folio); + goto out_nomap; + } + /* * Check under PT lock (to protect against concurrent fork() sharing * the swap entry concurrently) for certainly exclusive pages. */ if (!folio_test_ksm(folio)) { + /* + * The can_swapin_thp check above ensures all PTE have + * same exclusivenss, only check one PTE is fine. + */ exclusive =3D pte_swp_exclusive(vmf->orig_pte); + if (exclusive) + check_swap_exclusive(folio, entry, nr_pages); if (folio !=3D swapcache) { /* * We have a fresh page that is not exposed to the @@ -4992,18 +4969,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) vmf->orig_pte =3D pte_advance_pfn(pte, page_idx); =20 /* ksm created a completely new copy */ - if (unlikely(folio !=3D swapcache && swapcache)) { + if (unlikely(folio !=3D swapcache)) { folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); } else if (!folio_test_anon(folio)) { /* - * We currently only expect small !anon folios which are either - * fully exclusive or fully shared, or new allocated large - * folios which are fully exclusive. If we ever get large - * folios within swapcache here, we have to be careful. + * We currently only expect !anon folios that are fully + * mappable. See the comment after can_swapin_thp above. */ - VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) !=3D nr_pages, folio); + VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); } else { folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address, @@ -5043,12 +5018,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); out: - /* Clear the swap cache pin for direct swapin after PTL unlock */ - if (need_clear_cache) { - swapcache_clear(si, entry, nr_pages); - if (waitqueue_active(&swapcache_wq)) - wake_up(&swapcache_wq); - } if (si) put_swap_device(si); return ret; @@ -5056,6 +5025,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); out_page: + if (folio_test_swapcache(folio)) + folio_free_swap(folio); folio_unlock(folio); out_release: folio_put(folio); @@ -5063,11 +5034,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_unlock(swapcache); folio_put(swapcache); } - if (need_clear_cache) { - swapcache_clear(si, entry, nr_pages); - if (waitqueue_active(&swapcache_wq)) - wake_up(&swapcache_wq); - } if (si) put_swap_device(si); return ret; diff --git a/mm/swap.h b/mm/swap.h index 0fff92e42cfe..214e7d041030 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -268,6 +268,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t flag, struct mempolicy *mpol, pgoff_t ilx); struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf); +struct folio *swapin_folio(swp_entry_t entry, struct folio *folio); void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, unsigned long addr); =20 @@ -386,6 +387,11 @@ static inline struct folio *swapin_readahead(swp_entry= _t swp, gfp_t gfp_mask, return NULL; } =20 +static inline struct folio *swapin_folio(swp_entry_t entry, struct folio *= folio) +{ + return NULL; +} + static inline void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, unsigned long addr) { diff --git a/mm/swap_state.c b/mm/swap_state.c index d18ca765c04f..b3737c60aad9 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -544,6 +544,33 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry= , gfp_t gfp_mask, return result; } =20 +/** + * swapin_folio - swap-in one or multiple entries skipping readahead. + * @entry: starting swap entry to swap in + * @folio: a new allocated and charged folio + * + * Reads @entry into @folio, @folio will be added to the swap cache. + * If @folio is a large folio, the @entry will be rounded down to align + * with the folio size. + * + * Return: returns pointer to @folio on success. If folio is a large folio + * and this raced with another swapin, NULL will be returned. Else, if + * another folio was already added to the swap cache, return that swap + * cache folio instead. + */ +struct folio *swapin_folio(swp_entry_t entry, struct folio *folio) +{ + struct folio *swapcache; + pgoff_t offset =3D swp_offset(entry); + unsigned long nr_pages =3D folio_nr_pages(folio); + + entry =3D swp_entry(swp_type(entry), round_down(offset, nr_pages)); + swapcache =3D __swap_cache_prepare_and_add(entry, folio, 0, true, false); + if (swapcache =3D=3D folio) + swap_read_folio(folio, NULL); + return swapcache; +} + /* * Locate a page of swap in physical memory, reserving swap cache space * and reading the disk if it is not already cached. --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8595033A01E for ; Wed, 29 Oct 2025 15:59:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.50 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753559; cv=none; b=DRJtjZAUNDTEpc2m8NTpI2P/9nSgug+eAwbskeMX2i2VCmfz09sMX/BDTM6JwDLZlYmQxdC3GDkKngawS1X3+UMITc3NHAHhYb2uoUEk/RGiQG/IScTamlVkBmJAul70oHO94QYD7Y1d+jZ105s8WZ+IFD/LZ1Wfe5JX7c62sj0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753559; c=relaxed/simple; bh=zJznav8+/tOGs8R4Q/7+iRpOZtTAakwuGpzIz4iTHL8=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=O83H60nrWMZaP+LLbZStDNgYFBLi8yoV7YXk95oA1yvHWni5cxhfkcK+edjTiJ3AUoj4iL4fIHGLFSAShNV9pAt9wO5F9eXmiHlYFkYAWt6PXCg3OUle0YtKAPA9tzkSI0LKQDfKhAkeqPmU1Q3EOdO56EvAxYjwDHJ7uTZVq64= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=kihRZnoc; arc=none smtp.client-ip=209.85.216.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="kihRZnoc" Received: by mail-pj1-f50.google.com with SMTP id 98e67ed59e1d1-33bbc4e81dfso97122a91.1 for ; Wed, 29 Oct 2025 08:59:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753557; x=1762358357; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=7sAbt8Aq11hD5HKF7vZsq1nYOhoSCTs6kYCBAgTfydA=; b=kihRZnoco38POIF6SoHj5zL4x5NrRylcWxbGQBVc5FvciLtEfVO1HEHI+SfG6+J6sS AYv1bfR6RjEJvz3L5zfCuRvEnxmIXxN4+nCWdFjjoAUH1CnZEGHF1m+Ib6bm5Ur6hB/E WiuPU1hD/8P9Kac9Snq6PejYKcRu+88Ov0VzkNda2uP4z643CiEP1+yQgvBjunoDY58W w7gBmQxLeJoF7KbQ6DaEly50wU4b70eOZokIphm+w5v7rtTif4D1mV9FbabUR8mCUjvQ zlf8t/88yiOGExQYnx2Z01IOZtCD21eQ85Lk9RVpjuCSWvwVRLnIfPz3QgIBEb1R+kg9 Qq7w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753557; x=1762358357; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7sAbt8Aq11hD5HKF7vZsq1nYOhoSCTs6kYCBAgTfydA=; b=gsKsYkZ6WmQ1uWjbZsH2eEFoyGwf8QelWbxER46Dt94NLKyLlliSDy5kNOBSlGTZY4 Hnlp0NLMATNoqs++t62j6H3XmGNE58jmMxJjpY+kTUp14uoz860cDEFZXDe7Up34Mknj kGBLzygAz2dijb+sBhnP/uHWZ9v2Kn5miHXd2lvjBRgVKtHZrKf7Bbz+WIaqWEwFMm1j OieNf5+XUk1oldSJCquqICXX8f9nFEo+prE+E7zLUjrT7HLp+CMaCSOuDlV8LXFeGujU xMHwM8u1nmdquxzCoFcaU/yv45PpIFeUxerijk3s2rPnzGU5QVuuiedJNAI2W9ncJ9fO yBpw== X-Forwarded-Encrypted: i=1; AJvYcCU5wy5NIwQGn1mLqBsX2Ixz+ougmBdVxvSzLnkxDGx/H3yrmxlZU9nUWwC2g/larFwW9l7kUciVRw+JY04=@vger.kernel.org X-Gm-Message-State: AOJu0YyrDbbiqHk5W/dJvXm6tkdf9SZCei83y3iWFvGq3kD/HpvzuS1Z TCUzfvzadozXtxrTKcFpKDEMdfAKs6ZMBGs7bjGB3NyztF/JY8i6jMww X-Gm-Gg: ASbGncsws7gfPfGENB3kKNLZd/aih8eiWr/9oqnbPDsqZ1UOXqPP9W6rLVGl6XSKEzX S7MU3cOjqC2VBCQQMYm1w6ojgdwBaZZoTtJ4cW5DeIqDau5stCju02otdYJhYrSLLnmIO6g7uUZ QWmMqoCQOGM6jonXm6siITpXPzcVjIPMzXM4mZuEGXjEYWbXG1TWrAvt0EdtnlX1gZ0C7sGGVse Vk/OLwPRNC43HCLfzmr8MRHUatpwKKPd1G5nK7fd2QUz1U7zY2OiUlCzL9trZ+FZ63B+0MeuNGw 0fdPWWJ2/bcaO7rQ2WDpDVrAMnxwRaqpwuoisVJs73qQJfYWMMZ6OE80H0goMTacUKxSsOQkIj9 CEBpQCiSg3cKxsfpYMWqzXuuTMhQsRU1B6RJENmQzFdgBSrLXIND+P8pX+rA0SZ9aeI62zq8YID QhJg+n8/tTZ5bIdHNg5i6Q X-Google-Smtp-Source: AGHT+IFcx/tNH+mGfenFLGtMHSLOQFDVYtyEct3+kZwXnL2FV+gWAZHMp+QdJZXOLxNVix81OWQEkA== X-Received: by 2002:a17:90b:3c4f:b0:32e:5cba:ae26 with SMTP id 98e67ed59e1d1-3403a282201mr4126488a91.23.1761753556776; Wed, 29 Oct 2025 08:59:16 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.08.59.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 08:59:16 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:30 +0800 Subject: [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-4-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side effect is that a folio may stay in swap cache for a longer time due to lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios are being swapped out very frequently right after swapin, hence improving the performance. But the long pinning of swap slots also increases the fragmentation rate of the swap device significantly, and currently, all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also causes the backing memory to be pinned, increasing the memory pressure. So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices after swapin finishes. Swap cache has served its role as a synchronization layer to prevent any parallel swapin from wasting CPU or memory allocation, and the redundant IO is not a major concern for SWP_SYNCHRONOUS_IO devices. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/memory.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 9a43d4811781..78457347ae60 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4359,12 +4359,21 @@ static vm_fault_t remove_device_exclusive_entry(str= uct vm_fault *vmf) return 0; } =20 -static inline bool should_try_to_free_swap(struct folio *folio, +static inline bool should_try_to_free_swap(struct swap_info_struct *si, + struct folio *folio, struct vm_area_struct *vma, unsigned int fault_flags) { if (!folio_test_swapcache(folio)) return false; + /* + * Try to free swap cache for SWP_SYNCHRONOUS_IO devices. + * Redundant IO is unlikely to be an issue for them, but a + * slot being pinned by swap cache may cause more fragmentation + * and delayed freeing of swap metadata. + */ + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) + return true; if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) || folio_test_mlocked(folio)) return true; @@ -4935,7 +4944,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * yet. */ swap_free_nr(entry, nr_pages); - if (should_try_to_free_swap(folio, vma, vmf->flags)) + if (should_try_to_free_swap(si, folio, vma, vmf->flags)) folio_free_swap(folio); =20 add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pj1-f52.google.com (mail-pj1-f52.google.com [209.85.216.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2A15334B42B for ; Wed, 29 Oct 2025 15:59:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.52 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753563; cv=none; b=Y10EXmAhH9rEWIxDfuyCO8f45kvafvwj3PTz6tFzN2YTvpfZ/8s6cMvOP4UCnFsz8ls3kQWEgQNsclUmWBhpFlmg/oQHMDd7FWVmntDu1SgXDRLk1pdqfxPRI5di6st5WQH9kQMmIV60gPgoPIieZYWZS72UEJ7AHBxUOKIWToY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753563; c=relaxed/simple; bh=Y/8DN/bVrI1N0QSXJW/59K8K2Df8jsrwORtU+9uAPmE=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=q3Nks/hRWTEAG6ROReh9I68b/69NyuoMegTNRlFsrSaNEbiFHbd0KniVMuhivJWOjDA3tYtxiUQZWUYeRIZ2eG18IQvL5ubk2PR4rZQRatguvABzAr4suW0Ojt3CzkgLKYtd6UV1DCnLc/M2lge0/rNmzbNOlTBXtNVA4sB86vo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=DWAkDcQA; arc=none smtp.client-ip=209.85.216.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="DWAkDcQA" Received: by mail-pj1-f52.google.com with SMTP id 98e67ed59e1d1-34029c5beabso89487a91.1 for ; Wed, 29 Oct 2025 08:59:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753561; x=1762358361; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=ndK9rgLMnW0lXi4W8q46Dx/IRDzQ0optlFYL4qNTikM=; b=DWAkDcQAQJ37dTyCuA70uJMwJlo+V5ZUDe4FJ+oB0zCcbzdcvXwgc5rTqOqpqu8qkO 8lJc72jGDG8az0L47RBpyT4DKXrm0hUAbU2xa/JQxqoKuo4JFBK0F8KZnyOdjhi7niiH Q2XmBvDcXcIwJM5NjcBj2YOKwwHwZIcTjTkf0ZhA/EjoPEslwV+0Rxy4bri73R1+dmHh K2NiDZIW4RRBhg503h7kRqUuSOswdBH5zN8H6hup9usyLHauDWHp5R1Bw5NfYhEPPuv6 XooiW/3xSxur3xk4f+T7KB3wvtfRx+BTxhdsS7gBktefNWrVPLNUdp6j4BkRWOBh3p/9 qDgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753561; x=1762358361; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ndK9rgLMnW0lXi4W8q46Dx/IRDzQ0optlFYL4qNTikM=; b=pbXZca9c3ehUkGpMEPvaKQg2JO+dT2oequBV16UVy2M9vPWuFZ311uIKkQ6yqw0609 gFAaUYt9x1TugYMYKFivlsFysj3De6EILxnb82hRbfSG6S1uHqwCrEI1y8hACJNAeLdg qv+ZdQHTQ3RNpWKnu7rBv9k5WlcHZF/rrnZXmwYt8mnGszBMUh6ReK6fV0p/egvhLjcS CmJKXcfT76CIEh0mQHJL14ZgfH/iTXBmAs8+1NML0JlljzDzIXe8x86LY4BD8e0z8dcy 1HkqZRR3Nj8iFx5mZABEql0GcPkOvHGFS97BDiije0S16lqnckc4TJAPVlZyMT9WqTkz zKvQ== X-Forwarded-Encrypted: i=1; AJvYcCX+OlgxfZBbWFkGDbI6gAZSXYg0Q9RLjFGg4GzlwSwncZDcBl1wuXXJeIlkfZO47bsED/ZYv6K2h++fpE0=@vger.kernel.org X-Gm-Message-State: AOJu0YxvuoKdocTR0Yhm0zRJ1yPdDRrEJo6Tsn+otvaQnh0Ioe5/cAG6 ALIZt/6Eav2GfldidneNoBsne4Buu8tB2TIL/yNc++d3rDSFb6BuGgZh X-Gm-Gg: ASbGnctrmrhuJW1Q9DCmXgUqO2+5taiF58pdzM4d3/EFmCJPFLmcJwu3H8iOFela69k dTi7IEmvMT9/Le4NXrN6zTIKdE120ZqgqAfi3q/0OS8HtqQwGd3SaL9TTg8uWb6Jjl8gCeUWiGw ng7pE3JjgXmy1gPVOaFTNUb/IAr2r9gpTYxCHuGu93TMzwzqhERfipv1j7ywBuQBwt32zHVkLau 9onj/2NF/XWMMjRVIe7kvZyO+XyycWR+oTzqa6S1lh+KveaT2pxn7Rlv0n8tCsDULT59Rp+eEhs guN7lrbWO+/+WU+Sa7u7zolLoYwZgkM+z+4ihTEvIRU7CTexGhSyQnb9UvSa+8m3PeetkesfaZU sosxtgD/UR2mBrrwKRTx8IiOvaW0lmfMUkejUoex7KmVpz/kxz1QQrrZUUZWwZHDvxYEwGeBA+n ks4bj1Jr23CFrMlsQ9WPJS X-Google-Smtp-Source: AGHT+IFLkRpwM32hzKfCMcykzpH7VmCSQ2IyTOtMyj93WrL2JrbkqW/t9fuG8FuM15oE3b/6CKBLRw== X-Received: by 2002:a17:90b:288b:b0:338:3156:fc43 with SMTP id 98e67ed59e1d1-3403a15ab0emr4047870a91.11.1761753561487; Wed, 29 Oct 2025 08:59:21 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.08.59.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 08:59:21 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:31 +0800 Subject: [PATCH 05/19] mm, swap: simplify the code and reduce indention Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-5-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song Now swap cache is always used, multiple swap cache checks are no longer useful, remove them and reduce the code indention. No behavior change. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/memory.c | 89 +++++++++++++++++++++++++++++----------------------------= ---- 1 file changed, 43 insertions(+), 46 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 78457347ae60..6c5cd86c4a66 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4763,55 +4763,52 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_release; =20 page =3D folio_file_page(folio, swp_offset(entry)); - if (swapcache) { - /* - * Make sure folio_free_swap() or swapoff did not release the - * swapcache from under us. The page pin, and pte_same test - * below, are not enough to exclude that. Even if it is still - * swapcache, we need to check that the page's swap has not - * changed. - */ - if (unlikely(!folio_matches_swap_entry(folio, entry))) - goto out_page; - - if (unlikely(PageHWPoison(page))) { - /* - * hwpoisoned dirty swapcache pages are kept for killing - * owner processes (which may be unknown at hwpoison time) - */ - ret =3D VM_FAULT_HWPOISON; - goto out_page; - } - - /* - * KSM sometimes has to copy on read faults, for example, if - * folio->index of non-ksm folios would be nonlinear inside the - * anon VMA -- the ksm flag is lost on actual swapout. - */ - folio =3D ksm_might_need_to_copy(folio, vma, vmf->address); - if (unlikely(!folio)) { - ret =3D VM_FAULT_OOM; - folio =3D swapcache; - goto out_page; - } else if (unlikely(folio =3D=3D ERR_PTR(-EHWPOISON))) { - ret =3D VM_FAULT_HWPOISON; - folio =3D swapcache; - goto out_page; - } - if (folio !=3D swapcache) - page =3D folio_page(folio, 0); + /* + * Make sure folio_free_swap() or swapoff did not release the + * swapcache from under us. The page pin, and pte_same test + * below, are not enough to exclude that. Even if it is still + * swapcache, we need to check that the page's swap has not + * changed. + */ + if (unlikely(!folio_matches_swap_entry(folio, entry))) + goto out_page; =20 + if (unlikely(PageHWPoison(page))) { /* - * If we want to map a page that's in the swapcache writable, we - * have to detect via the refcount if we're really the exclusive - * owner. Try removing the extra reference from the local LRU - * caches if required. + * hwpoisoned dirty swapcache pages are kept for killing + * owner processes (which may be unknown at hwpoison time) */ - if ((vmf->flags & FAULT_FLAG_WRITE) && folio =3D=3D swapcache && - !folio_test_ksm(folio) && !folio_test_lru(folio)) - lru_add_drain(); + ret =3D VM_FAULT_HWPOISON; + goto out_page; } =20 + /* + * KSM sometimes has to copy on read faults, for example, if + * folio->index of non-ksm folios would be nonlinear inside the + * anon VMA -- the ksm flag is lost on actual swapout. + */ + folio =3D ksm_might_need_to_copy(folio, vma, vmf->address); + if (unlikely(!folio)) { + ret =3D VM_FAULT_OOM; + folio =3D swapcache; + goto out_page; + } else if (unlikely(folio =3D=3D ERR_PTR(-EHWPOISON))) { + ret =3D VM_FAULT_HWPOISON; + folio =3D swapcache; + goto out_page; + } else if (folio !=3D swapcache) + page =3D folio_page(folio, 0); + + /* + * If we want to map a page that's in the swapcache writable, we + * have to detect via the refcount if we're really the exclusive + * owner. Try removing the extra reference from the local LRU + * caches if required. + */ + if ((vmf->flags & FAULT_FLAG_WRITE) && + !folio_test_ksm(folio) && !folio_test_lru(folio)) + lru_add_drain(); + folio_throttle_swaprate(folio, GFP_KERNEL); =20 /* @@ -5001,7 +4998,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) pte, pte, nr_pages); =20 folio_unlock(folio); - if (folio !=3D swapcache && swapcache) { + if (unlikely(folio !=3D swapcache)) { /* * Hold the lock to avoid the swap entry to be reused * until we take the PT lock for the pte_same() check @@ -5039,7 +5036,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_unlock(folio); out_release: folio_put(folio); - if (folio !=3D swapcache && swapcache) { + if (folio !=3D swapcache) { folio_unlock(swapcache); folio_put(swapcache); } --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pj1-f48.google.com (mail-pj1-f48.google.com [209.85.216.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DED4733F8BE for ; Wed, 29 Oct 2025 15:59:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.48 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753568; cv=none; b=cPZZ+xHG6GSpWSdJnfU6LXRER61UD9tx9wJPBlvIoMhYyliNYPiBBbZiSr47YI2Os5MN6PUxqdZQeRfeBinGmBa/XLV7LH8F1lxFgNEXjXA9D8d93sYW1kVOy81iwuvVm6YJ/Uc9AD7D/lLuBi9YAYLEeue4ZcNbnLrV5sAaMN8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753568; c=relaxed/simple; bh=Dt4+BCkV7tFGOuDRH+P9tmZPbjgxHwLtkraqYpKF+PA=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=MzxhNBFcZtnCCKTovEGKAFBbCup+s8B7sE39tS8ztemQu5gdTt1LdsS6BDXHR6gkrCUCwy+665Ubu7DbWPmP6vke8SCJqM7c9dHVlxoNR3me0GlTfhVsQ7ej0ORswPFuD1IFZjoVd0AZCRuiusCmhSFNZakKjZLmkeuw/PpA38M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Cnm0v+KA; arc=none smtp.client-ip=209.85.216.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Cnm0v+KA" Received: by mail-pj1-f48.google.com with SMTP id 98e67ed59e1d1-33292adb180so68425a91.3 for ; Wed, 29 Oct 2025 08:59:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753566; x=1762358366; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=c9Lc3r90zvCZ/SIB1R7dCgmTv8+FwGBKVqNJ6km7fiU=; b=Cnm0v+KAvlSLC5ra4SDhZNEtoX96+TeJ4rQoglwFbBot1ZKvt+SPj7OxQRiPt+WWpl +/eL0gbIEN+IR5D+G2U8gLXrEX/yqLaTnZV+Sf0QzEyD1yj7FnL4c2Is/EjdKGMhT5un KJwKxZwa0q1EACYMCgELp5HT0rAcsDp9RkJT12L/AEXORtED8XNgGV/qJBKcuuIZuHIR VlrJygfetUqwjUMj2dlAJ32mzln5yFDfzWs9Zvqm8zE80n0HyL5IWcl2jEVKUzE+eCtK 6eISe0uLy81Kl6BNhU1mdPQTvtNJaMPEr5kYpEIVMfOkD7ITpqOI5Q5LvCi0v7URHPbp 6VPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753566; x=1762358366; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=c9Lc3r90zvCZ/SIB1R7dCgmTv8+FwGBKVqNJ6km7fiU=; b=m93KUHhDMKIrcXDtIXRAgs/aWXeMyBg/NJCISmm+ypoucUvMgfK6iuAGL6FjQuWozC siQKJ2ii7NFtiMo9O/ylCBOl6tp0fsnX9HVvd4UKpxEzG3Mp9MEUOvd6+8BDwj+LI1G/ YGHlR99ATjXb2ZSX96p9QJFzRQL61EIZdvdd/kytkOMmalulTDYuAHYMErCwGJOn4ifD Zm7AwjAl+6n49+FwM2b66dwr7idiyfFTESVb7rj1epZaFXXyqdf970XbU4hz5JJ5mo4o mBreMKBxo9diS6E9M/KMKI3kznMXPBJweKTcrru2Sj2aVKDsX+79ZURnzXVi1Irs9uTm dijg== X-Forwarded-Encrypted: i=1; AJvYcCXyomVaRYbqQlOp47WFXybconGMgaG0X/C1BKeXKf2frhRTwLeyXhKTYTw0IzQAnTS2tsUDc+BSzVHQRMQ=@vger.kernel.org X-Gm-Message-State: AOJu0YzPMpOWnkzr9GHfTWdnrh9v0XCNvwxbuod9ZKf27uKJUXRbQFRG lNsKs7nL6/BTRIOgszRQZZ7CJwsJTVRe6oLHsamznJ+J+Yr1swEnBgz6 X-Gm-Gg: ASbGncvF/vlpmcO6+9EFam2dSH41n3Xedhoat/dhj9C76e3NSU5wduPXwBWVSTvJDnU /3omG0MsIMT0yKxHRHXFb4mm0DuCdQo0p6m7OLGztJU5/cTtjvkbdrVzJuZXLs9B0fmLnuQ8+b6 olR5YJRDrrslY3TFzV68bZnu4vDgkEBfX2SOSyaGqDLmMX0dZqvLfuGnSabNbxZUowVZhHkRLsc CLztJcMXHyM/WsERWPu0rtc0pusxZzsYqF0JHRTuYkAtZ6izvkaH0LcgNX/5CB7JQlGB0dhH/07 JGW1eQ7M/zKBKG0SyZKLKlOkip+OsxczVzKEm5jRNnPEh+L1vCu1rItMege4YGM/1jK6U2L8u0o +vrkSpFvRYYTwSNJ59Hg+tp89rhtXjZSVQ58GH2+b/+VLqPgplQzu2ge6hxrpUZ/HsnnTn4e9T/ bjIH1w78SnTVV3URIf4GUo X-Google-Smtp-Source: AGHT+IE3X/h8/q0SwVnk59/fKVbT587Z49epkOdhBeEqGIhXexJoe/eYZfVuMy++nBU8/UnEmEl0uw== X-Received: by 2002:a17:90b:288a:b0:335:28ee:eeaf with SMTP id 98e67ed59e1d1-3403a291ac5mr3323678a91.29.1761753566161; Wed, 29 Oct 2025 08:59:26 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.08.59.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 08:59:25 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:32 +0800 Subject: [PATCH 06/19] mm, swap: free the swap cache after folio is mapped Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-6-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song To prevent repeated faults of parallel swapin of the same PTE, remove the folio from the swap cache after the folio is mapped. So any user faulting from the swap PTE should see the folio in the swap cache and wait on it. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/memory.c | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 6c5cd86c4a66..589d6fc3d424 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4362,6 +4362,7 @@ static vm_fault_t remove_device_exclusive_entry(struc= t vm_fault *vmf) static inline bool should_try_to_free_swap(struct swap_info_struct *si, struct folio *folio, struct vm_area_struct *vma, + unsigned int extra_refs, unsigned int fault_flags) { if (!folio_test_swapcache(folio)) @@ -4384,7 +4385,7 @@ static inline bool should_try_to_free_swap(struct swa= p_info_struct *si, * reference only in case it's likely that we'll be the exclusive user. */ return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) && - folio_ref_count(folio) =3D=3D (1 + folio_nr_pages(folio)); + folio_ref_count(folio) =3D=3D (extra_refs + folio_nr_pages(folio)); } =20 static vm_fault_t pte_marker_clear(struct vm_fault *vmf) @@ -4935,15 +4936,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ arch_swap_restore(folio_swap(entry, folio), folio); =20 - /* - * Remove the swap entry and conditionally try to free up the swapcache. - * We're already holding a reference on the page but haven't mapped it - * yet. - */ - swap_free_nr(entry, nr_pages); - if (should_try_to_free_swap(si, folio, vma, vmf->flags)) - folio_free_swap(folio); - add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); pte =3D mk_pte(page, vma->vm_page_prot); @@ -4997,6 +4989,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) arch_do_swap_page_nr(vma->vm_mm, vma, address, pte, pte, nr_pages); =20 + /* + * Remove the swap entry and conditionally try to free up the + * swapcache. Do it after mapping so any raced page fault will + * see the folio in swap cache and wait for us. + */ + swap_free_nr(entry, nr_pages); + if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags)) + folio_free_swap(folio); + folio_unlock(folio); if (unlikely(folio !=3D swapcache)) { /* --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pf1-f175.google.com (mail-pf1-f175.google.com [209.85.210.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A1E7E34DB59 for ; Wed, 29 Oct 2025 15:59:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.175 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753573; cv=none; b=eIJaPMsXCFaSYYQdbc9+4kEr3iXe4L7nY9ln3DI/leN543Wb+tbWyYJ+jjVXCaxgB9rAYkbOmjSNxy8E5RGRIubq72ZJMd6kEHz3huHETYotqLFPRoer69czlBHXR8uq55zKtHjPL1B8Wk8szpMYnvd3g1assZv6VJU/LEe//yI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753573; c=relaxed/simple; bh=NNs4g+7G79cPTsm+6/8/54t0N8X7gATkIrTuv/UFieU=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=nWsX+lxkgBgAk6tgWzXO4nnV5ca1mXH7/DC5IZf/DvWt9pUkYQFMwsnq4FY8Ph7ufmEc8raP23CM+1KtA341sNiT6ep4KdFDFUeDTbWVYgS5sg8f17yHWlABForCVxwXNq++wfzFUFTowvk+LhL36neERDGNVh24LsYYbe67k40= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=eMpEI4/S; arc=none smtp.client-ip=209.85.210.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="eMpEI4/S" Received: by mail-pf1-f175.google.com with SMTP id d2e1a72fcca58-7a4176547bfso70082b3a.2 for ; Wed, 29 Oct 2025 08:59:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753571; x=1762358371; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=//x7l/ZVG5GFT0i4YAgVrYYCeiDcBTswgh555TzkaWQ=; b=eMpEI4/S3Uni5g/k3OCoBN8TNZKOsqRCpM0uubnLfSKiPlYQt455K2aKmLj/tGxtDO csnZrYDzrz2GMT1vY30DySyt5V0wVfRuyWj5Bgf+3dCA55pO96cWXhYYi5Th/b/kDwqz 7pgIBoKOJX82xoNeCJj2kPKTQwi1srqnvn2QxXLG1s8kBVlhFhleAHnR/y450l9RFHuQ 75cm01adrcTlgiMGJrtucycZMi1Tw7IM66T9HqfqOnKTqTpIM2bcHOLT4cxhF2mR2RPx LzC3B9kj3ceQt/yFtQRxBtL0RprRcKPU2gypQRxWWscrAu+cz1+8LxTPRwylkbHPf6nP 80ww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753571; x=1762358371; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=//x7l/ZVG5GFT0i4YAgVrYYCeiDcBTswgh555TzkaWQ=; b=XSO3ANDKv14sM9OYv0am+b08quEjRjKyLibzwIZ20A4i796pS6qPxM/XoXaPzP9RBS 0T8bB68OEWM8z8i//Oesq4YoO8WfxByhwxSGPrhp88h/gVALFrQ1ZWojiTN1ID83d/zf E5u+/mwwMbst4ffth6XSkAqytRwFATbCNqyY9FUb5fb4d0jkMhCv/EQLByKmvydQpb0R JL1MJODRHrK+ydpBLEZMSgEVwAZX4TvFp+m+VsvuPzahpyYiLWfZzz5P9Iq9nsUa1r2z gHrTNCcnjHrQHl99hlxJCIvuE/kaiU41fwXNYQH0w67JRxO55AD0Yf0xH4wRK5lLDQJh av1w== X-Forwarded-Encrypted: i=1; AJvYcCXyjjX6XPQfxyD/Y+T1Vi0UOY/E3wMhAsRb2nPilgMMMZJ508/shzie0KUXSycKlZ8TCOnTxQ9tBu0ITSg=@vger.kernel.org X-Gm-Message-State: AOJu0YzEE2y9tUlq86Fk4v5Ga/G7ewvbZdlWBhmE1j2cDhOKxWEYufjO 7fpiKDNyteIjcTRRAGvnASIrgydKdEaTVd3U9DImhKS8Wl24LWFE4cKh X-Gm-Gg: ASbGnct1t8gFR0JsBXGmMWgT1jO9pjWQyAvA0HPimCSVA6lvziQiZtNnfezI1pJLOSS +QOYW2tnFkkkpI6nYOJpeiVy1cl99sbIZToOAaUpgctYPYgfxwa5Vx0JyHFOcDbgrlplIpPTUA5 /FAGm2SeArAa5M/9c4MYogVyVi5DvECO0aNgQA6DMvIhUUUQWoHAt05ARKbCtE2DoKqkwDhM0Id RRNsRreLPs61Er3e5yz1seyIrB/hTVXQuvrwwBIHWwTHhuAomymz/cwj/YfhdbivcqOd0+dIOO8 hokwh+YaK9gsfPPZEFovPTfwNrmmsQu3aJi5GRo+Jyw/Y+ThaBiA1ANnSRsGUrqFeik4tVrsaC+ RbVqyqc2UiyQbfZz0adLQiTlESZPr8tRTh+WA7RoYBrYGgxUGD7kumQbU+KjXZNV1lVymcdLWD/ zC0e7DhMKP4vGUeZgcut+U X-Google-Smtp-Source: AGHT+IEWtY10yvvY4Cp0RzOTU6HW9O3cstRaZVUbKF1PyT4VABvIddsgFxCghFW1F8bVuzsAePb2qg== X-Received: by 2002:a17:902:ec81:b0:268:1034:ac8b with SMTP id d9443c01a7336-294dee334b3mr41546155ad.26.1761753570876; Wed, 29 Oct 2025 08:59:30 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.08.59.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 08:59:30 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:33 +0800 Subject: [PATCH 07/19] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-7-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song Now the overhead of the swap cache is trivial to none, bypassing the swap cache is no longer a valid optimization. We have removed the cache bypass swapin for anon memory, now do the same for shmem. Many helpers and functions can be dropped now. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/shmem.c | 65 +++++++++++++++++--------------------------------------= ---- mm/swap.h | 4 ---- mm/swapfile.c | 35 +++++++++----------------------- 3 files changed, 27 insertions(+), 77 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 6580f3cd24bb..759981435953 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2012,10 +2012,9 @@ static struct folio *shmem_swap_alloc_folio(struct i= node *inode, swp_entry_t entry, int order, gfp_t gfp) { struct shmem_inode_info *info =3D SHMEM_I(inode); + struct folio *new, *swapcache; int nr_pages =3D 1 << order; - struct folio *new; gfp_t alloc_gfp; - void *shadow; =20 /* * We have arrived here because our zones are constrained, so don't @@ -2055,34 +2054,19 @@ static struct folio *shmem_swap_alloc_folio(struct = inode *inode, goto fallback; } =20 - /* - * Prevent parallel swapin from proceeding with the swap cache flag. - * - * Of course there is another possible concurrent scenario as well, - * that is to say, the swap cache flag of a large folio has already - * been set by swapcache_prepare(), while another thread may have - * already split the large swap entry stored in the shmem mapping. - * In this case, shmem_add_to_page_cache() will help identify the - * concurrent swapin and return -EEXIST. - */ - if (swapcache_prepare(entry, nr_pages)) { + swapcache =3D swapin_folio(entry, new); + if (swapcache !=3D new) { folio_put(new); - new =3D ERR_PTR(-EEXIST); - /* Try smaller folio to avoid cache conflict */ - goto fallback; + if (!swapcache) { + /* + * The new folio is charged already, swapin can + * only fail due to another raced swapin. + */ + new =3D ERR_PTR(-EEXIST); + goto fallback; + } } - - __folio_set_locked(new); - __folio_set_swapbacked(new); - new->swap =3D entry; - - memcg1_swapin(entry, nr_pages); - shadow =3D swap_cache_get_shadow(entry); - if (shadow) - workingset_refault(new, shadow); - folio_add_lru(new); - swap_read_folio(new, NULL); - return new; + return swapcache; fallback: /* Order 0 swapin failed, nothing to fallback to, abort */ if (!order) @@ -2172,8 +2156,7 @@ static int shmem_replace_folio(struct folio **foliop,= gfp_t gfp, } =20 static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t inde= x, - struct folio *folio, swp_entry_t swap, - bool skip_swapcache) + struct folio *folio, swp_entry_t swap) { struct address_space *mapping =3D inode->i_mapping; swp_entry_t swapin_error; @@ -2189,8 +2172,7 @@ static void shmem_set_folio_swapin_error(struct inode= *inode, pgoff_t index, =20 nr_pages =3D folio_nr_pages(folio); folio_wait_writeback(folio); - if (!skip_swapcache) - swap_cache_del_folio(folio); + swap_cache_del_folio(folio); /* * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks) @@ -2289,7 +2271,6 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, swp_entry_t swap, index_entry; struct swap_info_struct *si; struct folio *folio =3D NULL; - bool skip_swapcache =3D false; int error, nr_pages, order; pgoff_t offset; =20 @@ -2332,7 +2313,6 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, folio =3D NULL; goto failed; } - skip_swapcache =3D true; } else { /* Cached swapin only supports order 0 folio */ folio =3D shmem_swapin_cluster(swap, gfp, info, index); @@ -2388,9 +2368,8 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, * and swap cache folios are never partially freed. */ folio_lock(folio); - if ((!skip_swapcache && !folio_test_swapcache(folio)) || - shmem_confirm_swap(mapping, index, swap) < 0 || - folio->swap.val !=3D swap.val) { + if (!folio_matches_swap_entry(folio, swap) || + shmem_confirm_swap(mapping, index, swap) < 0) { error =3D -EEXIST; goto unlock; } @@ -2422,12 +2401,7 @@ static int shmem_swapin_folio(struct inode *inode, p= goff_t index, if (sgp =3D=3D SGP_WRITE) folio_mark_accessed(folio); =20 - if (skip_swapcache) { - folio->swap.val =3D 0; - swapcache_clear(si, swap, nr_pages); - } else { - swap_cache_del_folio(folio); - } + swap_cache_del_folio(folio); folio_mark_dirty(folio); swap_free_nr(swap, nr_pages); put_swap_device(si); @@ -2438,14 +2412,11 @@ static int shmem_swapin_folio(struct inode *inode, = pgoff_t index, if (shmem_confirm_swap(mapping, index, swap) < 0) error =3D -EEXIST; if (error =3D=3D -EIO) - shmem_set_folio_swapin_error(inode, index, folio, swap, - skip_swapcache); + shmem_set_folio_swapin_error(inode, index, folio, swap); unlock: if (folio) folio_unlock(folio); failed_nolock: - if (skip_swapcache) - swapcache_clear(si, folio->swap, folio_nr_pages(folio)); if (folio) folio_put(folio); put_swap_device(si); diff --git a/mm/swap.h b/mm/swap.h index 214e7d041030..e0f05babe13a 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -403,10 +403,6 @@ static inline int swap_writeout(struct folio *folio, return 0; } =20 -static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_= t entry, int nr) -{ -} - static inline struct folio *swap_cache_get_folio(swp_entry_t entry) { return NULL; diff --git a/mm/swapfile.c b/mm/swapfile.c index 849be32377d9..3898c3a2be62 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1613,22 +1613,6 @@ struct swap_info_struct *get_swap_device(swp_entry_t= entry) return NULL; } =20 -static void swap_entries_put_cache(struct swap_info_struct *si, - swp_entry_t entry, int nr) -{ - unsigned long offset =3D swp_offset(entry); - struct swap_cluster_info *ci; - - ci =3D swap_cluster_lock(si, offset); - if (swap_only_has_cache(si, offset, nr)) { - swap_entries_free(si, ci, entry, nr); - } else { - for (int i =3D 0; i < nr; i++, entry.val++) - swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); - } - swap_cluster_unlock(ci); -} - static bool swap_entries_put_map(struct swap_info_struct *si, swp_entry_t entry, int nr) { @@ -1764,13 +1748,21 @@ void swap_free_nr(swp_entry_t entry, int nr_pages) void put_swap_folio(struct folio *folio, swp_entry_t entry) { struct swap_info_struct *si; + struct swap_cluster_info *ci; + unsigned long offset =3D swp_offset(entry); int size =3D 1 << swap_entry_order(folio_order(folio)); =20 si =3D _swap_info_get(entry); if (!si) return; =20 - swap_entries_put_cache(si, entry, size); + ci =3D swap_cluster_lock(si, offset); + if (swap_only_has_cache(si, offset, size)) + swap_entries_free(si, ci, entry, size); + else + for (int i =3D 0; i < size; i++, entry.val++) + swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); + swap_cluster_unlock(ci); } =20 int __swap_count(swp_entry_t entry) @@ -3778,15 +3770,6 @@ int swapcache_prepare(swp_entry_t entry, int nr) return __swap_duplicate(entry, SWAP_HAS_CACHE, nr); } =20 -/* - * Caller should ensure entries belong to the same folio so - * the entries won't span cross cluster boundary. - */ -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int n= r) -{ - swap_entries_put_cache(si, entry, nr); -} - /* * add_swap_count_continuation - called when a swap count is duplicated * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entr= y's --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pj1-f52.google.com (mail-pj1-f52.google.com [209.85.216.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8506134DCF5 for ; Wed, 29 Oct 2025 15:59:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.52 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753578; cv=none; b=bbmlXnkEkLTNCBIkaokqnqYpU9VJRYzgV7FDjDGDcG4xvqR1lJkKewBArSNWLnNk8d1/pLpd+dHV14AJb9zr7sq2+9lPZASZhJwrdn2S5FoigyzyQnCNP4T+uxertzWnyMkrVl/fOMjbS/664icc3Q6TyUuBEem/geYrCxaGxq0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753578; c=relaxed/simple; bh=ktAJtCBnfIA5dlLd5QM2wr4zYggg7zfGW4XFUwzGcBM=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=O/VBiEO3v088TXyT97JtxRxC8tpTp8brhQFZMF1yzAzmLTnVF/fNWj47jL51Wy2ZGE97TvGgPh7Pl2MSr1ZBhgmHBx2WpNHnHMfAGas8k6dwBkxIIzXdTI0zqoYSsbVIcbPNgVHUqCGLk0vJcIhmmXKypepmfSvR8ZrBRBNF6gE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ncJ1Kjv5; arc=none smtp.client-ip=209.85.216.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ncJ1Kjv5" Received: by mail-pj1-f52.google.com with SMTP id 98e67ed59e1d1-33067909400so58092a91.2 for ; Wed, 29 Oct 2025 08:59:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753576; x=1762358376; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=1touJCt8Ij13wXVlFKC93n8YCKM/wZXGXqqqq+QSOpw=; b=ncJ1Kjv5R0B8WfzF7t1fPXBpGyCKS5uiX+giQD72s64Da4IOj/xJV+h1izqAdyd/aA 2sLA0rZUpmZiTNNK4t7EVvGLHEmrWVKXXglUfydn38IphkxcvlwporjQVQenxf5ahqoo NVvnuiw3BJ9V1TR+JoWD5ErUa+H1PAGCIvD/KiEkhxQpJKvGekz92cygVe/B+Dy+X0Be TpXpuqM5/RWj2mQ2pt7Q4Qs+WMCgpX+88jUuHND9caPSMYb17ycg9kHlmHjQLUWtWgmN GOxEZVcQyebid8FTMVw9rlzx1jfK72mCbR1BOQehrLCGFQkpoJRW8AyWkWjBWXiU80Mp s1Cg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753576; x=1762358376; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1touJCt8Ij13wXVlFKC93n8YCKM/wZXGXqqqq+QSOpw=; b=VRshzL/sGdRtQYNkxmC0ZySjjO4LvrPLpHXevuioXa6bFeFiWKDMrEhcFsUZ4DcLST 1KmCdGFOw0tg5zheQ2WiAMZoq0NnvmxBag8bkPYWUSS60xmZTwhVU17QL3GMLRN3hBDQ CK+XLbbn6qRWKYIdz12jCRpnGHE0/xx9ux0XkFHKgw6edjP2yDkKHbdFAAee4HQ+IukI dFToxcsf2AmyRsm5MAEL+Zfz5WWN4DIv5hB3paxTFd4br1luY0z8M5c2oksxzB4Avvau uNEQ5A1JlCWWBVcKQpcvzpRtHUcPxIHi06Ucndtp7lPnXq3p/N1Y2n7t7hPHblASzagR gQ/A== X-Forwarded-Encrypted: i=1; AJvYcCX8qk/m7IzB5BHoJ4RUEpx+ZbprIni4PIpY/dQgAkHa3Z3dn+4vd7zDdmtcAUnI0HWGo3QkGFEsJXiHZcw=@vger.kernel.org X-Gm-Message-State: AOJu0YxfRehGeFVWtvGMsbLaA+jLL0Cdujsg25bS1TgI9Krof9czBvQa XP3KRGgYWwLP3zZxXBxBItpfgdwkWZhTIMmFmWz63IB1PP+9mSO2C92M X-Gm-Gg: ASbGncvdWNO3mWlUEOyEgTvI8ICuqhmhQfTBGBHfLU/C0UvEN+sKfbrW3Icall1m7Ic YU95QU4A6Wyw6glesxXIW9TRrSX5u0EU3pBKlfKoE+vbv/aG4OdZRuYMDuwkGOsFKLFnvW9QqOs HYNqUtZozNG+iDUVhYuWukr0qxDlLzTfjQEnoa5D5bFfE45ICBV2Bi7qQ4+iom2CdS9mUvaa+rG 9+zYlXBjOXlQoa88d/pVhkVu4NH+NYVe3Xsw46qixlKS6A81DsAH3ewcDTg/UfGGhOmiYAZQeWI NJW+XNw3TwWmrfm0y/bF5qIIS3MvkiKbSnp7YoLzy8FJ9SEhmfyP/1D+sBQS/0dR/MEXNzUiZOs JnQDIfwpgVO4M/d3pjgmsFymSarDByXbwnOmQAT8KzsoR1eAq0PiPlHkb9bHwCm23bpshDygGNR aeTF6DOwjSR9f6W8XZ8Niq X-Google-Smtp-Source: AGHT+IFHuFprzRWbK9qS6JvIV2N1AbXu0GbLXktN8sr1Qqk543SOL3MlPU4u1zXhTQjIeIb0517XQQ== X-Received: by 2002:a17:902:ced2:b0:290:dc5d:c0d0 with SMTP id d9443c01a7336-294def36255mr40105735ad.49.1761753575586; Wed, 29 Oct 2025 08:59:35 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.08.59.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 08:59:35 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:34 +0800 Subject: [PATCH 08/19] mm/shmem, swap: remove SWAP_MAP_SHMEM Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-8-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Nhat Pham The SWAP_MAP_SHMEM state was introduced in the commit aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry belongs to shmem during swapoff. However, swapoff has since been rewritten in the commit b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now having swap count =3D=3D SWAP_MAP_SHMEM value is basically the same as having swap count =3D=3D 1, and swap_shmem_alloc() behaves analogously to swap_duplicate(). The only difference of note is that swap_shmem_alloc() does not check for -ENOMEM returned from __swap_duplicate(), but it is OK because shmem never re-duplicates any swap entry it owns. This will stil be safe if we use (batched) swap_duplicate() instead. This commit adds swap_duplicate_nr(), the batched variant of swap_duplicate(), and removes the SWAP_MAP_SHMEM state and the associated swap_shmem_alloc() helper to simplify the state machine (both mentally and in terms of actual code). We will also have an extra state/special value that can be repurposed (for swap entries that never gets re-duplicated). Signed-off-by: Nhat Pham Signed-off-by: Kairui Song Suggested-by: Chris Li --- include/linux/swap.h | 15 +++++++-------- mm/shmem.c | 2 +- mm/swapfile.c | 42 +++++++++++++++++------------------------- 3 files changed, 25 insertions(+), 34 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 38ca3df68716..bf72b548a96d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -230,7 +230,6 @@ enum { /* Special value in first swap_map */ #define SWAP_MAP_MAX 0x3e /* Max count */ #define SWAP_MAP_BAD 0x3f /* Note page is bad */ -#define SWAP_MAP_SHMEM 0xbf /* Owned by shmem/tmpfs */ =20 /* Special value in each swap_map continuation */ #define SWAP_CONT_MAX 0x7f /* Max count */ @@ -458,8 +457,7 @@ bool folio_free_swap(struct folio *folio); void put_swap_folio(struct folio *folio, swp_entry_t entry); extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); -extern void swap_shmem_alloc(swp_entry_t, int); -extern int swap_duplicate(swp_entry_t); +extern int swap_duplicate_nr(swp_entry_t entry, int nr); extern int swapcache_prepare(swp_entry_t entry, int nr); extern void swap_free_nr(swp_entry_t entry, int nr_pages); extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); @@ -514,11 +512,7 @@ static inline int add_swap_count_continuation(swp_entr= y_t swp, gfp_t gfp_mask) return 0; } =20 -static inline void swap_shmem_alloc(swp_entry_t swp, int nr) -{ -} - -static inline int swap_duplicate(swp_entry_t swp) +static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages) { return 0; } @@ -569,6 +563,11 @@ static inline int add_swap_extent(struct swap_info_str= uct *sis, } #endif /* CONFIG_SWAP */ =20 +static inline int swap_duplicate(swp_entry_t entry) +{ + return swap_duplicate_nr(entry, 1); +} + static inline void free_swap_and_cache(swp_entry_t entry) { free_swap_and_cache_nr(entry, 1); diff --git a/mm/shmem.c b/mm/shmem.c index 759981435953..46d54a1288fd 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1665,7 +1665,7 @@ int shmem_writeout(struct folio *folio, struct swap_i= ocb **plug, spin_unlock(&shmem_swaplist_lock); } =20 - swap_shmem_alloc(folio->swap, nr_pages); + swap_duplicate_nr(folio->swap, nr_pages); shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap)); =20 BUG_ON(folio_mapped(folio)); diff --git a/mm/swapfile.c b/mm/swapfile.c index 3898c3a2be62..55362bb2a781 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -201,7 +201,7 @@ static bool swap_is_last_map(struct swap_info_struct *s= i, unsigned char *map_end =3D map + nr_pages; unsigned char count =3D *map; =20 - if (swap_count(count) !=3D 1 && swap_count(count) !=3D SWAP_MAP_SHMEM) + if (swap_count(count) !=3D 1) return false; =20 while (++map < map_end) { @@ -1522,12 +1522,6 @@ static unsigned char swap_entry_put_locked(struct sw= ap_info_struct *si, if (usage =3D=3D SWAP_HAS_CACHE) { VM_BUG_ON(!has_cache); has_cache =3D 0; - } else if (count =3D=3D SWAP_MAP_SHMEM) { - /* - * Or we could insist on shmem.c using a special - * swap_shmem_free() and free_shmem_swap_and_cache()... - */ - count =3D 0; } else if ((count & ~COUNT_CONTINUED) <=3D SWAP_MAP_MAX) { if (count =3D=3D COUNT_CONTINUED) { if (swap_count_continued(si, offset, count)) @@ -1625,7 +1619,7 @@ static bool swap_entries_put_map(struct swap_info_str= uct *si, if (nr <=3D 1) goto fallback; count =3D swap_count(data_race(si->swap_map[offset])); - if (count !=3D 1 && count !=3D SWAP_MAP_SHMEM) + if (count !=3D 1) goto fallback; =20 ci =3D swap_cluster_lock(si, offset); @@ -1679,12 +1673,10 @@ static bool swap_entries_put_map_nr(struct swap_inf= o_struct *si, =20 /* * Check if it's the last ref of swap entry in the freeing path. - * Qualified value includes 1, SWAP_HAS_CACHE or SWAP_MAP_SHMEM. */ static inline bool __maybe_unused swap_is_last_ref(unsigned char count) { - return (count =3D=3D SWAP_HAS_CACHE) || (count =3D=3D 1) || - (count =3D=3D SWAP_MAP_SHMEM); + return (count =3D=3D SWAP_HAS_CACHE) || (count =3D=3D 1); } =20 /* @@ -3672,7 +3664,6 @@ static int __swap_duplicate(swp_entry_t entry, unsign= ed char usage, int nr) =20 offset =3D swp_offset(entry); VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); - VM_WARN_ON(usage =3D=3D 1 && nr > 1); ci =3D swap_cluster_lock(si, offset); =20 err =3D 0; @@ -3732,27 +3723,28 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) return err; } =20 -/* - * Help swapoff by noting that swap entry belongs to shmem/tmpfs - * (in which case its reference count is never incremented). - */ -void swap_shmem_alloc(swp_entry_t entry, int nr) -{ - __swap_duplicate(entry, SWAP_MAP_SHMEM, nr); -} - -/* - * Increase reference count of swap entry by 1. +/** + * swap_duplicate_nr() - Increase reference count of nr contiguous swap en= tries + * by 1. + * + * @entry: first swap entry from which we want to increase the refcount. + * @nr: Number of entries in range. + * * Returns 0 for success, or -ENOMEM if a swap_count_continuation is requi= red * but could not be atomically allocated. Returns 0, just as if it succee= ded, * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), wh= ich * might occur if a page table entry has got corrupted. + * + * Note that we are currently not handling the case where nr > 1 and we ne= ed to + * add swap count continuation. This is OK, because no such user exists - = shmem + * is the only user that can pass nr > 1, and it never re-duplicates any s= wap + * entry it owns. */ -int swap_duplicate(swp_entry_t entry) +int swap_duplicate_nr(swp_entry_t entry, int nr) { int err =3D 0; =20 - while (!err && __swap_duplicate(entry, 1, 1) =3D=3D -ENOMEM) + while (!err && __swap_duplicate(entry, 1, nr) =3D=3D -ENOMEM) err =3D add_swap_count_continuation(entry, GFP_ATOMIC); return err; } --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7A6B2346A0F for ; Wed, 29 Oct 2025 15:59:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753583; cv=none; b=AIM56M1lAVcp2vR4Z4Z/hRhG5DgYG60deSCii7GMXTB418DGi32NT/nGG4Na/I/DMy8eyfqQdxPfse3A+M/Vp0HAhYkycIMQZ+fVF4zmMAx9OfrTbbha2I2XaEnbVo8+QiiW2B9kP6YwX+gwao/da453FqnTLHqO4l8//IQ7ENI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753583; c=relaxed/simple; bh=p29r+QUfe/siyZxfbOGJwDMz+sqbsOaa5fULOpNDmmE=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=TqDJQiAkeWrXyctqRbCU2X6AfutSYun6I1jHlMxax2T2aGQ7f2Nb1Du2Zb7gvwV50x2zOLopkLqEv6Cn5oC+nQQIR5Wwbov9+Q6LxrJmpwy1JXoAOGpK1H2dij9KDkRzgnpH5aYIeDpoxPgzkLQf0XruONxRTn4BHa1sbvdjZFc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=c1AOEayW; arc=none smtp.client-ip=209.85.214.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="c1AOEayW" Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-2947d345949so65399225ad.3 for ; Wed, 29 Oct 2025 08:59:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753580; x=1762358380; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=s0PKlbwwkMgYNvXaDf4Er6Bj6mcJgZLUDsODPjxx62Q=; b=c1AOEayW34LRz3hirUk9exmqpYyVdX84xZw9H2A3+dFKd0MozVO7ydRB/JZ4XHrPza xvz2i9QbIDXVQEouejqOrK3BbIDPoFG7nzgN65FNy1EVkzgYGv/gqac1cFTvRkXSXYWG P5cDnoZJkQHXr9DP8zIqQ/JAP6+p1ZkAWHfXzaGW+DI7n3qfd/gyM6iu/EWhEbJ0Lm4c qC/cnQw5YQbguv51Y6cL9xeSfJgWkxGsM9Kx28FeCwlQS8rhcmK45IU1N8p38lCG1GZ9 avYorD7PM5Lz6nGUh1hj30esLT7EJ+Bvhr13XYEALXaiBU1hma6HJAdvAboWziPl9lIR 0HVg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753580; x=1762358380; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=s0PKlbwwkMgYNvXaDf4Er6Bj6mcJgZLUDsODPjxx62Q=; b=AenTPpyLxROI9Nvhx7NcU2RfJ3MIlc3/70juMC6bkwhYellsAb+FzGSORjPuJFnI+K xMnmCXcrT6m65fwEG1e/dHAKL/luv/EvofAoIZ7w3PyqitDAG/Xyb0t7RV9PllxAJxDd 3ihmUyjnW+jl8/vQRdBo0Uw6kGETEdJIv7eS2AmJOfrrqhNCK5vTat+hvYllXY2VMYDa JkRnITG46ywJxZkUczix1p3XKO0P0xUshe8zscCZyyl2ag8vCJZ+IwM7zocvstNoCOFk aaGS5qQRiYZ9uD0cU9ItgrhiietwPl0oiCrLfGMI3rrwZcKbPHvCcQzvLn9X5xlctHge MYQQ== X-Forwarded-Encrypted: i=1; AJvYcCVBKhp4dfTvKjL4tXfLRLr9W5l4K5o4MOZewUlvWAM6TR1ZOfAVJxloLBv/jZwELrR7Q/l69YPW5nYpSys=@vger.kernel.org X-Gm-Message-State: AOJu0YxCOb8qxD/pCsjMC3gQB4AJEcx1wzGUOFipmWXj/x7qPSela5Fs nifPDCyOTYHXxx8Z8YOY30U0ON9NDQeg0UgXlEayS0xAftFbYFMnpuTm X-Gm-Gg: ASbGnct//eqMsq5N4XJx5WAcpENp9zlDj2ZP6Shf5aDH0nXii6zxKqo+EwqZNYZO6JM Lnw5STmFiVX8OkSTXihN1HH4NsarppBERIrKKlypxvfaIsvtpOBlC+6KkpPeSRKb/4cEyvQBBee N++CY+jmRDBsdhwmdS5qBdcD2HXiUgscGwZmH8VOKPVxrmYPp2XvIuJlNGO6tyUGRgjyzsrACBm s+CWK1I6eljQIK9CNqXNZd9GNt/pHT5gARWUSVq7NPcYyw23mpy5J9lzN7ymsQheU0o+bLJcXJo fX6xTzw+MXnhtWHeWMRjYSW13V+m1ccOBvQABXP+8UShsqrlQqxlsee7Sy0eXh+D+mH9Mh+m6t3 esYwa4F4/tGGcJtL16Sg1lo3bzdx5R3AzduozyKc0mggPRqrq3YUbJOsP6YEr7dgJWwNnz3JGLx 4GKu9D/v4t2ogAC+YmAO21 X-Google-Smtp-Source: AGHT+IGCBgmMdgFmMsePvkB0PmSb+aFjRIQoDbF7xmYjaqCG2xJ9PpuIvij1g6RTJtvdlQmxXngKtQ== X-Received: by 2002:a17:902:e88b:b0:262:79a:93fb with SMTP id d9443c01a7336-294deea316fmr43936795ad.32.1761753580273; Wed, 29 Oct 2025 08:59:40 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.08.59.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 08:59:39 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:35 +0800 Subject: [PATCH 09/19] mm, swap: swap entry of a bad slot should not be considered as swapped out Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-9-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song When checking if a swap entry is swapped out, we simply check if the bitwise result of the count value is larger than 0. But SWAP_MAP_BAD will also be considered as a swao count value larger than 0. SWAP_MAP_BAD being considered as a count value larger than 0 is useful for the swap allocator: they will be seen as a used slot, so the allocator will skip them. But for the swapped out check, this isn't correct. There is currently no observable issue. The swapped out check is only useful for readahead and folio swapped-out status check. For readahead, the swap cache layer will abort upon checking and updating the swap map. For the folio swapped out status check, the swap allocator will never allocate an entry of bad slots to folio, so that part is fine too. The worst that could happen now is redundant allocation/freeing of folios and waste CPU time. This also makes it easier to get rid of swap map checking and update during folio insertion in the swap cache layer. Signed-off-by: Kairui Song Suggested-by: Chris Li --- include/linux/swap.h | 6 ++++-- mm/swap_state.c | 4 ++-- mm/swapfile.c | 22 +++++++++++----------- 3 files changed, 17 insertions(+), 15 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index bf72b548a96d..936fa8f9e5f3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -466,7 +466,8 @@ int find_first_swap(dev_t *device); extern unsigned int count_swap_pages(int, int); extern sector_t swapdev_block(int, pgoff_t); extern int __swap_count(swp_entry_t entry); -extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t en= try); +extern bool swap_entry_swapped(struct swap_info_struct *si, + unsigned long offset); extern int swp_swapcount(swp_entry_t entry); struct backing_dev_info; extern struct swap_info_struct *get_swap_device(swp_entry_t entry); @@ -535,7 +536,8 @@ static inline int __swap_count(swp_entry_t entry) return 0; } =20 -static inline bool swap_entry_swapped(struct swap_info_struct *si, swp_ent= ry_t entry) +static inline bool swap_entry_swapped(struct swap_info_struct *si, + unsigned long offset) { return false; } diff --git a/mm/swap_state.c b/mm/swap_state.c index b3737c60aad9..aaf8d202434d 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -526,8 +526,8 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry,= gfp_t gfp_mask, if (folio) return folio; =20 - /* Skip allocation for unused swap slot for readahead path. */ - if (!swap_entry_swapped(si, entry)) + /* Skip allocation for unused and bad swap slot for readahead. */ + if (!swap_entry_swapped(si, swp_offset(entry))) return NULL; =20 /* Allocate a new folio to be added into the swap cache. */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 55362bb2a781..d66141f1c452 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1765,21 +1765,21 @@ int __swap_count(swp_entry_t entry) return swap_count(si->swap_map[offset]); } =20 -/* - * How many references to @entry are currently swapped out? - * This does not give an exact answer when swap count is continued, - * but does include the high COUNT_CONTINUED flag to allow for that. +/** + * swap_entry_swapped - Check if the swap entry at @offset is swapped. + * @si: the swap device. + * @offset: offset of the swap entry. */ -bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry) +bool swap_entry_swapped(struct swap_info_struct *si, unsigned long offset) { - pgoff_t offset =3D swp_offset(entry); struct swap_cluster_info *ci; int count; =20 ci =3D swap_cluster_lock(si, offset); count =3D swap_count(si->swap_map[offset]); swap_cluster_unlock(ci); - return !!count; + + return count && count !=3D SWAP_MAP_BAD; } =20 /* @@ -1865,7 +1865,7 @@ static bool folio_swapped(struct folio *folio) return false; =20 if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio))) - return swap_entry_swapped(si, entry); + return swap_entry_swapped(si, swp_offset(entry)); =20 return swap_page_trans_huge_swapped(si, entry, folio_order(folio)); } @@ -3671,10 +3671,10 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) count =3D si->swap_map[offset + i]; =20 /* - * swapin_readahead() doesn't check if a swap entry is valid, so the - * swap entry could be SWAP_MAP_BAD. Check here with lock held. + * Allocator never allocates bad slots, and readahead is guarded + * by swap_entry_swapped. */ - if (unlikely(swap_count(count) =3D=3D SWAP_MAP_BAD)) { + if (WARN_ON(swap_count(count) =3D=3D SWAP_MAP_BAD)) { err =3D -ENOENT; goto unlock_out; } --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pj1-f41.google.com (mail-pj1-f41.google.com [209.85.216.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1A04A34F485 for ; Wed, 29 Oct 2025 15:59:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.41 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753587; cv=none; b=dFRdkaICNtk/D10oHSiOpRH3hOkcEf8gLPdh21vYbUgGpk8wiws7BsU9bcn3BzB/69qP9uVMRGK28p0zzuvyLoTeri8r+rTsNhgcXluxUBkuZL8JZECdm35VYXZLscwWWvtTCTEXNhP0jzRN1ERIBZ+t0LA/PowNO60EPHM1JPA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753587; c=relaxed/simple; bh=q7eu28LCPIzbssMIEYStlUI1LdlwlH8d7M3kS1M9/04=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=oceF2meG/gN7wKL8LS+DchrRDCUkIpT/wVTrH16ykldLVTEpvCP3PtvyL7XS5E+QXIJMxZSe1nBExUDym9nSWe2Q6iMFdUwpPjV//Rf548mKnflVqbeax59NQsgR/TgVlELyNpZsEHHiU7THsk2BGIv9eswzvDwjpQjJsqOpl2Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=AJxrLYD7; arc=none smtp.client-ip=209.85.216.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="AJxrLYD7" Received: by mail-pj1-f41.google.com with SMTP id 98e67ed59e1d1-34003f73a05so110136a91.1 for ; Wed, 29 Oct 2025 08:59:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753585; x=1762358385; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=tq6PoOm8fNGczgugQnE/jPj2Q0Y+cXCuawf9c8Nqzkk=; b=AJxrLYD7fJMw2/ko5ab3zJXitbkW2D0ABzNCz38XkH7gFk0E6x1cB0sZ50vzpnmvgK LSqDiZD65FDH6JKMzekYLyZdKi+i/sO4mhBl2+yyK4XMlmE+PefG7ImVTOoTgxUgFjXu KobHtJOWbZoXa1x+3xs5TIJypMPpIxTj59NTfYFVytiSDwf7iAtjXu5C9ExqUF8r+7WQ pEYz9QQ1Cf67J7FEugFhvn2yrMY4tFqi6+yEBDH5BD/8C4fbe2WpQMZC+S9SVzygEQq7 BLMkCdTgg0tyKVM7iGC/oTwkitvd0AqVfAR3XhUQmgHtu1pysrMuVAE/RiaLEqFdcXmB ROyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753585; x=1762358385; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tq6PoOm8fNGczgugQnE/jPj2Q0Y+cXCuawf9c8Nqzkk=; b=XrPiKIC1Zab62vGJzicwN3ia2Mbfvpu+sUPA9cBrc9a1+AoM3ZjAf6KcugPtN6oUus J93oby5z8sFMIvc3yrCnfh1fON6xtQmxakirxsJbIQI7BW7vRCx8OYgTi8sBiAO8X27T iOdPHv/pXryrHe2S6/As45HeS6dzwoyiv7SKPbq/uJFd6fG7+v7jLc4H9MM+bYd/j4g4 SHlXUEEV33Q6p5n7bN95B3cqgrHwOZLWmiacBMitRf17HAFO3QMCvEZcmkOra1teUoS3 DlSz3VErxN3Vl288Qrj+WGQXVO5++PHD2CkVC016vrfE69gp8u/ma2nfy+YGO+2jQUfu GPUA== X-Forwarded-Encrypted: i=1; AJvYcCVPHXH9d1crQq/hlOWGeHrC1MDekpcg0UixtkfPcodb2CF6R3olpmDj//y9nAay209wu+AzzNzD4j7Wgt0=@vger.kernel.org X-Gm-Message-State: AOJu0Yxp4iXdK1KXDjDjx1PSOEUdlrxdx5YlhGW0Jw24oZDrR9dHGP0o 6WoLyw8Pi9/ERgRPoN02PYZ7kegVSAzHQyRDvVH8O2+yRdusPHrJ4Fqq X-Gm-Gg: ASbGncsk16o4UgkTzA8CjxSsBLz6pJLmnjUSTgvl9sxBsEVXGpSwsvTGW1VnwmdA2PE TugzeQB0YElMlFENMo9wso3rizmpPXQQgrRr8J7Gt5zfBJPIS0+baO2Ucf/foio+6mxwFdYqBQd Evz7AGhxrE9MIdRKWJkFj19yGnNa2msSyqxEaulC2QttqXsuuyGlRDYnetpfV2CIi11PbeD87aW d/2ss5Skyy1mriZ1vPt+8MkDjzIBNxL1ZZ+EpyxhMk/7A5JqxfLa/JJ9A7zyDq34zpWrHVyKIdC +C8tIOVR+fGJHvQtMEBmria39/eMm2sMqV9ORUzLbCV0eL24ClMaK8lUHBULLIld2EXsmqvjyGF S91o9Skw+6USWcVJBg3ZYEaKpt0CwNb+SLI8Zm20fqbabcowiTUjL/9FHLK5ZQ0u4oYP+sYF67I CrX9Oydcxm7AIpp5gcfYOp X-Google-Smtp-Source: AGHT+IEJiZ7ou6E9Ow56ntpYQ8NsRTNR+WJ+uc6HP4HPdJyD9fOV4yH1D0+/1zf2TYa8UuiYOuvhFQ== X-Received: by 2002:a17:90b:3ec1:b0:33b:dff1:5f44 with SMTP id 98e67ed59e1d1-3404ac129demr113628a91.6.1761753585029; Wed, 29 Oct 2025 08:59:45 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.08.59.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 08:59:44 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:36 +0800 Subject: [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-10-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song Swap cluster cache reclaim requires releasing the lock, so some extra checks are needed after the reclaim. To prepare for checking swap cache using the swap table directly, consolidate the swap cluster reclaim and check the logic. Also, adjust it very slightly. By moving the cluster empty and usable check into the reclaim helper, it will avoid a redundant scan of the slots if the cluster is empty. And always scan the whole region during reclaim, don't skip slots covered by a reclaimed folio. Because the reclaim is lockless, it's possible that new cache lands at any time. And for allocation, we want all caches to be reclaimed to avoid fragmentation. And besides, if the scan offset is not aligned with the size of the reclaimed folio, we are skipping some existing caches. There should be no observable behavior change, which might slightly improve the fragmentation issue or performance. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swapfile.c | 47 +++++++++++++++++++++++------------------------ 1 file changed, 23 insertions(+), 24 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index d66141f1c452..e4c521528817 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -778,42 +778,50 @@ static int swap_cluster_setup_bad_slot(struct swap_cl= uster_info *cluster_info, return 0; } =20 -static bool cluster_reclaim_range(struct swap_info_struct *si, - struct swap_cluster_info *ci, - unsigned long start, unsigned long end) +static unsigned int cluster_reclaim_range(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long start, unsigned int order) { + unsigned int nr_pages =3D 1 << order; + unsigned long offset =3D start, end =3D start + nr_pages; unsigned char *map =3D si->swap_map; - unsigned long offset =3D start; int nr_reclaim; =20 spin_unlock(&ci->lock); do { switch (READ_ONCE(map[offset])) { case 0: - offset++; break; case SWAP_HAS_CACHE: nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); - if (nr_reclaim > 0) - offset +=3D nr_reclaim; - else + if (nr_reclaim < 0) goto out; break; default: goto out; } - } while (offset < end); + } while (++offset < end); out: spin_lock(&ci->lock); + + /* + * We just dropped ci->lock so cluster could be used by another + * order or got freed, check if it's still usable or empty. + */ + if (!cluster_is_usable(ci, order)) + return SWAP_ENTRY_INVALID; + if (cluster_is_empty(ci)) + return cluster_offset(si, ci); + /* * Recheck the range no matter reclaim succeeded or not, the slot * could have been be freed while we are not holding the lock. */ for (offset =3D start; offset < end; offset++) if (READ_ONCE(map[offset])) - return false; + return SWAP_ENTRY_INVALID; =20 - return true; + return start; } =20 static bool cluster_scan_range(struct swap_info_struct *si, @@ -901,7 +909,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap= _info_struct *si, unsigned long start =3D ALIGN_DOWN(offset, SWAPFILE_CLUSTER); unsigned long end =3D min(start + SWAPFILE_CLUSTER, si->max); unsigned int nr_pages =3D 1 << order; - bool need_reclaim, ret; + bool need_reclaim; =20 lockdep_assert_held(&ci->lock); =20 @@ -913,20 +921,11 @@ static unsigned int alloc_swap_scan_cluster(struct sw= ap_info_struct *si, if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim)) continue; if (need_reclaim) { - ret =3D cluster_reclaim_range(si, ci, offset, offset + nr_pages); - /* - * Reclaim drops ci->lock and cluster could be used - * by another order. Not checking flag as off-list - * cluster has no flag set, and change of list - * won't cause fragmentation. - */ - if (!cluster_is_usable(ci, order)) - goto out; - if (cluster_is_empty(ci)) - offset =3D start; + found =3D cluster_reclaim_range(si, ci, offset, order); /* Reclaim failed but cluster is usable, try next */ - if (!ret) + if (!found) continue; + offset =3D found; } if (!cluster_alloc_range(si, ci, offset, usage, order)) break; --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pj1-f53.google.com (mail-pj1-f53.google.com [209.85.216.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B432634FF69 for ; Wed, 29 Oct 2025 15:59:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.53 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753592; cv=none; b=HH77Iq2AgIHQB3y5PdJJo8e5guF0RW6Z70CsxYcFMk1CO3AINLCwTO7bD3/NtTMiqMqPNvyDZi7g3cm1sWdwTXNHsUn+FjVhD0AOYiIflm7yY8EucqKFhGjB432QqNDeAHp+Dc6nGWL1PC852bAJ/YBYpw18qyMbcEJKusmtNTY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753592; c=relaxed/simple; bh=3ptaXN+GAGoCho1TfU0eVUutT5FBhLQ1N3F5bH4ARsY=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=YSGRrUG49B+wGKUmPy+IJOsv23zn4/8HPEy2xvDaavNBiqySF3fSAw/O0zP0At5VW7esGMd9sngcH3mEDJ/zJ/8MVlHjNj3CYr4lImTJTgv0L6OcjwcU5/fo/9PwJcpIkxkE2jey2vKIb0BvsbcfnjaSqxjRkbMoQLHn5HayUEo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=fGWOPl/f; arc=none smtp.client-ip=209.85.216.53 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="fGWOPl/f" Received: by mail-pj1-f53.google.com with SMTP id 98e67ed59e1d1-34029cd0cbdso84293a91.3 for ; Wed, 29 Oct 2025 08:59:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753590; x=1762358390; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=hEBe93+AqDqbmzUn4tECC01fE40N0PwOlnrL0O+c5gY=; b=fGWOPl/fFTEnj35SGpei98lgx5vkrhaylfXEqvKBflfbgchru3ENfNXjwwrC68qyDW Q7+I8JZ/NO3aaCq4BKuvKypgP9NDrzDp5PuwKlpEoJaqPMXQ5YaETe8KB0m32hgH55c2 dYSOn68GnK2w2+Fk1Q1skVn5qkfmo8fGWCmTIzQ29RwMgwzpR51t/n9Ij30NrgfFUZm1 KZH9wkrfLRtGwwMnB7aqzxKjNr4O7dwd1lqHF9Id9kkdkNZqHkXKP7/HG6ONiYRuG0MD 72l2q3hYtGqqOu2o0mGlVuTPuZUwz1oefFGwc9Rsi12Z507wHNM0ZnBNuHqisIPp8Sr8 ayEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753590; x=1762358390; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hEBe93+AqDqbmzUn4tECC01fE40N0PwOlnrL0O+c5gY=; b=UDwtTfg8w3dTMLDMNaiHWg4Z2kTUxG8QPKye4wj9xP8qGDOYw2ZBaLiw4GpE8upWur qANoQSklS3kvyRB6P+yf3IM3A+i73ZxnMvFTuvN00n1ohMmOBmFSKxQce8s8aZNeRDym DxeE258l2/l+rG5Yq1jLp+gpuNDK+NnHmm1tqpez61FAVr5RNHmshQc+cEDiQpQ17Vwj vR3ZeJmbjk4l0ePgtQ/IRk+aVhgw6W7rkpj0urxNdCyM8UFw+FGyd8IXssZIx8uuh3fh sfohCJVySnrCMZ9Hsrle9KBlS21xPcaur/dP8XyWg+kDa26I/juokTWBY8Q2HHpbHdNi NYBg== X-Forwarded-Encrypted: i=1; AJvYcCXWZk7csop4A4YqfnkfJ0r6bVm9wPgiKFCIkPtxfZbde7jUZUVptiS7Y7kYQaQ/m0rmkMx5DBjH+oSEUTQ=@vger.kernel.org X-Gm-Message-State: AOJu0Ywa3ByE/XdJ6xY/MibZ/7eOCgx14TwlvG4iVRZWjWmQU3tpjsn+ l6+SQRaOqVip9oCwcXmNyshcYQ8ATQFqkTC3E2X9//gopTJqnm+e4hQL X-Gm-Gg: ASbGncuOB1kBabzye7+oW40sMp501Xm63F4JLWnllDVcz5xYxa1Mxczd/SUfavF68ZT rwzbEr+0DTwmAQs1BjcFvzyRnd5rfrbtvbF5DMkiOWK5Qe/d+jaNYlKzQmj721BjGovurcfumfF SgFKPKO1Rn1HcA/0NLzBEsNP63RRMYGMs/WaVLZi4mGCZWiUf+U4y4oVduh/Dqnp+nv3xO/zRsL 97Q33IsYRWb0rbklj7/oarIzyt+Tsgeq1aGdsNX6c1c8LLo6B/yCYEU8n/27q7yQ6fWIg9qIruY vIyjjv3Cfu1B9zZ6ncoIjLs2xpaIOeSNwwFAnEdvtWFZDAOfnZCAnhF+JEZR3cUbF98CCf7kkNI TpE0z3CPHWRDTIXKqvMABLzizyziXygoqVTKGFOiZVPPHhIstEoMzvuJsux0ZNbBOZpLXoo5jZu COL01SSUyRZw== X-Google-Smtp-Source: AGHT+IE3wJlrEK8Y4HOI8VQFwkEg3x3geQZauMd9eO5Geg4rfGwAJlYNO9WbP0RB4Wfto3M74+vZpg== X-Received: by 2002:a17:90b:254b:b0:32b:9bec:158f with SMTP id 98e67ed59e1d1-3403a2f1625mr3442455a91.29.1761753589772; Wed, 29 Oct 2025 08:59:49 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.08.59.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 08:59:49 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:37 +0800 Subject: [PATCH 11/19] mm, swap: split locked entry duplicating into a standalone helper Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-11-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song No feature change, split the common logic into a stand alone helper to be reused later. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swapfile.c | 62 +++++++++++++++++++++++++++++--------------------------= ---- 1 file changed, 31 insertions(+), 31 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index e4c521528817..56054af12afd 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3646,26 +3646,14 @@ void si_swapinfo(struct sysinfo *val) * - swap-cache reference is requested but the entry is not used. -> ENOENT * - swap-mapped reference requested but needs continued swap count. -> EN= OMEM */ -static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) +static int swap_dup_entries(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, + unsigned char usage, int nr) { - struct swap_info_struct *si; - struct swap_cluster_info *ci; - unsigned long offset; - unsigned char count; - unsigned char has_cache; - int err, i; - - si =3D swap_entry_to_info(entry); - if (WARN_ON_ONCE(!si)) { - pr_err("%s%08lx\n", Bad_file, entry.val); - return -EINVAL; - } - - offset =3D swp_offset(entry); - VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); - ci =3D swap_cluster_lock(si, offset); + int i; + unsigned char count, has_cache; =20 - err =3D 0; for (i =3D 0; i < nr; i++) { count =3D si->swap_map[offset + i]; =20 @@ -3673,25 +3661,20 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) * Allocator never allocates bad slots, and readahead is guarded * by swap_entry_swapped. */ - if (WARN_ON(swap_count(count) =3D=3D SWAP_MAP_BAD)) { - err =3D -ENOENT; - goto unlock_out; - } + if (WARN_ON(swap_count(count) =3D=3D SWAP_MAP_BAD)) + return -ENOENT; =20 has_cache =3D count & SWAP_HAS_CACHE; count &=3D ~SWAP_HAS_CACHE; =20 if (!count && !has_cache) { - err =3D -ENOENT; + return -ENOENT; } else if (usage =3D=3D SWAP_HAS_CACHE) { if (has_cache) - err =3D -EEXIST; + return -EEXIST; } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) { - err =3D -EINVAL; + return -EINVAL; } - - if (err) - goto unlock_out; } =20 for (i =3D 0; i < nr; i++) { @@ -3710,14 +3693,31 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) * Don't need to rollback changes, because if * usage =3D=3D 1, there must be nr =3D=3D 1. */ - err =3D -ENOMEM; - goto unlock_out; + return -ENOMEM; } =20 WRITE_ONCE(si->swap_map[offset + i], count | has_cache); } =20 -unlock_out: + return 0; +} + +static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) +{ + int err; + struct swap_info_struct *si; + struct swap_cluster_info *ci; + unsigned long offset =3D swp_offset(entry); + + si =3D swap_entry_to_info(entry); + if (WARN_ON_ONCE(!si)) { + pr_err("%s%08lx\n", Bad_file, entry.val); + return -EINVAL; + } + + VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); + ci =3D swap_cluster_lock(si, offset); + err =3D swap_dup_entries(si, ci, offset, usage, nr); swap_cluster_unlock(ci); return err; } --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B9AB9350A07 for ; Wed, 29 Oct 2025 15:59:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753597; cv=none; b=lCWn15hceYkERJ5q2MY5vVk2vP/ruoEO9Y8romyaCjuVP2vM6VQuChXnl3xq64pObj1wv3gJPyIcI/CahWRYLtY1evlhhTmXebEdUcRK9oVJvD09pQXW6aHRn63TLUScqhMqAggCp+9VI1MomQkHO94huNcOcUtf42jh6Ez2tXE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753597; c=relaxed/simple; bh=L+rdjO2vtp13dx9EhKBaRDdv3v/dIFE6uM6xYJiOIPk=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=FdEPpB3nfRCYTFvlDNXENb9Dmwk40jnQN+85mvQGxbCc5qN0mF5CjTR8MDzxpX9LapGr/sPvYubz1SB/o0JF+I8nfBmxGi3v/8OUCg7r7s9vKoHoKteqsYIW+deFw0GUh5q9VdELEt2+yu63c51YMstM/nrprrqcoc4jEiMLONc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=nUYMFliY; arc=none smtp.client-ip=209.85.215.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="nUYMFliY" Received: by mail-pg1-f182.google.com with SMTP id 41be03b00d2f7-b6ce696c18bso6814014a12.1 for ; Wed, 29 Oct 2025 08:59:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753595; x=1762358395; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=oro6FwqIkTmuaMWEGle1A3mIt+YqWKAuLSr+vB3zc4c=; b=nUYMFliYNDJ+rqw3NghmwK38AgY+zlgqDiAcb0BKBt1reSAiyP6yjT7TBhZ1cebbGj bqL3HFMLlsU7LnThDS3bV0zg9n9FDv+P6W1MGlAeo/XdFTOZ7nDFLhx0mSc72XIPNU2s 1LpYQ5ojA23Q5ZjtkS2nANijmS+f3Lm3nGvNpoupplpUnsrvUCCrve/CtCKQ7jp6+5v4 sjgb11BiM3I0xpe+kYW5MbgKMEgNz5X0nfFiTXl7u+GwcW5jBBA5FiE3y67mDzOC8qSM BuL8Zc3r70WhCnMPVwcYgm2M7ojMOnJ4Lm7lwlA34wK2lHV6OP0JwJj+/5qCi28KAjri 765w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753595; x=1762358395; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=oro6FwqIkTmuaMWEGle1A3mIt+YqWKAuLSr+vB3zc4c=; b=spuEC4CVrnvELgkXSru1kraT2uzCpe2j0XkNmjeNOBW1Q55pMZqOvwyWB4r67qMzKc znX0sbvp2gWvzipfURS5j1pkOQ/TILOOOda+taG/wfYPpf0b4Xz3lVoMRspLuws6LnHF Op7k+UFgYwEVCtuWwCbC2+A7qV6ntLwOgsnG7s2lDdBzDLTtXKCGB0rDMe5+pl1IPqkC 1JRWcqd1sZFDg9zgl8irccuhmwYJCt+TBmZ790ydVRAtF+OMaiHBFFekqBmxFWiNsqXo ZuGqJ4qBp1qjDhaMu7F/4MTrIZP5HCKVi5N9x3Xp1Iv4NzGVatW9nC4poN16VQddzZln u+SA== X-Forwarded-Encrypted: i=1; AJvYcCXoi3plZpijGua7otNtegVjJFjzpCBrMoS1wyVYuP1+9TRg94Wziq62ce5scMIKDttEzpgGjAKbJ+oWae0=@vger.kernel.org X-Gm-Message-State: AOJu0YzgQU3Ve2acMaJGbwe6OuNw9na9CyYPkFqUUocAW+eAy91srWst CARWvYB+XmCJOAN1n4MiNJqXAGWnXht82Qrf/4JXPKOrQRa3xV8P7Q9V X-Gm-Gg: ASbGncuLl01pS0pFD5HcuBXRdNn0vkQZfOY8FZm6+zFb2tLkkvclouTm1pSm8sZdLYQ /KdasHN+1POK2FQPd5c6Z/TeaFIrhP7O6/jaSnu+ukCVCtkBYNSdm/GGAMt3xlSALUBMp97CGju 942x/esM+bZiNG3X90MRhP2TlDsCk6CIY2Eu5JQ3RJhlT7hStIPxIgBEwg/N7+Z8p2Xw+4kqOyC CGRTgSuHOTkTIqTcLPaKQr6BwjlnfU4NIqCjFve5iNFweUzUVQGwh+yoTzoZ7sd+jB5TFl8AZfq FydkiD9ULwKjkbg8YIHwoQCq5ULauMkDV4ZXtfxoM+1JnLOZAu6nFah8LgpPaAqME3heOXPVvSl mHRBV8h1VLboL9e5f3DIYX8R+Rk8yaBPeo2sZo5Zfib/sNfoXYhiVqeY9HFMkTOQ3e03ibhKa3A QCpjLO7oIPhA== X-Google-Smtp-Source: AGHT+IHOMMDiT1d3KDkNSsyTngQflKh7MDlf/06YJIchSbedr8pH+Ab6kM95BleLFAmEkO+PZdRYrQ== X-Received: by 2002:a17:903:2301:b0:290:c5c8:941d with SMTP id d9443c01a7336-294dee9970fmr42231285ad.39.1761753594783; Wed, 29 Oct 2025 08:59:54 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.08.59.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 08:59:54 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:38 +0800 Subject: [PATCH 12/19] mm, swap: use swap cache as the swap in synchronize layer Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-12-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song Current swap in synchronization mostly uses the swap_map's SWAP_HAS_CACHE bit. Whoever sets the bit first does the actual work to swap in a folio. This has been causing many issues as it's just a poor implementation of a bit lock. Raced users have no idea what is pinning a slot, so it has to loop with a schedule_timeout_uninterruptible(1), which is ugly and causes long-tailing or other performance issues. Besides, the abuse of SWAP_HAS_CACHE has been causing many other troubles for synchronization or maintenance. This is the first step to remove this bit completely. This will also save one bit for the 8-bit swap counting field. We have just removed all swap in paths that bypass the swap cache, and now both the swap cache and swap map are protected by the cluster lock. So now we can just resolve the swap synchronization with the swap cache layer directly using the cluster lock. Whoever inserts a folio in the swap cache first does the swap in work. And because folios are locked during swap operations, other raced users will just wait on the folio lock. The SWAP_HAS_CACHE will be removed in later commit. For now, we still set it for some remaining users. But now we do the bit setting and swap cache folio adding in the same critical section, after swap cache is ready. No one will have to spin on the SWAP_HAS_CACHE bit anymore. This both simplifies the logic and should improve the performance, eliminating issues like the one solved in commit 01626a1823024 ("mm: avoid unconditional one-tick sleep when swapcache_prepare fails"), or the "skip_if_exists" from commit a65b0e7607ccb ("zswap: make shrinking memcg-aware"), which will be removed very soon. Signed-off-by: Kairui Song Suggested-by: Chris Li --- include/linux/swap.h | 6 --- mm/swap.h | 14 ++++++- mm/swap_state.c | 103 +++++++++++++++++++++++++++++------------------= ---- mm/swapfile.c | 39 ++++++++++++------- mm/vmscan.c | 1 - 5 files changed, 95 insertions(+), 68 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 936fa8f9e5f3..69025b473472 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -458,7 +458,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t en= try); extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern int swap_duplicate_nr(swp_entry_t entry, int nr); -extern int swapcache_prepare(swp_entry_t entry, int nr); extern void swap_free_nr(swp_entry_t entry, int nr_pages); extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); int swap_type_of(dev_t device, sector_t offset); @@ -518,11 +517,6 @@ static inline int swap_duplicate_nr(swp_entry_t swp, i= nt nr_pages) return 0; } =20 -static inline int swapcache_prepare(swp_entry_t swp, int nr) -{ - return 0; -} - static inline void swap_free_nr(swp_entry_t entry, int nr_pages) { } diff --git a/mm/swap.h b/mm/swap.h index e0f05babe13a..3cd99850bbaf 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -234,6 +234,14 @@ static inline bool folio_matches_swap_entry(const stru= ct folio *folio, return folio_entry.val =3D=3D round_down(entry.val, nr_pages); } =20 +/* Temporary internal helpers */ +void __swapcache_set_cached(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry); +void __swapcache_clear_cached(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry, unsigned int nr); + /* * All swap cache helpers below require the caller to ensure the swap entr= ies * used are valid and stablize the device by any of the following ways: @@ -247,7 +255,8 @@ static inline bool folio_matches_swap_entry(const struc= t folio *folio, */ struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); -void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **s= hadow); +int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, + void **shadow, bool alloc); void swap_cache_del_folio(struct folio *folio); struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, struct mempolicy *mpol, pgoff_t ilx, @@ -413,7 +422,8 @@ static inline void *swap_cache_get_shadow(swp_entry_t e= ntry) return NULL; } =20 -static inline void swap_cache_add_folio(struct folio *folio, swp_entry_t e= ntry, void **shadow) +static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t en= try, + void **shadow, bool alloc) { } =20 diff --git a/mm/swap_state.c b/mm/swap_state.c index aaf8d202434d..2d53e3b5e8e9 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -128,34 +128,66 @@ void *swap_cache_get_shadow(swp_entry_t entry) * @entry: The swap entry corresponding to the folio. * @gfp: gfp_mask for XArray node allocation. * @shadowp: If a shadow is found, return the shadow. + * @alloc: If it's the allocator that is trying to insert a folio. Allocat= or + * sets SWAP_HAS_CACHE to pin slots before insert so skip map upda= te. * * Context: Caller must ensure @entry is valid and protect the swap device * with reference count or locks. * The caller also needs to update the corresponding swap_map slots with * SWAP_HAS_CACHE bit to avoid race or conflict. */ -void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **s= hadowp) +int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, + void **shadowp, bool alloc) { + int err; void *shadow =3D NULL; + struct swap_info_struct *si; unsigned long old_tb, new_tb; struct swap_cluster_info *ci; - unsigned int ci_start, ci_off, ci_end; + unsigned int ci_start, ci_off, ci_end, offset; unsigned long nr_pages =3D folio_nr_pages(folio); =20 VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); =20 + si =3D __swap_entry_to_info(entry); new_tb =3D folio_to_swp_tb(folio); ci_start =3D swp_cluster_offset(entry); ci_end =3D ci_start + nr_pages; ci_off =3D ci_start; - ci =3D swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry)); + offset =3D swp_offset(entry); + ci =3D swap_cluster_lock(si, swp_offset(entry)); + if (unlikely(!ci->table)) { + err =3D -ENOENT; + goto failed; + } do { - old_tb =3D __swap_table_xchg(ci, ci_off, new_tb); - WARN_ON_ONCE(swp_tb_is_folio(old_tb)); + old_tb =3D __swap_table_get(ci, ci_off); + if (unlikely(swp_tb_is_folio(old_tb))) { + err =3D -EEXIST; + goto failed; + } + if (!alloc && unlikely(!__swap_count(swp_entry(swp_type(entry), offset))= )) { + err =3D -ENOENT; + goto failed; + } if (swp_tb_is_shadow(old_tb)) shadow =3D swp_tb_to_shadow(old_tb); + offset++; + } while (++ci_off < ci_end); + + ci_off =3D ci_start; + offset =3D swp_offset(entry); + do { + /* + * Still need to pin the slots with SWAP_HAS_CACHE since + * swap allocator depends on that. + */ + if (!alloc) + __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset)); + __swap_table_set(ci, ci_off, new_tb); + offset++; } while (++ci_off < ci_end); =20 folio_ref_add(folio, nr_pages); @@ -168,6 +200,11 @@ void swap_cache_add_folio(struct folio *folio, swp_ent= ry_t entry, void **shadowp =20 if (shadowp) *shadowp =3D shadow; + return 0; + +failed: + swap_cluster_unlock(ci); + return err; } =20 /** @@ -186,6 +223,7 @@ void swap_cache_add_folio(struct folio *folio, swp_entr= y_t entry, void **shadowp void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *fo= lio, swp_entry_t entry, void *shadow) { + struct swap_info_struct *si; unsigned long old_tb, new_tb; unsigned int ci_start, ci_off, ci_end; unsigned long nr_pages =3D folio_nr_pages(folio); @@ -195,6 +233,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *c= i, struct folio *folio, VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio); =20 + si =3D __swap_entry_to_info(entry); new_tb =3D shadow_swp_to_tb(shadow); ci_start =3D swp_cluster_offset(entry); ci_end =3D ci_start + nr_pages; @@ -210,6 +249,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *c= i, struct folio *folio, folio_clear_swapcache(folio); node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages); lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages); + __swapcache_clear_cached(si, ci, entry, nr_pages); } =20 /** @@ -231,7 +271,6 @@ void swap_cache_del_folio(struct folio *folio) __swap_cache_del_folio(ci, folio, entry, NULL); swap_cluster_unlock(ci); =20 - put_swap_folio(folio, entry); folio_ref_sub(folio, folio_nr_pages(folio)); } =20 @@ -423,67 +462,37 @@ static struct folio *__swap_cache_prepare_and_add(swp= _entry_t entry, gfp_t gfp, bool charged, bool skip_if_exists) { - struct folio *swapcache; + struct folio *swapcache =3D NULL; void *shadow; int ret; =20 - /* - * Check and pin the swap map with SWAP_HAS_CACHE, then add the folio - * into the swap cache. Loop with a schedule delay if raced with - * another process setting SWAP_HAS_CACHE. This hackish loop will - * be fixed very soon. - */ + __folio_set_locked(folio); + __folio_set_swapbacked(folio); for (;;) { - ret =3D swapcache_prepare(entry, folio_nr_pages(folio)); + ret =3D swap_cache_add_folio(folio, entry, &shadow, false); if (!ret) break; =20 /* - * The skip_if_exists is for protecting against a recursive - * call to this helper on the same entry waiting forever - * here because SWAP_HAS_CACHE is set but the folio is not - * in the swap cache yet. This can happen today if - * mem_cgroup_swapin_charge_folio() below triggers reclaim - * through zswap, which may call this helper again in the - * writeback path. - * - * Large order allocation also needs special handling on + * Large order allocation needs special handling on * race: if a smaller folio exists in cache, swapin needs * to fallback to order 0, and doing a swap cache lookup * might return a folio that is irrelevant to the faulting * entry because @entry is aligned down. Just return NULL. */ if (ret !=3D -EEXIST || skip_if_exists || folio_test_large(folio)) - return NULL; + goto failed; =20 - /* - * Check the swap cache again, we can only arrive - * here because swapcache_prepare returns -EEXIST. - */ swapcache =3D swap_cache_get_folio(entry); if (swapcache) - return swapcache; - - /* - * We might race against __swap_cache_del_folio(), and - * stumble across a swap_map entry whose SWAP_HAS_CACHE - * has not yet been cleared. Or race against another - * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE - * in swap_map, but not yet added its folio to swap cache. - */ - schedule_timeout_uninterruptible(1); + goto failed; } =20 - __folio_set_locked(folio); - __folio_set_swapbacked(folio); - if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) { - put_swap_folio(folio, entry); - folio_unlock(folio); - return NULL; + swap_cache_del_folio(folio); + goto failed; } =20 - swap_cache_add_folio(folio, entry, &shadow); memcg1_swapin(entry, folio_nr_pages(folio)); if (shadow) workingset_refault(folio, shadow); @@ -491,6 +500,10 @@ static struct folio *__swap_cache_prepare_and_add(swp_= entry_t entry, /* Caller will initiate read into locked folio */ folio_add_lru(folio); return folio; + +failed: + folio_unlock(folio); + return swapcache; } =20 /** diff --git a/mm/swapfile.c b/mm/swapfile.c index 56054af12afd..415db36d85d3 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1461,7 +1461,11 @@ int folio_alloc_swap(struct folio *folio) if (!entry.val) return -ENOMEM; =20 - swap_cache_add_folio(folio, entry, NULL); + /* + * Allocator has pinned the slots with SWAP_HAS_CACHE + * so it should never fail + */ + WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true)); =20 return 0; =20 @@ -1567,9 +1571,8 @@ static unsigned char swap_entry_put_locked(struct swa= p_info_struct *si, * do_swap_page() * ... swapoff+swapon * swap_cache_alloc_folio() - * swapcache_prepare() - * __swap_duplicate() - * // check swap_map + * swap_cache_add_folio() + * // check swap_map * // verify PTE not changed * * In __swap_duplicate(), the swap_map need to be checked before @@ -3748,17 +3751,25 @@ int swap_duplicate_nr(swp_entry_t entry, int nr) return err; } =20 -/* - * @entry: first swap entry from which we allocate nr swap cache. - * - * Called when allocating swap cache for existing swap entries, - * This can return error codes. Returns 0 at success. - * -EEXIST means there is a swap cache. - * Note: return code is different from swap_duplicate(). - */ -int swapcache_prepare(swp_entry_t entry, int nr) +/* Mark the swap map as HAS_CACHE, caller need to hold the cluster lock */ +void __swapcache_set_cached(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry) +{ + WARN_ON(swap_dup_entries(si, ci, swp_offset(entry), SWAP_HAS_CACHE, 1)); +} + +/* Clear the swap map as !HAS_CACHE, caller need to hold the cluster lock = */ +void __swapcache_clear_cached(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry, unsigned int nr) { - return __swap_duplicate(entry, SWAP_HAS_CACHE, nr); + if (swap_only_has_cache(si, swp_offset(entry), nr)) { + swap_entries_free(si, ci, entry, nr); + } else { + for (int i =3D 0; i < nr; i++, entry.val++) + swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); + } } =20 /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 5e74a2807930..76b9c21a7fe2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -762,7 +762,6 @@ static int __remove_mapping(struct address_space *mappi= ng, struct folio *folio, __swap_cache_del_folio(ci, folio, swap, shadow); memcg1_swapout(folio, swap); swap_cluster_unlock_irq(ci); - put_swap_folio(folio, swap); } else { void (*free_folio)(struct folio *); =20 --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pj1-f47.google.com (mail-pj1-f47.google.com [209.85.216.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 748D3347FE9 for ; Wed, 29 Oct 2025 16:00:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.47 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753603; cv=none; b=DfUiJ5s6PwxncKxUmlJClCqgFZIduHZxUKuqOjwOQVn5rB9ibvjerVsBjAll/ojEG3cdhQcv+HUUpdNiM3xh33attfhXpFSBf2hdFED0a7zM8A4ck3lLtGUa/uH/7Cnf10UzvXnRGIUVnBzcoAIgUI9lCAXv3EqK9oBhn+p+77k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753603; c=relaxed/simple; bh=ByTK2SoyQgaA/mDCQcIjTZ8xrs8DpFkCl7HauPrEv1g=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=bvB7ucWUwi4u80yhuBxjkwW/4bgwidtlkX+v0b+fY1E0aY7jAFmbwQRe3M2PPIQiTGm0pjtIz3my9On7hujzP/UeA6u4cp0Ug5u4T6aAYS8220EOA2fx3k7TxzVPo/pnVjboC2GonXppml/OvH+a2+tJUg3PCclO+yzydhejSlM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=geSZPLtb; arc=none smtp.client-ip=209.85.216.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="geSZPLtb" Received: by mail-pj1-f47.google.com with SMTP id 98e67ed59e1d1-3307de086d8so65875a91.2 for ; Wed, 29 Oct 2025 09:00:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753599; x=1762358399; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=HaJeOLc41bvRnU13WOF01XHE844GkJvbSlk0im22rDM=; b=geSZPLtbm/Eove1aWva6FamxThVtO0NUquAmNQAsOe0N32Z+GNbateWG6qZh9SnBiq dr6tCnw4rBfW4rJVJ6S7bb79n3JehAGXUAZKow7JJ4iT1S9ISLxDLi+Sc02CvOSw6HMe GtWV7JVNvtWbM6QPCSAMsNI/az9tEdVbKKd6Cnyd57mB6p98I8N5PrI4L3W1gAGktAdu 64/KqmiUp3jOZ6BP6ekGb5HVMU+nGpPuSBB0gsQhFAwKug2CxGOoUjW0Mdg4ZNHE8HuO oqu/0QfrZRvcreqRi3N97Nj1axahmsLi9y2upmX8j0ccWo6fIxpoKdVYw1P0gWZ2yS+0 dfWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753599; x=1762358399; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HaJeOLc41bvRnU13WOF01XHE844GkJvbSlk0im22rDM=; b=oiqPZrNa3NO2LqhSFX/lJJ2Ft8SRUkRRr3IuM4O8XtcxrvX/vtlkRqtbbCbxtrQv28 p7M825+gf2+9JzkGDIRS/yix9U2tYCcEEMWD2qAdkOgtKPilWH6Te3ikL7ua75XW4MHd ViUQ7AEgn6OSs0wbWfrd9w2u7hQUJhXg/0YmyVDaQDn1/PvyOQku4q033/ijBNQG9yrO 5PqZRpd6tyM5S9d7kA1MrsVeaf0O5GsxImT88jQl1Ju9A9ckOPVWHk+0OjWmDnE0nLeK 8GmfdoWObHuIXyi0A06SeyXn9skIgMvcUA7uG+jO0LSEW1gTBcteNCvGo/xfSI0qlP2l D6kQ== X-Forwarded-Encrypted: i=1; AJvYcCXqbOajzOLiJajdsNM5Lnuun7eExtvgrdVg/SBKgQT8lLuv3q9kFQxJgaM9s/ilmImzardaEpQaHvNoCkk=@vger.kernel.org X-Gm-Message-State: AOJu0YyPMlJVG+j5ip5/AFErje4aW0uRpSxEUQ9eoQ1LXN4SqIWt8+4v xNpkxAHVnwHwev8tJgEX7rqOHL1cyZ/FhQIlk23SD1HSaapAf3JYz4W5 X-Gm-Gg: ASbGncuUW2ocCKrB2C7q7sR/G4VPZkaBGqUYM+tv/WWR1ce3ctCEYveobL2Br/Fqhjx I6B2AuXkf2Lv8/0fGISEJt2x3UCI5nYHC8kFmzBDdYtmQ3ls9htMLFsphygZIynDzPNGn8JGrDf ORFByni1U0qDN9NBmxGbtdrFyiXTkb5H1wyRE4Y+oSWm7RPTYftsud9rRTL6x7Lp/lt2MnGmQ2l Mtl1I+d8SWEl+etslvREhmfrg08XUIAd1NwJ/XwK9AvN4W3k9KV2X9o8bcS/1CTuyoBn5fMzPz4 usd7FFJFpdRF7KUybB+tebFzgRfHEvKbBVMJ6H+dqsnrcvrzA3fS0DTqlIr/m1SLyjT88IbX3NP x8ZBajtBlGSM8YtxQAQTP0NBKGMwZPkRJx9IWwOwGJRTpLpbdP53dnUtCtKTbls9v2PtMRep0W1 YDdLd5QqJtyxYATYcvvw4+WnZkCeAqfbk= X-Google-Smtp-Source: AGHT+IFBM69X5pxHyD3MEtjs9Hok3q/yPZBDJUCYgD5ZLroaODN6kmz+XKztQfn2faRgu7hlU5vjoA== X-Received: by 2002:a17:90b:2e4b:b0:33b:bf8d:6172 with SMTP id 98e67ed59e1d1-3403a305995mr3636072a91.34.1761753599453; Wed, 29 Oct 2025 08:59:59 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.08.59.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 08:59:59 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:39 +0800 Subject: [PATCH 13/19] mm, swap: remove workaround for unsynchronized swap map cache state Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-13-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song Remove the "skip if exists" check from commit a65b0e7607ccb ("zswap: make shrinking memcg-aware"). It was needed because there is a tiny time window between setting the SWAP_HAS_CACHE bit and actually adding the folio to the swap cache. If a user is trying to add the folio into the swap cache but another user was interrupted after setting SWAP_HAS_CACHE but hasn't added the folio to the swap cache yet, it might lead to a deadlock. We have moved the bit setting to the same critical section as adding the folio, so this is no longer needed. Remove it and clean it up. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swap.h | 2 +- mm/swap_state.c | 27 ++++++++++----------------- mm/zswap.c | 2 +- 3 files changed, 12 insertions(+), 19 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index 3cd99850bbaf..a3c5f2dca0d5 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -260,7 +260,7 @@ int swap_cache_add_folio(struct folio *folio, swp_entry= _t entry, void swap_cache_del_folio(struct folio *folio); struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, struct mempolicy *mpol, pgoff_t ilx, - bool *alloced, bool skip_if_exists); + bool *alloced); /* Below helpers require the caller to lock and pass in the swap cluster. = */ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry, void *shadow); diff --git a/mm/swap_state.c b/mm/swap_state.c index 2d53e3b5e8e9..d2bcca92b6e0 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -447,8 +447,6 @@ void swap_update_readahead(struct folio *folio, struct = vm_area_struct *vma, * @folio: folio to be added. * @gfp: memory allocation flags for charge, can be 0 if @charged if true. * @charged: if the folio is already charged. - * @skip_if_exists: if the slot is in a cached state, return NULL. - * This is an old workaround that will be removed shortly. * * Update the swap_map and add folio as swap cache, typically before swapi= n. * All swap slots covered by the folio must have a non-zero swap count. @@ -459,8 +457,7 @@ void swap_update_readahead(struct folio *folio, struct = vm_area_struct *vma, */ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry, struct folio *folio, - gfp_t gfp, bool charged, - bool skip_if_exists) + gfp_t gfp, bool charged) { struct folio *swapcache =3D NULL; void *shadow; @@ -480,7 +477,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, * might return a folio that is irrelevant to the faulting * entry because @entry is aligned down. Just return NULL. */ - if (ret !=3D -EEXIST || skip_if_exists || folio_test_large(folio)) + if (ret !=3D -EEXIST || folio_test_large(folio)) goto failed; =20 swapcache =3D swap_cache_get_folio(entry); @@ -513,8 +510,6 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, * @mpol: NUMA memory allocation policy to be applied * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE * @new_page_allocated: sets true if allocation happened, false otherwise - * @skip_if_exists: if the slot is a partially cached state, return NULL. - * This is a workaround that would be removed shortly. * * Allocate a folio in the swap cache for one swap slot, typically before * doing IO (swap in or swap out). The swap slot indicated by @entry must @@ -526,8 +521,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, */ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx, - bool *new_page_allocated, - bool skip_if_exists) + bool *new_page_allocated) { struct swap_info_struct *si =3D __swap_entry_to_info(entry); struct folio *folio; @@ -548,8 +542,7 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry,= gfp_t gfp_mask, if (!folio) return NULL; /* Try add the new folio, returns existing folio or NULL on failure. */ - result =3D __swap_cache_prepare_and_add(entry, folio, gfp_mask, - false, skip_if_exists); + result =3D __swap_cache_prepare_and_add(entry, folio, gfp_mask, false); if (result =3D=3D folio) *new_page_allocated =3D true; else @@ -578,7 +571,7 @@ struct folio *swapin_folio(swp_entry_t entry, struct fo= lio *folio) unsigned long nr_pages =3D folio_nr_pages(folio); =20 entry =3D swp_entry(swp_type(entry), round_down(offset, nr_pages)); - swapcache =3D __swap_cache_prepare_and_add(entry, folio, 0, true, false); + swapcache =3D __swap_cache_prepare_and_add(entry, folio, 0, true); if (swapcache =3D=3D folio) swap_read_folio(folio, NULL); return swapcache; @@ -606,7 +599,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, = gfp_t gfp_mask, =20 mpol =3D get_vma_policy(vma, addr, 0, &ilx); folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, - &page_allocated, false); + &page_allocated); mpol_cond_put(mpol); =20 if (page_allocated) @@ -725,7 +718,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, /* Ok, do the async read-ahead now */ folio =3D swap_cache_alloc_folio( swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx, - &page_allocated, false); + &page_allocated); if (!folio) continue; if (page_allocated) { @@ -743,7 +736,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, skip: /* The page was likely read above, so no need for plugging here */ folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, - &page_allocated, false); + &page_allocated); if (unlikely(page_allocated)) swap_read_folio(folio, NULL); return folio; @@ -838,7 +831,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, pte_unmap(pte); pte =3D NULL; folio =3D swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx, - &page_allocated, false); + &page_allocated); if (!folio) continue; if (page_allocated) { @@ -858,7 +851,7 @@ static struct folio *swap_vma_readahead(swp_entry_t tar= g_entry, gfp_t gfp_mask, skip: /* The folio was likely read above, so no need for plugging here */ folio =3D swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx, - &page_allocated, false); + &page_allocated); if (unlikely(page_allocated)) swap_read_folio(folio, NULL); return folio; diff --git a/mm/zswap.c b/mm/zswap.c index a7a2443912f4..d8a33db9d3cc 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1015,7 +1015,7 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, =20 mpol =3D get_task_policy(current); folio =3D swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol, - NO_INTERLEAVE_INDEX, &folio_was_allocated, true); + NO_INTERLEAVE_INDEX, &folio_was_allocated); put_swap_device(si); if (!folio) return -ENOMEM; --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 62B10350D47 for ; Wed, 29 Oct 2025 16:00:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753608; cv=none; b=WmpSL+T0KIjtffzCnfi3UPDTtDMx9T0H6xS9EGf+WVOsAfDMz+Bu32ACqAgzXujlgcRg2EA9lKrpypuMynR1Mh0GZdRh+rSkf+4f21H/QDynoB3Ro59fcIknQfcRBSl8G6UD8o1DclUhwlZ6X6fnmBdBlso+Xl8WF2KLVjVfE9g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753608; c=relaxed/simple; bh=tksYFpw+Fx21aPVs3ud94B2IERIJ9qKkhy9qUyWaDMU=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=TPyNIQr221H4gjS7OsIPMuAC39ATQsMrYoecLwGy9kzPidjAB72MyNXq0FRb9uRa/2GuHvOZeKXpnqpeh9Y6Q0aCcwg/IpLfC74aUWKNPhm5hA/vYPOSNa/vvYjT9jk7U8TJ4tE+NG9PkSgk8dtlgmwICM+IkgQxd2EY2MOgrgM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=LWL6X3yb; arc=none smtp.client-ip=209.85.214.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="LWL6X3yb" Received: by mail-pl1-f170.google.com with SMTP id d9443c01a7336-29292eca5dbso96892835ad.0 for ; Wed, 29 Oct 2025 09:00:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753605; x=1762358405; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=G0UdXGAMewCWDAEEMa7ZAr4h4jR5pcjvof6VKpTbof4=; b=LWL6X3ybsBDSFCwUoXdXeZQAfNK/Flyqf9hl5eV3dmJtJ0HJxHATRk1TsYQTx7/vOY qveLmTCnLA8GgNdn+C+bTrJsOBrgC/5ro8NXJmdHXFYL1nXk7894/ygTb+53eQ2K1fYG bVBkWHOLLfJraN4db5jIp+2XWRzBU+uVK8hec7Lty9sjP6APzGl4ZRNBxEx7p8gYQ4SD 1W88yAx5nmalH0JE9x89AnJyj7VNa+lzVknUZiSrPyAAaqjmdv7dTnznKrJb1WwVnyPw Mrlfv9IN49FtJc0ofA6eNXNiDaeBzxez0P1QxLw3TWY0Onm88tKfMPNXX9YReK8lHBF5 rWNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753605; x=1762358405; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=G0UdXGAMewCWDAEEMa7ZAr4h4jR5pcjvof6VKpTbof4=; b=jw+EWtU3PYvIHJMxsf1QNisCanpkEORV+jx6oy3ijd4aVRmwV6iVBmM+EO2H2CcAyU uFpwcj5utzacDjXJ2HYaLj/EpW1VoRanL7MUeVNaO1WoqeST/KitiAPKyHpIeLincLdd nnMrU4qB3HqfIUyV3hManFM/SodlPnviFku9GnNCn5a0kTjhVlaUm+mQMquafEqtYkSU HhxT6UVv8a3OOyIVOJkCiV0XSNHuBBNg3s/re52s7Mo14Cn6RGXshYN5w3xDh+HVLsU8 zOXn3YWVE18K6lAsBK9rBYQ5sNBOV6gZe+K50Y2J6PvbuNsQ2aqXVgaat3OaA68TOhKo mY6g== X-Forwarded-Encrypted: i=1; AJvYcCXaOGhDAL+cjVXD4keISbGQkCnzt2ZN4Bszg9QT1VZHgEG66RR+0DpXKm86VRXbcPQlc089HA/cuzbK4HA=@vger.kernel.org X-Gm-Message-State: AOJu0Yys8Wp6QQb2aC79zlEw3HYpuMuvv1e72b+HirexBwKX7GmVXhUv TfOerftgM8FHjS1ibQl895B+vI3XmwDE5aVT5uJkkyh8At+VRzfuZrSLriARspWaMR8= X-Gm-Gg: ASbGnct/OjWe+dJ5ivvdVdNz1ngBMeSr/AndsG2YkH9EA6YfDETt2dUWPlTmaTy0yf0 yhizPySl0FJRBebeVhY9rsEL8EIChYYkk0QJh/XUOfQR8Kzttq2iQWMwnbjQEnYwrS1+0DiyW8w fzubMPtN+/bvgrLZRO4VoppLTiDsCGiCkBWlr60WIPYyM7htDG9REzeWzsgAFhSMsp+RCq2kFWg Qe7uOHLuGKUYo6sITv+7bvui/8fWhCAvoweoXiO9qQvnW90XKZuic3BqYFief4tgdUrMg2Rz3b/ XFiUuIW+g5W4a2X3/jOZXj+YPzDApHyzu2Yz1lh1jtKK2+S1BRWQoiactatftrzWaKrSPKHK9nG IZHoXBGG/jTYuVKJPyQtz0bu/8HhfsO1QR1txGJ2CPDbKbHYTd0oPVR3hFa8HmyI3mx2B7dxY6r xDJwsHeHXi1UKVrN3ITKEB X-Google-Smtp-Source: AGHT+IHuii9ioIbwdwvLXCUqFBOVCyosr5WYGJ1cI0MxYhGX4CKgA31FD6TOAKWXvkOu5RGv/fc9/A== X-Received: by 2002:a17:902:d4c3:b0:290:94ed:1841 with SMTP id d9443c01a7336-294deedcf83mr48288835ad.41.1761753604431; Wed, 29 Oct 2025 09:00:04 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.08.59.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 09:00:03 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:40 +0800 Subject: [PATCH 14/19] mm, swap: sanitize swap entry management workflow Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-14-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song The current swap entry allocation/freeing workflow has never had a clear definition. This makes it hard to debug or add new optimizations. This commit introduces a proper definition of how swap entries would be allocated and freed. Now, most operations are folio based, so they will never exceed one swap cluster, and we now have a cleaner border between swap and the rest of mm, making it much easier to follow and debug, especially with new added sanity checks. Also making more optimization possible. Swap entry will be mostly allocated and free with a folio bound. The folio lock will be useful for resolving many swap ralated races. Now swap allocation (except hibernation) always starts with a folio in the swap cache, and gets duped/freed protected by the folio lock: - folio_alloc_swap() - The only allocation entry point now. Context: The folio must be locked. This allocates one or a set of continuous swap slots for a folio and binds them to the folio by adding the folio to the swap cache. The swap slots' swap count start with zero value. - folio_dup_swap() - Increase the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This increases the ref count of swap entries allocated to a folio. Newly allocated swap slots' count has to be increased by this helper as the folio got unmapped (and swap entries got installed). - folio_put_swap() - Decrease the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This decreases the ref count of swap entries allocated to a folio. Typically, swapin will decrease the swap count as the folio got installed back and the swap entry got uninstalled This won't remove the folio from the swap cache and free the slot. Lazy freeing of swap cache is helpful for reducing IO. There is already a folio_free_swap() for immediate cache reclaim. This part could be further optimized later. The above locking constraints could be further relaxed when the swap table if fully implemented. Currently dup still needs the caller to lock the swap entry container (e.g. PTL), or a concurrent zap may underflow the swap count. Some swap users need to interact with swap count without involving folio (e.g. forking/zapping the page table or mapping truncate without swapin). In such cases, the caller has to ensure there is no race condition on whatever owns the swap count and use the below helpers: - swap_put_entries_direct() - Decrease the swap count directly. Context: The caller must lock whatever is referencing the slots to avoid a race. Typically the page table zapping or shmem mapping truncate will need to free swap slots directly. If a slot is cached (has a folio bound), this will also try to release the swap cache. - swap_dup_entry_direct() - Increase the swap count directly. Context: The caller must lock whatever is referencing the entries to avoid race, and the entries must already have a swap count > 1. Typically, forking will need to copy the page table and hence needs to increase the swap count of the entries in the table. The page table is locked while referencing the swap entries, so the entries all have a swap count > 1 and can't be freed. Hibernation subsystem is a bit different, so two special wrappers are here: - swap_alloc_hibernation_slot() - Allocate one entry from one device. - swap_free_hibernation_slot() - Free one entry allocated by the above helper. All hibernation entries are exclusive to the hibernation subsystem and should not interact with ordinary swap routines. By separating the workflows, it will be possible to bind folio more tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary pin. This commit should not introduce any behavior change Signed-off-by: Kairui Song Suggested-by: Chris Li --- arch/s390/mm/pgtable.c | 2 +- include/linux/swap.h | 58 +++++++++---------- kernel/power/swap.c | 10 ++-- mm/madvise.c | 2 +- mm/memory.c | 15 +++-- mm/rmap.c | 7 ++- mm/shmem.c | 10 ++-- mm/swap.h | 37 +++++++++++++ mm/swapfile.c | 148 ++++++++++++++++++++++++++++++++++-----------= ---- 9 files changed, 192 insertions(+), 97 deletions(-) diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c index 0fde20bbc50b..c51304a4418e 100644 --- a/arch/s390/mm/pgtable.c +++ b/arch/s390/mm/pgtable.c @@ -692,7 +692,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, s= wp_entry_t entry) =20 dec_mm_counter(mm, mm_counter(folio)); } - free_swap_and_cache(entry); + swap_put_entries_direct(entry, 1); } =20 void ptep_zap_unused(struct mm_struct *mm, unsigned long addr, diff --git a/include/linux/swap.h b/include/linux/swap.h index 69025b473472..ac3caa4c6999 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -452,14 +452,8 @@ static inline long get_nr_swap_pages(void) } =20 extern void si_swapinfo(struct sysinfo *); -int folio_alloc_swap(struct folio *folio); -bool folio_free_swap(struct folio *folio); void put_swap_folio(struct folio *folio, swp_entry_t entry); -extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); -extern int swap_duplicate_nr(swp_entry_t entry, int nr); -extern void swap_free_nr(swp_entry_t entry, int nr_pages); -extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); int swap_type_of(dev_t device, sector_t offset); int find_first_swap(dev_t *device); extern unsigned int count_swap_pages(int, int); @@ -472,6 +466,29 @@ struct backing_dev_info; extern struct swap_info_struct *get_swap_device(swp_entry_t entry); sector_t swap_folio_sector(struct folio *folio); =20 +/* + * If there is an existing swap slot reference (swap entry) and the caller + * guarantees that there is no race modification of it (e.g., PTL + * protecting the swap entry in page table; shmem's cmpxchg protects t + * he swap entry in shmem mapping), these two helpers below can be used + * to put/dup the entries directly. + * + * All entries must be allocated by folio_alloc_swap(). And they must have + * a swap count > 1. See comments of folio_*_swap helpers for more info. + */ +int swap_dup_entry_direct(swp_entry_t entry); +void swap_put_entries_direct(swp_entry_t entry, int nr); + +/* + * folio_free_swap tries to free the swap entries pinned by a swap cache + * folio, it has to be here to be called by other components. + */ +bool folio_free_swap(struct folio *folio); + +/* Allocate / free (hibernation) exclusive entries */ +swp_entry_t swap_alloc_hibernation_slot(int type); +void swap_free_hibernation_slot(swp_entry_t entry); + static inline void put_swap_device(struct swap_info_struct *si) { percpu_ref_put(&si->users); @@ -499,10 +516,6 @@ static inline void put_swap_device(struct swap_info_st= ruct *si) #define free_pages_and_swap_cache(pages, nr) \ release_pages((pages), (nr)); =20 -static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr) -{ -} - static inline void free_swap_cache(struct folio *folio) { } @@ -512,12 +525,12 @@ static inline int add_swap_count_continuation(swp_ent= ry_t swp, gfp_t gfp_mask) return 0; } =20 -static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages) +static inline int swap_dup_entry_direct(swp_entry_t ent) { return 0; } =20 -static inline void swap_free_nr(swp_entry_t entry, int nr_pages) +static inline void swap_put_entries_direct(swp_entry_t ent, int nr) { } =20 @@ -541,11 +554,6 @@ static inline int swp_swapcount(swp_entry_t entry) return 0; } =20 -static inline int folio_alloc_swap(struct folio *folio) -{ - return -EINVAL; -} - static inline bool folio_free_swap(struct folio *folio) { return false; @@ -558,22 +566,6 @@ static inline int add_swap_extent(struct swap_info_str= uct *sis, return -EINVAL; } #endif /* CONFIG_SWAP */ - -static inline int swap_duplicate(swp_entry_t entry) -{ - return swap_duplicate_nr(entry, 1); -} - -static inline void free_swap_and_cache(swp_entry_t entry) -{ - free_swap_and_cache_nr(entry, 1); -} - -static inline void swap_free(swp_entry_t entry) -{ - swap_free_nr(entry, 1); -} - #ifdef CONFIG_MEMCG static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) { diff --git a/kernel/power/swap.c b/kernel/power/swap.c index 0beff7eeaaba..546a0c701970 100644 --- a/kernel/power/swap.c +++ b/kernel/power/swap.c @@ -179,10 +179,10 @@ sector_t alloc_swapdev_block(int swap) { unsigned long offset; =20 - offset =3D swp_offset(get_swap_page_of_type(swap)); + offset =3D swp_offset(swap_alloc_hibernation_slot(swap)); if (offset) { if (swsusp_extents_insert(offset)) - swap_free(swp_entry(swap, offset)); + swap_free_hibernation_slot(swp_entry(swap, offset)); else return swapdev_block(swap, offset); } @@ -197,6 +197,7 @@ sector_t alloc_swapdev_block(int swap) =20 void free_all_swap_pages(int swap) { + unsigned long offset; struct rb_node *node; =20 while ((node =3D swsusp_extents.rb_node)) { @@ -204,8 +205,9 @@ void free_all_swap_pages(int swap) =20 ext =3D rb_entry(node, struct swsusp_extent, node); rb_erase(node, &swsusp_extents); - swap_free_nr(swp_entry(swap, ext->start), - ext->end - ext->start + 1); + + for (offset =3D ext->start; offset < ext->end; offset++) + swap_free_hibernation_slot(swp_entry(swap, offset)); =20 kfree(ext); } diff --git a/mm/madvise.c b/mm/madvise.c index fb1c86e630b6..3cf2097d2085 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -697,7 +697,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned = long addr, max_nr =3D (end - addr) / PAGE_SIZE; nr =3D swap_pte_batch(pte, max_nr, ptent); nr_swap -=3D nr; - free_swap_and_cache_nr(entry, nr); + swap_put_entries_direct(entry, nr); clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); } else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { diff --git a/mm/memory.c b/mm/memory.c index 589d6fc3d424..27d91ae3648a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -933,7 +933,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm= _struct *src_mm, swp_entry_t entry =3D pte_to_swp_entry(orig_pte); =20 if (likely(!non_swap_entry(entry))) { - if (swap_duplicate(entry) < 0) + if (swap_dup_entry_direct(entry) < 0) return -EIO; =20 /* make sure dst_mm is on swapoff's mmlist. */ @@ -1746,7 +1746,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gath= er *tlb, =20 nr =3D swap_pte_batch(pte, max_nr, ptent); rss[MM_SWAPENTS] -=3D nr; - free_swap_and_cache_nr(entry, nr); + swap_put_entries_direct(entry, nr); } else if (is_migration_entry(entry)) { struct folio *folio =3D pfn_swap_entry_folio(entry); =20 @@ -4932,7 +4932,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) /* * Some architectures may have to restore extra metadata to the page * when reading from swap. This metadata may be indexed by swap entry - * so this must be called before swap_free(). + * so this must be called before folio_put_swap(). */ arch_swap_restore(folio_swap(entry, folio), folio); =20 @@ -4970,6 +4970,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (unlikely(folio !=3D swapcache)) { folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); + folio_put_swap(swapcache, NULL); } else if (!folio_test_anon(folio)) { /* * We currently only expect !anon folios that are fully @@ -4978,9 +4979,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) !=3D nr_pages, folio); VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); + folio_put_swap(folio, NULL); } else { + VM_WARN_ON_ONCE(nr_pages !=3D 1 && nr_pages !=3D folio_nr_pages(folio)); folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address, - rmap_flags); + rmap_flags); + folio_put_swap(folio, nr_pages =3D=3D 1 ? page : NULL); } =20 VM_BUG_ON(!folio_test_anon(folio) || @@ -4994,7 +4998,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * swapcache. Do it after mapping so any raced page fault will * see the folio in swap cache and wait for us. */ - swap_free_nr(entry, nr_pages); if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags)) folio_free_swap(folio); =20 @@ -5004,7 +5007,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * Hold the lock to avoid the swap entry to be reused * until we take the PT lock for the pte_same() check * (to avoid false positives from pte_same). For - * further safety release the lock after the swap_free + * further safety release the lock after the folio_put_swap * so that the swap count won't change under a * parallel locked swapcache. */ diff --git a/mm/rmap.c b/mm/rmap.c index 1954c538a991..844864831797 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -82,6 +82,7 @@ #include =20 #include "internal.h" +#include "swap.h" =20 static struct kmem_cache *anon_vma_cachep; static struct kmem_cache *anon_vma_chain_cachep; @@ -2146,7 +2147,7 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, goto discard; } =20 - if (swap_duplicate(entry) < 0) { + if (folio_dup_swap(folio, subpage) < 0) { set_pte_at(mm, address, pvmw.pte, pteval); goto walk_abort; } @@ -2157,7 +2158,7 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, * so we'll not check/care. */ if (arch_unmap_one(mm, vma, address, pteval) < 0) { - swap_free(entry); + folio_put_swap(folio, subpage); set_pte_at(mm, address, pvmw.pte, pteval); goto walk_abort; } @@ -2165,7 +2166,7 @@ static bool try_to_unmap_one(struct folio *folio, str= uct vm_area_struct *vma, /* See folio_try_share_anon_rmap(): clear PTE first. */ if (anon_exclusive && folio_try_share_anon_rmap_pte(folio, subpage)) { - swap_free(entry); + folio_put_swap(folio, subpage); set_pte_at(mm, address, pvmw.pte, pteval); goto walk_abort; } diff --git a/mm/shmem.c b/mm/shmem.c index 46d54a1288fd..5e6cb763d945 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -982,7 +982,7 @@ static long shmem_free_swap(struct address_space *mappi= ng, old =3D xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0); if (old !=3D radswap) return 0; - free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order); + swap_put_entries_direct(radix_to_swp_entry(radswap), 1 << order); =20 return 1 << order; } @@ -1665,7 +1665,7 @@ int shmem_writeout(struct folio *folio, struct swap_i= ocb **plug, spin_unlock(&shmem_swaplist_lock); } =20 - swap_duplicate_nr(folio->swap, nr_pages); + folio_dup_swap(folio, NULL); shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap)); =20 BUG_ON(folio_mapped(folio)); @@ -1686,7 +1686,7 @@ int shmem_writeout(struct folio *folio, struct swap_i= ocb **plug, /* Swap entry might be erased by racing shmem_free_swap() */ if (!error) { shmem_recalc_inode(inode, 0, -nr_pages); - swap_free_nr(folio->swap, nr_pages); + folio_put_swap(folio, NULL); } =20 /* @@ -2172,6 +2172,7 @@ static void shmem_set_folio_swapin_error(struct inode= *inode, pgoff_t index, =20 nr_pages =3D folio_nr_pages(folio); folio_wait_writeback(folio); + folio_put_swap(folio, NULL); swap_cache_del_folio(folio); /* * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks @@ -2179,7 +2180,6 @@ static void shmem_set_folio_swapin_error(struct inode= *inode, pgoff_t index, * in shmem_evict_inode(). */ shmem_recalc_inode(inode, -nr_pages, -nr_pages); - swap_free_nr(swap, nr_pages); } =20 static int shmem_split_large_entry(struct inode *inode, pgoff_t index, @@ -2401,9 +2401,9 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, if (sgp =3D=3D SGP_WRITE) folio_mark_accessed(folio); =20 + folio_put_swap(folio, NULL); swap_cache_del_folio(folio); folio_mark_dirty(folio); - swap_free_nr(swap, nr_pages); put_swap_device(si); =20 *foliop =3D folio; diff --git a/mm/swap.h b/mm/swap.h index a3c5f2dca0d5..74c61129d7b7 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -183,6 +183,28 @@ static inline void swap_cluster_unlock_irq(struct swap= _cluster_info *ci) spin_unlock_irq(&ci->lock); } =20 +/* + * Below are the core routines for doing swap for a folio. + * All helpers requires the folio to be locked, and a locked folio + * in the swap cache pins the swap entries / slots allocated to the + * folio, swap relies heavily on the swap cache and folio lock for + * synchronization. + * + * folio_alloc_swap(): the entry point for a folio to be swapped + * out. It allocates swap slots and pins the slots with swap cache. + * The slots start with a swap count of zero. + * + * folio_dup_swap(): increases the swap count of a folio, usually + * during it gets unmapped and a swap entry is installed to replace + * it (e.g., swap entry in page table). A swap slot with swap + * count =3D=3D 0 should only be increasd by this helper. + * + * folio_put_swap(): does the opposite thing of folio_dup_swap(). + */ +int folio_alloc_swap(struct folio *folio); +int folio_dup_swap(struct folio *folio, struct page *subpage); +void folio_put_swap(struct folio *folio, struct page *subpage); + /* linux/mm/page_io.c */ int sio_pool_init(void); struct swap_iocb; @@ -363,9 +385,24 @@ static inline struct swap_info_struct *__swap_entry_to= _info(swp_entry_t entry) return NULL; } =20 +static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp) +{ + return -EINVAL; +} + +static inline int folio_dup_swap(struct folio *folio, struct page *page) +{ + return -EINVAL; +} + +static inline void folio_put_swap(struct folio *folio, struct page *page) +{ +} + static inline void swap_read_folio(struct folio *folio, struct swap_iocb *= *plug) { } + static inline void swap_write_unplug(struct swap_iocb *sio) { } diff --git a/mm/swapfile.c b/mm/swapfile.c index 415db36d85d3..426b0b6d583f 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -58,6 +58,9 @@ static void swap_entries_free(struct swap_info_struct *si, swp_entry_t entry, unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); +static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr= ); +static bool swap_entries_put_map(struct swap_info_struct *si, + swp_entry_t entry, int nr); static bool folio_swapcache_freeable(struct folio *folio); static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, @@ -1467,6 +1470,12 @@ int folio_alloc_swap(struct folio *folio) */ WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true)); =20 + /* + * Allocator should always allocate aligned entries so folio based + * operations never crossed more than one cluster. + */ + VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio); + return 0; =20 out_free: @@ -1474,6 +1483,62 @@ int folio_alloc_swap(struct folio *folio) return -ENOMEM; } =20 +/** + * folio_dup_swap() - Increase swap count of swap entries of a folio. + * @folio: folio with swap entries bounded. + * @subpage: if not NULL, only increase the swap count of this subpage. + * + * Context: Caller must ensure the folio is locked and in the swap cache. + * The caller also has to ensure there is no raced call to + * swap_put_entries_direct before this helper returns, or the swap + * map may underflow (TODO: maybe we should allow or avoid underflow to + * make swap refcount lockless). + */ +int folio_dup_swap(struct folio *folio, struct page *subpage) +{ + int err =3D 0; + swp_entry_t entry =3D folio->swap; + unsigned long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio); + + if (subpage) { + entry.val +=3D folio_page_idx(folio, subpage); + nr_pages =3D 1; + } + + while (!err && __swap_duplicate(entry, 1, nr_pages) =3D=3D -ENOMEM) + err =3D add_swap_count_continuation(entry, GFP_ATOMIC); + + return err; +} + +/** + * folio_put_swap() - Decrease swap count of swap entries of a folio. + * @folio: folio with swap entries bounded, must be in swap cache and lock= ed. + * @subpage: if not NULL, only decrease the swap count of this subpage. + * + * This won't free the swap slots even if swap count drops to zero, they a= re + * still pinned by the swap cache. User may call folio_free_swap to free t= hem. + * Context: Caller must ensure the folio is locked and in the swap cache. + */ +void folio_put_swap(struct folio *folio, struct page *subpage) +{ + swp_entry_t entry =3D folio->swap; + unsigned long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio); + + if (subpage) { + entry.val +=3D folio_page_idx(folio, subpage); + nr_pages =3D 1; + } + + swap_entries_put_map(__swap_entry_to_info(entry), entry, nr_pages); +} + static struct swap_info_struct *_swap_info_get(swp_entry_t entry) { struct swap_info_struct *si; @@ -1714,28 +1779,6 @@ static void swap_entries_free(struct swap_info_struc= t *si, partial_free_cluster(si, ci); } =20 -/* - * Caller has made sure that the swap device corresponding to entry - * is still around or has not been recycled. - */ -void swap_free_nr(swp_entry_t entry, int nr_pages) -{ - int nr; - struct swap_info_struct *sis; - unsigned long offset =3D swp_offset(entry); - - sis =3D _swap_info_get(entry); - if (!sis) - return; - - while (nr_pages) { - nr =3D min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER= ); - swap_entries_put_map(sis, swp_entry(sis->type, offset), nr); - offset +=3D nr; - nr_pages -=3D nr; - } -} - /* * Called after dropping swapcache to decrease refcnt to swap entries. */ @@ -1924,16 +1967,19 @@ bool folio_free_swap(struct folio *folio) } =20 /** - * free_swap_and_cache_nr() - Release reference on range of swap entries a= nd - * reclaim their cache if no more references re= main. + * swap_put_entries_direct() - Release reference on range of swap entries = and + * reclaim their cache if no more references r= emain. * @entry: First entry of range. * @nr: Number of entries in range. * * For each swap entry in the contiguous range, release a reference. If an= y swap * entries become free, try to reclaim their underlying folios, if present= . The * offset range is defined by [entry.offset, entry.offset + nr). + * + * Context: Caller must ensure there is no race condition on the reference + * owner. e.g., locking the PTL of a PTE containing the entry being releas= ed. */ -void free_swap_and_cache_nr(swp_entry_t entry, int nr) +void swap_put_entries_direct(swp_entry_t entry, int nr) { const unsigned long start_offset =3D swp_offset(entry); const unsigned long end_offset =3D start_offset + nr; @@ -1942,10 +1988,9 @@ void free_swap_and_cache_nr(swp_entry_t entry, int n= r) unsigned long offset; =20 si =3D get_swap_device(entry); - if (!si) + if (WARN_ON_ONCE(!si)) return; - - if (WARN_ON(end_offset > si->max)) + if (WARN_ON_ONCE(end_offset > si->max)) goto out; =20 /* @@ -1989,8 +2034,8 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr) } =20 #ifdef CONFIG_HIBERNATION - -swp_entry_t get_swap_page_of_type(int type) +/* Allocate a slot for hibernation */ +swp_entry_t swap_alloc_hibernation_slot(int type) { struct swap_info_struct *si =3D swap_type_to_info(type); unsigned long offset; @@ -2020,6 +2065,27 @@ swp_entry_t get_swap_page_of_type(int type) return entry; } =20 +/* Free a slot allocated by swap_alloc_hibernation_slot */ +void swap_free_hibernation_slot(swp_entry_t entry) +{ + struct swap_info_struct *si; + struct swap_cluster_info *ci; + pgoff_t offset =3D swp_offset(entry); + + si =3D get_swap_device(entry); + if (WARN_ON(!si)) + return; + + ci =3D swap_cluster_lock(si, offset); + swap_entry_put_locked(si, ci, entry, 1); + WARN_ON(swap_entry_swapped(si, offset)); + swap_cluster_unlock(ci); + + /* In theory readahead might add it to the swap cache by accident */ + __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); + put_swap_device(si); +} + /* * Find the swap type that corresponds to given device (if any). * @@ -2181,7 +2247,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_= t *pmd, /* * Some architectures may have to restore extra metadata to the page * when reading from swap. This metadata may be indexed by swap entry - * so this must be called before swap_free(). + * so this must be called before folio_put_swap(). */ arch_swap_restore(folio_swap(entry, folio), folio); =20 @@ -2222,7 +2288,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_= t *pmd, new_pte =3D pte_mkuffd_wp(new_pte); setpte: set_pte_at(vma->vm_mm, addr, pte, new_pte); - swap_free(entry); + folio_put_swap(folio, page); out: if (pte) pte_unmap_unlock(pte, ptl); @@ -3725,28 +3791,22 @@ static int __swap_duplicate(swp_entry_t entry, unsi= gned char usage, int nr) return err; } =20 -/** - * swap_duplicate_nr() - Increase reference count of nr contiguous swap en= tries - * by 1. - * +/* + * swap_dup_entry_direct() - Increase reference count of a swap entry by o= ne. * @entry: first swap entry from which we want to increase the refcount. - * @nr: Number of entries in range. * * Returns 0 for success, or -ENOMEM if a swap_count_continuation is requi= red * but could not be atomically allocated. Returns 0, just as if it succee= ded, * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), wh= ich * might occur if a page table entry has got corrupted. * - * Note that we are currently not handling the case where nr > 1 and we ne= ed to - * add swap count continuation. This is OK, because no such user exists - = shmem - * is the only user that can pass nr > 1, and it never re-duplicates any s= wap - * entry it owns. + * Context: Caller must ensure there is no race condition on the reference + * owner. e.g., locking the PTL of a PTE containing the entry being increa= sed. */ -int swap_duplicate_nr(swp_entry_t entry, int nr) +int swap_dup_entry_direct(swp_entry_t entry) { int err =3D 0; - - while (!err && __swap_duplicate(entry, 1, nr) =3D=3D -ENOMEM) + while (!err && __swap_duplicate(entry, 1, 1) =3D=3D -ENOMEM) err =3D add_swap_count_continuation(entry, GFP_ATOMIC); return err; } --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pj1-f53.google.com (mail-pj1-f53.google.com [209.85.216.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 84664342C9D for ; Wed, 29 Oct 2025 16:00:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.53 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753612; cv=none; b=r9VHK/+kkfviTiLvUyZVLoblJWkYdVS3s/ZCb1+c9TV8YCdxlhmVkQcS70zdr1EBUKk7EflmaRZXNbhBt9Ww/wFqmT30gm8CTE+dfWCjT5bgdmy1fCBzCqIdF1GJ7YlyDw9kvUx3x8qtBEY9hROxP5S92fdkXwBPNc8IUuKe5/M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753612; c=relaxed/simple; bh=Jiym8GsRWceIiEut1BkiXE6jjLZmz8McJhg10hT1edw=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=GLNOY/Wjncc5N3aKbY9btUtBLKHXJ9qs4kCUj+S04MnlWG/fGypQRCxMF7Nta9hCXGo6ugxhUVjyat0eGMzyB7S7pAZ4cf9HeL1SCvTOL89jHKQ6s31UwXpeARxjm6dq2Zj6KcWVxN7yKmaRUCME+smRAwCBE3rKGjlbQNXgp/Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=I8tlLP/V; arc=none smtp.client-ip=209.85.216.53 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="I8tlLP/V" Received: by mail-pj1-f53.google.com with SMTP id 98e67ed59e1d1-33d463e79ddso43500a91.0 for ; Wed, 29 Oct 2025 09:00:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753610; x=1762358410; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=WOSfsWO395+hyZyDR55RbtfmCUyV+LYQE7dhSnv1egg=; b=I8tlLP/VbZKyfat7efKbZZGzCarw+df9dYDe8b6A6VbuEnIHWv6DSKhrCNazQ3PUK9 aFlYnijbD+E2dcQkd+w92RMdv4sv93UinfjJCRKhkAUCZTyYV01oCZAg/+asH+ihnj/l RTz7dY6h9pauNUFR5vu/o8ujpAULES1akPJW1Zh68J7q2VpHwbHDJbYzTz+QbukfI9c/ Zwr+0m2R0pQknMCdaocB4pGYyXItB5cXm2QZxca6D20mrFMcdTRUhM4MQ3VMr7i/9CrT 8nGCrraTykuwLLfVG5+lYEeiTNdi9NZrQMm+aR2X3w95qFfkBnS1G41dguiYUt9OmHM5 ho0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753610; x=1762358410; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=WOSfsWO395+hyZyDR55RbtfmCUyV+LYQE7dhSnv1egg=; b=hOm2/JwD33eRCmT/MWCOu47S6t3Fc2RCFOjPlfdKEebosabmHivn14E2YYyKV5HQIw fjQvlllGfH0j1A8FeBxARDUFQED2Eyw304wrIK/no3KU1JXFQbEUhS+dzhjW8t2j5pMY WlcvPCCqbdaHXfBxX2yAuLIoHXAexV5qCDxWyuwVGy8yNAKHXAGs0scWPSL7jyazCHsI lrb0qzozNuqwdtIJ+Flo0svtjHQhgblA5sq3RmrHaCNPivul94jWtzLFS8Skw1Afs2nm Y1/dqzCAX7CeNI+0eSRNtFh8KiH8S6BG9MsY4fLCQzzx99IkNrZeJFHRCwPig+hY8ckP CO2A== X-Forwarded-Encrypted: i=1; AJvYcCWtBBDJPSnX/WIWtU7u4jBCQbdDguGyr77aQxM+ckqyRJHxoDOt52yn7Fxmj55XxD/8sfCBghDnvyxvQbY=@vger.kernel.org X-Gm-Message-State: AOJu0YwYUQdnqDIHy8HhvLCKO8hSMdq3nOVBUmI05YQGI9AARxakg5Z9 M/fCxan/PfYCju5gAGyQ+zP/zZWPXpNYv9zKn1eUnaymMF3W+5Yr6Re8 X-Gm-Gg: ASbGncsb4QEr7ak6LdBg3/jEkJ/F4/uh2CnzkMOLi3c3UNUjBeh5fNYAhCXz1ee5OTE N8tJ8pTlN/Sa0S6y7pRk7hMb7LiEG2H6Vr25gEm6FQ5jOFAvlLy1JGxZGA8jh2KlMLsVlB4BDlv M1wggD7V2B8jjs/RQmos9gAWmJVTottNIkB6pudtdpxxi0gOmw56TWVOpj4zoEaI+mc1pUSyWKH BrY6DKLjSw4XLJE9IW6nxLP7nK37PWF5sZf5AmoGDX537a+C1e2Vto7bIihJOZ9jV4gJZ5u7MiI HcCNjGIxXhBd3D1ivybIA07f8O1H/Dp8Zx7r+igZ8+yvBvxOavFE7lbaR0AhoDudkuBsClL15IM baRvlpWdVjA9uBfBcvUURRJ/xQI4pWBWh07mwNBL8p8wgfejMCzyoalTdd6VZnUvdzTzueAi57s hyr29D37RTsRnrzI6dMeqa X-Google-Smtp-Source: AGHT+IHdMKk0xrUMkpxGL9OB/4Oy3R/jyTgK3C9xNhiIjy3DnteEJ3lW0AdeKL6BPVPFkxO3dw9Q8Q== X-Received: by 2002:a17:90b:2687:b0:335:2eef:4ca8 with SMTP id 98e67ed59e1d1-3403a303662mr4108145a91.33.1761753609476; Wed, 29 Oct 2025 09:00:09 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.09.00.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 09:00:08 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:41 +0800 Subject: [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-15-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation. SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion. This pinning usage here can be dropped by adding the folio to swap cache directly on allocation. All swap allocations are folio-based now (except for hibernation), so the swap allocator can always take the folio as the parameter. And now both swap cache (swap table) and swap map are protected by the cluster lock, scanning the map and inserting the folio can be done in the same critical section. This eliminates the time window that a slot is pinned by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock multiple times. This is both a cleanup and an optimization. Signed-off-by: Kairui Song Suggested-by: Chris Li --- include/linux/swap.h | 5 -- mm/swap.h | 8 +-- mm/swap_state.c | 56 +++++++++++------- mm/swapfile.c | 161 +++++++++++++++++++++--------------------------= ---- 4 files changed, 105 insertions(+), 125 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index ac3caa4c6999..4b4b81fbc6a3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -452,7 +452,6 @@ static inline long get_nr_swap_pages(void) } =20 extern void si_swapinfo(struct sysinfo *); -void put_swap_folio(struct folio *folio, swp_entry_t entry); extern int add_swap_count_continuation(swp_entry_t, gfp_t); int swap_type_of(dev_t device, sector_t offset); int find_first_swap(dev_t *device); @@ -534,10 +533,6 @@ static inline void swap_put_entries_direct(swp_entry_t= ent, int nr) { } =20 -static inline void put_swap_folio(struct folio *folio, swp_entry_t swp) -{ -} - static inline int __swap_count(swp_entry_t entry) { return 0; diff --git a/mm/swap.h b/mm/swap.h index 74c61129d7b7..03694ffa662f 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -277,13 +277,13 @@ void __swapcache_clear_cached(struct swap_info_struct= *si, */ struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, - void **shadow, bool alloc); void swap_cache_del_folio(struct folio *folio); struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, struct mempolicy *mpol, pgoff_t ilx, bool *alloced); /* Below helpers require the caller to lock and pass in the swap cluster. = */ +void __swap_cache_add_folio(struct swap_cluster_info *ci, + struct folio *folio, swp_entry_t entry); void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry, void *shadow); void __swap_cache_replace_folio(struct swap_cluster_info *ci, @@ -459,8 +459,8 @@ static inline void *swap_cache_get_shadow(swp_entry_t e= ntry) return NULL; } =20 -static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t en= try, - void **shadow, bool alloc) +static inline void *__swap_cache_add_folio(struct swap_cluster_info *ci, + struct folio *folio, swp_entry_t entry) { } =20 diff --git a/mm/swap_state.c b/mm/swap_state.c index d2bcca92b6e0..85d9f99c384f 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -122,6 +122,34 @@ void *swap_cache_get_shadow(swp_entry_t entry) return NULL; } =20 +void __swap_cache_add_folio(struct swap_cluster_info *ci, + struct folio *folio, swp_entry_t entry) +{ + unsigned long new_tb; + unsigned int ci_start, ci_off, ci_end; + unsigned long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); + + new_tb =3D folio_to_swp_tb(folio); + ci_start =3D swp_cluster_offset(entry); + ci_off =3D ci_start; + ci_end =3D ci_start + nr_pages; + do { + VM_WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off))); + __swap_table_set(ci, ci_off, new_tb); + } while (++ci_off < ci_end); + + folio_ref_add(folio, nr_pages); + folio_set_swapcache(folio); + folio->swap =3D entry; + + node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); + lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); +} + /** * swap_cache_add_folio - Add a folio into the swap cache. * @folio: The folio to be added. @@ -136,23 +164,18 @@ void *swap_cache_get_shadow(swp_entry_t entry) * The caller also needs to update the corresponding swap_map slots with * SWAP_HAS_CACHE bit to avoid race or conflict. */ -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, - void **shadowp, bool alloc) +static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, + void **shadowp) { int err; void *shadow =3D NULL; + unsigned long old_tb; struct swap_info_struct *si; - unsigned long old_tb, new_tb; struct swap_cluster_info *ci; unsigned int ci_start, ci_off, ci_end, offset; unsigned long nr_pages =3D folio_nr_pages(folio); =20 - VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); - VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); - VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); - si =3D __swap_entry_to_info(entry); - new_tb =3D folio_to_swp_tb(folio); ci_start =3D swp_cluster_offset(entry); ci_end =3D ci_start + nr_pages; ci_off =3D ci_start; @@ -168,7 +191,7 @@ int swap_cache_add_folio(struct folio *folio, swp_entry= _t entry, err =3D -EEXIST; goto failed; } - if (!alloc && unlikely(!__swap_count(swp_entry(swp_type(entry), offset))= )) { + if (unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) { err =3D -ENOENT; goto failed; } @@ -184,20 +207,11 @@ int swap_cache_add_folio(struct folio *folio, swp_ent= ry_t entry, * Still need to pin the slots with SWAP_HAS_CACHE since * swap allocator depends on that. */ - if (!alloc) - __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset)); - __swap_table_set(ci, ci_off, new_tb); + __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset)); offset++; } while (++ci_off < ci_end); - - folio_ref_add(folio, nr_pages); - folio_set_swapcache(folio); - folio->swap =3D entry; + __swap_cache_add_folio(ci, folio, entry); swap_cluster_unlock(ci); - - node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); - lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); - if (shadowp) *shadowp =3D shadow; return 0; @@ -466,7 +480,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_e= ntry_t entry, __folio_set_locked(folio); __folio_set_swapbacked(folio); for (;;) { - ret =3D swap_cache_add_folio(folio, entry, &shadow, false); + ret =3D swap_cache_add_folio(folio, entry, &shadow); if (!ret) break; =20 diff --git a/mm/swapfile.c b/mm/swapfile.c index 426b0b6d583f..8d98f28907bc 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -875,28 +875,53 @@ static void swap_cluster_assert_table_empty(struct sw= ap_cluster_info *ci, } } =20 -static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_c= luster_info *ci, - unsigned int start, unsigned char usage, - unsigned int order) +static bool cluster_alloc_range(struct swap_info_struct *si, + struct swap_cluster_info *ci, + struct folio *folio, + unsigned int offset) { - unsigned int nr_pages =3D 1 << order; + unsigned long nr_pages; + unsigned int order; =20 lockdep_assert_held(&ci->lock); =20 if (!(si->flags & SWP_WRITEOK)) return false; =20 + /* + * All mm swap allocation starts with a folio (folio_alloc_swap), + * it's also the only allocation path for large orders allocation. + * Such swap slots starts with count =3D=3D 0 and will be increased + * upon folio unmap. + * + * Else, it's a exclusive order 0 allocation for hibernation. + * The slot starts with count =3D=3D 1 and never increases. + */ + if (likely(folio)) { + order =3D folio_order(folio); + nr_pages =3D 1 << order; + /* + * Pin the slot with SWAP_HAS_CACHE to satisfy swap_dup_entries. + * This is the legacy allocation behavior, will drop it very soon. + */ + memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages); + __swap_cache_add_folio(ci, folio, swp_entry(si->type, offset)); + } else { + order =3D 0; + nr_pages =3D 1; + WARN_ON_ONCE(si->swap_map[offset]); + si->swap_map[offset] =3D 1; + swap_cluster_assert_table_empty(ci, offset, 1); + } + /* * The first allocation in a cluster makes the * cluster exclusive to this order */ if (cluster_is_empty(ci)) ci->order =3D order; - - memset(si->swap_map + start, usage, nr_pages); - swap_cluster_assert_table_empty(ci, start, nr_pages); - swap_range_alloc(si, nr_pages); ci->count +=3D nr_pages; + swap_range_alloc(si, nr_pages); =20 return true; } @@ -904,13 +929,12 @@ static bool cluster_alloc_range(struct swap_info_stru= ct *si, struct swap_cluster /* Try use a new cluster for current CPU and allocate from it. */ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long offset, - unsigned int order, - unsigned char usage) + struct folio *folio, unsigned long offset) { unsigned int next =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTRY_INVALID; unsigned long start =3D ALIGN_DOWN(offset, SWAPFILE_CLUSTER); unsigned long end =3D min(start + SWAPFILE_CLUSTER, si->max); + unsigned int order =3D likely(folio) ? folio_order(folio) : 0; unsigned int nr_pages =3D 1 << order; bool need_reclaim; =20 @@ -930,7 +954,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap= _info_struct *si, continue; offset =3D found; } - if (!cluster_alloc_range(si, ci, offset, usage, order)) + if (!cluster_alloc_range(si, ci, folio, offset)) break; found =3D offset; offset +=3D nr_pages; @@ -952,8 +976,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap= _info_struct *si, =20 static unsigned int alloc_swap_scan_list(struct swap_info_struct *si, struct list_head *list, - unsigned int order, - unsigned char usage, + struct folio *folio, bool scan_all) { unsigned int found =3D SWAP_ENTRY_INVALID; @@ -965,7 +988,7 @@ static unsigned int alloc_swap_scan_list(struct swap_in= fo_struct *si, if (!ci) break; offset =3D cluster_offset(si, ci); - found =3D alloc_swap_scan_cluster(si, ci, offset, order, usage); + found =3D alloc_swap_scan_cluster(si, ci, folio, offset); if (found) break; } while (scan_all); @@ -1026,10 +1049,11 @@ static void swap_reclaim_work(struct work_struct *w= ork) * Try to allocate swap entries with specified order and try set a new * cluster for current CPU too. */ -static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,= int order, - unsigned char usage) +static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, + struct folio *folio) { struct swap_cluster_info *ci; + unsigned int order =3D likely(folio) ? folio_order(folio) : 0; unsigned int offset =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTRY_INVALID; =20 /* @@ -1051,8 +1075,7 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o if (cluster_is_usable(ci, order)) { if (cluster_is_empty(ci)) offset =3D cluster_offset(si, ci); - found =3D alloc_swap_scan_cluster(si, ci, offset, - order, usage); + found =3D alloc_swap_scan_cluster(si, ci, folio, offset); } else { swap_cluster_unlock(ci); } @@ -1066,22 +1089,19 @@ static unsigned long cluster_alloc_swap_entry(struc= t swap_info_struct *si, int o * to spread out the writes. */ if (si->flags & SWP_PAGE_DISCARD) { - found =3D alloc_swap_scan_list(si, &si->free_clusters, order, usage, - false); + found =3D alloc_swap_scan_list(si, &si->free_clusters, folio, false); if (found) goto done; } =20 if (order < PMD_ORDER) { - found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[order], - order, usage, true); + found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[order], folio, = true); if (found) goto done; } =20 if (!(si->flags & SWP_PAGE_DISCARD)) { - found =3D alloc_swap_scan_list(si, &si->free_clusters, order, usage, - false); + found =3D alloc_swap_scan_list(si, &si->free_clusters, folio, false); if (found) goto done; } @@ -1097,8 +1117,7 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o * failure is not critical. Scanning one cluster still * keeps the list rotated and reclaimed (for HAS_CACHE). */ - found =3D alloc_swap_scan_list(si, &si->frag_clusters[order], order, - usage, false); + found =3D alloc_swap_scan_list(si, &si->frag_clusters[order], folio, fal= se); if (found) goto done; } @@ -1112,13 +1131,11 @@ static unsigned long cluster_alloc_swap_entry(struc= t swap_info_struct *si, int o * Clusters here have at least one usable slots and can't fail order 0 * allocation, but reclaim may drop si->lock and race with another user. */ - found =3D alloc_swap_scan_list(si, &si->frag_clusters[o], - 0, usage, true); + found =3D alloc_swap_scan_list(si, &si->frag_clusters[o], folio, true); if (found) goto done; =20 - found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[o], - 0, usage, true); + found =3D alloc_swap_scan_list(si, &si->nonfull_clusters[o], folio, true= ); if (found) goto done; } @@ -1309,12 +1326,12 @@ static bool get_swap_device_info(struct swap_info_s= truct *si) * Fast path try to get swap entries with specified order from current * CPU's swap entry pool (a cluster). */ -static bool swap_alloc_fast(swp_entry_t *entry, - int order) +static bool swap_alloc_fast(struct folio *folio) { + unsigned int order =3D folio_order(folio); struct swap_cluster_info *ci; struct swap_info_struct *si; - unsigned int offset, found =3D SWAP_ENTRY_INVALID; + unsigned int offset; =20 /* * Once allocated, swap_info_struct will never be completely freed, @@ -1329,22 +1346,18 @@ static bool swap_alloc_fast(swp_entry_t *entry, if (cluster_is_usable(ci, order)) { if (cluster_is_empty(ci)) offset =3D cluster_offset(si, ci); - found =3D alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE); - if (found) - *entry =3D swp_entry(si->type, found); + alloc_swap_scan_cluster(si, ci, folio, offset); } else { swap_cluster_unlock(ci); } =20 put_swap_device(si); - return !!found; + return folio_test_swapcache(folio); } =20 /* Rotate the device and switch to a new cluster */ -static bool swap_alloc_slow(swp_entry_t *entry, - int order) +static void swap_alloc_slow(struct folio *folio) { - unsigned long offset; struct swap_info_struct *si, *next; =20 spin_lock(&swap_avail_lock); @@ -1354,14 +1367,12 @@ static bool swap_alloc_slow(swp_entry_t *entry, plist_requeue(&si->avail_list, &swap_avail_head); spin_unlock(&swap_avail_lock); if (get_swap_device_info(si)) { - offset =3D cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE); + cluster_alloc_swap_entry(si, folio); put_swap_device(si); - if (offset) { - *entry =3D swp_entry(si->type, offset); - return true; - } - if (order) - return false; + if (folio_test_swapcache(folio)) + return; + if (folio_test_large(folio)) + return; } =20 spin_lock(&swap_avail_lock); @@ -1423,7 +1434,6 @@ int folio_alloc_swap(struct folio *folio) { unsigned int order =3D folio_order(folio); unsigned int size =3D 1 << order; - swp_entry_t entry =3D {}; =20 VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio); @@ -1448,39 +1458,23 @@ int folio_alloc_swap(struct folio *folio) =20 again: local_lock(&percpu_swap_cluster.lock); - if (!swap_alloc_fast(&entry, order)) - swap_alloc_slow(&entry, order); + if (!swap_alloc_fast(folio)) + swap_alloc_slow(folio); local_unlock(&percpu_swap_cluster.lock); =20 - if (unlikely(!order && !entry.val)) { + if (!order && unlikely(!folio_test_swapcache(folio))) { if (swap_sync_discard()) goto again; } =20 /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */ - if (mem_cgroup_try_charge_swap(folio, entry)) - goto out_free; + if (unlikely(mem_cgroup_try_charge_swap(folio, folio->swap))) + swap_cache_del_folio(folio); =20 - if (!entry.val) + if (unlikely(!folio_test_swapcache(folio))) return -ENOMEM; =20 - /* - * Allocator has pinned the slots with SWAP_HAS_CACHE - * so it should never fail - */ - WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true)); - - /* - * Allocator should always allocate aligned entries so folio based - * operations never crossed more than one cluster. - */ - VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio); - return 0; - -out_free: - put_swap_folio(folio, entry); - return -ENOMEM; } =20 /** @@ -1779,29 +1773,6 @@ static void swap_entries_free(struct swap_info_struc= t *si, partial_free_cluster(si, ci); } =20 -/* - * Called after dropping swapcache to decrease refcnt to swap entries. - */ -void put_swap_folio(struct folio *folio, swp_entry_t entry) -{ - struct swap_info_struct *si; - struct swap_cluster_info *ci; - unsigned long offset =3D swp_offset(entry); - int size =3D 1 << swap_entry_order(folio_order(folio)); - - si =3D _swap_info_get(entry); - if (!si) - return; - - ci =3D swap_cluster_lock(si, offset); - if (swap_only_has_cache(si, offset, size)) - swap_entries_free(si, ci, entry, size); - else - for (int i =3D 0; i < size; i++, entry.val++) - swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); - swap_cluster_unlock(ci); -} - int __swap_count(swp_entry_t entry) { struct swap_info_struct *si =3D __swap_entry_to_info(entry); @@ -2052,7 +2023,7 @@ swp_entry_t swap_alloc_hibernation_slot(int type) * with swap table allocation. */ local_lock(&percpu_swap_cluster.lock); - offset =3D cluster_alloc_swap_entry(si, 0, 1); + offset =3D cluster_alloc_swap_entry(si, NULL); local_unlock(&percpu_swap_cluster.lock); if (offset) { entry =3D swp_entry(si->type, offset); --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pf1-f180.google.com (mail-pf1-f180.google.com [209.85.210.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 739502ECE9B for ; Wed, 29 Oct 2025 16:00:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753617; cv=none; b=BqDsThyxH4b2dOIMPd2gDN+nWgmeT8sNJQOU7ev21HY7OjFTcJHmw4QSApuwlMwtDbJY1u5gJWQMDhdVPsi9aI3Myc9O3VTBNfm0GPDdSF+YR1dhlIMOOSDnBZ8GxZzVfLKNrzpiBeqxIkqAqjss0wfpyPWgt2F7Lx0jutkF5Xk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753617; c=relaxed/simple; bh=8Cy65yUBycdypGydFeuJoL8OWFxHeEQ49FZio9knumo=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=XJw4WJ3ei0s24L6C/kDKNHrYewHGEaIMMN2lacsMeZe3ASPFK4dtxjzlC+lLEtC1eGiLhHab3VV9HGY6SjT9EPhjhHVYAJskIjXeErxf5mCMK8dzO1Gys6J8vscnQy0WMya/XE1QcRXVDMGMRahpeLoCe4bqrkAfam27rkos6U8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=atMskfNA; arc=none smtp.client-ip=209.85.210.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="atMskfNA" Received: by mail-pf1-f180.google.com with SMTP id d2e1a72fcca58-781997d195aso53739b3a.3 for ; Wed, 29 Oct 2025 09:00:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753615; x=1762358415; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=8nOl+SU4PPYP3fCDuFL6r8khIpsm6yNyf/3hVmtV4SM=; b=atMskfNA3Z5y/lgIXhXR8lg/WEkwPK8h5gh/C6zbqNrB0SVsyYLJoBsrMbDnijKf/p qzARzXDkkG2JC+dcMjlbYxuwTSlKPHhALsUZpCgQozJk1sJWyQ7OAuWlbtTO8It7AigH XFzWx+WbRoym3PlL+IENq6ghKtg3ynleZUrHbP0GWq8Ca+Vk5yT2TwipKldpFHIUpcXr f1LylYhThPvSuALI3GlG58l23PPS8SHwepZGzROyDaVc5g/DEcnp5gb0fxgpput5+hDh k00ULpiYpEHhhZKuEUVyTnAAXHIIVRbixI7Zx159W0tXKGos7r21qNqITSOb0LYSmNr8 wrbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753615; x=1762358415; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8nOl+SU4PPYP3fCDuFL6r8khIpsm6yNyf/3hVmtV4SM=; b=ckezyRmlvIahYGUwj7q0hOMz9ElBU+xM/TWbbLadvwh8X3Pj95VtJXQF7z2GhHw4RG n3exTkEEs8YCZXMRuvO3eZZsDXvfm9eiecbXFbiaENuvizOpGvMoujcz/nR9IXOMk/HY +5vaVZyUayDze2g0svLX/03UG5N3UoVumS9LwHJFsMKDLfok/PlTsBYUDSv7NbARo7OS do1DYY04IZ6HPQLgNQQRrOoZ90w8n1GcUJqOfx3vlW65/Wp1BIdLIMFi9N+N1GOEwi3B 3zA/RVA8tUX8X1Xd115Zg6kgOPYXbn46FH826LaZz+ZKJx3qUdvdg3lIk/Oug5WKy8ud tCkA== X-Forwarded-Encrypted: i=1; AJvYcCWBf+kXIpinH2kkbvm2hBoAZ8rVmBNU4ke7TNvuSMzmkpGLNAIri+vrYuZ9AoGOzs+8jlY3Yj0Hhwqrc+E=@vger.kernel.org X-Gm-Message-State: AOJu0Yzyx4Q0gKoKsepS/BZeop0uq5bjo2pdpLFsTQFML6DfHHhB73+G k0E5oovVUi5RHNJ+mSSfqQuKHeShPryNwvtcV6nIeQ53d8ZzgwOpRh2k X-Gm-Gg: ASbGncvDf+p+n1BVUoB3+3lwWakCNcjNLzjHFJBe+IWMkG+cLJ8AfwlLFLeoPwaJ12p kOJrEi8yVCKCQvUu5IOFepXKeXQ1jTORDtEhs0mJBV5/P5n8zisk++G4+Berxnt6D5InZz0lRvQ yUoAxfHI5de4vw/pYb4her744lutzn2uMKx9XoAUd1PNaI9gb6rk5hw6PiporF+hcdlLbkLQgMO kBFNU06TBXN0xOe+Q4xrUUHv5IQBwza8PxangPRevHvQws0P0C0jHngnkaK/6vPbCvM0qzAiJEU mSS0BYA0RW8DdkmLJLuSLdnnwhkm2rnvqAQDreSAy070AkuD1pqTr8owgm6mxj9l/p0vdQ9yYO4 LS5R2DiWj4rgfrOQ78Z3/srRvomOeX3tggdkqr0kBT98YnjLBwaAdUN/bW2Oebr9j73Wif5qPva 3D6l7RQ/MLG5YMsmgzT7Wr X-Google-Smtp-Source: AGHT+IEdDsTYIcn1GU0ePW/OQ5eneMt6fHxiUdQrsAB+BibqdhzsnSDNaoYabLh0JhIuhC4B204MNA== X-Received: by 2002:a17:90b:1dcc:b0:32e:4924:690f with SMTP id 98e67ed59e1d1-3403a143773mr3962982a91.6.1761753614344; Wed, 29 Oct 2025 09:00:14 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.09.00.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 09:00:13 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:42 +0800 Subject: [PATCH 16/19] mm, swap: check swap table directly for checking cache Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-16-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song Instead of looking at the swap map, check swap table directly to tell if a swap slot is cached. Prepares for the removal of SWAP_HAS_CACHE. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swap.h | 11 ++++++++--- mm/swap_state.c | 16 ++++++++++++++++ mm/swapfile.c | 55 +++++++++++++++++++++++++++++-----------------------= --- mm/userfaultfd.c | 10 +++------- 4 files changed, 56 insertions(+), 36 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index 03694ffa662f..73f07bcea5f0 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -275,6 +275,7 @@ void __swapcache_clear_cached(struct swap_info_struct *= si, * swap entries in the page table, similar to locking swap cache folio. * - See the comment of get_swap_device() for more complex usage. */ +bool swap_cache_check_folio(swp_entry_t entry); struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); void swap_cache_del_folio(struct folio *folio); @@ -335,8 +336,6 @@ static inline int swap_zeromap_batch(swp_entry_t entry,= int max_nr, =20 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr) { - struct swap_info_struct *si =3D __swap_entry_to_info(entry); - pgoff_t offset =3D swp_offset(entry); int i; =20 /* @@ -345,8 +344,9 @@ static inline int non_swapcache_batch(swp_entry_t entry= , int max_nr) * be in conflict with the folio in swap cache. */ for (i =3D 0; i < max_nr; i++) { - if ((si->swap_map[offset + i] & SWAP_HAS_CACHE)) + if (swap_cache_check_folio(entry)) return i; + entry.val++; } =20 return i; @@ -449,6 +449,11 @@ static inline int swap_writeout(struct folio *folio, return 0; } =20 +static inline bool swap_cache_check_folio(swp_entry_t entry) +{ + return false; +} + static inline struct folio *swap_cache_get_folio(swp_entry_t entry) { return NULL; diff --git a/mm/swap_state.c b/mm/swap_state.c index 85d9f99c384f..41d4fa056203 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -103,6 +103,22 @@ struct folio *swap_cache_get_folio(swp_entry_t entry) return NULL; } =20 +/** + * swap_cache_check_folio - Check if a swap slot has cache. + * @entry: swap entry indicating the slot. + * + * Context: Caller must ensure @entry is valid and protect the swap + * device with reference count or locks. + */ +bool swap_cache_check_folio(swp_entry_t entry) +{ + unsigned long swp_tb; + + swp_tb =3D swap_table_get(__swap_entry_to_cluster(entry), + swp_cluster_offset(entry)); + return swp_tb_is_folio(swp_tb); +} + /** * swap_cache_get_shadow - Looks up a shadow in the swap cache. * @entry: swap entry used for the lookup. diff --git a/mm/swapfile.c b/mm/swapfile.c index 8d98f28907bc..3b7df5768d7f 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -788,23 +788,18 @@ static unsigned int cluster_reclaim_range(struct swap= _info_struct *si, unsigned int nr_pages =3D 1 << order; unsigned long offset =3D start, end =3D start + nr_pages; unsigned char *map =3D si->swap_map; - int nr_reclaim; + unsigned long swp_tb; =20 spin_unlock(&ci->lock); do { - switch (READ_ONCE(map[offset])) { - case 0: + if (swap_count(READ_ONCE(map[offset]))) break; - case SWAP_HAS_CACHE: - nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); - if (nr_reclaim < 0) - goto out; - break; - default: - goto out; + swp_tb =3D swap_table_get(ci, offset % SWAPFILE_CLUSTER); + if (swp_tb_is_folio(swp_tb)) { + if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY) < 0) + break; } } while (++offset < end); -out: spin_lock(&ci->lock); =20 /* @@ -820,37 +815,41 @@ static unsigned int cluster_reclaim_range(struct swap= _info_struct *si, * Recheck the range no matter reclaim succeeded or not, the slot * could have been be freed while we are not holding the lock. */ - for (offset =3D start; offset < end; offset++) - if (READ_ONCE(map[offset])) + for (offset =3D start; offset < end; offset++) { + swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); + if (swap_count(map[offset]) || !swp_tb_is_null(swp_tb)) return SWAP_ENTRY_INVALID; + } =20 return start; } =20 static bool cluster_scan_range(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long start, unsigned int nr_pages, + unsigned long offset, unsigned int nr_pages, bool *need_reclaim) { - unsigned long offset, end =3D start + nr_pages; + unsigned long end =3D offset + nr_pages; unsigned char *map =3D si->swap_map; + unsigned long swp_tb; =20 if (cluster_is_empty(ci)) return true; =20 - for (offset =3D start; offset < end; offset++) { - switch (READ_ONCE(map[offset])) { - case 0: - continue; - case SWAP_HAS_CACHE: + do { + if (swap_count(map[offset])) + return false; + swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); + if (swp_tb_is_folio(swp_tb)) { + WARN_ON_ONCE(!(map[offset] & SWAP_HAS_CACHE)); if (!vm_swap_full()) return false; *need_reclaim =3D true; - continue; - default: - return false; + } else { + /* A entry with no count and no cache must be null */ + VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb)); } - } + } while (++offset < end); =20 return true; } @@ -1013,7 +1012,8 @@ static void swap_reclaim_full_clusters(struct swap_in= fo_struct *si, bool force) to_scan--; =20 while (offset < end) { - if (READ_ONCE(map[offset]) =3D=3D SWAP_HAS_CACHE) { + if (!swap_count(READ_ONCE(map[offset])) && + swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER))) { spin_unlock(&ci->lock); nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); @@ -1957,6 +1957,7 @@ void swap_put_entries_direct(swp_entry_t entry, int n= r) struct swap_info_struct *si; bool any_only_cache =3D false; unsigned long offset; + unsigned long swp_tb; =20 si =3D get_swap_device(entry); if (WARN_ON_ONCE(!si)) @@ -1981,7 +1982,9 @@ void swap_put_entries_direct(swp_entry_t entry, int n= r) */ for (offset =3D start_offset; offset < end_offset; offset +=3D nr) { nr =3D 1; - if (READ_ONCE(si->swap_map[offset]) =3D=3D SWAP_HAS_CACHE) { + swp_tb =3D swap_table_get(__swap_offset_to_cluster(si, offset), + offset % SWAPFILE_CLUSTER); + if (!swap_count(READ_ONCE(si->swap_map[offset])) && swp_tb_is_folio(swp_= tb)) { /* * Folios are always naturally aligned in swap so * advance forward to the next boundary. Zero means no diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 00122f42718c..5411fd340ac3 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1184,17 +1184,13 @@ static int move_swap_pte(struct mm_struct *mm, stru= ct vm_area_struct *dst_vma, * Check if the swap entry is cached after acquiring the src_pte * lock. Otherwise, we might miss a newly loaded swap cache folio. * - * Check swap_map directly to minimize overhead, READ_ONCE is sufficient. * We are trying to catch newly added swap cache, the only possible case= is * when a folio is swapped in and out again staying in swap cache, using= the * same entry before the PTE check above. The PTL is acquired and releas= ed - * twice, each time after updating the swap_map's flag. So holding - * the PTL here ensures we see the updated value. False positive is poss= ible, - * e.g. SWP_SYNCHRONOUS_IO swapin may set the flag without touching the - * cache, or during the tiny synchronization window between swap cache a= nd - * swap_map, but it will be gone very quickly, worst result is retry jit= ters. + * twice, each time after updating the swap table. So holding + * the PTL here ensures we see the updated value. */ - if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) { + if (swap_cache_check_folio(entry)) { double_pt_unlock(dst_ptl, src_ptl); return -EAGAIN; } --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8505934AB06 for ; Wed, 29 Oct 2025 16:00:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753622; cv=none; b=Npf4U3/DPypyLvFNOw4DLCxhmBtnZLvpqZtw0knb0OfG4SqXfLBb7FrvgUABV3ixWCOhHjO+H/pWoVL21wtdPSwLEoT7DK11BwLDorjUV0rISITAo6D3ldg+cr79qxz28lHjnTDzR9PeZ1uFsbhdo9+p6Iez+/fDGQ05/MIn+VY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753622; c=relaxed/simple; bh=SBcMbXsf3YpRjcnepEnmkt7v+J8flPyweFFgb9xT/9o=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=OuhZfEGih7zr4hN8kihhBhz3ppzkW4trh+4tr5wqGuSs6UoEEvi3P+FzkBdWgKRFaxFssLUQIpjQb5f8+fryuO3osmN22HwoMhvCfT90NGUJ2tTbykFskn8+uJ1GUd/ppZvuknNNewD558AJyYUqgT0vN8WPWdNPLlP3lT+M044= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Mj4qSBkD; arc=none smtp.client-ip=209.85.214.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Mj4qSBkD" Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-290cd62acc3so82793115ad.2 for ; Wed, 29 Oct 2025 09:00:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753620; x=1762358420; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=8zoVrXsn7rvX5B7PiDhrEJ9ULAwxtbalHy9W90Ojrs4=; b=Mj4qSBkDRWkbeK8pV7uVy050xalFC7lfoxzqinf2vxb+UU3ReJME9sEqwV2E0Rw9Ad V8lVNk3hvtYNqQOJqQ7GCrSTFy/3W/csWAfd4P8S5BEyVUayJPfXUeo+m/Fipc+7lcYp 9W0ops1PSBpLs0SMPGCdJUKx5zUh8Odi4sOSA66UWr8aADbIaIvvXikS6PJnK9hrRXLI eHrkOMdz7WLoeZcPyaxh1elfGtrKsZQ7UsEHDG+Wy8+3ZR6G9f1cTvvXi144sS+yE7WZ MD2tT+WJUm2vqGTUWH4qDrIwuPQkr6AXQoXtTrQPjfhxAqq4mBoi3XAFqGoA1b3Sm0HT Jkdg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753620; x=1762358420; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8zoVrXsn7rvX5B7PiDhrEJ9ULAwxtbalHy9W90Ojrs4=; b=TB/zFpltkZUue9xL55xTiQVSlA/sK1NIgrzgB8IgnDgMWFSHCEjLDGsHgzCHV1IXr6 Xh3q2bZI1s2JvnqAADefB7RAPr0Fo7nqURiDxsExCgBmCTxJuFzUioZCZ/MQHFOL6kP1 tzvCEW6ww1kkhm4DowyEfBcpOSXza7RVT76+YOi2Bzv3TDp1aOWDwEuv2MPFZ5ZIIL0p GmbOP3xcK4isETUX0e8EJ6OcKS85ZBiwvHZDP4a/6Ope+VqTlR4Iys56L4kzVROivQQf VeFeSPK8WXhH+ESa7XR83nE+JxZgAXcuIoDB4nD+4p5mYgM9JzALby8aKLyRbWFDTBRO J13g== X-Forwarded-Encrypted: i=1; AJvYcCVYmRC8o/Nwy95qqe+svrkbLqYo/yLkxgxfgD+MaMrpxkoZ+W7XK96t6UrX6THgxE2ZaEicxL1XHWUg6wA=@vger.kernel.org X-Gm-Message-State: AOJu0Yxyu0WsBppZLy3sGX2f0897DmAwzXugNdGQBA4s3VALhSv6vxsX PeGYfNfr2aDLCSxxErCaZ3z10fWMjGi+Dy4eHuMRaXcHNwwwoybgnk+v X-Gm-Gg: ASbGncttRr4IA9YND6mfk4T8k/lNrDP7zb2HMwBcIkvElYQSrXraKlseBwjb1PN2M+h 5bL4DAwhO/1cIPBf2W42yNX0NBP30ucue1szTP3aRcyrbfHJGCsgmIiOqkMcMbs5okx1Eikers3 X9PjgdyFfWRsPBTdZIB2ycnymzMptHtOm3JT7TmLvMVjW7xfNq0khtxNZKn2Iq/ONb4kQuPXkqX 0BhorvDrHsvxS4kkfYnLJGh+2MBcKozidfJHfI0eOhR1NtIbKBSfvNYilw1COIR1kQUw/Dw/OEn XKMuilFLvz4/+ZBqSq7Am/hQiNYNriZnxvpiwMNtBWgV5UUmvXX6BZEcGe4UAv/qXwVSvOI34cl YPOvFOHl0SisGXqCBA+0POdGeuAYi37GvtAx9q0HKcwBnHw19PjSNSZjfmJPj2nKWQr0dhLEcyU 6buheN53fXj1Bq1d7DPIWgoqM0JuIE/9c= X-Google-Smtp-Source: AGHT+IHuvrF0Tu5VI3bO4fs+LvM+8CgRisU1xPC+GDqoRvkRBpcU82/x1EVGEJmBaKM2vuBkXY2ETw== X-Received: by 2002:a17:902:ea0c:b0:294:9919:b29f with SMTP id d9443c01a7336-294def471f3mr45839805ad.58.1761753619498; Wed, 29 Oct 2025 09:00:19 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.09.00.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 09:00:18 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:43 +0800 Subject: [PATCH 17/19] mm, swap: clean up and improve swap entries freeing Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-17-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song There are a few problems with the current freeing of swap entries. When freeing a set of swap entries directly (swap_put_entries_direct, typically from zapping the page table), it scans the whole swap region multiple times. First, it scans the whole region to check if it can be batch freed and if there is any cached folio. Then do a batch free only if the whole region's swap count equals 1. And if any entry is cached, even if only one, it will have to walk the whole region again to clean up the cache. And if any entry is not in a consistent status with other entries, it will fall back to order 0 freeing. For example, if only one of them is cached, the batch free will fall back. And the current batch freeing workflow relies on the swap map's SWAP_HAS_CACHE bit for both continuous checking and batch freeing, which isn't compatible with the swap table design. Tidy this up, introduce a new cluster scoped helper for all swap entry freeing job. It will batch frees all continuous entries, and just start a new batch if any inconsistent entry is found. This may improve the batch size when the clusters are fragmented. This should also be more robust with more sanity checks, and make it clear that a slot pinned by swap cache will be cleared upon cache reclaim. And the cache reclaim scan is also now limited to each cluster. If a cluster has any clean swap cache left after putting the swap count, reclaim the cluster only instead of the whole region. And since a folio's entries are always in the same cluster, putting swap entries from a folio can also use the new helper directly. This should be both an optimization and a cleanup, and the new helper is adapted to the swap table. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swapfile.c | 238 +++++++++++++++++++++++-------------------------------= ---- 1 file changed, 96 insertions(+), 142 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 3b7df5768d7f..12a1ab6f7b32 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -55,12 +55,14 @@ static bool swap_count_continued(struct swap_info_struc= t *, pgoff_t, static void free_swap_count_continuations(struct swap_info_struct *); static void swap_entries_free(struct swap_info_struct *si, struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr_pages); + unsigned long start, unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr= ); -static bool swap_entries_put_map(struct swap_info_struct *si, - swp_entry_t entry, int nr); +static void swap_put_entry_locked(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, + unsigned char usage); static bool folio_swapcache_freeable(struct folio *folio); static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, @@ -197,25 +199,6 @@ static bool swap_only_has_cache(struct swap_info_struc= t *si, return true; } =20 -static bool swap_is_last_map(struct swap_info_struct *si, - unsigned long offset, int nr_pages, bool *has_cache) -{ - unsigned char *map =3D si->swap_map + offset; - unsigned char *map_end =3D map + nr_pages; - unsigned char count =3D *map; - - if (swap_count(count) !=3D 1) - return false; - - while (++map < map_end) { - if (*map !=3D count) - return false; - } - - *has_cache =3D !!(count & SWAP_HAS_CACHE); - return true; -} - /* * returns number of pages in the folio that backs the swap entry. If posi= tive, * the folio was reclaimed. If negative, the folio was not reclaimed. If 0= , no @@ -1420,6 +1403,76 @@ static bool swap_sync_discard(void) return false; } =20 +/** + * swap_put_entries_cluster - Decrease the swap count of a set of slots. + * @si: The swap device. + * @start: start offset of slots. + * @nr: number of slots. + * @reclaim_cache: if true, also reclaim the swap cache. + * + * This helper decreases the swap count of a set of slots and tries to + * batch free them. Also reclaims the swap cache if @reclaim_cache is true. + * Context: The caller must ensure that all slots belong to the same + * cluster and their swap count doesn't go underflow. + */ +static void swap_put_entries_cluster(struct swap_info_struct *si, + unsigned long start, int nr, + bool reclaim_cache) +{ + unsigned long offset =3D start, end =3D start + nr; + unsigned long batch_start =3D SWAP_ENTRY_INVALID; + struct swap_cluster_info *ci; + bool need_reclaim =3D false; + unsigned int nr_reclaimed; + unsigned long swp_tb; + unsigned int count; + + ci =3D swap_cluster_lock(si, offset); + do { + swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); + count =3D si->swap_map[offset]; + VM_WARN_ON(swap_count(count) < 1 || count =3D=3D SWAP_MAP_BAD); + if (swap_count(count) =3D=3D 1) { + /* count =3D=3D 1 and non-cached slots will be batch freed. */ + if (!swp_tb_is_folio(swp_tb)) { + if (!batch_start) + batch_start =3D offset; + continue; + } + /* count will be 0 after put, slot can be reclaimed */ + VM_WARN_ON(!(count & SWAP_HAS_CACHE)); + need_reclaim =3D true; + } + /* + * A count !=3D 1 or cached slot can't be freed. Put its swap + * count and then free the interrupted pending batch. Cached + * slots will be freed when folio is removed from swap cache + * (__swap_cache_del_folio). + */ + swap_put_entry_locked(si, ci, offset, 1); + if (batch_start) { + swap_entries_free(si, ci, batch_start, offset - batch_start); + batch_start =3D SWAP_ENTRY_INVALID; + } + } while (++offset < end); + + if (batch_start) + swap_entries_free(si, ci, batch_start, offset - batch_start); + swap_cluster_unlock(ci); + + if (!need_reclaim || !reclaim_cache) + return; + + offset =3D start; + do { + nr_reclaimed =3D __try_to_reclaim_swap(si, offset, + TTRS_UNMAPPED | TTRS_FULL); + offset++; + if (nr_reclaimed) + offset =3D round_up(offset, abs(nr_reclaimed)); + } while (offset < end); +} + /** * folio_alloc_swap - allocate swap space for a folio * @folio: folio we want to move to swap @@ -1521,6 +1574,7 @@ void folio_put_swap(struct folio *folio, struct page = *subpage) { swp_entry_t entry =3D folio->swap; unsigned long nr_pages =3D folio_nr_pages(folio); + struct swap_info_struct *si =3D __swap_entry_to_info(entry); =20 VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio); @@ -1530,7 +1584,7 @@ void folio_put_swap(struct folio *folio, struct page = *subpage) nr_pages =3D 1; } =20 - swap_entries_put_map(__swap_entry_to_info(entry), entry, nr_pages); + swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false); } =20 static struct swap_info_struct *_swap_info_get(swp_entry_t entry) @@ -1567,12 +1621,11 @@ static struct swap_info_struct *_swap_info_get(swp_= entry_t entry) return NULL; } =20 -static unsigned char swap_entry_put_locked(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry, - unsigned char usage) +static void swap_put_entry_locked(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, + unsigned char usage) { - unsigned long offset =3D swp_offset(entry); unsigned char count; unsigned char has_cache; =20 @@ -1598,9 +1651,7 @@ static unsigned char swap_entry_put_locked(struct swa= p_info_struct *si, if (usage) WRITE_ONCE(si->swap_map[offset], usage); else - swap_entries_free(si, ci, entry, 1); - - return usage; + swap_entries_free(si, ci, offset, 1); } =20 /* @@ -1668,70 +1719,6 @@ struct swap_info_struct *get_swap_device(swp_entry_t= entry) return NULL; } =20 -static bool swap_entries_put_map(struct swap_info_struct *si, - swp_entry_t entry, int nr) -{ - unsigned long offset =3D swp_offset(entry); - struct swap_cluster_info *ci; - bool has_cache =3D false; - unsigned char count; - int i; - - if (nr <=3D 1) - goto fallback; - count =3D swap_count(data_race(si->swap_map[offset])); - if (count !=3D 1) - goto fallback; - - ci =3D swap_cluster_lock(si, offset); - if (!swap_is_last_map(si, offset, nr, &has_cache)) { - goto locked_fallback; - } - if (!has_cache) - swap_entries_free(si, ci, entry, nr); - else - for (i =3D 0; i < nr; i++) - WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE); - swap_cluster_unlock(ci); - - return has_cache; - -fallback: - ci =3D swap_cluster_lock(si, offset); -locked_fallback: - for (i =3D 0; i < nr; i++, entry.val++) { - count =3D swap_entry_put_locked(si, ci, entry, 1); - if (count =3D=3D SWAP_HAS_CACHE) - has_cache =3D true; - } - swap_cluster_unlock(ci); - return has_cache; -} - -/* - * Only functions with "_nr" suffix are able to free entries spanning - * cross multi clusters, so ensure the range is within a single cluster - * when freeing entries with functions without "_nr" suffix. - */ -static bool swap_entries_put_map_nr(struct swap_info_struct *si, - swp_entry_t entry, int nr) -{ - int cluster_nr, cluster_rest; - unsigned long offset =3D swp_offset(entry); - bool has_cache =3D false; - - cluster_rest =3D SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER; - while (nr) { - cluster_nr =3D min(nr, cluster_rest); - has_cache |=3D swap_entries_put_map(si, entry, cluster_nr); - cluster_rest =3D SWAPFILE_CLUSTER; - nr -=3D cluster_nr; - entry.val +=3D cluster_nr; - } - - return has_cache; -} - /* * Check if it's the last ref of swap entry in the freeing path. */ @@ -1746,9 +1733,9 @@ static inline bool __maybe_unused swap_is_last_ref(un= signed char count) */ static void swap_entries_free(struct swap_info_struct *si, struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr_pages) + unsigned long offset, unsigned int nr_pages) { - unsigned long offset =3D swp_offset(entry); + swp_entry_t entry =3D swp_entry(si->type, offset); unsigned char *map =3D si->swap_map + offset; unsigned char *map_end =3D map + nr_pages; =20 @@ -1954,10 +1941,8 @@ void swap_put_entries_direct(swp_entry_t entry, int = nr) { const unsigned long start_offset =3D swp_offset(entry); const unsigned long end_offset =3D start_offset + nr; + unsigned long offset, cluster_end; struct swap_info_struct *si; - bool any_only_cache =3D false; - unsigned long offset; - unsigned long swp_tb; =20 si =3D get_swap_device(entry); if (WARN_ON_ONCE(!si)) @@ -1965,44 +1950,13 @@ void swap_put_entries_direct(swp_entry_t entry, int= nr) if (WARN_ON_ONCE(end_offset > si->max)) goto out; =20 - /* - * First free all entries in the range. - */ - any_only_cache =3D swap_entries_put_map_nr(si, entry, nr); - - /* - * Short-circuit the below loop if none of the entries had their - * reference drop to zero. - */ - if (!any_only_cache) - goto out; - - /* - * Now go back over the range trying to reclaim the swap cache. - */ - for (offset =3D start_offset; offset < end_offset; offset +=3D nr) { - nr =3D 1; - swp_tb =3D swap_table_get(__swap_offset_to_cluster(si, offset), - offset % SWAPFILE_CLUSTER); - if (!swap_count(READ_ONCE(si->swap_map[offset])) && swp_tb_is_folio(swp_= tb)) { - /* - * Folios are always naturally aligned in swap so - * advance forward to the next boundary. Zero means no - * folio was found for the swap entry, so advance by 1 - * in this case. Negative value means folio was found - * but could not be reclaimed. Here we can still advance - * to the next boundary. - */ - nr =3D __try_to_reclaim_swap(si, offset, - TTRS_UNMAPPED | TTRS_FULL); - if (nr =3D=3D 0) - nr =3D 1; - else if (nr < 0) - nr =3D -nr; - nr =3D ALIGN(offset + 1, nr) - offset; - } - } - + /* Put entries and reclaim cache in each cluster */ + offset =3D start_offset; + do { + cluster_end =3D min(round_up(offset + 1, SWAPFILE_CLUSTER), end_offset); + swap_put_entries_cluster(si, offset, cluster_end - offset, true); + offset =3D cluster_end; + } while (offset < end_offset); out: put_swap_device(si); } @@ -2051,7 +2005,7 @@ void swap_free_hibernation_slot(swp_entry_t entry) return; =20 ci =3D swap_cluster_lock(si, offset); - swap_entry_put_locked(si, ci, entry, 1); + swap_put_entry_locked(si, ci, offset, 1); WARN_ON(swap_entry_swapped(si, offset)); swap_cluster_unlock(ci); =20 @@ -3799,10 +3753,10 @@ void __swapcache_clear_cached(struct swap_info_stru= ct *si, swp_entry_t entry, unsigned int nr) { if (swap_only_has_cache(si, swp_offset(entry), nr)) { - swap_entries_free(si, ci, entry, nr); + swap_entries_free(si, ci, swp_offset(entry), nr); } else { for (int i =3D 0; i < nr; i++, entry.val++) - swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); + swap_put_entry_locked(si, ci, swp_offset(entry), SWAP_HAS_CACHE); } } =20 @@ -3923,7 +3877,7 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) * into, carry if so, or else fail until a new continuation page is alloca= ted; * when the original swap_map count is decremented from 0 with continuatio= n, * borrow from the continuation and report whether it still holds more. - * Called while __swap_duplicate() or caller of swap_entry_put_locked() + * Called while __swap_duplicate() or caller of swap_put_entry_locked() * holds cluster lock. */ static bool swap_count_continued(struct swap_info_struct *si, --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pj1-f44.google.com (mail-pj1-f44.google.com [209.85.216.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2C363329C52 for ; Wed, 29 Oct 2025 16:00:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.44 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753632; cv=none; b=gnh7bUgvn4tkI7cdA8WccPJXYOuLIDXm8AzZN/YrhvgO/hFYv/w5QtoOPCCEgMXxVas7YHAbKw9fVkY0oBOtZyTwkloTH6Dqf/HIOUETWaui161Mr2MzTXwRpdE/x5/t6+yBBRuIBp4n/5Vp0/sWrf9qM7HPEcd3gXdnJFzJyGg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753632; c=relaxed/simple; bh=IkoUZKVm/dcdG0tdShD1YcTPNdqbWnFEuLXBvvHL70w=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=T6mOaAuZDlDuJaPh04LAYu9Znefv7MM3VwWV0uihKwzsFmvdtwTJDyt6FLsqJwaWRKZouX8SaeE2FCwVRtq3mcKiUUwe4l/1vgWEcir0Qx93lMKkNVM7B/CWWvIB6Ip5ALlNWZvh4tYjSERWfTva3RMbtsomGea5Kf+1atQZKyY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=CpLHuXN7; arc=none smtp.client-ip=209.85.216.44 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="CpLHuXN7" Received: by mail-pj1-f44.google.com with SMTP id 98e67ed59e1d1-33f9aec69b6so39376a91.1 for ; Wed, 29 Oct 2025 09:00:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753629; x=1762358429; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=js7VbMzO/PyCsGbnIWOZ2XRn9F05cVP9JsAjpW1ZYIc=; b=CpLHuXN7T9fhpc/gZhEI5M2X+By1amyIeFTN0Lt6oLV2lUzOAY14V1niWkiHhMHB3E AsjVtpabjEBC77m9aNpMJ2/+FhVBeXkIVLciNzSbW0byYZoheqKJJmpm7gM7EQrWtbOx axmf9bskUp4oUZuJfn7iflwo+/rUJH2gp7BwAw76QB4hB/NpiVXXNQOdd7u51BCtJJO0 jcvP8WMN2rMGSxBFyeWYyIOokppP6j+Zvrf04aRLapYCfEdRWQImyXK0GTh7SdHhW693 StU2UtL/CDina5CMk69nTagCheU0dZSaZ6JgvdilWYVlJRID0j4ciq8l5huUttzcZJVr bgsg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753629; x=1762358429; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=js7VbMzO/PyCsGbnIWOZ2XRn9F05cVP9JsAjpW1ZYIc=; b=jzQE/SPScU3D1JtW5BQzftNUwnI62D8+O4lIYxo2g+WoCLeR7XcwvSTvLHY3DYIic8 OF/bpMAC5Ra7RC1Nuz+4agU6rwV4H+mhel3vgZl8XRASicdrwzWdc+B3wUCoL1+OHqIB pb4OThN0iQMtO7sWB9wVetbC1cyPPEUlWQjRCdYapn5ijZcxgCDJvWj6aXztOqNuRf3x YKMb2lO3wdMLokxSuje0MwzXKmqDQoEEiio/bpszrRs4fnDI+Rrlfe8kDLXtg9ohyedP gmYkcRfcZqtVZLZWBwnPJZwdsfxUEOA5b2pEVBvFKRCFZqZUeN/c81z+vEuHg1bJynsz KPQw== X-Forwarded-Encrypted: i=1; AJvYcCWjqV3zswa5/iKuWgv2MfxNoBFnhYu40MwzoC6WVmrDp8/Q4TN9/gjfWTy83visMa7dINFZ6/HG4sB2SAU=@vger.kernel.org X-Gm-Message-State: AOJu0YxmIkcvZgXB0sMM7eWeP5dgrzs/x68gCLYsgNVs9LJO91141jYn 9ivVO/SovxbFq3UhqXTc+esIeVNoCnkJdm6wKWNoI3b4BuS0LZc1ki5u X-Gm-Gg: ASbGncvPus8wqL6CHrPSePebMc/chIPIWqVM/j9Oj6bPmjy/mhbbBUCNONbLaYT/mT4 e5IUqrBwCFwly9iOgNy9+Sqzr201wVs/PIdq/7C5G0nTepEBaIdlH7zKgwJk8dZtSoeQ30VjV5z VTNHlHiEwWeDsjIF8564JaqkV2qDjqoIzL+G18O6/5N2aJt5xLpJ8ZjQuLLOk48gVU0Fh6SV9rX z8eegSnqMfeQvlfbW0fUzcbwsZTZuoLyKxmmcmqeMOKI2F3q6fiHjqz7DFaf7YQwAXQSZYaFjXn SO6u5boaEkLoRzHJXXwdpGfCMb+9gF/PFyBG5Gu85lJ9BeMjQlNX/RL9glbualI8J+Cb7CiBnJc XqEAiH8zn9Yhtc02WiVFwOvRWHoQ04NFMokMxvC67RYS8/LHRtG2p3MQG/KIA3NbfCrDi2x7aEh Q1H2NkmZLGMpollB51wDWn X-Google-Smtp-Source: AGHT+IGWCQLviFcQvcj4hei5TQb5babvs+IdShmaI91V27dhBK786K4d+NOP4MaEXVIILQVw35I/iA== X-Received: by 2002:a17:90b:4f44:b0:32d:90c7:c63b with SMTP id 98e67ed59e1d1-3403a2f218amr3958243a91.30.1761753624639; Wed, 29 Oct 2025 09:00:24 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.09.00.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 09:00:23 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:44 +0800 Subject: [PATCH 18/19] mm, swap: drop the SWAP_HAS_CACHE flag Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-18-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song Now, the swap cache is managed by the swap table. All swap cache users are checking the swap table directly to check the swap cache state. SWAP_HAS_CACHE is now just a temporary pin before the first increase from 0 to 1 of a slot's swap count (swap_dup_entries), or before the final free of slots pinned by folio in swap cache (put_swap_folio). Drop these two usages. For the first dup, SWAP_HAS_CACHE pinning was hard to kill because it used to have multiple meanings, more than just "a slot is cached". We have simplified that and just defined that the first dup is always done with folio locked in swap cache (folio_dup_swap), so it can just check the swap cache (swap table) directly. As for freeing, just let the swap cache free all swap entries of a folio that have a swap count of zero directly upon folio removal. We have also just cleaned up freeing to cover the swap cache usage in the swap table, a slot with swap cache will not be freed until its cache is gone. Now, making the removal of a folio and freeing the slots being done in the same critical section, this should improve the performance and gets rid of the SWAP_HAS_CACHE pin. After these two changes, SWAP_HAS_CACHE no longer has any users. Remove all related logic and helpers. swap_map is now only used for tracking the count, so all swap_map users can just need to read it directly, ignoring the swap_count helper, which was previously used to filter out the SWAP_HAS_CACHE bit. The idea of dropping SWAP_HAS_CACHE and using the swap table directly was initially from Chris's idea of merging all the metadata usage of all swaps into one place. Suggested-by: Chris Li Signed-off-by: Kairui Song --- include/linux/swap.h | 1 - mm/swap.h | 13 ++-- mm/swap_state.c | 28 +++++---- mm/swapfile.c | 163 ++++++++++++++++-------------------------------= ---- 4 files changed, 71 insertions(+), 134 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 4b4b81fbc6a3..dcb1760e36c3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -224,7 +224,6 @@ enum { #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX =20 /* Bit flag in swap_map */ -#define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */ #define COUNT_CONTINUED 0x80 /* Flag swap_map continuation for full count = */ =20 /* Special value in first swap_map */ diff --git a/mm/swap.h b/mm/swap.h index 73f07bcea5f0..331424366487 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -205,6 +205,11 @@ int folio_alloc_swap(struct folio *folio); int folio_dup_swap(struct folio *folio, struct page *subpage); void folio_put_swap(struct folio *folio, struct page *subpage); =20 +/* For internal use */ +extern void swap_entries_free(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, unsigned int nr_pages); + /* linux/mm/page_io.c */ int sio_pool_init(void); struct swap_iocb; @@ -256,14 +261,6 @@ static inline bool folio_matches_swap_entry(const stru= ct folio *folio, return folio_entry.val =3D=3D round_down(entry.val, nr_pages); } =20 -/* Temporary internal helpers */ -void __swapcache_set_cached(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry); -void __swapcache_clear_cached(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr); - /* * All swap cache helpers below require the caller to ensure the swap entr= ies * used are valid and stablize the device by any of the following ways: diff --git a/mm/swap_state.c b/mm/swap_state.c index 41d4fa056203..2bf72d58f6ee 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -215,17 +215,6 @@ static int swap_cache_add_folio(struct folio *folio, s= wp_entry_t entry, shadow =3D swp_tb_to_shadow(old_tb); offset++; } while (++ci_off < ci_end); - - ci_off =3D ci_start; - offset =3D swp_offset(entry); - do { - /* - * Still need to pin the slots with SWAP_HAS_CACHE since - * swap allocator depends on that. - */ - __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset)); - offset++; - } while (++ci_off < ci_end); __swap_cache_add_folio(ci, folio, entry); swap_cluster_unlock(ci); if (shadowp) @@ -256,6 +245,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *c= i, struct folio *folio, struct swap_info_struct *si; unsigned long old_tb, new_tb; unsigned int ci_start, ci_off, ci_end; + bool folio_swapped =3D false, need_free =3D false; unsigned long nr_pages =3D folio_nr_pages(folio); =20 VM_WARN_ON_ONCE(__swap_entry_to_cluster(entry) !=3D ci); @@ -273,13 +263,27 @@ void __swap_cache_del_folio(struct swap_cluster_info = *ci, struct folio *folio, old_tb =3D __swap_table_xchg(ci, ci_off, new_tb); WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) !=3D folio); + if (__swap_count(swp_entry(si->type, + swp_offset(entry) + ci_off - ci_start))) + folio_swapped =3D true; + else + need_free =3D true; } while (++ci_off < ci_end); =20 folio->swap.val =3D 0; folio_clear_swapcache(folio); node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages); lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages); - __swapcache_clear_cached(si, ci, entry, nr_pages); + + if (!folio_swapped) { + swap_entries_free(si, ci, swp_offset(entry), nr_pages); + } else if (need_free) { + do { + if (!__swap_count(entry)) + swap_entries_free(si, ci, swp_offset(entry), 1); + entry.val++; + } while (--nr_pages); + } } =20 /** diff --git a/mm/swapfile.c b/mm/swapfile.c index 12a1ab6f7b32..49916fdb8b70 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -48,21 +48,18 @@ #include #include "swap_table.h" #include "internal.h" +#include "swap_table.h" #include "swap.h" =20 static bool swap_count_continued(struct swap_info_struct *, pgoff_t, unsigned char); static void free_swap_count_continuations(struct swap_info_struct *); -static void swap_entries_free(struct swap_info_struct *si, - struct swap_cluster_info *ci, - unsigned long start, unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr= ); static void swap_put_entry_locked(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long offset, - unsigned char usage); + unsigned long offset); static bool folio_swapcache_freeable(struct folio *folio); static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, @@ -149,11 +146,6 @@ static struct swap_info_struct *swap_entry_to_info(swp= _entry_t entry) return swap_type_to_info(swp_type(entry)); } =20 -static inline unsigned char swap_count(unsigned char ent) -{ - return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */ -} - /* * Use the second highest bit of inuse_pages counter as the indicator * if one swap device is on the available plist, so the atomic can @@ -185,15 +177,20 @@ static long swap_usage_in_pages(struct swap_info_stru= ct *si) #define TTRS_FULL 0x4 =20 static bool swap_only_has_cache(struct swap_info_struct *si, - unsigned long offset, int nr_pages) + struct swap_cluster_info *ci, + unsigned long offset, int nr_pages) { + unsigned int ci_off =3D offset % SWAPFILE_CLUSTER; unsigned char *map =3D si->swap_map + offset; unsigned char *map_end =3D map + nr_pages; + unsigned long swp_tb; =20 do { - VM_BUG_ON(!(*map & SWAP_HAS_CACHE)); - if (*map !=3D SWAP_HAS_CACHE) + swp_tb =3D __swap_table_get(ci, ci_off); + VM_WARN_ON_ONCE(!swp_tb_is_folio(swp_tb)); + if (*map) return false; + ++ci_off; } while (++map < map_end); =20 return true; @@ -254,7 +251,7 @@ static int __try_to_reclaim_swap(struct swap_info_struc= t *si, * reference or pending writeback, and can't be allocated to others. */ ci =3D swap_cluster_lock(si, offset); - need_reclaim =3D swap_only_has_cache(si, offset, nr_pages); + need_reclaim =3D swap_only_has_cache(si, ci, offset, nr_pages); swap_cluster_unlock(ci); if (!need_reclaim) goto out_unlock; @@ -775,7 +772,7 @@ static unsigned int cluster_reclaim_range(struct swap_i= nfo_struct *si, =20 spin_unlock(&ci->lock); do { - if (swap_count(READ_ONCE(map[offset]))) + if (READ_ONCE(map[offset])) break; swp_tb =3D swap_table_get(ci, offset % SWAPFILE_CLUSTER); if (swp_tb_is_folio(swp_tb)) { @@ -800,7 +797,7 @@ static unsigned int cluster_reclaim_range(struct swap_i= nfo_struct *si, */ for (offset =3D start; offset < end; offset++) { swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); - if (swap_count(map[offset]) || !swp_tb_is_null(swp_tb)) + if (map[offset] || !swp_tb_is_null(swp_tb)) return SWAP_ENTRY_INVALID; } =20 @@ -820,11 +817,10 @@ static bool cluster_scan_range(struct swap_info_struc= t *si, return true; =20 do { - if (swap_count(map[offset])) + if (map[offset]) return false; swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); if (swp_tb_is_folio(swp_tb)) { - WARN_ON_ONCE(!(map[offset] & SWAP_HAS_CACHE)); if (!vm_swap_full()) return false; *need_reclaim =3D true; @@ -882,11 +878,6 @@ static bool cluster_alloc_range(struct swap_info_struc= t *si, if (likely(folio)) { order =3D folio_order(folio); nr_pages =3D 1 << order; - /* - * Pin the slot with SWAP_HAS_CACHE to satisfy swap_dup_entries. - * This is the legacy allocation behavior, will drop it very soon. - */ - memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages); __swap_cache_add_folio(ci, folio, swp_entry(si->type, offset)); } else { order =3D 0; @@ -995,8 +986,8 @@ static void swap_reclaim_full_clusters(struct swap_info= _struct *si, bool force) to_scan--; =20 while (offset < end) { - if (!swap_count(READ_ONCE(map[offset])) && - swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER))) { + if (!READ_ONCE(map[offset]) && + swp_tb_is_folio(swap_table_get(ci, offset % SWAPFILE_CLUSTER))) { spin_unlock(&ci->lock); nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); @@ -1431,8 +1422,8 @@ static void swap_put_entries_cluster(struct swap_info= _struct *si, do { swp_tb =3D __swap_table_get(ci, offset % SWAPFILE_CLUSTER); count =3D si->swap_map[offset]; - VM_WARN_ON(swap_count(count) < 1 || count =3D=3D SWAP_MAP_BAD); - if (swap_count(count) =3D=3D 1) { + VM_WARN_ON(count < 1 || count =3D=3D SWAP_MAP_BAD); + if (count =3D=3D 1) { /* count =3D=3D 1 and non-cached slots will be batch freed. */ if (!swp_tb_is_folio(swp_tb)) { if (!batch_start) @@ -1440,7 +1431,6 @@ static void swap_put_entries_cluster(struct swap_info= _struct *si, continue; } /* count will be 0 after put, slot can be reclaimed */ - VM_WARN_ON(!(count & SWAP_HAS_CACHE)); need_reclaim =3D true; } /* @@ -1449,7 +1439,7 @@ static void swap_put_entries_cluster(struct swap_info= _struct *si, * slots will be freed when folio is removed from swap cache * (__swap_cache_del_folio). */ - swap_put_entry_locked(si, ci, offset, 1); + swap_put_entry_locked(si, ci, offset); if (batch_start) { swap_entries_free(si, ci, batch_start, offset - batch_start); batch_start =3D SWAP_ENTRY_INVALID; @@ -1602,13 +1592,8 @@ static struct swap_info_struct *_swap_info_get(swp_e= ntry_t entry) offset =3D swp_offset(entry); if (offset >=3D si->max) goto bad_offset; - if (data_race(!si->swap_map[swp_offset(entry)])) - goto bad_free; return si; =20 -bad_free: - pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val); - goto out; bad_offset: pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); goto out; @@ -1623,21 +1608,12 @@ static struct swap_info_struct *_swap_info_get(swp_= entry_t entry) =20 static void swap_put_entry_locked(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long offset, - unsigned char usage) + unsigned long offset) { unsigned char count; - unsigned char has_cache; =20 count =3D si->swap_map[offset]; - - has_cache =3D count & SWAP_HAS_CACHE; - count &=3D ~SWAP_HAS_CACHE; - - if (usage =3D=3D SWAP_HAS_CACHE) { - VM_BUG_ON(!has_cache); - has_cache =3D 0; - } else if ((count & ~COUNT_CONTINUED) <=3D SWAP_MAP_MAX) { + if ((count & ~COUNT_CONTINUED) <=3D SWAP_MAP_MAX) { if (count =3D=3D COUNT_CONTINUED) { if (swap_count_continued(si, offset, count)) count =3D SWAP_MAP_MAX | COUNT_CONTINUED; @@ -1647,10 +1623,8 @@ static void swap_put_entry_locked(struct swap_info_s= truct *si, count--; } =20 - usage =3D count | has_cache; - if (usage) - WRITE_ONCE(si->swap_map[offset], usage); - else + WRITE_ONCE(si->swap_map[offset], count); + if (!count && !swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLU= STER))) swap_entries_free(si, ci, offset, 1); } =20 @@ -1719,21 +1693,13 @@ struct swap_info_struct *get_swap_device(swp_entry_= t entry) return NULL; } =20 -/* - * Check if it's the last ref of swap entry in the freeing path. - */ -static inline bool __maybe_unused swap_is_last_ref(unsigned char count) -{ - return (count =3D=3D SWAP_HAS_CACHE) || (count =3D=3D 1); -} - /* * Drop the last ref of swap entries, caller have to ensure all entries * belong to the same cgroup and cluster. */ -static void swap_entries_free(struct swap_info_struct *si, - struct swap_cluster_info *ci, - unsigned long offset, unsigned int nr_pages) +void swap_entries_free(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, unsigned int nr_pages) { swp_entry_t entry =3D swp_entry(si->type, offset); unsigned char *map =3D si->swap_map + offset; @@ -1746,7 +1712,7 @@ static void swap_entries_free(struct swap_info_struct= *si, =20 ci->count -=3D nr_pages; do { - VM_BUG_ON(!swap_is_last_ref(*map)); + VM_WARN_ON(*map > 1); *map =3D 0; } while (++map < map_end); =20 @@ -1765,7 +1731,7 @@ int __swap_count(swp_entry_t entry) struct swap_info_struct *si =3D __swap_entry_to_info(entry); pgoff_t offset =3D swp_offset(entry); =20 - return swap_count(si->swap_map[offset]); + return si->swap_map[offset]; } =20 /** @@ -1779,7 +1745,7 @@ bool swap_entry_swapped(struct swap_info_struct *si, = unsigned long offset) int count; =20 ci =3D swap_cluster_lock(si, offset); - count =3D swap_count(si->swap_map[offset]); + count =3D si->swap_map[offset]; swap_cluster_unlock(ci); =20 return count && count !=3D SWAP_MAP_BAD; @@ -1806,7 +1772,7 @@ int swp_swapcount(swp_entry_t entry) =20 ci =3D swap_cluster_lock(si, offset); =20 - count =3D swap_count(si->swap_map[offset]); + count =3D si->swap_map[offset]; if (!(count & COUNT_CONTINUED)) goto out; =20 @@ -1844,12 +1810,12 @@ static bool swap_page_trans_huge_swapped(struct swa= p_info_struct *si, =20 ci =3D swap_cluster_lock(si, offset); if (nr_pages =3D=3D 1) { - if (swap_count(map[roffset])) + if (map[roffset]) ret =3D true; goto unlock_out; } for (i =3D 0; i < nr_pages; i++) { - if (swap_count(map[offset + i])) { + if (map[offset + i]) { ret =3D true; break; } @@ -2005,7 +1971,7 @@ void swap_free_hibernation_slot(swp_entry_t entry) return; =20 ci =3D swap_cluster_lock(si, offset); - swap_put_entry_locked(si, ci, offset, 1); + swap_put_entry_locked(si, ci, offset); WARN_ON(swap_entry_swapped(si, offset)); swap_cluster_unlock(ci); =20 @@ -2412,6 +2378,7 @@ static unsigned int find_next_to_unuse(struct swap_in= fo_struct *si, unsigned int prev) { unsigned int i; + unsigned long swp_tb; unsigned char count; =20 /* @@ -2422,7 +2389,11 @@ static unsigned int find_next_to_unuse(struct swap_i= nfo_struct *si, */ for (i =3D prev + 1; i < si->max; i++) { count =3D READ_ONCE(si->swap_map[i]); - if (count && swap_count(count) !=3D SWAP_MAP_BAD) + swp_tb =3D swap_table_get(__swap_offset_to_cluster(si, i), + i % SWAPFILE_CLUSTER); + if (count =3D=3D SWAP_MAP_BAD) + continue; + if (count || swp_tb_is_folio(swp_tb)) break; if ((i % LATENCY_LIMIT) =3D=3D 0) cond_resched(); @@ -3649,39 +3620,26 @@ static int swap_dup_entries(struct swap_info_struct= *si, unsigned char usage, int nr) { int i; - unsigned char count, has_cache; + unsigned char count; =20 for (i =3D 0; i < nr; i++) { count =3D si->swap_map[offset + i]; - /* * Allocator never allocates bad slots, and readahead is guarded * by swap_entry_swapped. */ - if (WARN_ON(swap_count(count) =3D=3D SWAP_MAP_BAD)) - return -ENOENT; - - has_cache =3D count & SWAP_HAS_CACHE; - count &=3D ~SWAP_HAS_CACHE; - - if (!count && !has_cache) { - return -ENOENT; - } else if (usage =3D=3D SWAP_HAS_CACHE) { - if (has_cache) - return -EEXIST; - } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) { - return -EINVAL; - } + VM_WARN_ON(count =3D=3D SWAP_MAP_BAD); + /* + * Swap count duplication is guranteed by either locked swap cache + * folio (folio_dup_swap) or external lock (swap_dup_entry_direct). + */ + VM_WARN_ON(!count && + !swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER))); } =20 for (i =3D 0; i < nr; i++) { count =3D si->swap_map[offset + i]; - has_cache =3D count & SWAP_HAS_CACHE; - count &=3D ~SWAP_HAS_CACHE; - - if (usage =3D=3D SWAP_HAS_CACHE) - has_cache =3D SWAP_HAS_CACHE; - else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) + if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) count +=3D usage; else if (swap_count_continued(si, offset + i, count)) count =3D COUNT_CONTINUED; @@ -3693,7 +3651,7 @@ static int swap_dup_entries(struct swap_info_struct *= si, return -ENOMEM; } =20 - WRITE_ONCE(si->swap_map[offset + i], count | has_cache); + WRITE_ONCE(si->swap_map[offset + i], count); } =20 return 0; @@ -3739,27 +3697,6 @@ int swap_dup_entry_direct(swp_entry_t entry) return err; } =20 -/* Mark the swap map as HAS_CACHE, caller need to hold the cluster lock */ -void __swapcache_set_cached(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry) -{ - WARN_ON(swap_dup_entries(si, ci, swp_offset(entry), SWAP_HAS_CACHE, 1)); -} - -/* Clear the swap map as !HAS_CACHE, caller need to hold the cluster lock = */ -void __swapcache_clear_cached(struct swap_info_struct *si, - struct swap_cluster_info *ci, - swp_entry_t entry, unsigned int nr) -{ - if (swap_only_has_cache(si, swp_offset(entry), nr)) { - swap_entries_free(si, ci, swp_offset(entry), nr); - } else { - for (int i =3D 0; i < nr; i++, entry.val++) - swap_put_entry_locked(si, ci, swp_offset(entry), SWAP_HAS_CACHE); - } -} - /* * add_swap_count_continuation - called when a swap count is duplicated * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entr= y's @@ -3805,7 +3742,7 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) =20 ci =3D swap_cluster_lock(si, offset); =20 - count =3D swap_count(si->swap_map[offset]); + count =3D si->swap_map[offset]; =20 if ((count & ~COUNT_CONTINUED) !=3D SWAP_MAP_MAX) { /* --=20 2.51.1 From nobody Sun Feb 8 17:03:32 2026 Received: from mail-pj1-f48.google.com (mail-pj1-f48.google.com [209.85.216.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4511D34573C for ; Wed, 29 Oct 2025 16:00:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.48 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753631; cv=none; b=DEGqfHVUQvE5XJRVx1J1Oc44iVM6nyXhspQYDFC06WbH+OenTYjaFdtiwunBjiPyn9dmY1P+iSgZgGnzDjLaet6hXD17XdkuBrL6OFdDXK4i72xpFfCvnJldZ7gvtCaXbKFoRv9v5YL1b4CBifbNnfUKBeg/sYLT59ZX7lp+uAU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761753631; c=relaxed/simple; bh=NNUzmcFz13cOwZMRpnn/w26LlE1s3FpeVoPd0yZr2kw=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=bfPrttt/lvFV8Hn3qFMfDfLNcELlOul3ROQNdGXEPu/C/91PRQZjz4Ca2ZRs9CV2N3pS1dt0v2PjsuerIR+NaCClkpErTsMANu87HW2umb03MAumJLLWC89bBEdCtSb6icsPSJaduNbTH2fnfderPR3PfdNVF4ton5pcijFYVGA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=UWxEKnxn; arc=none smtp.client-ip=209.85.216.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="UWxEKnxn" Received: by mail-pj1-f48.google.com with SMTP id 98e67ed59e1d1-33bc2178d6aso80894a91.0 for ; Wed, 29 Oct 2025 09:00:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761753629; x=1762358429; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=K65D0qkLQ24GtBqbjWxaMhnSplvMfrH+SnXenMqBLGE=; b=UWxEKnxnGX1mhWg2CPdDd1/8daM1UeZRunQatpdKC24zwCSZLxGgRn15IgQS+cEHG4 TljIvbkQaiCBvlyeZfLa8TkC2pJIkZD3djd/9mtg+KbJFwrzJ/M9Z3+TSR3nwnHMR5eT IRv4VnSDLvw2cCKxMza4Vf8i5Ui2tSFv3+hWyHkKDiVuQ23pi1enwlRlB3+1aWMJo8JU xuCdQQEZzSCEg2FZqc51Iv6oUVSKmYnmcYZq980v6gUo08rdTMeQRD9cR50P867MmmZ0 U65KM4uGq7uQ2+Rep18hHLOuSOH42V6t4OWRHLQB9czPORHC/2Wda5w7xEqfhp3tOfVX CuEw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761753629; x=1762358429; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=K65D0qkLQ24GtBqbjWxaMhnSplvMfrH+SnXenMqBLGE=; b=ZuKTBCS97AyjStdmvVo3DruIaZqGd+uz1yS6EstBupkWXnb94B0dodYmGYD/cSWGXx RA/pJpm3hiYthXnRvPJtmwIn6kqyu2Oor41B2x4pQTfE+t45jWzHTVgIgsaxuQytw9PG ymvne4vyqtpNyxydDENB0foc3alpZJ7hq6uArGK3uj3kdZdlwuLDWOIZ1HSN8zeot3e1 lSfPe57zz00NVZ/5EdXAQWMLLB0x9bmYMWQ4I+8KgFXgDdSwBYXBcYZVk4EUheDSrMtJ kfqXDmQC9+OCSgA7XQR5Jlz7eKjrd85h8g2w4Tu7BYT1lcb8HBj9XTC1pqabmnwl+qqh NhMg== X-Forwarded-Encrypted: i=1; AJvYcCVF9Lyb8rJPdxVOmD9tV85QGw9wmmK2VV2o5OoPBAXNy+4ykQBXMO+egUqGId+Xdqhxby+PNUE3i+1giBE=@vger.kernel.org X-Gm-Message-State: AOJu0Yxti1ztfAATDGHPCM2hnRk9948bwS/CrYznbfo01YsEEtpf1Xv6 Fn2dmWXkCBWVx/opxE505LDtIOIantoECASipYgF+MNbKe9wgLgJBfzJ X-Gm-Gg: ASbGnctFF3f4hpPrBdVPqkLwEfhBaM0Q6FdBTS0Nw7ptQc75B48HbsCj7WgMnS9xh1Q B6XNlKXKGg+CmrTzdZpZyezJa+gJnnvY1aW0/cFmZJQsxAxA2sWTeWMMa6PDJZESGmp8V2OyLEk 3iaLmor5rGkcda7Y4B5urbApz+t0SWDkMdwXz7tLcyKyLLNFVqZOKtc+6r6darWuKQ7s5Y+3Mon 9sNfsidwcZ1WfbE6ycbuLig2EHScZjfZ0kXDdz80qEq8E2s3DChQ4O1rP0F/1UpXVU/aM5yqDXU 7lOfSgmxYrzsyCmonNWtDA63jdkX2aJy8E8q2mPdfI8ZpMb/K/Fr21+NdiIxkdqet0c2fCNevDy RltCGsLOCElYZOUlew5yyQt7oPQxndhBPl0fzjGrzawyT92flLpPCa9EUpgcVng4ro6cAh/c8ny f1RZG9IPLNs8974oWHoFFh X-Google-Smtp-Source: AGHT+IFfb91fHwDShHt72AqOZe3jp0hMtnf0fflVA79t3VxX8NtzmyOvdGrSfHzPbZg1N+wH4X3uXw== X-Received: by 2002:a17:90b:48c4:b0:340:299e:dca with SMTP id 98e67ed59e1d1-3403a15c412mr3897842a91.16.1761753629363; Wed, 29 Oct 2025 09:00:29 -0700 (PDT) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-33fed7e95aasm16087366a91.8.2025.10.29.09.00.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Oct 2025 09:00:28 -0700 (PDT) From: Kairui Song Date: Wed, 29 Oct 2025 23:58:45 +0800 Subject: [PATCH 19/19] mm, swap: remove no longer needed _swap_info_get Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20251029-swap-table-p2-v1-19-3d43f3b6ec32@tencent.com> References: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> In-Reply-To: <20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Johannes Weiner , Yosry Ahmed , David Hildenbrand , Youngjun Park , Hugh Dickins , Baolin Wang , "Huang, Ying" , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song X-Mailer: b4 0.14.3 From: Kairui Song There are now only two users of _swap_info_get after consolidating these callers, folio_try_reclaim_swap and swp_swapcount. folio_free_swap already holds the folio lock, and the folio is in swap cache, _swap_info_get is redundant. For swp_swapcount, it can just use get_swap_device instead. It only wants to check the swap count, both are fine except get_swap_device increases the device ref count, which is actually a bit safer. The only current use is smap walking, and the performance change here is tiny. And after these changes, _swap_info_get is no longer used, so we can safely remove it. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swapfile.c | 39 ++++++--------------------------------- 1 file changed, 6 insertions(+), 33 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 49916fdb8b70..150916f4640c 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1577,35 +1577,6 @@ void folio_put_swap(struct folio *folio, struct page= *subpage) swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false); } =20 -static struct swap_info_struct *_swap_info_get(swp_entry_t entry) -{ - struct swap_info_struct *si; - unsigned long offset; - - if (!entry.val) - goto out; - si =3D swap_entry_to_info(entry); - if (!si) - goto bad_nofile; - if (data_race(!(si->flags & SWP_USED))) - goto bad_device; - offset =3D swp_offset(entry); - if (offset >=3D si->max) - goto bad_offset; - return si; - -bad_offset: - pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); - goto out; -bad_device: - pr_err("%s: %s%08lx\n", __func__, Unused_file, entry.val); - goto out; -bad_nofile: - pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val); -out: - return NULL; -} - static void swap_put_entry_locked(struct swap_info_struct *si, struct swap_cluster_info *ci, unsigned long offset) @@ -1764,7 +1735,7 @@ int swp_swapcount(swp_entry_t entry) pgoff_t offset; unsigned char *map; =20 - si =3D _swap_info_get(entry); + si =3D get_swap_device(entry); if (!si) return 0; =20 @@ -1794,6 +1765,7 @@ int swp_swapcount(swp_entry_t entry) } while (tmp_count & COUNT_CONTINUED); out: swap_cluster_unlock(ci); + put_swap_device(si); return count; } =20 @@ -1828,11 +1800,12 @@ static bool swap_page_trans_huge_swapped(struct swa= p_info_struct *si, static bool folio_swapped(struct folio *folio) { swp_entry_t entry =3D folio->swap; - struct swap_info_struct *si =3D _swap_info_get(entry); + struct swap_info_struct *si; =20 - if (!si) - return false; + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); =20 + si =3D __swap_entry_to_info(entry); if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio))) return swap_entry_swapped(si, swp_offset(entry)); =20 --=20 2.51.1