From nobody Fri Oct 3 23:02:52 2025 Received: from mail-qv1-f47.google.com (mail-qv1-f47.google.com [209.85.219.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0E2342D0C97 for ; Fri, 22 Aug 2025 19:20:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.47 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890452; cv=none; b=at08YGs0drIQghXMGIoJCAj6guvRRJK9ZtxhxhFql96ZpX71bpR/zH1NfJePHjZdPSX251aOTbtWGm5XVJbDUH3cCioh/5mKwmVSF3MdAo1IJttTSp96o3SQ+fmlQqWsjHXSHkEKg1pfF71WdJKAVenHmzFQ92vlKf5O0CwACLQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890452; c=relaxed/simple; bh=cDRXKhx85jlB8g5rnIzuJNCFec4cu/7fE4oo9Rho1Tc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=hWnVuJBkEGWlOzPENtAnGKIWEKN1wNVG7/KFESsmi/vXjlMDblW2nLGiPkvacP5CD1UvYNwr2dNFBZ35jnjatA75dg6Xf82SvzCkRD6+0OuMCA/xILAjwieTMHiQkvZWV62eFJ+fhjSMlfLM2KByiwTR1njTDh5bxrk27h632B4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=RZX2SLWs; arc=none smtp.client-ip=209.85.219.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="RZX2SLWs" Received: by mail-qv1-f47.google.com with SMTP id 6a1803df08f44-70d93f5799bso13977426d6.1 for ; Fri, 22 Aug 2025 12:20:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1755890450; x=1756495250; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=scLHsbDs+ArcmFRsGn5Cx+L6FeMMrFmBpcYyEjUVG0s=; b=RZX2SLWswUfZWB/Tfp87D8rS4fv/TjJBLPQ7Zq6Xcw9hbalkOKC9b4SozDwPrEyOBi R31eV6Hby4gJwcvKwmE3lhoCdJEI4wxzy9N9MfBJGTsDBRzqHQcygvYvjndGWsatW1OL vosTli5FDYX4F/wVSfnGGyCmM/CV+y0Q8BmiHQRK0XHiKuMNFviaJ/HHpBqMnYOqjn8z hOUO6kRolM/VF4po0UOaDeg6BZ6W8bnVTuwMJRiWgF4MS1cu08GO+LiBRqe9L6K8Grio vzxyhcdedhLXiVC/YO0QwukH7nhp3ZVaqP+MDog6XM408HbOUCGvZ19EsNCm9OSYhnyb hIiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755890450; x=1756495250; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=scLHsbDs+ArcmFRsGn5Cx+L6FeMMrFmBpcYyEjUVG0s=; b=ElW9NMqHUeeZt2qFRUTutp7pdpOLql+5E9jD4Qt9Jb3FcG7tFykcA4xhlcA4YiP3TR +ddCnZTwD5EAdSzvLUGowKxBhIHwAuo4efYcGL5nhxr2jTe1vyb6GPVB7kXuKWJ7hIHX H6OY4HwnLgA0HbsrK2ULIa87wADK+dUgwUwZUhUlhOnxwzWmd6ilzpV/qtlflGzmwenZ F9xLB2kQvPpC8z1RDuPBv6YDzgg+u5H7tEGTgbRPtWKURBqr1ZqtkM2c3YRb3fpVgvA2 5XMt60AQPnXY8knvPqmWwWIXuI7Ce3TDdgGa5ajsI4WMeOKU4jVo0OGc1Nl+VQPHC1bA rmjw== X-Forwarded-Encrypted: i=1; AJvYcCUZB8XgDG/gOo0+InY14wcLoM6Pt40T+GJTg6GpkCbm0qSNHRe/nH/dHOUY9B4iLDaVR/QEkLJ6jXsf8ek=@vger.kernel.org X-Gm-Message-State: AOJu0YyU8tM9Sond2wY+sIOycJUN0+vPxiD6L7GGv1OmC/H1tR88AJnz iVsxriifYRyEEOsYLzRKnRLidK1+VI+5UUz9eYegxd4tTLqrlodYu7cc X-Gm-Gg: ASbGncvIjAUlGIqi1jqT9YzFerzG5AkwIE4hQOHnk3WragQfE0lOA63t8tBrXQdh/wq B9apP2AMy9jmEn2vMPYK9P9Rr1hUX3beH8tsQ5WW68wlGmN2ewDP/KU/EEiwi8FWs8NY8zz6AyP CcHr4yjOYAGC2GvqAR11sSbu0Fgh0Ba35IVdQdLt0qG6759XvzvmKitQ8s+dHttxDqztiPpZBy9 AqkpNc8V/hItN2ZquIwX8LXtgqZelEgJqajOrsgil4cOGGxucKB7bdAjX4MnkBxJHCmSX+xVZ04 1hWdJK0/obCWwgyyGs+8a3zYP71nD/p+kdaK1VsqXIpxbo9gfgwP+x/i9813P1Rfe6aX0XlaPjT 6Cmwiaf/y9ezzlPTu/djRX0JCm7yqEJbcDUzakzQQT4Q= X-Google-Smtp-Source: AGHT+IFiwxZNvGWuVzEZPlsigrtEjNG+eiRSrQs49PTgdnaqUMY7rK4GzwChHIz9LB8CYcRy1pwq6Q== X-Received: by 2002:ad4:5c6f:0:b0:70d:85d0:f85b with SMTP id 6a1803df08f44-70d970b2899mr56158606d6.10.1755890449755; Fri, 22 Aug 2025 12:20:49 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-70da72b04a6sm3843656d6.52.2025.08.22.12.20.43 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 22 Aug 2025 12:20:49 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 1/9] mm, swap: use unified helper for swap cache look up Date: Sat, 23 Aug 2025 03:20:15 +0800 Message-ID: <20250822192023.13477-2-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250822192023.13477-1-ryncsn@gmail.com> References: <20250822192023.13477-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Always use swap_cache_get_folio for swap cache folio look up. The reason we are not using it in all places is that it also updates the readahead info, and some callsites want to avoid that. So decouple readahead update with swap cache lookup into a standalone helper, let the caller call the readahead update helper if that's needed. And convert all swap cache lookups to use swap_cache_get_folio. After this commit, there are only three special cases for accessing swap cache space now: huge memory splitting, migration and shmem replacing, because they need to lock the Xarray. Following commits will wrap their accesses to the swap cache too with special helpers. Signed-off-by: Kairui Song Acked-by: Chris Li Acked-by: Nhat Pham Reviewed-by: Baolin Wang Reviewed-by: Barry Song --- mm/memory.c | 6 ++- mm/mincore.c | 3 +- mm/shmem.c | 4 +- mm/swap.h | 13 +++++-- mm/swap_state.c | 99 +++++++++++++++++++++++------------------------- mm/swapfile.c | 11 +++--- mm/userfaultfd.c | 5 +-- 7 files changed, 72 insertions(+), 69 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index d9de6c056179..10ef528a5f44 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4660,9 +4660,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (unlikely(!si)) goto out; =20 - folio =3D swap_cache_get_folio(entry, vma, vmf->address); - if (folio) + folio =3D swap_cache_get_folio(entry); + if (folio) { + swap_update_readahead(folio, vma, vmf->address); page =3D folio_file_page(folio, swp_offset(entry)); + } swapcache =3D folio; =20 if (!folio) { diff --git a/mm/mincore.c b/mm/mincore.c index 2f3e1816a30d..8ec4719370e1 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -76,8 +76,7 @@ static unsigned char mincore_swap(swp_entry_t entry, bool= shmem) if (!si) return 0; } - folio =3D filemap_get_entry(swap_address_space(entry), - swap_cache_index(entry)); + folio =3D swap_cache_get_folio(entry); if (shmem) put_swap_device(si); /* The swap cache space contains either folio, shadow or NULL */ diff --git a/mm/shmem.c b/mm/shmem.c index 13cc51df3893..e9d0d2784cd5 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2354,7 +2354,7 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, } =20 /* Look it up and read it in.. */ - folio =3D swap_cache_get_folio(swap, NULL, 0); + folio =3D swap_cache_get_folio(swap); if (!folio) { if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { /* Direct swapin skipping swap cache & readahead */ @@ -2379,6 +2379,8 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, count_vm_event(PGMAJFAULT); count_memcg_event_mm(fault_mm, PGMAJFAULT); } + } else { + swap_update_readahead(folio, NULL, 0); } =20 if (order > folio_order(folio)) { diff --git a/mm/swap.h b/mm/swap.h index 1ae44d4193b1..efb6d7ff9f30 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -62,8 +62,7 @@ void delete_from_swap_cache(struct folio *folio); void clear_shadow_from_swap_cache(int type, unsigned long begin, unsigned long end); void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int n= r); -struct folio *swap_cache_get_folio(swp_entry_t entry, - struct vm_area_struct *vma, unsigned long addr); +struct folio *swap_cache_get_folio(swp_entry_t entry); struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr, struct swap_iocb **plug); @@ -74,6 +73,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, g= fp_t flag, struct mempolicy *mpol, pgoff_t ilx); struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf); +void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, + unsigned long addr); =20 static inline unsigned int folio_swap_flags(struct folio *folio) { @@ -159,6 +160,11 @@ static inline struct folio *swapin_readahead(swp_entry= _t swp, gfp_t gfp_mask, return NULL; } =20 +static inline void swap_update_readahead(struct folio *folio, + struct vm_area_struct *vma, unsigned long addr) +{ +} + static inline int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug) { @@ -169,8 +175,7 @@ static inline void swapcache_clear(struct swap_info_str= uct *si, swp_entry_t entr { } =20 -static inline struct folio *swap_cache_get_folio(swp_entry_t entry, - struct vm_area_struct *vma, unsigned long addr) +static inline struct folio *swap_cache_get_folio(swp_entry_t entry) { return NULL; } diff --git a/mm/swap_state.c b/mm/swap_state.c index 99513b74b5d8..ff9eb761a103 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -69,6 +69,21 @@ void show_swap_cache_info(void) printk("Total swap =3D %lukB\n", K(total_swap_pages)); } =20 +/* + * Lookup a swap entry in the swap cache. A found folio will be returned + * unlocked and with its refcount incremented. + * + * Caller must lock the swap device or hold a reference to keep it valid. + */ +struct folio *swap_cache_get_folio(swp_entry_t entry) +{ + struct folio *folio =3D filemap_get_folio(swap_address_space(entry), + swap_cache_index(entry)); + if (!IS_ERR(folio)) + return folio; + return NULL; +} + void *get_shadow_from_swap_cache(swp_entry_t entry) { struct address_space *address_space =3D swap_address_space(entry); @@ -273,54 +288,40 @@ static inline bool swap_use_vma_readahead(void) } =20 /* - * Lookup a swap entry in the swap cache. A found folio will be returned - * unlocked and with its refcount incremented - we rely on the kernel - * lock getting page table operations atomic even if we drop the folio - * lock before returning. - * - * Caller must lock the swap device or hold a reference to keep it valid. + * Update the readahead statistics of a vma or globally. */ -struct folio *swap_cache_get_folio(swp_entry_t entry, - struct vm_area_struct *vma, unsigned long addr) +void swap_update_readahead(struct folio *folio, + struct vm_area_struct *vma, + unsigned long addr) { - struct folio *folio; - - folio =3D filemap_get_folio(swap_address_space(entry), swap_cache_index(e= ntry)); - if (!IS_ERR(folio)) { - bool vma_ra =3D swap_use_vma_readahead(); - bool readahead; + bool readahead, vma_ra =3D swap_use_vma_readahead(); =20 - /* - * At the moment, we don't support PG_readahead for anon THP - * so let's bail out rather than confusing the readahead stat. - */ - if (unlikely(folio_test_large(folio))) - return folio; - - readahead =3D folio_test_clear_readahead(folio); - if (vma && vma_ra) { - unsigned long ra_val; - int win, hits; - - ra_val =3D GET_SWAP_RA_VAL(vma); - win =3D SWAP_RA_WIN(ra_val); - hits =3D SWAP_RA_HITS(ra_val); - if (readahead) - hits =3D min_t(int, hits + 1, SWAP_RA_HITS_MAX); - atomic_long_set(&vma->swap_readahead_info, - SWAP_RA_VAL(addr, win, hits)); - } - - if (readahead) { - count_vm_event(SWAP_RA_HIT); - if (!vma || !vma_ra) - atomic_inc(&swapin_readahead_hits); - } - } else { - folio =3D NULL; + /* + * At the moment, we don't support PG_readahead for anon THP + * so let's bail out rather than confusing the readahead stat. + */ + if (unlikely(folio_test_large(folio))) + return; + + readahead =3D folio_test_clear_readahead(folio); + if (vma && vma_ra) { + unsigned long ra_val; + int win, hits; + + ra_val =3D GET_SWAP_RA_VAL(vma); + win =3D SWAP_RA_WIN(ra_val); + hits =3D SWAP_RA_HITS(ra_val); + if (readahead) + hits =3D min_t(int, hits + 1, SWAP_RA_HITS_MAX); + atomic_long_set(&vma->swap_readahead_info, + SWAP_RA_VAL(addr, win, hits)); } =20 - return folio; + if (readahead) { + count_vm_event(SWAP_RA_HIT); + if (!vma || !vma_ra) + atomic_inc(&swapin_readahead_hits); + } } =20 struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, @@ -336,14 +337,10 @@ struct folio *__read_swap_cache_async(swp_entry_t ent= ry, gfp_t gfp_mask, *new_page_allocated =3D false; for (;;) { int err; - /* - * First check the swap cache. Since this is normally - * called after swap_cache_get_folio() failed, re-calling - * that would confuse statistics. - */ - folio =3D filemap_get_folio(swap_address_space(entry), - swap_cache_index(entry)); - if (!IS_ERR(folio)) + + /* Check the swap cache in case the folio is already there */ + folio =3D swap_cache_get_folio(entry); + if (folio) goto got_folio; =20 /* diff --git a/mm/swapfile.c b/mm/swapfile.c index a7ffabbe65ef..4b8ab2cb49ca 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -213,15 +213,14 @@ static int __try_to_reclaim_swap(struct swap_info_str= uct *si, unsigned long offset, unsigned long flags) { swp_entry_t entry =3D swp_entry(si->type, offset); - struct address_space *address_space =3D swap_address_space(entry); struct swap_cluster_info *ci; struct folio *folio; int ret, nr_pages; bool need_reclaim; =20 again: - folio =3D filemap_get_folio(address_space, swap_cache_index(entry)); - if (IS_ERR(folio)) + folio =3D swap_cache_get_folio(entry); + if (!folio) return 0; =20 nr_pages =3D folio_nr_pages(folio); @@ -2131,7 +2130,7 @@ static int unuse_pte_range(struct vm_area_struct *vma= , pmd_t *pmd, pte_unmap(pte); pte =3D NULL; =20 - folio =3D swap_cache_get_folio(entry, vma, addr); + folio =3D swap_cache_get_folio(entry); if (!folio) { struct vm_fault vmf =3D { .vma =3D vma, @@ -2357,8 +2356,8 @@ static int try_to_unuse(unsigned int type) (i =3D find_next_to_unuse(si, i)) !=3D 0) { =20 entry =3D swp_entry(type, i); - folio =3D filemap_get_folio(swap_address_space(entry), swap_cache_index(= entry)); - if (IS_ERR(folio)) + folio =3D swap_cache_get_folio(entry); + if (!folio) continue; =20 /* diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 50aaa8dcd24c..af61b95c89e4 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1489,9 +1489,8 @@ static long move_pages_ptes(struct mm_struct *mm, pmd= _t *dst_pmd, pmd_t *src_pmd * separately to allow proper handling. */ if (!src_folio) - folio =3D filemap_get_folio(swap_address_space(entry), - swap_cache_index(entry)); - if (!IS_ERR_OR_NULL(folio)) { + folio =3D swap_cache_get_folio(entry); + if (folio) { if (folio_test_large(folio)) { ret =3D -EBUSY; folio_put(folio); --=20 2.51.0 From nobody Fri Oct 3 23:02:52 2025 Received: from mail-qv1-f43.google.com (mail-qv1-f43.google.com [209.85.219.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3C13C2D12E1 for ; Fri, 22 Aug 2025 19:20:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.43 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890459; cv=none; b=Jmc1lnBCB8Aq+sTsBVDHUZVlhnllAtDtIg2DZwIe1aQLccd+9qKFVArJZ+kmV3cVC2XgRNesQushVVjSxKnFW3yYBMyQRs+B0d9Bwd9MlM4W9cQwVT9TOo2Mds/p2qi+8icjTNwAxBv+HVIf+tkmzcTDvd98EoWJ3MJ0Gw5Eg+c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890459; c=relaxed/simple; bh=T9VmDnYh+CIb9LXa5EKqKPNpt982UPvE2KuXzL8sQGE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=KEOeFIMyUgRgizkmU2g4Uvn3pZ4e8cfJ0WfR5Av8huREmgKlPTx9WPWLtD6PYmCnU6apz4UTGuLAaKyGSHX9oH++yCSLVOvdV76lurI+JQhCaArzVMCDXLDDPIy40XDXbc9XdjRrGpsMepiSjr4hu9wXTsXDfSFnBQoCTL7IvNg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=KtjUHe6v; arc=none smtp.client-ip=209.85.219.43 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="KtjUHe6v" Received: by mail-qv1-f43.google.com with SMTP id 6a1803df08f44-70ba7aa131fso29602836d6.2 for ; Fri, 22 Aug 2025 12:20:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1755890456; x=1756495256; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=TPC8fKY9lYOENGHM8BlxnczCD4bDvatV7ABAMC7FrMc=; b=KtjUHe6v9dYUMMrcwU7tDoZL2ENXoCcChSklO4B6k9wQoUBajcj0TP08AKPBhj6Uzi DjvZy5A5EJ7BVJruX73xReQitNa7kIG+ZbNv3rBHna3kfC6TySC6dh4a9olvhpJ6PN0o /jIaoyDnnxSaBvubecw7F2W0XhkN2esn1vKucBMOMTdvUKCwEb1142gp52TujirNbusu 85Mi3NpM3YQnyWrx79kM2Y14/07Kj5/PK7reh7tuJYZSd4V53QVBKBiBWvRitGik/3CG DPMx3MfpFQv/XI1Uba/LTQ+AvP0C+Vl/IHtUoJMmEZVhbEx/57Ru7af2ynPEplKXDw3p I2Bw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755890456; x=1756495256; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=TPC8fKY9lYOENGHM8BlxnczCD4bDvatV7ABAMC7FrMc=; b=Elc5u0GpTiioiQkw4BqSHnrjF50GQXUpugopvUU/jWOWzbV5KRJsZyFXPAiNDocLr5 q4mKTbq2pkMt0J6PpBoFciW14RBBtQp1xQLScExq46c6pf4WdU9d4Z1rUPtMaKdjwzUQ 0Qn01MQ8nMHBT1+lWDVV8L9W5MiEGwF5MQqqqpyE06b9gAoRgRL7uidFKCCN7xiq87Bh TXoUX8awYAQnn6ycp+ckzexSvjm2hdjC745yF85O/Z6XJ2GozN7HM2VoeKtYrxinqyVO MbjlVLOFHWR5+Z4tWKV1B3SURi6g4FNR3PpcRKarBjp9wfkTODz9cowG7NzIVnCdiWGr VwhA== X-Forwarded-Encrypted: i=1; AJvYcCXGLtzWfixYq7ubExLrfvgfib2sAVSaooCiGztJ5mBRJhmkp0szx7sA/aqMffW5wF+Q+jUm/7MKHdH7UN4=@vger.kernel.org X-Gm-Message-State: AOJu0YxFSYjjBJwZTgyQtfGWiccipcYRz/3fucTIjGVKQe8Mom/C1E8Y 2+wzyW2opAirvw+23heX9m9moy1eKWIYFImWQAIT4AdsXd/Hpm1RTkiK X-Gm-Gg: ASbGncu7zPzthW3D1ZMrTn1rGJ7uDBIowFUIDa5rt+8aPW0RDN/npzN1bJ0rBO+s9d0 oYnOTE6KJqudNT+wN8CL549r/SqDle9mK3WTG5AGmNTAUiKwgyVJLdCsm84Cm1wKr3CXxtgGHnV xoL0keL5UVIa7PgnS44cjLYVsG1ZYXF2GYOgtrCNTnBeyA3VtIF9HKtQuFEoqJIuqpIRCkSEo5s E7tw0hylRYiXaSws20jaBLTKiWVNNYc8c+0TrPY0Zb5WHeJEdY5+5pn+e3w/k0wfGlF9o1S0Iox 4ZKcxea5HNGZ3Rl841BYIlQJWdIIz29m8yNphqCcYI9Yl3jasKF+08Fk+Kbi5KPX+Bdzj8n8W21 C3sCCWzbjNnJjaPht1btf810N4x7NFqv8Y6FwaTrWu9ihV+Kp8Ku/ew== X-Google-Smtp-Source: AGHT+IFercWRtfaBvKSiuBTbOi61gw/vYK89k9RSquXFYUj53lRFJ7gU9v1uG67cTWNGlGYoKenNsg== X-Received: by 2002:a05:6214:3f83:b0:70d:a2f9:393f with SMTP id 6a1803df08f44-70da2f93ba0mr26749896d6.36.1755890456017; Fri, 22 Aug 2025 12:20:56 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-70da72b04a6sm3843656d6.52.2025.08.22.12.20.50 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 22 Aug 2025 12:20:55 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use Date: Sat, 23 Aug 2025 03:20:16 +0800 Message-ID: <20250822192023.13477-3-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250822192023.13477-1-ryncsn@gmail.com> References: <20250822192023.13477-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Swap cache lookup is lockless, it only increases the reference count of the returned folio. That's not enough to ensure a folio is stable in the swap cache, so the folio could be removed from the swap cache at any time. The caller always has to lock and check the folio before use. Document this as a comment, and introduce a helper for swap cache folio verification with proper sanity checks. Also, sanitize all current users to use this convention, and use the new helper when possible for easier debugging. Some existing callers won't cause any major problem right now, only trivial issues like incorrect readahead statistic (swapin) or wasted loop (swapoff). It's better to always follow this convention to make things robust. Signed-off-by: Kairui Song --- mm/memory.c | 28 +++++++++++++--------------- mm/shmem.c | 4 ++-- mm/swap.h | 28 ++++++++++++++++++++++++++++ mm/swap_state.c | 13 +++++++++---- mm/swapfile.c | 10 ++++++++-- 5 files changed, 60 insertions(+), 23 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 10ef528a5f44..9ca8e1873c6e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4661,12 +4661,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out; =20 folio =3D swap_cache_get_folio(entry); - if (folio) { - swap_update_readahead(folio, vma, vmf->address); - page =3D folio_file_page(folio, swp_offset(entry)); - } swapcache =3D folio; - if (!folio) { if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) =3D=3D 1) { @@ -4735,20 +4730,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) ret =3D VM_FAULT_MAJOR; count_vm_event(PGMAJFAULT); count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); - page =3D folio_file_page(folio, swp_offset(entry)); - } else if (PageHWPoison(page)) { - /* - * hwpoisoned dirty swapcache pages are kept for killing - * owner processes (which may be unknown at hwpoison time) - */ - ret =3D VM_FAULT_HWPOISON; - goto out_release; } =20 ret |=3D folio_lock_or_retry(folio, vmf); if (ret & VM_FAULT_RETRY) goto out_release; =20 + page =3D folio_file_page(folio, swp_offset(entry)); if (swapcache) { /* * Make sure folio_free_swap() or swapoff did not release the @@ -4757,10 +4745,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * swapcache, we need to check that the page's swap has not * changed. */ - if (unlikely(!folio_test_swapcache(folio) || - page_swap_entry(page).val !=3D entry.val)) + if (!folio_contains_swap(folio, entry)) goto out_page; =20 + if (PageHWPoison(page)) { + /* + * hwpoisoned dirty swapcache pages are kept for killing + * owner processes (which may be unknown at hwpoison time) + */ + ret =3D VM_FAULT_HWPOISON; + goto out_page; + } + + swap_update_readahead(folio, vma, vmf->address); + /* * KSM sometimes has to copy on read faults, for example, if * folio->index of non-ksm folios would be nonlinear inside the diff --git a/mm/shmem.c b/mm/shmem.c index e9d0d2784cd5..b4d39f2a1e0a 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2379,8 +2379,6 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, count_vm_event(PGMAJFAULT); count_memcg_event_mm(fault_mm, PGMAJFAULT); } - } else { - swap_update_readahead(folio, NULL, 0); } =20 if (order > folio_order(folio)) { @@ -2431,6 +2429,8 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, error =3D -EIO; goto failed; } + if (!skip_swapcache) + swap_update_readahead(folio, NULL, 0); folio_wait_writeback(folio); nr_pages =3D folio_nr_pages(folio); =20 diff --git a/mm/swap.h b/mm/swap.h index efb6d7ff9f30..bb2adbfd64a9 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -52,6 +52,29 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry) return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK; } =20 +/** + * folio_contains_swap - Does this folio contain this swap entry? + * @folio: The folio. + * @entry: The swap entry to check against. + * + * Swap version of folio_contains() + * + * Context: The caller should have the folio locked to ensure + * nothing will move it out of the swap cache. + * Return: true or false. + */ +static inline bool folio_contains_swap(struct folio *folio, swp_entry_t en= try) +{ + pgoff_t offset =3D swp_offset(entry); + + VM_WARN_ON_ONCE(!folio_test_locked(folio)); + if (unlikely(!folio_test_swapcache(folio))) + return false; + if (unlikely(swp_type(entry) !=3D swp_type(folio->swap))) + return false; + return offset - swp_offset(folio->swap) < folio_nr_pages(folio); +} + void show_swap_cache_info(void); void *get_shadow_from_swap_cache(swp_entry_t entry); int add_to_swap_cache(struct folio *folio, swp_entry_t entry, @@ -144,6 +167,11 @@ static inline pgoff_t swap_cache_index(swp_entry_t ent= ry) return 0; } =20 +static inline bool folio_contains_swap(struct folio *folio, swp_entry_t en= try) +{ + return false; +} + static inline void show_swap_cache_info(void) { } diff --git a/mm/swap_state.c b/mm/swap_state.c index ff9eb761a103..be0d96494dc1 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -70,10 +70,12 @@ void show_swap_cache_info(void) } =20 /* - * Lookup a swap entry in the swap cache. A found folio will be returned - * unlocked and with its refcount incremented. + * swap_cache_get_folio - Lookup a swap entry in the swap cache. * - * Caller must lock the swap device or hold a reference to keep it valid. + * A found folio will be returned unlocked and with its refcount increased. + * + * Context: Caller must ensure @entry is valid and pin the swap device, al= so + * check the returned folio after locking it (e.g. folio_swap_contains). */ struct folio *swap_cache_get_folio(swp_entry_t entry) { @@ -338,7 +340,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entr= y, gfp_t gfp_mask, for (;;) { int err; =20 - /* Check the swap cache in case the folio is already there */ + /* + * Check the swap cache first, if a cached folio is found, + * return it unlocked. The caller will lock and check it. + */ folio =3D swap_cache_get_folio(entry); if (folio) goto got_folio; diff --git a/mm/swapfile.c b/mm/swapfile.c index 4b8ab2cb49ca..12f2580ebe8d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -240,12 +240,12 @@ static int __try_to_reclaim_swap(struct swap_info_str= uct *si, * Offset could point to the middle of a large folio, or folio * may no longer point to the expected offset before it's locked. */ - entry =3D folio->swap; - if (offset < swp_offset(entry) || offset >=3D swp_offset(entry) + nr_page= s) { + if (!folio_contains_swap(folio, entry)) { folio_unlock(folio); folio_put(folio); goto again; } + entry =3D folio->swap; offset =3D swp_offset(entry); =20 need_reclaim =3D ((flags & TTRS_ANYWAY) || @@ -2150,6 +2150,12 @@ static int unuse_pte_range(struct vm_area_struct *vm= a, pmd_t *pmd, } =20 folio_lock(folio); + if (!folio_contains_swap(folio, entry)) { + folio_unlock(folio); + folio_put(folio); + continue; + } + folio_wait_writeback(folio); ret =3D unuse_pte(vma, pmd, addr, entry, folio); if (ret < 0) { --=20 2.51.0 From nobody Fri Oct 3 23:02:52 2025 Received: from mail-qv1-f52.google.com (mail-qv1-f52.google.com [209.85.219.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6CA742C15B7 for ; Fri, 22 Aug 2025 19:21:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.52 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890465; cv=none; b=KVeTGdf2IN6PFh57JTEQP3tXS5+EWvUwE5NNT9JGsSDGm1kic2lHKatpmuVZNd6XlXTmzOfsN0Uj7Mq4t6PmQXEQX95vm59VmiSiX8xAfjtb7W+TQO1XVfhA9mj6zeOSRK0QS46IiGeXjr5telhk1WmDU+w+VmS0qtpLnKRKhN8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890465; c=relaxed/simple; bh=7VzJsX4yzgIigyfQwTfvecbZreUI9hQbPeex62S1HIg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=uOaHNFSbmZPqWwFRUWfy5nI6qd1k6Sb2cyMA013aHXQD/SXx1eK1+At6VWW4BzO+sbZIS1io4J7s6vqlvLAu/w/MYhX324F50a6YAduEaTlFXgMYDM3HIzYcuEyOFSBlRWmOJncQXdvtsa3PGh4Lm5PVcDtF891xsKJmP5Nlidc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=SATqU4fA; arc=none smtp.client-ip=209.85.219.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="SATqU4fA" Received: by mail-qv1-f52.google.com with SMTP id 6a1803df08f44-70a9f5625b7so21456626d6.2 for ; Fri, 22 Aug 2025 12:21:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1755890462; x=1756495262; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=Z+cJXYJn3gwcVQj/BHKK/PcHgDrCevOXhroXOkk9oEE=; b=SATqU4fAnKs2oSfE4zw6HD/4bi5fWrY9EmmJe5Yh20oWZbuF5YwiTvKU5WWhGp+Yg9 TVW0CLAyemp7qMnZeHokAbac6nAXJX7RsacLsDmGGeDitWdGCh5u6h3mXbPVZIhjOaRq 0QitGDh1LIUo4cVNLkyjZsmfFxV8NRvJUjQgVjTJCpCFoUFOi3sfxQQrWXcfu0TY0ThN bi/4Ns/m18nVJj63D0dxNa+CxZ/+ScbVjHOUrcJrrtN7kb6dQOiuzy7NJexc+zTbG0yh NlkHxlXnPA4xn8KMWAvmp2fICCJTk+hnPD7hpUE0GBYywGnD+iNee4YafuOvXzwcyGq0 vryw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755890462; x=1756495262; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Z+cJXYJn3gwcVQj/BHKK/PcHgDrCevOXhroXOkk9oEE=; b=S+uOBCRAvKdoWW6w0jviaiMZY9t2R62fZTrvTYk+oeHL6tkEUWs+Eoq1x5ImZnEwEi D/iNP4eNnrgTVDOAbq3YD2mooGWtJeVnvjKwezBoHJJfpBwtG/hQPvqgyu7wUGAtvTBN Rpzrutx3Rl6kn6bZ9q/+OWme3YRMkirVy0KzjeUGGD3jj8rlRbfqcugaRnEHOrV82pu4 nkqnEYDedRDpyZWSUOh44jmxVx3mK6KC/oFus6CFun2NMr+wFCBpHB/e6Fa31DLeKYjc j3c/3XGFB2n9NUaMMvvBGixleRhJK83RG459Wq8RfeGZnoYber6GC8p7sHLvCbmeZ/qR Q54A== X-Forwarded-Encrypted: i=1; AJvYcCVmoXHJilvOtOZAfe0ziJfPlPWokVCt2QvZr1wjyZnUsftin8av91Q+uhiDZSMH5OdG7yWDZgR7fR1cRsk=@vger.kernel.org X-Gm-Message-State: AOJu0YxYG7C49aLgRnBryXWLghuzoQUqWxYGV08wLNmWWQG9sLkBLt97 oJqQ7AFcedPis+MMqMsCMrY26ghahkU8ofI6hV+royhrV21Pqm/5OMX7 X-Gm-Gg: ASbGncuwKDi4Gi9sGVkOz9F8lid7eXoTX6I4c12tJ62m+s8lF9tcbxIVcwb4KUloxIL /41215N32FOVE6qfkzy1+ML8fzd8+yyaij6+ix322BdMtp27FLnqU29fiy5wSz2R/C93G0X6/s5 YQq/npiUEPCz2HpdiTYJeM8HJKq2nFWfr1jw5/I5K7qYdULLtzUauQUjraH6vzs/aNi3BcKmS81 V0h7uwJkePJukmrYZ+cTrkwabF0CvWCjjXOymlriS654i73KC/qE5ry0W0AJUFgvkIkI5c0sSU+ Xxe1j4BQ5Kdqu2dZzkr8mWegb8IUEafc3Kx1hY/nL9eF5M2Vx+MdWXORqP5NQCu4ddodzdVgvwd Uz6zTSQgp9zqcWuxWk/gLEn8pemQvjmpAF8W5npmFT0s= X-Google-Smtp-Source: AGHT+IFM2ncLXTYrGE2bUfptgzHCzj01g0lOJ0Db/heWPpGQ7X3jSS+7KVadsNxpoXIo9BdWLidvfQ== X-Received: by 2002:a05:6214:27c2:b0:70d:6df4:1b21 with SMTP id 6a1803df08f44-70d972431camr51145956d6.62.1755890462185; Fri, 22 Aug 2025 12:21:02 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-70da72b04a6sm3843656d6.52.2025.08.22.12.20.56 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 22 Aug 2025 12:21:01 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 3/9] mm, swap: rename and move some swap cluster definition and helpers Date: Sat, 23 Aug 2025 03:20:17 +0800 Message-ID: <20250822192023.13477-4-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250822192023.13477-1-ryncsn@gmail.com> References: <20250822192023.13477-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song No feature change, move cluster related definitions and helpers to mm/swap.h, also tidy up and add a "swap_" prefix for cluster lock/unlock helpers, so they can be used outside of swap files. Signed-off-by: Kairui Song Acked-by: Chris Li Acked-by: David Hildenbrand Reviewed-by: Barry Song --- include/linux/swap.h | 34 --------------- mm/swap.h | 63 ++++++++++++++++++++++++++++ mm/swapfile.c | 99 ++++++++++++++------------------------------ 3 files changed, 93 insertions(+), 103 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index c2da85cb7fe7..20efd9a34034 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -235,40 +235,6 @@ enum { /* Special value in each swap_map continuation */ #define SWAP_CONT_MAX 0x7f /* Max count */ =20 -/* - * We use this to track usage of a cluster. A cluster is a block of swap d= isk - * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All - * free clusters are organized into a list. We fetch an entry from the lis= t to - * get a free cluster. - * - * The flags field determines if a cluster is free. This is - * protected by cluster lock. - */ -struct swap_cluster_info { - spinlock_t lock; /* - * Protect swap_cluster_info fields - * other than list, and swap_info_struct->swap_map - * elements corresponding to the swap cluster. - */ - u16 count; - u8 flags; - u8 order; - struct list_head list; -}; - -/* All on-list cluster must have a non-zero flag. */ -enum swap_cluster_flags { - CLUSTER_FLAG_NONE =3D 0, /* For temporary off-list cluster */ - CLUSTER_FLAG_FREE, - CLUSTER_FLAG_NONFULL, - CLUSTER_FLAG_FRAG, - /* Clusters with flags above are allocatable */ - CLUSTER_FLAG_USABLE =3D CLUSTER_FLAG_FRAG, - CLUSTER_FLAG_FULL, - CLUSTER_FLAG_DISCARD, - CLUSTER_FLAG_MAX, -}; - /* * The first page in the swap file is the swap header, which is always mar= ked * bad to prevent it from being allocated as an entry. This also prevents = the diff --git a/mm/swap.h b/mm/swap.h index bb2adbfd64a9..223b40f2d37e 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -7,10 +7,73 @@ struct swap_iocb; =20 extern int page_cluster; =20 +#ifdef CONFIG_THP_SWAP +#define SWAPFILE_CLUSTER HPAGE_PMD_NR +#define swap_entry_order(order) (order) +#else +#define SWAPFILE_CLUSTER 256 +#define swap_entry_order(order) 0 +#endif + +/* + * We use this to track usage of a cluster. A cluster is a block of swap d= isk + * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All + * free clusters are organized into a list. We fetch an entry from the lis= t to + * get a free cluster. + * + * The flags field determines if a cluster is free. This is + * protected by cluster lock. + */ +struct swap_cluster_info { + spinlock_t lock; /* + * Protect swap_cluster_info fields + * other than list, and swap_info_struct->swap_map + * elements corresponding to the swap cluster. + */ + u16 count; + u8 flags; + u8 order; + struct list_head list; +}; + +/* All on-list cluster must have a non-zero flag. */ +enum swap_cluster_flags { + CLUSTER_FLAG_NONE =3D 0, /* For temporary off-list cluster */ + CLUSTER_FLAG_FREE, + CLUSTER_FLAG_NONFULL, + CLUSTER_FLAG_FRAG, + /* Clusters with flags above are allocatable */ + CLUSTER_FLAG_USABLE =3D CLUSTER_FLAG_FRAG, + CLUSTER_FLAG_FULL, + CLUSTER_FLAG_DISCARD, + CLUSTER_FLAG_MAX, +}; + #ifdef CONFIG_SWAP #include /* for swp_offset */ #include /* for bio_end_io_t */ =20 +static inline struct swap_cluster_info *swp_offset_cluster( + struct swap_info_struct *si, pgoff_t offset) +{ + return &si->cluster_info[offset / SWAPFILE_CLUSTER]; +} + +static inline struct swap_cluster_info *swap_cluster_lock( + struct swap_info_struct *si, + unsigned long offset) +{ + struct swap_cluster_info *ci =3D swp_offset_cluster(si, offset); + + spin_lock(&ci->lock); + return ci; +} + +static inline void swap_cluster_unlock(struct swap_cluster_info *ci) +{ + spin_unlock(&ci->lock); +} + /* linux/mm/page_io.c */ int sio_pool_init(void); struct swap_iocb; diff --git a/mm/swapfile.c b/mm/swapfile.c index 12f2580ebe8d..618cf4333a3d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -58,9 +58,6 @@ static void swap_entries_free(struct swap_info_struct *si, static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); static bool folio_swapcache_freeable(struct folio *folio); -static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, - unsigned long offset); -static inline void unlock_cluster(struct swap_cluster_info *ci); =20 static DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; @@ -259,9 +256,9 @@ static int __try_to_reclaim_swap(struct swap_info_struc= t *si, * swap_map is HAS_CACHE only, which means the slots have no page table * reference or pending writeback, and can't be allocated to others. */ - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); need_reclaim =3D swap_only_has_cache(si, offset, nr_pages); - unlock_cluster(ci); + swap_cluster_unlock(ci); if (!need_reclaim) goto out_unlock; =20 @@ -386,20 +383,7 @@ static void discard_swap_cluster(struct swap_info_stru= ct *si, } } =20 -#ifdef CONFIG_THP_SWAP -#define SWAPFILE_CLUSTER HPAGE_PMD_NR - -#define swap_entry_order(order) (order) -#else -#define SWAPFILE_CLUSTER 256 - -/* - * Define swap_entry_order() as constant to let compiler to optimize - * out some code if !CONFIG_THP_SWAP - */ -#define swap_entry_order(order) 0 -#endif -#define LATENCY_LIMIT 256 +#define LATENCY_LIMIT 256 =20 static inline bool cluster_is_empty(struct swap_cluster_info *info) { @@ -426,34 +410,12 @@ static inline unsigned int cluster_index(struct swap_= info_struct *si, return ci - si->cluster_info; } =20 -static inline struct swap_cluster_info *offset_to_cluster(struct swap_info= _struct *si, - unsigned long offset) -{ - return &si->cluster_info[offset / SWAPFILE_CLUSTER]; -} - static inline unsigned int cluster_offset(struct swap_info_struct *si, struct swap_cluster_info *ci) { return cluster_index(si, ci) * SWAPFILE_CLUSTER; } =20 -static inline struct swap_cluster_info *lock_cluster(struct swap_info_stru= ct *si, - unsigned long offset) -{ - struct swap_cluster_info *ci; - - ci =3D offset_to_cluster(si, offset); - spin_lock(&ci->lock); - - return ci; -} - -static inline void unlock_cluster(struct swap_cluster_info *ci) -{ - spin_unlock(&ci->lock); -} - static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, enum swap_cluster_flags new_flags) @@ -809,7 +771,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap= _info_struct *si, } out: relocate_cluster(si, ci); - unlock_cluster(ci); + swap_cluster_unlock(ci); if (si->flags & SWP_SOLIDSTATE) { this_cpu_write(percpu_swap_cluster.offset[order], next); this_cpu_write(percpu_swap_cluster.si[order], si); @@ -876,7 +838,7 @@ static void swap_reclaim_full_clusters(struct swap_info= _struct *si, bool force) if (ci->flags =3D=3D CLUSTER_FLAG_NONE) relocate_cluster(si, ci); =20 - unlock_cluster(ci); + swap_cluster_unlock(ci); if (to_scan <=3D 0) break; } @@ -915,7 +877,7 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o if (offset =3D=3D SWAP_ENTRY_INVALID) goto new_cluster; =20 - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); /* Cluster could have been used by another order */ if (cluster_is_usable(ci, order)) { if (cluster_is_empty(ci)) @@ -923,7 +885,7 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o found =3D alloc_swap_scan_cluster(si, ci, offset, order, usage); } else { - unlock_cluster(ci); + swap_cluster_unlock(ci); } if (found) goto done; @@ -1204,7 +1166,7 @@ static bool swap_alloc_fast(swp_entry_t *entry, if (!si || !offset || !get_swap_device_info(si)) return false; =20 - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); if (cluster_is_usable(ci, order)) { if (cluster_is_empty(ci)) offset =3D cluster_offset(si, ci); @@ -1212,7 +1174,7 @@ static bool swap_alloc_fast(swp_entry_t *entry, if (found) *entry =3D swp_entry(si->type, found); } else { - unlock_cluster(ci); + swap_cluster_unlock(ci); } =20 put_swap_device(si); @@ -1480,14 +1442,14 @@ static void swap_entries_put_cache(struct swap_info= _struct *si, unsigned long offset =3D swp_offset(entry); struct swap_cluster_info *ci; =20 - ci =3D lock_cluster(si, offset); - if (swap_only_has_cache(si, offset, nr)) + ci =3D swap_cluster_lock(si, offset); + if (swap_only_has_cache(si, offset, nr)) { swap_entries_free(si, ci, entry, nr); - else { + } else { for (int i =3D 0; i < nr; i++, entry.val++) swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE); } - unlock_cluster(ci); + swap_cluster_unlock(ci); } =20 static bool swap_entries_put_map(struct swap_info_struct *si, @@ -1505,7 +1467,7 @@ static bool swap_entries_put_map(struct swap_info_str= uct *si, if (count !=3D 1 && count !=3D SWAP_MAP_SHMEM) goto fallback; =20 - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); if (!swap_is_last_map(si, offset, nr, &has_cache)) { goto locked_fallback; } @@ -1514,21 +1476,20 @@ static bool swap_entries_put_map(struct swap_info_s= truct *si, else for (i =3D 0; i < nr; i++) WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE); - unlock_cluster(ci); + swap_cluster_unlock(ci); =20 return has_cache; =20 fallback: - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); locked_fallback: for (i =3D 0; i < nr; i++, entry.val++) { count =3D swap_entry_put_locked(si, ci, entry, 1); if (count =3D=3D SWAP_HAS_CACHE) has_cache =3D true; } - unlock_cluster(ci); + swap_cluster_unlock(ci); return has_cache; - } =20 /* @@ -1578,7 +1539,7 @@ static void swap_entries_free(struct swap_info_struct= *si, unsigned char *map_end =3D map + nr_pages; =20 /* It should never free entries across different clusters */ - VM_BUG_ON(ci !=3D offset_to_cluster(si, offset + nr_pages - 1)); + VM_BUG_ON(ci !=3D swp_offset_cluster(si, offset + nr_pages - 1)); VM_BUG_ON(cluster_is_empty(ci)); VM_BUG_ON(ci->count < nr_pages); =20 @@ -1653,9 +1614,9 @@ bool swap_entry_swapped(struct swap_info_struct *si, = swp_entry_t entry) struct swap_cluster_info *ci; int count; =20 - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); count =3D swap_count(si->swap_map[offset]); - unlock_cluster(ci); + swap_cluster_unlock(ci); return !!count; } =20 @@ -1678,7 +1639,7 @@ int swp_swapcount(swp_entry_t entry) =20 offset =3D swp_offset(entry); =20 - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); =20 count =3D swap_count(si->swap_map[offset]); if (!(count & COUNT_CONTINUED)) @@ -1701,7 +1662,7 @@ int swp_swapcount(swp_entry_t entry) n *=3D (SWAP_CONT_MAX + 1); } while (tmp_count & COUNT_CONTINUED); out: - unlock_cluster(ci); + swap_cluster_unlock(ci); return count; } =20 @@ -1716,7 +1677,7 @@ static bool swap_page_trans_huge_swapped(struct swap_= info_struct *si, int i; bool ret =3D false; =20 - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); if (nr_pages =3D=3D 1) { if (swap_count(map[roffset])) ret =3D true; @@ -1729,7 +1690,7 @@ static bool swap_page_trans_huge_swapped(struct swap_= info_struct *si, } } unlock_out: - unlock_cluster(ci); + swap_cluster_unlock(ci); return ret; } =20 @@ -2662,8 +2623,8 @@ static void wait_for_allocation(struct swap_info_stru= ct *si) BUG_ON(si->flags & SWP_WRITEOK); =20 for (offset =3D 0; offset < end; offset +=3D SWAPFILE_CLUSTER) { - ci =3D lock_cluster(si, offset); - unlock_cluster(ci); + ci =3D swap_cluster_lock(si, offset); + swap_cluster_unlock(ci); } } =20 @@ -3579,7 +3540,7 @@ static int __swap_duplicate(swp_entry_t entry, unsign= ed char usage, int nr) offset =3D swp_offset(entry); VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); VM_WARN_ON(usage =3D=3D 1 && nr > 1); - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); =20 err =3D 0; for (i =3D 0; i < nr; i++) { @@ -3634,7 +3595,7 @@ static int __swap_duplicate(swp_entry_t entry, unsign= ed char usage, int nr) } =20 unlock_out: - unlock_cluster(ci); + swap_cluster_unlock(ci); return err; } =20 @@ -3733,7 +3694,7 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) =20 offset =3D swp_offset(entry); =20 - ci =3D lock_cluster(si, offset); + ci =3D swap_cluster_lock(si, offset); =20 count =3D swap_count(si->swap_map[offset]); =20 @@ -3793,7 +3754,7 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) out_unlock_cont: spin_unlock(&si->cont_lock); out: - unlock_cluster(ci); + swap_cluster_unlock(ci); put_swap_device(si); outer: if (page) --=20 2.51.0 From nobody Fri Oct 3 23:02:52 2025 Received: from mail-qv1-f41.google.com (mail-qv1-f41.google.com [209.85.219.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BC6F92D12E1 for ; Fri, 22 Aug 2025 19:21:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.41 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890471; cv=none; b=qFI4o9mghWswo7YNkshS0TiPbDA7F0ypTs4pCnB5a36n1V3hfhUZdwibNH8gUONkf+Kg1zfkcME31VoKI0Zb/ns7QG9YrtozGhAtyvwJAwj+ZkCOWw1UUzMSSx+aNaibLUqxyeAUz7sslZdbKX3q7Vd16DWEDQsrt7h+6Unwb14= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890471; c=relaxed/simple; bh=POLs5ipBOXXV/ebWFaZVhpJOdmaVUojkXhzuNkbkVsE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=d+IeJ+IPYZWZVPLLeP+uBwGYyAvbHDiqk50rVbgjyWhCEj4qmLNBWYlFbYwBvnxwVdGrMG3cNl+D+GS8f5mpOxrbEgD4+CK/7UIGzApu9kcP0RepQCwWo0HmzOtfyZGWCKaH6TMcjxXGvCkEtJH8RgkHH0RP8GmKuRYeOEZXGvs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=LKTPg9wq; arc=none smtp.client-ip=209.85.219.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="LKTPg9wq" Received: by mail-qv1-f41.google.com with SMTP id 6a1803df08f44-70d9eb2e976so6682046d6.0 for ; Fri, 22 Aug 2025 12:21:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1755890468; x=1756495268; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=9KKlr1vdm+W9XSUyQPRVxPSMfJ3nwxE8pge88LauYMw=; b=LKTPg9wqHXGt9vndIj8hxKLYzQYJ41g7G4dHyLbmKpj/6+KoozLV38OrqlC9Nul9Kk t9Fy9W01xvh+8r45BpMsoqWW0l7ehh33tGpwjOczAWQCuxliTbiOqpEH0SopUwHY7IQx +4DCSQCXuPfl0WJWGHdnzGlaSA4wRHWYdc+aueso2FJ414rzHPlRMU1xQz6pRAc95VZb hBGMWyQq6ynZ2i+JG12b8bRsMwdLLkXq72nKmuD1O8zuRjnMdgv7ZWkRxwHbFICauOrp lli8RJeT02SZJq241M6h215CJ/t2nhhtnz+DZ54dYtyTBuzJ328MpOJ7hRgF3wyR1kdG 6fBA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755890468; x=1756495268; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=9KKlr1vdm+W9XSUyQPRVxPSMfJ3nwxE8pge88LauYMw=; b=sR0kojEjH8R+10l3gzULX8ZpCifIo4l7Xb6FHvRdf3eJQrkh17jvGoN6SmzhpoK4Og aDDmYRpVbDr4JztLliO2cYqvCNCqtSOzu4+Xbs6f9Z7OrBm48+NmRq2GVPnCxe5JCks4 BJnPDLeCqOdbdICP7CLN1IPLZ6DQVNTKnSNbO3y6P1+ShlqqMvnJkgv7KrMwyrCTJFf8 khc1P2t93VdsvDg/Efj7G5aOMOXh+Em45gZJ3USDsxgmVlSaNibSsDM7Bxfc1b3+vPqS 620eWDa68mypWdVYUvwPfNGG84t4vLZ1ApM7QzdCs1afCLMD/DaZAuaQJQEsl5i7cn+5 RUPQ== X-Forwarded-Encrypted: i=1; AJvYcCV0HcAg0SHTpAQjiGQdGhA9ixR+zUX5yDs+65Z2jqhgm7LbjPkEooLP2fiY0kkz27gyN5yLS1eD1HwwEZQ=@vger.kernel.org X-Gm-Message-State: AOJu0Yy8KK1DPUy+M1Mm/fLsJKoRnmvRhivMZzn1zqRox28sWqZREeEX ABxXLA256uiADM20N44/oiSC47TbahFukT7zYXGD6axKIbWTRJcp3rFR X-Gm-Gg: ASbGncuruTTAiCGeUodslFb+PTThlpkx8YlxEhM0wrW5fApvn1beFn8V01Oqe4OGGwQ ykGt1w9/HwSgJKHit/Tomawh4sfeL9KxDvUSnXTXBlplS/Xkt/5i7oxLClkmF3l0uEkRTLu4RVW z6LbHObwx6AnVhUPVFskLcJPtdBthv6VMKNbcc/0aEL5y6S+2y01Stmn/k3GWnthBXzuBB3MV6i s2LkjU3+DFaJRPGFkTpLKlloabIxzJYzPwtDD6QqT435hMb2EFqpTgXC8iskOB/gsgrfUc1hqpZ R7XCPItI7qIj3y4FQG+Ulfy2AqAtGDHcnOMdkIUwXN7IiLNFTs64Wm3zU9758U7FtdA/4PZNp6C Vv8iLt8Nim7SDSpZPFq/o1Vsckn9Mtt+4KNzJSRa971s= X-Google-Smtp-Source: AGHT+IHgYS0Hlci+j0Mlm6zl7ZxCN8MaCGGQNiHPLXTtkunhsWsiLD4KH7akKu/0X+oTS5atNzCeRg== X-Received: by 2002:a05:6214:6211:b0:70d:9e0a:5ab1 with SMTP id 6a1803df08f44-70d9e0a5f2dmr17911516d6.22.1755890468302; Fri, 22 Aug 2025 12:21:08 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-70da72b04a6sm3843656d6.52.2025.08.22.12.21.02 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 22 Aug 2025 12:21:07 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers Date: Sat, 23 Aug 2025 03:20:18 +0800 Message-ID: <20250822192023.13477-5-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250822192023.13477-1-ryncsn@gmail.com> References: <20250822192023.13477-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song swp_swap_info is the most commonly used helper for retrieving swap info. It has an internal check that may lead to a NULL return value, but almost none of its caller checks the return value, making the internal check pointless. In fact, most of these callers already ensured the entry is valid and never expect a NULL value. Tidy this up and shorten the name. If the caller can make sure the swap entry/type is valid and the device is pinned, use the new introduced swp_info/swp_type_info instead. They have more debug sanity checks and lower overhead as they are inlined. Callers that may expect a NULL value should use swp_get_info/swp_type_get_info instead. No feature change. The rearranged codes should have had no effect, or they should have been hitting NULL de-ref bugs already. Some new sanity checks have been added to the debug build to catch potential misuse. And the new helpers will be used by swap cache when working with locked swap cache folios, as a locked swap cache ensures the entries are valid and stable. Signed-off-by: Kairui Song Acked-by: Chris Li Reviewed-by: Barry Song --- include/linux/swap.h | 6 ------ mm/page_io.c | 12 ++++++------ mm/swap.h | 33 ++++++++++++++++++++++++++++++--- mm/swap_state.c | 4 ++-- mm/swapfile.c | 35 ++++++++++++++++++----------------- 5 files changed, 56 insertions(+), 34 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 20efd9a34034..cb59c13fef42 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -469,7 +469,6 @@ extern sector_t swapdev_block(int, pgoff_t); extern int __swap_count(swp_entry_t entry); extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t en= try); extern int swp_swapcount(swp_entry_t entry); -struct swap_info_struct *swp_swap_info(swp_entry_t entry); struct backing_dev_info; extern int init_swap_address_space(unsigned int type, unsigned long nr_pag= es); extern void exit_swap_address_space(unsigned int type); @@ -482,11 +481,6 @@ static inline void put_swap_device(struct swap_info_st= ruct *si) } =20 #else /* CONFIG_SWAP */ -static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry) -{ - return NULL; -} - static inline struct swap_info_struct *get_swap_device(swp_entry_t entry) { return NULL; diff --git a/mm/page_io.c b/mm/page_io.c index a2056a5ecb13..bc164677d70b 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -204,7 +204,7 @@ static bool is_folio_zero_filled(struct folio *folio) static void swap_zeromap_folio_set(struct folio *folio) { struct obj_cgroup *objcg =3D get_obj_cgroup_from_folio(folio); - struct swap_info_struct *sis =3D swp_swap_info(folio->swap); + struct swap_info_struct *sis =3D swp_info(folio->swap); int nr_pages =3D folio_nr_pages(folio); swp_entry_t entry; unsigned int i; @@ -223,7 +223,7 @@ static void swap_zeromap_folio_set(struct folio *folio) =20 static void swap_zeromap_folio_clear(struct folio *folio) { - struct swap_info_struct *sis =3D swp_swap_info(folio->swap); + struct swap_info_struct *sis =3D swp_info(folio->swap); swp_entry_t entry; unsigned int i; =20 @@ -374,7 +374,7 @@ static void sio_write_complete(struct kiocb *iocb, long= ret) static void swap_writepage_fs(struct folio *folio, struct swap_iocb **swap= _plug) { struct swap_iocb *sio =3D swap_plug ? *swap_plug : NULL; - struct swap_info_struct *sis =3D swp_swap_info(folio->swap); + struct swap_info_struct *sis =3D swp_info(folio->swap); struct file *swap_file =3D sis->swap_file; loff_t pos =3D swap_dev_pos(folio->swap); =20 @@ -446,7 +446,7 @@ static void swap_writepage_bdev_async(struct folio *fol= io, =20 void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug) { - struct swap_info_struct *sis =3D swp_swap_info(folio->swap); + struct swap_info_struct *sis =3D swp_info(folio->swap); =20 VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio); /* @@ -537,7 +537,7 @@ static bool swap_read_folio_zeromap(struct folio *folio) =20 static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plu= g) { - struct swap_info_struct *sis =3D swp_swap_info(folio->swap); + struct swap_info_struct *sis =3D swp_info(folio->swap); struct swap_iocb *sio =3D NULL; loff_t pos =3D swap_dev_pos(folio->swap); =20 @@ -608,7 +608,7 @@ static void swap_read_folio_bdev_async(struct folio *fo= lio, =20 void swap_read_folio(struct folio *folio, struct swap_iocb **plug) { - struct swap_info_struct *sis =3D swp_swap_info(folio->swap); + struct swap_info_struct *sis =3D swp_info(folio->swap); bool synchronous =3D sis->flags & SWP_SYNCHRONOUS_IO; bool workingset =3D folio_test_workingset(folio); unsigned long pflags; diff --git a/mm/swap.h b/mm/swap.h index 223b40f2d37e..7b3efaa51624 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -15,6 +15,8 @@ extern int page_cluster; #define swap_entry_order(order) 0 #endif =20 +extern struct swap_info_struct *swap_info[]; + /* * We use this to track usage of a cluster. A cluster is a block of swap d= isk * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All @@ -53,9 +55,28 @@ enum swap_cluster_flags { #include /* for swp_offset */ #include /* for bio_end_io_t */ =20 +/* + * Callers of all swp_* helpers here must ensure the entry is valid, and + * pin the swap device by reference or in other ways. + */ +static inline struct swap_info_struct *swp_type_info(int type) +{ + struct swap_info_struct *si; + + si =3D READ_ONCE(swap_info[type]); /* rcu_dereference() */ + VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */ + return si; +} + +static inline struct swap_info_struct *swp_info(swp_entry_t entry) +{ + return swp_type_info(swp_type(entry)); +} + static inline struct swap_cluster_info *swp_offset_cluster( struct swap_info_struct *si, pgoff_t offset) { + VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */ return &si->cluster_info[offset / SWAPFILE_CLUSTER]; } =20 @@ -65,6 +86,7 @@ static inline struct swap_cluster_info *swap_cluster_lock( { struct swap_cluster_info *ci =3D swp_offset_cluster(si, offset); =20 + VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */ spin_lock(&ci->lock); return ci; } @@ -164,7 +186,7 @@ void swap_update_readahead(struct folio *folio, struct = vm_area_struct *vma, =20 static inline unsigned int folio_swap_flags(struct folio *folio) { - return swp_swap_info(folio->swap)->flags; + return swp_info(folio->swap)->flags; } =20 /* @@ -175,7 +197,7 @@ static inline unsigned int folio_swap_flags(struct foli= o *folio) static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr, bool *is_zeromap) { - struct swap_info_struct *sis =3D swp_swap_info(entry); + struct swap_info_struct *sis =3D swp_info(entry); unsigned long start =3D swp_offset(entry); unsigned long end =3D start + max_nr; bool first_bit; @@ -194,7 +216,7 @@ static inline int swap_zeromap_batch(swp_entry_t entry,= int max_nr, =20 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr) { - struct swap_info_struct *si =3D swp_swap_info(entry); + struct swap_info_struct *si =3D swp_info(entry); pgoff_t offset =3D swp_offset(entry); int i; =20 @@ -213,6 +235,11 @@ static inline int non_swapcache_batch(swp_entry_t entr= y, int max_nr) =20 #else /* CONFIG_SWAP */ struct swap_iocb; +static inline struct swap_info_struct *swp_info(swp_entry_t entry) +{ + return NULL; +} + static inline void swap_read_folio(struct folio *folio, struct swap_iocb *= *plug) { } diff --git a/mm/swap_state.c b/mm/swap_state.c index be0d96494dc1..721ff1a5e73a 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -330,7 +330,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry= , gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated, bool skip_if_exists) { - struct swap_info_struct *si =3D swp_swap_info(entry); + struct swap_info_struct *si =3D swp_info(entry); struct folio *folio; struct folio *new_folio =3D NULL; struct folio *result =3D NULL; @@ -554,7 +554,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t gfp_mask, unsigned long offset =3D entry_offset; unsigned long start_offset, end_offset; unsigned long mask; - struct swap_info_struct *si =3D swp_swap_info(entry); + struct swap_info_struct *si =3D swp_info(entry); struct blk_plug plug; struct swap_iocb *splug =3D NULL; bool page_allocated; diff --git a/mm/swapfile.c b/mm/swapfile.c index 618cf4333a3d..85606fbebf0f 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -102,7 +102,7 @@ static PLIST_HEAD(swap_active_head); static struct plist_head *swap_avail_heads; static DEFINE_SPINLOCK(swap_avail_lock); =20 -static struct swap_info_struct *swap_info[MAX_SWAPFILES]; +struct swap_info_struct *swap_info[MAX_SWAPFILES]; =20 static DEFINE_MUTEX(swapon_mutex); =20 @@ -124,14 +124,20 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, per= cpu_swap_cluster) =3D { .lock =3D INIT_LOCAL_LOCK(), }; =20 -static struct swap_info_struct *swap_type_to_swap_info(int type) +/* May return NULL on invalid type, caller must check for NULL return */ +static struct swap_info_struct *swp_type_get_info(int type) { if (type >=3D MAX_SWAPFILES) return NULL; - return READ_ONCE(swap_info[type]); /* rcu_dereference() */ } =20 +/* May return NULL on invalid entry, caller must check for NULL return */ +static struct swap_info_struct *swp_get_info(swp_entry_t entry) +{ + return swp_type_get_info(swp_type(entry)); +} + static inline unsigned char swap_count(unsigned char ent) { return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */ @@ -343,7 +349,7 @@ offset_to_swap_extent(struct swap_info_struct *sis, uns= igned long offset) =20 sector_t swap_folio_sector(struct folio *folio) { - struct swap_info_struct *sis =3D swp_swap_info(folio->swap); + struct swap_info_struct *sis =3D swp_info(folio->swap); struct swap_extent *se; sector_t sector; pgoff_t offset; @@ -1301,7 +1307,7 @@ static struct swap_info_struct *_swap_info_get(swp_en= try_t entry) =20 if (!entry.val) goto out; - si =3D swp_swap_info(entry); + si =3D swp_get_info(entry); if (!si) goto bad_nofile; if (data_race(!(si->flags & SWP_USED))) @@ -1416,7 +1422,7 @@ struct swap_info_struct *get_swap_device(swp_entry_t = entry) =20 if (!entry.val) goto out; - si =3D swp_swap_info(entry); + si =3D swp_get_info(entry); if (!si) goto bad_nofile; if (!get_swap_device_info(si)) @@ -1597,7 +1603,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t = entry) =20 int __swap_count(swp_entry_t entry) { - struct swap_info_struct *si =3D swp_swap_info(entry); + struct swap_info_struct *si =3D swp_info(entry); pgoff_t offset =3D swp_offset(entry); =20 return swap_count(si->swap_map[offset]); @@ -1828,7 +1834,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr) =20 swp_entry_t get_swap_page_of_type(int type) { - struct swap_info_struct *si =3D swap_type_to_swap_info(type); + struct swap_info_struct *si =3D swp_type_get_info(type); unsigned long offset; swp_entry_t entry =3D {0}; =20 @@ -1909,7 +1915,7 @@ int find_first_swap(dev_t *device) */ sector_t swapdev_block(int type, pgoff_t offset) { - struct swap_info_struct *si =3D swap_type_to_swap_info(type); + struct swap_info_struct *si =3D swp_type_get_info(type); struct swap_extent *se; =20 if (!si || !(si->flags & SWP_WRITEOK)) @@ -2837,7 +2843,7 @@ static void *swap_start(struct seq_file *swap, loff_t= *pos) if (!l) return SEQ_START_TOKEN; =20 - for (type =3D 0; (si =3D swap_type_to_swap_info(type)); type++) { + for (type =3D 0; (si =3D swp_type_get_info(type)); type++) { if (!(si->flags & SWP_USED) || !si->swap_map) continue; if (!--l) @@ -2858,7 +2864,7 @@ static void *swap_next(struct seq_file *swap, void *v= , loff_t *pos) type =3D si->type + 1; =20 ++(*pos); - for (; (si =3D swap_type_to_swap_info(type)); type++) { + for (; (si =3D swp_type_get_info(type)); type++) { if (!(si->flags & SWP_USED) || !si->swap_map) continue; return si; @@ -3531,7 +3537,7 @@ static int __swap_duplicate(swp_entry_t entry, unsign= ed char usage, int nr) unsigned char has_cache; int err, i; =20 - si =3D swp_swap_info(entry); + si =3D swp_get_info(entry); if (WARN_ON_ONCE(!si)) { pr_err("%s%08lx\n", Bad_file, entry.val); return -EINVAL; @@ -3646,11 +3652,6 @@ void swapcache_clear(struct swap_info_struct *si, sw= p_entry_t entry, int nr) swap_entries_put_cache(si, entry, nr); } =20 -struct swap_info_struct *swp_swap_info(swp_entry_t entry) -{ - return swap_type_to_swap_info(swp_type(entry)); -} - /* * add_swap_count_continuation - called when a swap count is duplicated * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entr= y's --=20 2.51.0 From nobody Fri Oct 3 23:02:52 2025 Received: from mail-qv1-f52.google.com (mail-qv1-f52.google.com [209.85.219.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C152D2D1F72 for ; Fri, 22 Aug 2025 19:21:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.52 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890479; cv=none; b=cUTjQO+RWln46Qur1rZIpPuV5V2ACRjg0cQt2EDJQv+zzWJr+k+bUGWTXbRAImlNYCDzLExK6ZBKZK/U3QQHK1vcseA6ff1dzbzMe91oZ8pOLWWn2CNBdb4VVhFH6v5xLHU7P8kUx61ItNyU9qcuFZFpzX2z4py4rEuJ64h3UJw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890479; c=relaxed/simple; bh=hkV/oHy+HIoT70HKWrLByT/zKjgDhCAbTjC+wex07lo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=l9fHoxJGZI8JbncotP18TQb1eHeNXSTzN81vn9Yl417tjbSQjfFJYAS7BpGhbNbMkXIA1SJ4OHK1kRT3Cy/WMiJYPDUx1njPoVhI2fRu//yVQu8qdC+YFw6XoZU8O5A38SYZFVc7/VrmpRdonIOx/Kd6Rbhd3y48hm0SpwuPDDY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=g1VIzKwx; arc=none smtp.client-ip=209.85.219.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="g1VIzKwx" Received: by mail-qv1-f52.google.com with SMTP id 6a1803df08f44-70ba7aa131fso29606296d6.2 for ; Fri, 22 Aug 2025 12:21:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1755890474; x=1756495274; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=vDJ8d/bHR7O8n6mBegJlo1lM3XBldB6kCH+6HxYRPK8=; b=g1VIzKwxfvaNG+8JO5RTjLnGB0wrroslrmZXX8lSCZKjY7juaHAOxBlO8MGCGVZKsp bDJ9hsgXfEJY4vwBwgJ4xgYFDNAgbbP+CE65CYUaNcxAcQSmA1vSGFaTggvrAuFmLubX TSAyBh0+Ma7hAOw+n6ls0Ve9V0DpvrtbyGzgSI4zr0WOlN089e3CcWL1ptzl/WtD5JM0 CO+82bYR+Bllfn+Es1gaVa6EFGbLy20LO4UFeA0wiPgZgHmVe3InUFOJwq4o6GmBSJ0b mZ//RzTOrWc3z6ULFBJbhnRgJnsr39AQYtYhiLleF68i3bRzkUsQ2qV/YNbdE2b/hWOM q0BQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755890474; x=1756495274; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=vDJ8d/bHR7O8n6mBegJlo1lM3XBldB6kCH+6HxYRPK8=; b=JuE9rKXagBbibD6H5HWiB6+NjVLJ/7lLSHnHjdCe1ChniyU58tZ71rU4rLsmWovO6f BkGpRKFSaCh3O1h5f4Am81Yi5GwtuS9zJ/C0HzqOeGsMEKpg3yb4pc3DriuXpKtYNjCS jPhrDDtY3FWlKRcipjO9gfQTIEW3AKXEuZnfzqvsUzdECihXC7fOJXp3zlcKbCBnhBRS Dsioi0POyPHVPff5ZVDmdWv9VULXSZlM5iB3nZ4ISZrVMfMsRnFN7sAA7LVV1zwxcIQE 3G8KYCR7zaBbUbcI2fFiL+bBOmTgXWxuGUARYJW++7DcnMWBURKjUSKxygQhU2m3BeFk r9aA== X-Forwarded-Encrypted: i=1; AJvYcCVsAwezRD2yRrSYpYAfDu58OfsiLFhqh5dE+IrexqLQ2/4O9N2SL0XwYnsg4ph25Rte4X/ibzLMfNrrYD4=@vger.kernel.org X-Gm-Message-State: AOJu0YwdgdDA+yIzD9hEyzOjWF4GJWAMT5BshYI8rjyAYimwb6u20yIg 6DzqrQXbrOjlOSxwMCVCCo2I3zqgutYo4iyXZGk4HW8l88tpomPy/aXD X-Gm-Gg: ASbGncv2QMNHsmpa29JPfY9GJcv28xR7zGt3WIvvCPRaWmFFzals5R0H575LmBZF1Ij JH9dkXOt1aZ7jVD1XGWARS9Xrnc+YaxjG63mEkGoYEbQ+xTSG60WVi4OjNs2ZznV7QnesrUa5FM lyb/shXoyTpuGzMorGVmDDE3pLOEKyMjHhv5CotYEAtKriap1J375Z0G14Is7dpbSnyxUeSQYjU 4IrnshUcbG3Hun2hK+15ZoeAmwagkHQWX50YHDBS7xr2xaykSqjnldXSRKhDXxgJF9H3VkdE1xl 3E6woLb5x2gs+jFpIlqEudUZFHjNw+P7KgHnn1SNPRMpOgK2aLpOFg/KflhLrpQ7JAn/vClM8Ml V/+SyjhH+P8Xvu3nmoy2B8Vfzlv4HRDEZ5TDK5x8G53s= X-Google-Smtp-Source: AGHT+IHAPlr2J3KCV0LUW3oevZ44eqeFOr5c9XUGP5N3YDy/mqVt/tLuB8Hg/3fh/SZEIT3/dJDT8A== X-Received: by 2002:a05:6214:4b02:b0:70d:9f16:9a37 with SMTP id 6a1803df08f44-70d9f16c223mr31439766d6.62.1755890474481; Fri, 22 Aug 2025 12:21:14 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-70da72b04a6sm3843656d6.52.2025.08.22.12.21.08 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 22 Aug 2025 12:21:14 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 5/9] mm/shmem, swap: remove redundant error handling for replacing folio Date: Sat, 23 Aug 2025 03:20:19 +0800 Message-ID: <20250822192023.13477-6-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250822192023.13477-1-ryncsn@gmail.com> References: <20250822192023.13477-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Shmem may replace a folio in the swap cache if the cached one doesn't fit the swapin's GFP zone. When doing so, shmem has already double checked that the swap cache folio is locked, still has the swap cache flag set, and contains the wanted swap entry. So it is impossible to fail due to an Xarray mismatch. There is even a comment for that. Delete the defensive error handling path, and add a WARN_ON instead: if that happened, something has broken the basic principle of how the swap cache works, we should catch and fix that. Signed-off-by: Kairui Song Reviewed-by: David Hildenbrand --- mm/shmem.c | 28 +++------------------------- 1 file changed, 3 insertions(+), 25 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index b4d39f2a1e0a..e03793cc5169 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2158,35 +2158,13 @@ static int shmem_replace_folio(struct folio **folio= p, gfp_t gfp, /* Swap cache still stores N entries instead of a high-order entry */ xa_lock_irq(&swap_mapping->i_pages); for (i =3D 0; i < nr_pages; i++) { - void *item =3D xas_load(&xas); - - if (item !=3D old) { - error =3D -ENOENT; - break; - } - - xas_store(&xas, new); + WARN_ON_ONCE(xas_store(&xas, new)); xas_next(&xas); } - if (!error) { - mem_cgroup_replace_folio(old, new); - shmem_update_stats(new, nr_pages); - shmem_update_stats(old, -nr_pages); - } xa_unlock_irq(&swap_mapping->i_pages); =20 - if (unlikely(error)) { - /* - * Is this possible? I think not, now that our callers - * check both the swapcache flag and folio->private - * after getting the folio lock; but be defensive. - * Reverse old to newpage for clear and free. - */ - old =3D new; - } else { - folio_add_lru(new); - *foliop =3D new; - } + folio_add_lru(new); + *foliop =3D new; =20 folio_clear_swapcache(old); old->private =3D NULL; --=20 2.51.0 From nobody Fri Oct 3 23:02:52 2025 Received: from mail-qk1-f176.google.com (mail-qk1-f176.google.com [209.85.222.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 910E02D130A for ; Fri, 22 Aug 2025 19:21:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890485; cv=none; b=MvZptLWuDwBR5xZlLp1/sFX6X9l6zz8VU2qolln5hh1CcICPsxGkXFFxnqTxhypxme7f8ifQ2RUntrwRq0lAVM9HTZEMPxLGjkA1aKa5YL3fwzWCIX1S0gCfGDPWYs3e3QXPwcPWf01nIY1Pvkp9zPK8bCKVtTY/I5Q+ZPgYy1s= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890485; c=relaxed/simple; bh=h2S1QN9MeAb9qDest3HnSpPOpoG4EjZjaB031Qdc+sE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=WQ11Wys2j8dfARvBfnnxYzALrPtTQfQH1E5uql9sDwP/6BcQV2DW8R8vCI3KnoX2YoyjT9MSrgC3DkWwxvUPHfBwSG6BNV3N1c8wnfdSmhC5KCuN4Qan1DV9OK9b77om2+CB5NJqlVuQWE4cqkKSPVaZkt7QfBvVVsip7/eV4jk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=SAA+E2+B; arc=none smtp.client-ip=209.85.222.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="SAA+E2+B" Received: by mail-qk1-f176.google.com with SMTP id af79cd13be357-7e87068760bso288920285a.3 for ; Fri, 22 Aug 2025 12:21:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1755890481; x=1756495281; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=yk1MoPn6Ky673Ox5GotUkZ/TaW7Eguue9tZYIG/fSRw=; b=SAA+E2+BqRbPpLcXGEid4qyAmgR1Te058JxngTi3VKPr4Wvx/8glBej8mbPONClxMo a7pijnGHqz0EKK3R8Gs52kHJRI1lI37oHcoRamz+DBwjGhbe7J9aSmF0hGG2AjKmK+wu YgN3MIF4OfSL62HH1fw9qDADzD+iXUnq9WrWtuOSi6dHOOYqYQDg6vEdGi4nMqBEUhmh //3v6t1iEqkXUrq5EbUQqczbvTNf8Aubv3fwfPPLY9+bvSrSSNx6R5tmpnEFe9sXe4+K xSXx/UjD57fvu+PBodheqBcwvVoEHr+WCMhQAgaagvfIynpiuFw2l5GwFHhnQiyDLvBU sWJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755890481; x=1756495281; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=yk1MoPn6Ky673Ox5GotUkZ/TaW7Eguue9tZYIG/fSRw=; b=Ne0DDt2Dy7J96TR2/icgwHU5JWwe0mW4/L5kiTRZlQtZzK+vw1YcnBiC2db0GZKxwK 2Z6i/LdYK47RSZgWzKPUCSX0y3PZOOO9kDQNfdrp7oIsIvWa6FE/pcGKlXamvfqJWVo/ IObNIf8t1ojtDoXWNRPuyc+Zs58dI6Mszfj8iSFemSGyxiO1EfHgYKf+VmSaoxURSCVK RfLz3DogM958Evl7YNtk32XvGb/JxQdzvxUyQRru7O1EJ8+47C0Se4S6JHPYKgKh8t/E /eP2s7Wxyz9B6d/BTH7I/JVjx2WXB52YtBen1arxLRztxJqL4acc0lGwhHnqC+ohVddE FeIg== X-Forwarded-Encrypted: i=1; AJvYcCXSbSQjsy6qTvOFwHc4zvb68vBYu9p9/Sv7NgMPwlVq9/6spdUehcV5Ln+0IXhtplfk5yk8ov32A+Lc8fg=@vger.kernel.org X-Gm-Message-State: AOJu0YwC1o0fyny4mNFu0D/0h8ej8QGgAQgSALZarX9JvAn943riSou2 9JV+iY8p6Z0ArEOdd9LotiPZ/MIzDOQ5I9jyv6rmVMGvjF64RiexkKHj X-Gm-Gg: ASbGncvCtIuXFRPFN4G4OhYHhEnAFijwNG6liUd3ue0phL3quOIhLWqotWzz98gX4xo i60l37nJAokNsampyVs0XzXI5gDkn6LShf/BAYER2PYsWsSxHbzzgIVVn9e/RFg57i0Wg513H49 U5zobk2Nip+1yEj9i/vYkzLjZSKtqhKbfPlqGdJQ5xLzNfA23q9S4QLtR+Szq+R/l7qssCuu6VJ RKYGc7aGbfdMGfIR+865v0P9zP0yTQ6gv+aFPeCS1tbd0ymy7AP6aVov6zCjty9jDucl1gf7ydt wz7ynnCrSUIEGjTokKAg/Wjbk7hlt+rHxEieOGUhD+Pof/8q832F6BlcR5+mtkCfLe/Hg8EpwFI uWFRnPz97BJ3SyvdP8EVsHtcud6hjXwJwxrcK1BCtYOvpwyC5nIk4ng== X-Google-Smtp-Source: AGHT+IHquE7zcv0x/HIB4hqp2iN/rZ+Zxp3YqRJph3Gu52Y1zFMOLmdJqX+ZFOlq6uVNGZNNEVHMog== X-Received: by 2002:a05:620a:4085:b0:7ea:458:e6e6 with SMTP id af79cd13be357-7ea11039e4bmr447150485a.42.1755890481157; Fri, 22 Aug 2025 12:21:21 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-70da72b04a6sm3843656d6.52.2025.08.22.12.21.14 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 22 Aug 2025 12:21:20 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API Date: Sat, 23 Aug 2025 03:20:20 +0800 Message-ID: <20250822192023.13477-7-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250822192023.13477-1-ryncsn@gmail.com> References: <20250822192023.13477-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Introduce basic swap table infrastructures, which are now just a fixed-sized flat array inside each swap cluster, with access wrappers. Each cluster contains a swap table of 512 entries. Each table entry is an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE), a folio type (pointer), or NULL. In this first step, it only supports storing a folio or shadow, and it is a drop-in replacement for the current swap cache. Convert all swap cache users to use the new sets of APIs. Chris Li has been suggesting using a new infrastructure for swap cache for better performance, and that idea combined well with the swap table as the new backing structure. Now the lock contention range is reduced to 2M clusters, which is much smaller than the 64M address_space. And we can also drop the multiple address_space design. All the internal works are done with swap_cache_get_* helpers. Swap cache lookup is still lock-less like before, and the helper's contexts are same with original swap cache helpers. They still require a pin on the swap device to prevent the backing data from being freed. Swap cache updates are now protected by the swap cluster lock instead of the Xarray lock. This is mostly handled internally, but new __swap_cache_* helpers require the caller to lock the cluster. So, a few new cluster access and locking helpers are also introduced. A fully cluster-based unified swap table can be implemented on top of this to take care of all count tracking and synchronization work, with dynamic allocation. It should reduce the memory usage while making the performance even better. Co-developed-by: Chris Li Signed-off-by: Chris Li Signed-off-by: Kairui Song --- MAINTAINERS | 1 + include/linux/swap.h | 2 - mm/filemap.c | 2 +- mm/huge_memory.c | 16 +-- mm/memory-failure.c | 2 +- mm/memory.c | 2 +- mm/migrate.c | 28 ++-- mm/shmem.c | 26 ++-- mm/swap.h | 151 +++++++++++++++------ mm/swap_state.c | 315 +++++++++++++++++++++---------------------- mm/swap_table.h | 106 +++++++++++++++ mm/swapfile.c | 105 +++++++++++---- mm/vmscan.c | 20 ++- mm/zswap.c | 2 +- 14 files changed, 500 insertions(+), 278 deletions(-) create mode 100644 mm/swap_table.h diff --git a/MAINTAINERS b/MAINTAINERS index b6f7c6939ff8..b78adfb3c7f0 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16214,6 +16214,7 @@ F: include/linux/swapops.h F: mm/page_io.c F: mm/swap.c F: mm/swap.h +F: mm/swap_table.h F: mm/swap_state.c F: mm/swapfile.c =20 diff --git a/include/linux/swap.h b/include/linux/swap.h index cb59c13fef42..7455df9bf340 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -470,8 +470,6 @@ extern int __swap_count(swp_entry_t entry); extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t en= try); extern int swp_swapcount(swp_entry_t entry); struct backing_dev_info; -extern int init_swap_address_space(unsigned int type, unsigned long nr_pag= es); -extern void exit_swap_address_space(unsigned int type); extern struct swap_info_struct *get_swap_device(swp_entry_t entry); sector_t swap_folio_sector(struct folio *folio); =20 diff --git a/mm/filemap.c b/mm/filemap.c index e4a5a46db89b..1fd0565b56e4 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -4504,7 +4504,7 @@ static void filemap_cachestat(struct address_space *m= apping, * invalidation, so there might not be * a shadow in the swapcache (yet). */ - shadow =3D get_shadow_from_swap_cache(swp); + shadow =3D swap_cache_get_shadow(swp); if (!shadow) goto resched; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2a47cd3bb649..209580d395a1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3721,7 +3721,7 @@ static int __folio_split(struct folio *folio, unsigne= d int new_order, /* Prevent deferred_split_scan() touching ->_refcount */ spin_lock(&ds_queue->split_queue_lock); if (folio_ref_freeze(folio, 1 + extra_pins)) { - struct address_space *swap_cache =3D NULL; + struct swap_cluster_info *swp_ci =3D NULL; struct lruvec *lruvec; int expected_refs; =20 @@ -3765,8 +3765,7 @@ static int __folio_split(struct folio *folio, unsigne= d int new_order, goto fail; } =20 - swap_cache =3D swap_address_space(folio->swap); - xa_lock(&swap_cache->i_pages); + swp_ci =3D swap_cluster_lock_by_folio(folio); } =20 /* lock lru list/PageCompound, ref frozen by page_ref_freeze */ @@ -3798,10 +3797,9 @@ static int __folio_split(struct folio *folio, unsign= ed int new_order, * Anonymous folio with swap cache. * NOTE: shmem in swap cache is not supported yet. */ - if (swap_cache) { - __xa_store(&swap_cache->i_pages, - swap_cache_index(new_folio->swap), - new_folio, 0); + if (swp_ci) { + __swap_cache_replace_folio(swp_ci, new_folio->swap, + folio, new_folio); continue; } =20 @@ -3836,8 +3834,8 @@ static int __folio_split(struct folio *folio, unsigne= d int new_order, =20 unlock_page_lruvec(lruvec); =20 - if (swap_cache) - xa_unlock(&swap_cache->i_pages); + if (swp_ci) + swap_cluster_unlock(swp_ci); } else { spin_unlock(&ds_queue->split_queue_lock); ret =3D -EAGAIN; diff --git a/mm/memory-failure.c b/mm/memory-failure.c index c15ffee7d32b..bb92d0c72aec 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1199,7 +1199,7 @@ static int me_swapcache_clean(struct page_state *ps, = struct page *p) struct folio *folio =3D page_folio(p); int ret; =20 - delete_from_swap_cache(folio); + swap_cache_del_folio(folio); =20 ret =3D delete_from_lru_cache(folio) ? MF_FAILED : MF_RECOVERED; folio_unlock(folio); diff --git a/mm/memory.c b/mm/memory.c index 9ca8e1873c6e..f81bf06e6ff5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4696,7 +4696,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) =20 memcg1_swapin(entry, nr_pages); =20 - shadow =3D get_shadow_from_swap_cache(entry); + shadow =3D swap_cache_get_shadow(entry); if (shadow) workingset_refault(folio, shadow); =20 diff --git a/mm/migrate.c b/mm/migrate.c index 8e435a078fc3..74db32caba2d 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -563,10 +563,10 @@ static int __folio_migrate_mapping(struct address_spa= ce *mapping, struct folio *newfolio, struct folio *folio, int expected_count) { XA_STATE(xas, &mapping->i_pages, folio_index(folio)); + struct swap_cluster_info *swp_ci =3D NULL; struct zone *oldzone, *newzone; int dirty; long nr =3D folio_nr_pages(folio); - long entries, i; =20 if (!mapping) { /* Take off deferred split queue while frozen and memcg set */ @@ -592,9 +592,16 @@ static int __folio_migrate_mapping(struct address_spac= e *mapping, oldzone =3D folio_zone(folio); newzone =3D folio_zone(newfolio); =20 - xas_lock_irq(&xas); + if (folio_test_swapcache(folio)) + swp_ci =3D swap_cluster_lock_by_folio_irq(folio); + else + xas_lock_irq(&xas); + if (!folio_ref_freeze(folio, expected_count)) { - xas_unlock_irq(&xas); + if (swp_ci) + swap_cluster_unlock(swp_ci); + else + xas_unlock_irq(&xas); return -EAGAIN; } =20 @@ -615,9 +622,6 @@ static int __folio_migrate_mapping(struct address_space= *mapping, if (folio_test_swapcache(folio)) { folio_set_swapcache(newfolio); newfolio->private =3D folio_get_private(folio); - entries =3D nr; - } else { - entries =3D 1; } =20 /* Move dirty while folio refs frozen and newfolio not yet exposed */ @@ -627,11 +631,10 @@ static int __folio_migrate_mapping(struct address_spa= ce *mapping, folio_set_dirty(newfolio); } =20 - /* Swap cache still stores N entries instead of a high-order entry */ - for (i =3D 0; i < entries; i++) { + if (folio_test_swapcache(folio)) + __swap_cache_replace_folio(swp_ci, folio->swap, folio, newfolio); + else xas_store(&xas, newfolio); - xas_next(&xas); - } =20 /* * Drop cache reference from old folio by unfreezing @@ -640,8 +643,11 @@ static int __folio_migrate_mapping(struct address_spac= e *mapping, */ folio_ref_unfreeze(folio, expected_count - nr); =20 - xas_unlock(&xas); /* Leave irq disabled to prevent preemption while updating stats */ + if (swp_ci) + swap_cluster_unlock(swp_ci); + else + xas_unlock(&xas); =20 /* * If moved to a different zone then also account diff --git a/mm/shmem.c b/mm/shmem.c index e03793cc5169..f088115cf209 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1698,13 +1698,13 @@ int shmem_writeout(struct folio *folio, struct swap= _iocb **plug, } =20 /* - * The delete_from_swap_cache() below could be left for + * The swap_cache_del_folio() below could be left for * shrink_folio_list()'s folio_free_swap() to dispose of; * but I'm a little nervous about letting this folio out of * shmem_writeout() in a hybrid half-tmpfs-half-swap state * e.g. folio_mapping(folio) might give an unexpected answer. */ - delete_from_swap_cache(folio); + swap_cache_del_folio(folio); goto redirty; } if (nr_pages > 1) @@ -2082,7 +2082,7 @@ static struct folio *shmem_swap_alloc_folio(struct in= ode *inode, new->swap =3D entry; =20 memcg1_swapin(entry, nr_pages); - shadow =3D get_shadow_from_swap_cache(entry); + shadow =3D swap_cache_get_shadow(entry); if (shadow) workingset_refault(new, shadow); folio_add_lru(new); @@ -2120,13 +2120,11 @@ static int shmem_replace_folio(struct folio **folio= p, gfp_t gfp, struct shmem_inode_info *info, pgoff_t index, struct vm_area_struct *vma) { + struct swap_cluster_info *ci; struct folio *new, *old =3D *foliop; swp_entry_t entry =3D old->swap; - struct address_space *swap_mapping =3D swap_address_space(entry); - pgoff_t swap_index =3D swap_cache_index(entry); - XA_STATE(xas, &swap_mapping->i_pages, swap_index); int nr_pages =3D folio_nr_pages(old); - int error =3D 0, i; + int error =3D 0; =20 /* * We have arrived here because our zones are constrained, so don't @@ -2155,13 +2153,9 @@ static int shmem_replace_folio(struct folio **foliop= , gfp_t gfp, new->swap =3D entry; folio_set_swapcache(new); =20 - /* Swap cache still stores N entries instead of a high-order entry */ - xa_lock_irq(&swap_mapping->i_pages); - for (i =3D 0; i < nr_pages; i++) { - WARN_ON_ONCE(xas_store(&xas, new)); - xas_next(&xas); - } - xa_unlock_irq(&swap_mapping->i_pages); + ci =3D swap_cluster_lock_by_folio_irq(old); + __swap_cache_replace_folio(ci, entry, old, new); + swap_cluster_unlock(ci); =20 folio_add_lru(new); *foliop =3D new; @@ -2198,7 +2192,7 @@ static void shmem_set_folio_swapin_error(struct inode= *inode, pgoff_t index, nr_pages =3D folio_nr_pages(folio); folio_wait_writeback(folio); if (!skip_swapcache) - delete_from_swap_cache(folio); + swap_cache_del_folio(folio); /* * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks) @@ -2438,7 +2432,7 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, folio->swap.val =3D 0; swapcache_clear(si, swap, nr_pages); } else { - delete_from_swap_cache(folio); + swap_cache_del_folio(folio); } folio_mark_dirty(folio); swap_free_nr(swap, nr_pages); diff --git a/mm/swap.h b/mm/swap.h index 7b3efaa51624..4af42bc2cd72 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -2,6 +2,7 @@ #ifndef _MM_SWAP_H #define _MM_SWAP_H =20 +#include /* for atomic_long_t */ struct mempolicy; struct swap_iocb; =20 @@ -35,6 +36,7 @@ struct swap_cluster_info { u16 count; u8 flags; u8 order; + atomic_long_t *table; /* Swap table entries, see mm/swap_table.h */ struct list_head list; }; =20 @@ -80,22 +82,62 @@ static inline struct swap_cluster_info *swp_offset_clus= ter( return &si->cluster_info[offset / SWAPFILE_CLUSTER]; } =20 -static inline struct swap_cluster_info *swap_cluster_lock( - struct swap_info_struct *si, - unsigned long offset) +static inline struct swap_cluster_info *swp_cluster(swp_entry_t entry) +{ + return swp_offset_cluster(swp_info(entry), swp_offset(entry)); +} + +static inline unsigned int swp_cluster_offset(swp_entry_t entry) +{ + return swp_offset(entry) % SWAPFILE_CLUSTER; +} + +/* + * Lock the swap cluster of the given offset. The caller must ensure the s= wap + * offset is valid and that the following accesses won't go beyond the loc= ked + * cluster. swap_cluster_lock_by_folio is preferred when possible + */ +static __always_inline struct swap_cluster_info *__swap_cluster_lock( + struct swap_info_struct *si, unsigned long offset, bool irq) { struct swap_cluster_info *ci =3D swp_offset_cluster(si, offset); =20 VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */ - spin_lock(&ci->lock); + if (irq) + spin_lock_irq(&ci->lock); + else + spin_lock(&ci->lock); return ci; } +#define swap_cluster_lock(si, off) __swap_cluster_lock(si, off, false) + +/* + * Lock the swap cluster that holds a folio's swap entries. Caller needs t= o lock + * the folio and ensure it's in the swap cache, and only touch the folio's= swap + * entries. A folio's entries are always in one cluster, and a locked foli= o lock + * ensures it won't be freed from the swap cache, hence stabilizing the de= vice. + */ +static inline struct swap_cluster_info *__swap_cluster_lock_by_folio( + struct folio *folio, bool irq) +{ + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); + return __swap_cluster_lock(swp_info(folio->swap), + swp_offset(folio->swap), irq); +} +#define swap_cluster_lock_by_folio(folio) __swap_cluster_lock_by_folio(fol= io, false) +#define swap_cluster_lock_by_folio_irq(folio) __swap_cluster_lock_by_folio= (folio, true) =20 static inline void swap_cluster_unlock(struct swap_cluster_info *ci) { spin_unlock(&ci->lock); } =20 +static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci) +{ + spin_unlock_irq(&ci->lock); +} + /* linux/mm/page_io.c */ int sio_pool_init(void); struct swap_iocb; @@ -115,10 +157,11 @@ void __swap_writepage(struct folio *folio, struct swa= p_iocb **swap_plug); #define SWAP_ADDRESS_SPACE_SHIFT 14 #define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT) #define SWAP_ADDRESS_SPACE_MASK (SWAP_ADDRESS_SPACE_PAGES - 1) -extern struct address_space *swapper_spaces[]; -#define swap_address_space(entry) \ - (&swapper_spaces[swp_type(entry)][swp_offset(entry) \ - >> SWAP_ADDRESS_SPACE_SHIFT]) +extern struct address_space swap_space __ro_after_init; +static inline struct address_space *swap_address_space(swp_entry_t entry) +{ + return &swap_space; +} =20 /* * Return the swap device position of the swap entry. @@ -128,15 +171,6 @@ static inline loff_t swap_dev_pos(swp_entry_t entry) return ((loff_t)swp_offset(entry)) << PAGE_SHIFT; } =20 -/* - * Return the swap cache index of the swap entry. - */ -static inline pgoff_t swap_cache_index(swp_entry_t entry) -{ - BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) !=3D SWP_OFFSET_= MASK); - return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK; -} - /** * folio_contains_swap - Does this folio contain this swap entry? * @folio: The folio. @@ -160,17 +194,31 @@ static inline bool folio_contains_swap(struct folio *= folio, swp_entry_t entry) return offset - swp_offset(folio->swap) < folio_nr_pages(folio); } =20 +/* + * All swap cache helpers below require the caller to ensure the swap entr= ies + * are valid and pin the device. This can be guaranteed by: + * - get_swap_device: this ensures a single entry is valid and increases t= he + * swap device's refcount. + * - Locking a folio in the swap cache: this ensures the folio won't be fr= eed + * from the swap cache, stabilizes its entries, and the swap device. + * - Locking anything referencing the swap entry: e.g. locking the PTL that + * protects swap entries in the page table, so they won't be freed. + */ +extern struct folio *swap_cache_get_folio(swp_entry_t entry); +extern void *swap_cache_get_shadow(swp_entry_t entry); +extern int swap_cache_add_folio(swp_entry_t entry, + struct folio *folio, void **shadow); +extern void swap_cache_del_folio(struct folio *folio); +/* Below helpers also require the caller to lock the swap cluster. */ +extern void __swap_cache_del_folio(swp_entry_t entry, + struct folio *folio, void *shadow); +extern void __swap_cache_replace_folio(struct swap_cluster_info *ci, + swp_entry_t entry, struct folio *old, + struct folio *new); +extern void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents); + void show_swap_cache_info(void); -void *get_shadow_from_swap_cache(swp_entry_t entry); -int add_to_swap_cache(struct folio *folio, swp_entry_t entry, - gfp_t gfp, void **shadowp); -void __delete_from_swap_cache(struct folio *folio, - swp_entry_t entry, void *shadow); -void delete_from_swap_cache(struct folio *folio); -void clear_shadow_from_swap_cache(int type, unsigned long begin, - unsigned long end); void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int n= r); -struct folio *swap_cache_get_folio(swp_entry_t entry); struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr, struct swap_iocb **plug); @@ -235,6 +283,33 @@ static inline int non_swapcache_batch(swp_entry_t entr= y, int max_nr) =20 #else /* CONFIG_SWAP */ struct swap_iocb; + +static inline struct swap_cluster_info *swap_cluster_lock( + struct swap_info_struct *si, pgoff_t offset, bool irq) +{ + return NULL; +} + +static inline struct swap_cluster_info *swap_cluster_lock_by_folio( + struct folio *folio) +{ + return NULL; +} + +static inline struct swap_cluster_info *swap_cluster_lock_by_folio_irq( + struct folio *folio) +{ + return NULL; +} + +static inline void swap_cluster_unlock(struct swap_cluster_info *ci) +{ +} + +static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci) +{ +} + static inline struct swap_info_struct *swp_info(swp_entry_t entry) { return NULL; @@ -252,11 +327,6 @@ static inline struct address_space *swap_address_space= (swp_entry_t entry) return NULL; } =20 -static inline pgoff_t swap_cache_index(swp_entry_t entry) -{ - return 0; -} - static inline bool folio_contains_swap(struct folio *folio, swp_entry_t en= try) { return false; @@ -298,28 +368,27 @@ static inline struct folio *swap_cache_get_folio(swp_= entry_t entry) return NULL; } =20 -static inline void *get_shadow_from_swap_cache(swp_entry_t entry) +static inline void *swap_cache_get_shadow(swp_entry_t end) { return NULL; } =20 -static inline int add_to_swap_cache(struct folio *folio, swp_entry_t entry, - gfp_t gfp_mask, void **shadowp) +static inline int swap_cache_add_folio(swp_entry_t end, struct folio *foli= o, void **shadow) { - return -1; + return -EINVAL; } =20 -static inline void __delete_from_swap_cache(struct folio *folio, - swp_entry_t entry, void *shadow) +static inline void swap_cache_del_folio(struct folio *folio) { } =20 -static inline void delete_from_swap_cache(struct folio *folio) +static inline void __swap_cache_del_folio(swp_entry_t entry, struct folio = *folio, void *shadow) { } =20 -static inline void clear_shadow_from_swap_cache(int type, unsigned long be= gin, - unsigned long end) +static inline void __swap_cache_replace_folio( + struct swap_cluster_info *ci, swp_entry_t entry, + struct folio *old, struct folio *new) { } =20 @@ -354,7 +423,7 @@ static inline int non_swapcache_batch(swp_entry_t entry= , int max_nr) static inline pgoff_t folio_index(struct folio *folio) { if (unlikely(folio_test_swapcache(folio))) - return swap_cache_index(folio->swap); + return swp_offset(folio->swap); return folio->index; } =20 diff --git a/mm/swap_state.c b/mm/swap_state.c index 721ff1a5e73a..c0342024b4a8 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -23,6 +23,7 @@ #include #include #include "internal.h" +#include "swap_table.h" #include "swap.h" =20 /* @@ -36,8 +37,11 @@ static const struct address_space_operations swap_aops = =3D { #endif }; =20 -struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly; -static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly; +/* Set swap_space is read only as swap cache is handled by swap table */ +struct address_space swap_space __ro_after_init =3D { + .a_ops =3D &swap_aops, +}; + static bool enable_vma_readahead __read_mostly =3D true; =20 #define SWAP_RA_ORDER_CEILING 5 @@ -69,7 +73,7 @@ void show_swap_cache_info(void) printk("Total swap =3D %lukB\n", K(total_swap_pages)); } =20 -/* +/** * swap_cache_get_folio - Lookup a swap entry in the swap cache. * * A found folio will be returned unlocked and with its refcount increased. @@ -79,155 +83,179 @@ void show_swap_cache_info(void) */ struct folio *swap_cache_get_folio(swp_entry_t entry) { - struct folio *folio =3D filemap_get_folio(swap_address_space(entry), - swap_cache_index(entry)); - if (!IS_ERR(folio)) - return folio; + unsigned long swp_tb; + struct folio *folio; + + for (;;) { + swp_tb =3D __swap_table_get(swp_cluster(entry), swp_cluster_offset(entry= )); + if (!swp_tb_is_folio(swp_tb)) + return NULL; + folio =3D swp_tb_to_folio(swp_tb); + if (folio_try_get(folio)) + return folio; + } + return NULL; } =20 -void *get_shadow_from_swap_cache(swp_entry_t entry) +/** + * swap_cache_get_shadow - Lookup a shadow in the swap cache. + * + * Context: Caller must ensure @entry is valid and pin the swap device. + */ +void *swap_cache_get_shadow(swp_entry_t entry) { - struct address_space *address_space =3D swap_address_space(entry); - pgoff_t idx =3D swap_cache_index(entry); - void *shadow; + unsigned long swp_tb; + + swp_tb =3D __swap_table_get(swp_cluster(entry), swp_cluster_offset(entry)= ); + if (swp_tb_is_shadow(swp_tb)) + return swp_tb_to_shadow(swp_tb); =20 - shadow =3D xa_load(&address_space->i_pages, idx); - if (xa_is_value(shadow)) - return shadow; return NULL; } =20 -/* - * add_to_swap_cache resembles filemap_add_folio on swapper_space, - * but sets SwapCache flag and 'swap' instead of mapping and index. +/** + * swap_cache_add_folio - add a folio into the swap cache. + * + * The folio will be used for swapin or swapout of swap entries + * starting with @entry. May fail due to race. + * + * Context: Caller must ensure @entry is valid and pin the swap device. */ -int add_to_swap_cache(struct folio *folio, swp_entry_t entry, - gfp_t gfp, void **shadowp) +int swap_cache_add_folio(swp_entry_t entry, struct folio *folio, void **sh= adowp) { - struct address_space *address_space =3D swap_address_space(entry); - pgoff_t idx =3D swap_cache_index(entry); - XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio)); - unsigned long i, nr =3D folio_nr_pages(folio); - void *old; - - xas_set_update(&xas, workingset_update_node); - - VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); - VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio); - VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio); + unsigned long exist; + void *shadow =3D NULL; + struct swap_cluster_info *ci; + unsigned int ci_start, ci_off, ci_end; + unsigned long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); + + ci =3D swap_cluster_lock(swp_info(entry), swp_offset(entry)); + ci_start =3D swp_cluster_offset(entry); + ci_end =3D ci_start + nr_pages; + ci_off =3D ci_start; + do { + exist =3D __swap_table_get(ci, ci_off); + if (unlikely(swp_tb_is_folio(exist))) + goto fail; + if (swp_tb_is_shadow(exist)) + shadow =3D swp_tb_to_shadow(exist); + } while (++ci_off < ci_end); + + ci_off =3D ci_start; + do { + __swap_table_set_folio(ci, ci_off, folio); + } while (++ci_off < ci_end); =20 - folio_ref_add(folio, nr); + folio_ref_add(folio, nr_pages); folio_set_swapcache(folio); folio->swap =3D entry; + swap_cluster_unlock(ci); =20 - do { - xas_lock_irq(&xas); - xas_create_range(&xas); - if (xas_error(&xas)) - goto unlock; - for (i =3D 0; i < nr; i++) { - VM_BUG_ON_FOLIO(xas.xa_index !=3D idx + i, folio); - if (shadowp) { - old =3D xas_load(&xas); - if (xa_is_value(old)) - *shadowp =3D old; - } - xas_store(&xas, folio); - xas_next(&xas); - } - address_space->nrpages +=3D nr; - __node_stat_mod_folio(folio, NR_FILE_PAGES, nr); - __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr); -unlock: - xas_unlock_irq(&xas); - } while (xas_nomem(&xas, gfp)); - - if (!xas_error(&xas)) - return 0; + node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); + lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); =20 - folio_clear_swapcache(folio); - folio_ref_sub(folio, nr); - return xas_error(&xas); + if (shadowp) + *shadowp =3D shadow; + return 0; +fail: + swap_cluster_unlock(ci); + return -EEXIST; } =20 /* - * This must be called only on folios that have - * been verified to be in the swap cache. + * Caller must ensure the folio is in the swap cache and locked, + * also lock the swap cluster. */ -void __delete_from_swap_cache(struct folio *folio, - swp_entry_t entry, void *shadow) +void __swap_cache_del_folio(swp_entry_t entry, struct folio *folio, + void *shadow) { - struct address_space *address_space =3D swap_address_space(entry); - int i; - long nr =3D folio_nr_pages(folio); - pgoff_t idx =3D swap_cache_index(entry); - XA_STATE(xas, &address_space->i_pages, idx); - - xas_set_update(&xas, workingset_update_node); - - VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); - VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio); - VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio); - - for (i =3D 0; i < nr; i++) { - void *entry =3D xas_store(&xas, shadow); - VM_BUG_ON_PAGE(entry !=3D folio, entry); - xas_next(&xas); - } + unsigned long exist; + struct swap_cluster_info *ci; + unsigned int ci_start, ci_off, ci_end; + unsigned long nr_pages =3D folio_nr_pages(folio); + + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio); + + ci =3D swp_offset_cluster(swp_info(entry), swp_offset(entry)); + ci_start =3D swp_cluster_offset(entry); + ci_end =3D ci_start + nr_pages; + ci_off =3D ci_start; + do { + exist =3D __swap_table_get(ci, ci_off); + VM_WARN_ON_ONCE(swp_tb_to_folio(exist) !=3D folio); + /* If shadow is NULL, we sets an empty shadow */ + __swap_table_set_shadow(ci, ci_off, shadow); + } while (++ci_off < ci_end); + folio->swap.val =3D 0; folio_clear_swapcache(folio); - address_space->nrpages -=3D nr; - __node_stat_mod_folio(folio, NR_FILE_PAGES, -nr); - __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr); + node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages); + lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages); } =20 /* - * This must be called only on folios that have - * been verified to be in the swap cache and locked. - * It will never put the folio into the free list, - * the caller has a reference on the folio. + * Replace an old folio in the swap cache with a new one. The caller must + * hold the cluster lock and set the new folio's entry and flags. */ -void delete_from_swap_cache(struct folio *folio) +void __swap_cache_replace_folio(struct swap_cluster_info *ci, swp_entry_t = entry, + struct folio *old, struct folio *new) +{ + unsigned int ci_off =3D swp_cluster_offset(entry); + unsigned long nr_pages =3D folio_nr_pages(new); + unsigned int ci_end =3D ci_off + nr_pages; + + VM_WARN_ON_ONCE(entry.val !=3D new->swap.val); + VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new)); + VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new)); + do { + WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) !=3D old); + __swap_table_set_folio(ci, ci_off, new); + } while (++ci_off < ci_end); + + /* + * If the old folio is partially replaced (e.g., splitting a large + * folio, the old folio is shrunk in place, and new split sub folios + * are added to cache), ensure the new folio doesn't overlap it. + */ + if (IS_ENABLED(CONFIG_DEBUG_VM) && + folio_order(old) !=3D folio_order(new)) { + ci_off =3D swp_cluster_offset(old->swap); + ci_end =3D ci_off + folio_nr_pages(old); + while (ci_off++ < ci_end) + WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) !=3D old); + } +} + +void swap_cache_del_folio(struct folio *folio) { + struct swap_cluster_info *ci; swp_entry_t entry =3D folio->swap; - struct address_space *address_space =3D swap_address_space(entry); =20 - xa_lock_irq(&address_space->i_pages); - __delete_from_swap_cache(folio, entry, NULL); - xa_unlock_irq(&address_space->i_pages); + ci =3D swap_cluster_lock(swp_info(entry), swp_offset(entry)); + __swap_cache_del_folio(entry, folio, NULL); + swap_cluster_unlock(ci); =20 put_swap_folio(folio, entry); folio_ref_sub(folio, folio_nr_pages(folio)); } =20 -void clear_shadow_from_swap_cache(int type, unsigned long begin, - unsigned long end) +void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents) { - unsigned long curr =3D begin; - void *old; - - for (;;) { - swp_entry_t entry =3D swp_entry(type, curr); - unsigned long index =3D curr & SWAP_ADDRESS_SPACE_MASK; - struct address_space *address_space =3D swap_address_space(entry); - XA_STATE(xas, &address_space->i_pages, index); - - xas_set_update(&xas, workingset_update_node); - - xa_lock_irq(&address_space->i_pages); - xas_for_each(&xas, old, min(index + (end - curr), SWAP_ADDRESS_SPACE_PAG= ES)) { - if (!xa_is_value(old)) - continue; - xas_store(&xas, NULL); - } - xa_unlock_irq(&address_space->i_pages); + struct swap_cluster_info *ci =3D swp_cluster(entry); + unsigned int ci_off =3D swp_cluster_offset(entry), ci_end; =20 - /* search the next swapcache until we meet end */ - curr =3D ALIGN((curr + 1), SWAP_ADDRESS_SPACE_PAGES); - if (curr > end) - break; - } + ci_end =3D ci_off + nr_ents; + do { + WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off))); + __swap_table_init_null(ci, ci_off); + } while (++ci_off < ci_end); } =20 /* @@ -292,8 +320,7 @@ static inline bool swap_use_vma_readahead(void) /* * Update the readahead statistics of a vma or globally. */ -void swap_update_readahead(struct folio *folio, - struct vm_area_struct *vma, +void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, unsigned long addr) { bool readahead, vma_ra =3D swap_use_vma_readahead(); @@ -387,7 +414,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry= , gfp_t gfp_mask, goto put_and_return; =20 /* - * We might race against __delete_from_swap_cache(), and + * We might race against __swap_cache_del_folio(), and * stumble across a swap_map entry whose SWAP_HAS_CACHE * has not yet been cleared. Or race against another * __read_swap_cache_async(), which has set SWAP_HAS_CACHE @@ -405,8 +432,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry= , gfp_t gfp_mask, if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry)) goto fail_unlock; =20 - /* May fail (-ENOMEM) if XArray node allocation failed. */ - if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &sha= dow)) + if (swap_cache_add_folio(entry, new_folio, &shadow)) goto fail_unlock; =20 memcg1_swapin(entry, 1); @@ -572,11 +598,11 @@ struct folio *swap_cluster_readahead(swp_entry_t entr= y, gfp_t gfp_mask, end_offset =3D si->max - 1; =20 blk_start_plug(&plug); - for (offset =3D start_offset; offset <=3D end_offset ; offset++) { + for (offset =3D start_offset; offset <=3D end_offset; offset++) { /* Ok, do the async read-ahead now */ folio =3D __read_swap_cache_async( - swp_entry(swp_type(entry), offset), - gfp_mask, mpol, ilx, &page_allocated, false); + swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx, + &page_allocated, false); if (!folio) continue; if (page_allocated) { @@ -600,41 +626,6 @@ struct folio *swap_cluster_readahead(swp_entry_t entry= , gfp_t gfp_mask, return folio; } =20 -int init_swap_address_space(unsigned int type, unsigned long nr_pages) -{ - struct address_space *spaces, *space; - unsigned int i, nr; - - nr =3D DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES); - spaces =3D kvcalloc(nr, sizeof(struct address_space), GFP_KERNEL); - if (!spaces) - return -ENOMEM; - for (i =3D 0; i < nr; i++) { - space =3D spaces + i; - xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ); - atomic_set(&space->i_mmap_writable, 0); - space->a_ops =3D &swap_aops; - /* swap cache doesn't use writeback related tags */ - mapping_set_no_writeback_tags(space); - } - nr_swapper_spaces[type] =3D nr; - swapper_spaces[type] =3D spaces; - - return 0; -} - -void exit_swap_address_space(unsigned int type) -{ - int i; - struct address_space *spaces =3D swapper_spaces[type]; - - for (i =3D 0; i < nr_swapper_spaces[type]; i++) - VM_WARN_ON_ONCE(!mapping_empty(&spaces[i])); - kvfree(spaces); - nr_swapper_spaces[type] =3D 0; - swapper_spaces[type] =3D NULL; -} - static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start, unsigned long *end) { @@ -807,7 +798,7 @@ static const struct attribute_group swap_attr_group =3D= { .attrs =3D swap_attrs, }; =20 -static int __init swap_init_sysfs(void) +static int __init swap_init(void) { int err; struct kobject *swap_kobj; @@ -822,11 +813,13 @@ static int __init swap_init_sysfs(void) pr_err("failed to register swap group\n"); goto delete_obj; } + /* swap_space is set RO after init, so do it here before init ends. */ + mapping_set_no_writeback_tags(&swap_space); return 0; =20 delete_obj: kobject_put(swap_kobj); return err; } -subsys_initcall(swap_init_sysfs); +subsys_initcall(swap_init); #endif diff --git a/mm/swap_table.h b/mm/swap_table.h new file mode 100644 index 000000000000..ed9676547071 --- /dev/null +++ b/mm/swap_table.h @@ -0,0 +1,106 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _MM_SWAP_TABLE_H +#define _MM_SWAP_TABLE_H + +#include "swap.h" + +/* + * A swap table entry represents the status of a swap slot on a swap + * (physical or virtual) device. The swap table in each cluster is a + * 1:1 map of the swap slots in this cluster. + * + * Each swap table entry could be a pointer (folio), a XA_VALUE + * (shadow), or NULL. + */ + +/* + * Helpers for casting one type of info into a swap table entry. + */ +static inline unsigned long null_to_swp_tb(void) +{ + BUILD_BUG_ON(sizeof(unsigned long) !=3D sizeof(atomic_long_t)); + return 0; +} + +static inline unsigned long folio_to_swp_tb(struct folio *folio) +{ + BUILD_BUG_ON(sizeof(unsigned long) !=3D sizeof(void *)); + return (unsigned long)folio; +} + +static inline unsigned long shadow_swp_to_tb(void *shadow) +{ + BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=3D + BITS_PER_BYTE * sizeof(unsigned long)); + VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow)); + return (unsigned long)shadow; +} + +/* + * Helpers for swap table entry type checking. + */ +static inline bool swp_tb_is_null(unsigned long swp_tb) +{ + return !swp_tb; +} + +static inline bool swp_tb_is_folio(unsigned long swp_tb) +{ + return !xa_is_value((void *)swp_tb) && !swp_tb_is_null(swp_tb); +} + +static inline bool swp_tb_is_shadow(unsigned long swp_tb) +{ + return xa_is_value((void *)swp_tb); +} + +/* + * Helpers for retrieving info from swap table. + */ +static inline struct folio *swp_tb_to_folio(unsigned long swp_tb) +{ + VM_WARN_ON(!swp_tb_is_folio(swp_tb)); + return (void *)swp_tb; +} + +static inline void *swp_tb_to_shadow(unsigned long swp_tb) +{ + VM_WARN_ON(!swp_tb_is_shadow(swp_tb)); + return (void *)swp_tb; +} + +/* + * Helpers for accessing or modifying the swap table of a cluster, + * the swap cluster must be locked. + */ +static inline void __swap_table_set(struct swap_cluster_info *ci, + unsigned int off, unsigned long swp_tb) +{ + VM_WARN_ON_ONCE(off >=3D SWAPFILE_CLUSTER); + atomic_long_set(&ci->table[off], swp_tb); +} + +static inline unsigned long __swap_table_get(struct swap_cluster_info *ci, + unsigned int off) +{ + VM_WARN_ON_ONCE(off >=3D SWAPFILE_CLUSTER); + return atomic_long_read(&ci->table[off]); +} + +static inline void __swap_table_set_folio(struct swap_cluster_info *ci, + unsigned int off, struct folio *folio) +{ + __swap_table_set(ci, off, folio_to_swp_tb(folio)); +} + +static inline void __swap_table_set_shadow(struct swap_cluster_info *ci, + unsigned int off, void *shadow) +{ + __swap_table_set(ci, off, shadow_swp_to_tb(shadow)); +} + +static inline void __swap_table_init_null(struct swap_cluster_info *ci, un= signed int off) +{ + __swap_table_set(ci, off, null_to_swp_tb()); +} +#endif diff --git a/mm/swapfile.c b/mm/swapfile.c index 85606fbebf0f..df68b5e242a6 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -46,6 +46,7 @@ #include #include #include +#include "swap_table.h" #include "internal.h" #include "swap.h" =20 @@ -268,7 +269,7 @@ static int __try_to_reclaim_swap(struct swap_info_struc= t *si, if (!need_reclaim) goto out_unlock; =20 - delete_from_swap_cache(folio); + swap_cache_del_folio(folio); folio_set_dirty(folio); ret =3D nr_pages; out_unlock: @@ -422,6 +423,34 @@ static inline unsigned int cluster_offset(struct swap_= info_struct *si, return cluster_index(si, ci) * SWAPFILE_CLUSTER; } =20 +static int swap_table_alloc_table(struct swap_cluster_info *ci) +{ + WARN_ON(ci->table); + ci->table =3D kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNE= L); + if (!ci->table) + return -ENOMEM; + return 0; +} + +static void swap_cluster_free_table(struct swap_cluster_info *ci) +{ + unsigned int ci_off; + unsigned long swp_tb; + + if (!ci->table) + return; + + for (ci_off =3D 0; ci_off < SWAPFILE_CLUSTER; ci_off++) { + swp_tb =3D __swap_table_get(ci, ci_off); + if (!swp_tb_is_null(swp_tb)) + pr_err_once("swap: unclean swap space on swapoff: 0x%lx", + swp_tb); + } + + kfree(ci->table); + ci->table =3D NULL; +} + static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, enum swap_cluster_flags new_flags) @@ -704,6 +733,25 @@ static bool cluster_scan_range(struct swap_info_struct= *si, return true; } =20 +/* + * Currently, the swap table is not used for count tracking, + * just do a sanity check to ensure nothing went wrong. + */ +static void cluster_table_check(struct swap_cluster_info *ci, + unsigned int start, unsigned int nr) +{ + unsigned int ci_off =3D start % SWAPFILE_CLUSTER; + unsigned int ci_end =3D ci_off + nr; + unsigned long swp_tb; + + if (IS_ENABLED(CONFIG_DEBUG_VM)) { + do { + swp_tb =3D __swap_table_get(ci, ci_off); + VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb)); + } while (++ci_off < ci_end); + } +} + static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_c= luster_info *ci, unsigned int start, unsigned char usage, unsigned int order) @@ -723,6 +771,7 @@ static bool cluster_alloc_range(struct swap_info_struct= *si, struct swap_cluster ci->order =3D order; =20 memset(si->swap_map + start, usage, nr_pages); + cluster_table_check(ci, start, nr_pages); swap_range_alloc(si, nr_pages); ci->count +=3D nr_pages; =20 @@ -1100,8 +1149,7 @@ static void swap_range_alloc(struct swap_info_struct = *si, static void swap_range_free(struct swap_info_struct *si, unsigned long off= set, unsigned int nr_entries) { - unsigned long begin =3D offset; - unsigned long end =3D offset + nr_entries - 1; + unsigned long start =3D offset, end =3D offset + nr_entries - 1; void (*swap_slot_free_notify)(struct block_device *, unsigned long); unsigned int i; =20 @@ -1125,7 +1173,7 @@ static void swap_range_free(struct swap_info_struct *= si, unsigned long offset, swap_slot_free_notify(si->bdev, offset); offset++; } - clear_shadow_from_swap_cache(si->type, begin, end); + __swap_cache_clear_shadow(swp_entry(si->type, start), nr_entries); =20 /* * Make sure that try_to_unuse() observes si->inuse_pages reaching 0 @@ -1282,15 +1330,7 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp) if (!entry.val) return -ENOMEM; =20 - /* - * XArray node allocations from PF_MEMALLOC contexts could - * completely exhaust the page allocator. __GFP_NOMEMALLOC - * stops emergency reserves from being allocated. - * - * TODO: this could cause a theoretical memory reclaim - * deadlock in the swap out path. - */ - if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL)) + if (swap_cache_add_folio(entry, folio, NULL)) goto out_free; =20 return 0; @@ -1557,6 +1597,7 @@ static void swap_entries_free(struct swap_info_struct= *si, =20 mem_cgroup_uncharge_swap(entry, nr_pages); swap_range_free(si, offset, nr_pages); + cluster_table_check(ci, offset, nr_pages); =20 if (!ci->count) free_cluster(si, ci); @@ -1760,7 +1801,7 @@ bool folio_free_swap(struct folio *folio) if (folio_swapped(folio)) return false; =20 - delete_from_swap_cache(folio); + swap_cache_del_folio(folio); folio_set_dirty(folio); return true; } @@ -2634,6 +2675,18 @@ static void wait_for_allocation(struct swap_info_str= uct *si) } } =20 +static void free_cluster_info(struct swap_cluster_info *cluster_info, + unsigned long maxpages) +{ + int i, nr_clusters =3D DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); + + if (!cluster_info) + return; + for (i =3D 0; i < nr_clusters; i++) + swap_cluster_free_table(&cluster_info[i]); + kvfree(cluster_info); +} + /* * Called after swap device's reference count is dead, so * neither scan nor allocation will use it. @@ -2768,12 +2821,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, speci= alfile) =20 swap_file =3D p->swap_file; p->swap_file =3D NULL; - p->max =3D 0; swap_map =3D p->swap_map; p->swap_map =3D NULL; zeromap =3D p->zeromap; p->zeromap =3D NULL; cluster_info =3D p->cluster_info; + free_cluster_info(cluster_info, p->max); + p->max =3D 0; p->cluster_info =3D NULL; spin_unlock(&p->lock); spin_unlock(&swap_lock); @@ -2784,10 +2838,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specia= lfile) p->global_cluster =3D NULL; vfree(swap_map); kvfree(zeromap); - kvfree(cluster_info); /* Destroy swap account information */ swap_cgroup_swapoff(p->type); - exit_swap_address_space(p->type); =20 inode =3D mapping->host; =20 @@ -3171,8 +3223,11 @@ static struct swap_cluster_info *setup_clusters(stru= ct swap_info_struct *si, if (!cluster_info) goto err; =20 - for (i =3D 0; i < nr_clusters; i++) + for (i =3D 0; i < nr_clusters; i++) { spin_lock_init(&cluster_info[i].lock); + if (swap_table_alloc_table(&cluster_info[i])) + goto err_free; + } =20 if (!(si->flags & SWP_SOLIDSTATE)) { si->global_cluster =3D kmalloc(sizeof(*si->global_cluster), @@ -3233,9 +3288,8 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, } =20 return cluster_info; - err_free: - kvfree(cluster_info); + free_cluster_info(cluster_info, maxpages); err: return ERR_PTR(err); } @@ -3429,13 +3483,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, special= file, int, swap_flags) } } =20 - error =3D init_swap_address_space(si->type, maxpages); - if (error) - goto bad_swap_unlock_inode; - error =3D zswap_swapon(si->type, maxpages); if (error) - goto free_swap_address_space; + goto bad_swap_unlock_inode; =20 /* * Flush any pending IO and dirty mappings before we start using this @@ -3470,8 +3520,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf= ile, int, swap_flags) goto out; free_swap_zswap: zswap_swapoff(si->type); -free_swap_address_space: - exit_swap_address_space(si->type); bad_swap_unlock_inode: inode_unlock(inode); bad_swap: @@ -3486,7 +3534,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf= ile, int, swap_flags) spin_unlock(&swap_lock); vfree(swap_map); kvfree(zeromap); - kvfree(cluster_info); + if (cluster_info) + free_cluster_info(cluster_info, maxpages); if (inced_nr_rotate_swap) atomic_dec(&nr_rotate_swap); if (swap_file) diff --git a/mm/vmscan.c b/mm/vmscan.c index b0afd7f41a22..1ed3cf9dac4e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -730,13 +730,18 @@ static int __remove_mapping(struct address_space *map= ping, struct folio *folio, { int refcount; void *shadow =3D NULL; + struct swap_cluster_info *ci; =20 BUG_ON(!folio_test_locked(folio)); BUG_ON(mapping !=3D folio_mapping(folio)); =20 - if (!folio_test_swapcache(folio)) + if (folio_test_swapcache(folio)) { + ci =3D swap_cluster_lock_by_folio_irq(folio); + } else { spin_lock(&mapping->host->i_lock); - xa_lock_irq(&mapping->i_pages); + xa_lock_irq(&mapping->i_pages); + } + /* * The non racy check for a busy folio. * @@ -776,9 +781,9 @@ static int __remove_mapping(struct address_space *mappi= ng, struct folio *folio, =20 if (reclaimed && !mapping_exiting(mapping)) shadow =3D workingset_eviction(folio, target_memcg); - __delete_from_swap_cache(folio, swap, shadow); + __swap_cache_del_folio(swap, folio, shadow); memcg1_swapout(folio, swap); - xa_unlock_irq(&mapping->i_pages); + swap_cluster_unlock_irq(ci); put_swap_folio(folio, swap); } else { void (*free_folio)(struct folio *); @@ -816,9 +821,12 @@ static int __remove_mapping(struct address_space *mapp= ing, struct folio *folio, return 1; =20 cannot_free: - xa_unlock_irq(&mapping->i_pages); - if (!folio_test_swapcache(folio)) + if (folio_test_swapcache(folio)) { + swap_cluster_unlock_irq(ci); + } else { + xa_unlock_irq(&mapping->i_pages); spin_unlock(&mapping->host->i_lock); + } return 0; } =20 diff --git a/mm/zswap.c b/mm/zswap.c index ee443b317ac7..c869859eec77 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1166,7 +1166,7 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, =20 out: if (ret && ret !=3D -EEXIST) { - delete_from_swap_cache(folio); + swap_cache_del_folio(folio); folio_unlock(folio); } folio_put(folio); --=20 2.51.0 From nobody Fri Oct 3 23:02:52 2025 Received: from mail-qt1-f180.google.com (mail-qt1-f180.google.com [209.85.160.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B83CD2C3761 for ; Fri, 22 Aug 2025 19:21:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890490; cv=none; b=eKTm0ER1vVf5vLV0W0tQJdHQmHH0cIKcQW0bW8ggPEtIBjXqZ7HKj8mHfqO970GAKNQU7IpaywefWFfwaaGVHqEmx94mT4d9i7xwE1yif7TqNTjZ2xcUPsoAgCmv0FfvUX2R9luyfBeBTgJm9fpUw6uwllXt3LwmxC5aH0Ks4OM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890490; c=relaxed/simple; bh=UYlL7KajnNY98PmVMNm3u6UyJPjQoJ31OKSOgyflwVk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=X6wFwSO4W63q760N7B24HYLWA3R7a1yOAGujRNWqTfFIiqxITYbo4oYqLipiWebbJWC6TTBdec9BcxRvlcUXg5LDrYMW/COWK7KVMLF3FsIQHddwYe+bMJ6GQXZW9gralTTWmjSfpMqwB/tEGFzhsQdFnA+ePfuEGRLzhCbyiaA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=iltl/WoK; arc=none smtp.client-ip=209.85.160.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="iltl/WoK" Received: by mail-qt1-f180.google.com with SMTP id d75a77b69052e-4b0faa6601cso44220191cf.1 for ; Fri, 22 Aug 2025 12:21:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1755890487; x=1756495287; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=r+1HQ2TlvwxlBkr62a2H4qwZSY6nsOPwAVU1BBwtoJE=; b=iltl/WoKY1/Nu6wgujrQNwFJCPGlxtwwjcT98+V78Tfrzu8j2qXfW8UEEROBMYVHEH XHbOwRuaYj0lTuz40woQ33NR2kyZI9WMJRIwWO3Xox1ZaMoM7BJ/GXF8bbyEvT3POlKT vRusq3mYwLGwTfQCiq5Zl0oI+UJM2ix77OBbWPDJZzlTqi9+1i8lVmNtBnvoDU2vMOzJ o86LvpZvan2OvblcOkBePiRPlYmMVHz4er5em0FDojkh24fjrLF2Y4I/2LIJZafMuMPH Rwy2NB7/reSSU3LWhnAxTz31KalUvbwaEQOEBTTvJ/teDlElkCKDr425+1dahwJ5kZJv wxag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755890487; x=1756495287; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=r+1HQ2TlvwxlBkr62a2H4qwZSY6nsOPwAVU1BBwtoJE=; b=Rtd/v2H1ytS1ZkHhD/BXchXhbM2BfFbzCUVq8NVU8YZZNF7iHC+T0RcPkmdIwug4Fh NlwnBp0Y0U5YxBsuo1iZE8YNWlneY3frQmGfKHkX4f8H9lqYYDThvxJnTNSnY8UxZNF/ ttlYyBlROotiiH0EzIpdVyycmvzH8PrZPEAhipXqzqCdbjnO0kuKR2LqoFmb7DEte4S+ 1oz0YBNbZow6EES/jtptfCOv71DuH5T5KcfhDvTeTlAarxd+QdM0haTNt653YMOlM2cM Me7+YcEislW2AT5260N5pl1gE34Ic7HGJ6UX+INeHc1LkpsAuhI/4AKLFNzr9D/bdE2Z venQ== X-Forwarded-Encrypted: i=1; AJvYcCUUXMxFCtaV4sTk+F9o4uahVvhwDY1rXNeYxz2kwX33O3ak1f+WehSU5TXdSMjaRAmkDOQoI5LI62vkjRg=@vger.kernel.org X-Gm-Message-State: AOJu0Yz6nX2Nxvj4wKnLfvVB833Razs0TBm9fAex4g+QPJablBEmcDYP Z2lETO2YxhXMQgYiZCz6qeQspuHOV0Yu2i80X6GSKoElXURPd2TZeFjQ X-Gm-Gg: ASbGncsguJ8HWpS1pZMpwu8tTKTOljUUFbMQksF1t6kfXDOIs+AKZFWnevc1u8HgU9L v9F8E3sIaCfdyJ6Ut9JQN767vsIXPeT8m75G36Lv089hyMnxCKAWFNhuZR4z4cWSIIOh1sNySyL hjUHiOXLEAhdSv12Kuw43NMNPDSmND3kcfsWmKZotBo6uUrL3Ycv3scRrNzUiAyQrZG3UE6NIY/ vM9k9LxlMl2a/m1fH/1hupvSR0f465981KFiswKA7zMKQ9BfmgZCeDOIYq1t57NeyCKpcEfOegU z/EHI291nfBGm5Io0I6dwmdCVNE0G8brYvbkOboIxnxW5mh8P/9+GFAG8z5rHI57kapfgLBxVaQ GZUGZ/++wyNz5u0cKWevOQ8acMEFTKdkflcBOJS7ck8Q= X-Google-Smtp-Source: AGHT+IH2C/F6240WS4qBX1hA2FSLdhdG7hDUn141EwNBrggCl5LiHjq5wCWAwNtJS6EQdWBtk4nPzQ== X-Received: by 2002:a05:6214:1d2b:b0:70b:ab14:d827 with SMTP id 6a1803df08f44-70d89327451mr75340296d6.7.1755890487565; Fri, 22 Aug 2025 12:21:27 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-70da72b04a6sm3843656d6.52.2025.08.22.12.21.21 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 22 Aug 2025 12:21:27 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song , kernel test robot Subject: [PATCH 7/9] mm, swap: remove contention workaround for swap cache Date: Sat, 23 Aug 2025 03:20:21 +0800 Message-ID: <20250822192023.13477-8-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250822192023.13477-1-ryncsn@gmail.com> References: <20250822192023.13477-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Swap cluster setup will try to shuffle the clusters on initialization. It was helpful to avoid contention for the swap cache space. The cluster size (2M) was much smaller than each swap cache space (64M), so shuffling the cluster means the allocator will try to allocate swap slots that are in different swap cache spaces for each CPU, reducing the chance of two CPUs using the same swap cache space, and hence reducing the contention. Now, swap cache is managed by swap clusters, this shuffle is pointless. Just remove it, and clean up related macros. This should also improve the HDD swap performance as shuffling IO is a bad idea for HDD, and now the shuffling is gone. Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com Signed-off-by: Kairui Song Acked-by: Chris Li Reviewed-by: Barry Song --- mm/swap.h | 4 ---- mm/swapfile.c | 32 ++++++++------------------------ mm/zswap.c | 7 +++++-- 3 files changed, 13 insertions(+), 30 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index 4af42bc2cd72..ce3ec62cc05e 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -153,10 +153,6 @@ int swap_writeout(struct folio *folio, struct swap_ioc= b **swap_plug); void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug); =20 /* linux/mm/swap_state.c */ -/* One swap address space for each 64M swap space */ -#define SWAP_ADDRESS_SPACE_SHIFT 14 -#define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT) -#define SWAP_ADDRESS_SPACE_MASK (SWAP_ADDRESS_SPACE_PAGES - 1) extern struct address_space swap_space __ro_after_init; static inline struct address_space *swap_address_space(swp_entry_t entry) { diff --git a/mm/swapfile.c b/mm/swapfile.c index df68b5e242a6..0c8001c99f30 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3203,21 +3203,14 @@ static int setup_swap_map(struct swap_info_struct *= si, return 0; } =20 -#define SWAP_CLUSTER_INFO_COLS \ - DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info)) -#define SWAP_CLUSTER_SPACE_COLS \ - DIV_ROUND_UP(SWAP_ADDRESS_SPACE_PAGES, SWAPFILE_CLUSTER) -#define SWAP_CLUSTER_COLS \ - max_t(unsigned int, SWAP_CLUSTER_INFO_COLS, SWAP_CLUSTER_SPACE_COLS) - static struct swap_cluster_info *setup_clusters(struct swap_info_struct *s= i, union swap_header *swap_header, unsigned long maxpages) { unsigned long nr_clusters =3D DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); struct swap_cluster_info *cluster_info; - unsigned long i, j, idx; int err =3D -ENOMEM; + unsigned long i; =20 cluster_info =3D kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL); if (!cluster_info) @@ -3266,22 +3259,13 @@ static struct swap_cluster_info *setup_clusters(str= uct swap_info_struct *si, INIT_LIST_HEAD(&si->frag_clusters[i]); } =20 - /* - * Reduce false cache line sharing between cluster_info and - * sharing same address space. - */ - for (j =3D 0; j < SWAP_CLUSTER_COLS; j++) { - for (i =3D 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) { - struct swap_cluster_info *ci; - idx =3D i * SWAP_CLUSTER_COLS + j; - ci =3D cluster_info + idx; - if (idx >=3D nr_clusters) - continue; - if (ci->count) { - ci->flags =3D CLUSTER_FLAG_NONFULL; - list_add_tail(&ci->list, &si->nonfull_clusters[0]); - continue; - } + for (i =3D 0; i < nr_clusters; i++) { + struct swap_cluster_info *ci =3D &cluster_info[i]; + + if (ci->count) { + ci->flags =3D CLUSTER_FLAG_NONFULL; + list_add_tail(&ci->list, &si->nonfull_clusters[0]); + } else { ci->flags =3D CLUSTER_FLAG_FREE; list_add_tail(&ci->list, &si->free_clusters); } diff --git a/mm/zswap.c b/mm/zswap.c index c869859eec77..c0a9be14a725 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -237,10 +237,13 @@ static bool zswap_has_pool; * helpers and fwd declarations **********************************/ =20 +/* One swap address space for each 64M swap space */ +#define ZSWAP_ADDRESS_SPACE_SHIFT 14 +#define ZSWAP_ADDRESS_SPACE_PAGES (1 << ZSWAP_ADDRESS_SPACE_SHIFT) static inline struct xarray *swap_zswap_tree(swp_entry_t swp) { return &zswap_trees[swp_type(swp)][swp_offset(swp) - >> SWAP_ADDRESS_SPACE_SHIFT]; + >> ZSWAP_ADDRESS_SPACE_SHIFT]; } =20 #define zswap_pool_debug(msg, p) \ @@ -1771,7 +1774,7 @@ int zswap_swapon(int type, unsigned long nr_pages) struct xarray *trees, *tree; unsigned int nr, i; =20 - nr =3D DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES); + nr =3D DIV_ROUND_UP(nr_pages, ZSWAP_ADDRESS_SPACE_PAGES); trees =3D kvcalloc(nr, sizeof(*tree), GFP_KERNEL); if (!trees) { pr_err("alloc failed, zswap disabled for swap type %d\n", type); --=20 2.51.0 From nobody Fri Oct 3 23:02:52 2025 Received: from mail-qv1-f43.google.com (mail-qv1-f43.google.com [209.85.219.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E4A5B2D29CE for ; Fri, 22 Aug 2025 19:21:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.43 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890497; cv=none; b=AekoSIyn7KJGmXC8MQ1YPpJ/lMTJZZ4ntQwm/Hk1CW9kDImrY5JtD7rUl3Tp5eB7dtBDcXZhkDQXyjfmrZgH+5kAOcF8eU2D+Um4gw4ojzOyx4IYK4S3b6sS4cS+AXM4PrHjGuQEEIpqi+qwm/Zy8ljrKv5qOSkAT/nlDjcyf6Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890497; c=relaxed/simple; bh=5rK4SwRFvQzbgX/TQX0tgetfJggWpeZNg1M09C8OTjU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=jX2JyP2pamOjXkyAWxRAwsn/i+P9tnfBjmaEiZMbFnAV0ijvVkX0J3KobnWbcFS31DC2mYyf7vRvJgCOl5FQHyzp74u0s0LhMvxPBrDrHxYNFHlDcFPqgCWp+w7yVcqDNLePHJM1khdVBoz59Ejmh3MRw9sHZD9+8jUQd4AqZXI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Kj07X/x/; arc=none smtp.client-ip=209.85.219.43 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Kj07X/x/" Received: by mail-qv1-f43.google.com with SMTP id 6a1803df08f44-70ba7aa13dcso21653446d6.3 for ; Fri, 22 Aug 2025 12:21:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1755890494; x=1756495294; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=2L+hJE+IB0iUcVlUYZZhbpmVEvEPmj5Vj/KEQQntKcM=; b=Kj07X/x/+QZXIH3xFjON5tvtVRHt6IfnXKMs3061enY0Z8fXDIdI84ggDFrnXKTENL 8v3J9/wl98pO8drd/Qkku3GCImbX+nRAUDJdJ2hN0hzVc2+5knRlLF0q7v/+l1EuKqcp 5NaR52DeYRHaImx7G501NJI9qyBmDcsrDS4c5HRXeo95E9OpbNOkZdUkTHG5+cB1rLMc mFFgpC0JgLltldf6sFn0rqK+t1tJNp2+Oq2FHQ8WJ6v1TyUB99kfY8LC1hL6GVbOlWMq RCJ7QEU5Mxy2rSyc+nyPJh+3CoEB5FDZ+391JZQDKeHAK9sQDdPl2OjEErQUbtyw7GWA FpBA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755890494; x=1756495294; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=2L+hJE+IB0iUcVlUYZZhbpmVEvEPmj5Vj/KEQQntKcM=; b=gFiEdcQ9WBpeFqKb8HP8NzwoBQ/RfHAbylQjQkRhyTJqw0wz9cVteUZQFxK9ie5avd qWB/+CbNhjwA4p5xm8ybelF8b+/9r3TsqSvqEakPaXXdqvfbsmDsxwVMbpJmphU/enbX 0bb6mVnP4Iw6sdTkEJpefgAQKQkysIseXd7vzottcv3XirtQsPws24KHx+ABJR5ZxNWH OXUQtR/GU53jghY2xNgApvf7PhDa1+t694AA581erOTMnRMAuZocpLAhT2PB3CSHIewd DdePxNtzkhx2U7BDe/EiCJTtYZVuFd3eXgKBvjNovN29+I/g/pRVqzFSslNRGtUVpwZk B4lA== X-Forwarded-Encrypted: i=1; AJvYcCXvbl8IYcBCy1BxYQtTbDJrYdWhJCQgQkcGGxtNx1TnH3AvKHKDb3tQJj/iAgAea1Lzrrqzd+9gIF15uTU=@vger.kernel.org X-Gm-Message-State: AOJu0YzxE8baIYK8JSerar7RKqPn0yeSs8k6SOjG9daI5mR7Sj4tjRY7 +fNEYE5Ki07J6ghN/AlAS/ivmUgwa0TgyRrWIUCww5FX9SkKbt6/zaq/ X-Gm-Gg: ASbGnctKY7H70nc01wTXMGZBEAYRBiD+TowUKefSnp6cJXvqnLPDQ4xGKSHWQPZlCCF nOtZOSyEbnIPH+RZDNatJQKkGgaZNAKvFAnl9dyCCQxlhXEfk9PYCQ2qekPHAmNjDbGihT2oy93 88H3EErS3z3F09RYoU0fCV31pmcstCKtgoY7QLgzYYZ8ld6g7TxfAnNnVHdnTxFHFUO2R4w10oB d8luy+HZZqfRuocR2hHYPdnTwVN8zdKf8NEDLa0ZIYSJ1JQZ9KCLKlDjhJyniW4mLN1f6tFAlOx gRXEeLHDQv8Ln9/TvkbbkoOvxKlXaEGDP5NQiP2wnATB2QbI4v51B5CU/YP/mgAm4vU8BggkZbl sriekADDeuM0buLQUMhxvQB5byFQfFA9817VhT0IqlGM= X-Google-Smtp-Source: AGHT+IGGbnFrGqyosDOoF0G2Qyk9vRWcSTgBb+CggxPJw9xuwtxeZh8Y9UnfLYcZ+rG800PIBXDPzQ== X-Received: by 2002:a05:6214:21cd:b0:709:f305:705d with SMTP id 6a1803df08f44-70d9722c613mr53385776d6.19.1755890493703; Fri, 22 Aug 2025 12:21:33 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-70da72b04a6sm3843656d6.52.2025.08.22.12.21.27 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 22 Aug 2025 12:21:33 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 8/9] mm, swap: implement dynamic allocation of swap table Date: Sat, 23 Aug 2025 03:20:22 +0800 Message-ID: <20250822192023.13477-9-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250822192023.13477-1-ryncsn@gmail.com> References: <20250822192023.13477-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Now swap table is cluster based, which means free clusters can free its table since no one should modify it. There could be speculative readers, like swap cache look up, protect them by making them RCU safe. All swap table should be filled with null entries before free, so such readers will either see a NULL pointer or a null filled table being lazy freed. On allocation, allocate the table when a cluster is used by any order. This way, we can reduce the memory usage of large swap device significantly. This idea to dynamically release unused swap cluster data was initially suggested by Chris Li while proposing the cluster swap allocator and I found it suits the swap table idea very well. Co-developed-by: Chris Li Signed-off-by: Chris Li Signed-off-by: Kairui Song Acked-by: Chris Li --- mm/swap.h | 2 +- mm/swap_state.c | 9 ++- mm/swap_table.h | 32 +++++++- mm/swapfile.c | 202 ++++++++++++++++++++++++++++++++++++++---------- 4 files changed, 197 insertions(+), 48 deletions(-) diff --git a/mm/swap.h b/mm/swap.h index ce3ec62cc05e..ee33733027f4 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -36,7 +36,7 @@ struct swap_cluster_info { u16 count; u8 flags; u8 order; - atomic_long_t *table; /* Swap table entries, see mm/swap_table.h */ + atomic_long_t __rcu *table; /* Swap table entries, see mm/swap_table.h */ struct list_head list; }; =20 diff --git a/mm/swap_state.c b/mm/swap_state.c index c0342024b4a8..a0120d822fbe 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -87,7 +87,8 @@ struct folio *swap_cache_get_folio(swp_entry_t entry) struct folio *folio; =20 for (;;) { - swp_tb =3D __swap_table_get(swp_cluster(entry), swp_cluster_offset(entry= )); + swp_tb =3D swap_table_get(swp_cluster(entry), + swp_cluster_offset(entry)); if (!swp_tb_is_folio(swp_tb)) return NULL; folio =3D swp_tb_to_folio(swp_tb); @@ -107,10 +108,9 @@ void *swap_cache_get_shadow(swp_entry_t entry) { unsigned long swp_tb; =20 - swp_tb =3D __swap_table_get(swp_cluster(entry), swp_cluster_offset(entry)= ); + swp_tb =3D swap_table_get(swp_cluster(entry), swp_cluster_offset(entry)); if (swp_tb_is_shadow(swp_tb)) return swp_tb_to_shadow(swp_tb); - return NULL; } =20 @@ -135,6 +135,9 @@ int swap_cache_add_folio(swp_entry_t entry, struct foli= o *folio, void **shadowp) VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); =20 ci =3D swap_cluster_lock(swp_info(entry), swp_offset(entry)); + if (unlikely(!ci->table)) + goto fail; + ci_start =3D swp_cluster_offset(entry); ci_end =3D ci_start + nr_pages; ci_off =3D ci_start; diff --git a/mm/swap_table.h b/mm/swap_table.h index ed9676547071..4e97513b11ef 100644 --- a/mm/swap_table.h +++ b/mm/swap_table.h @@ -2,8 +2,15 @@ #ifndef _MM_SWAP_TABLE_H #define _MM_SWAP_TABLE_H =20 +#include +#include #include "swap.h" =20 +/* A typical flat array in each cluster as swap table */ +struct swap_table { + atomic_long_t entries[SWAPFILE_CLUSTER]; +}; + /* * A swap table entry represents the status of a swap slot on a swap * (physical or virtual) device. The swap table in each cluster is a @@ -76,15 +83,36 @@ static inline void *swp_tb_to_shadow(unsigned long swp_= tb) static inline void __swap_table_set(struct swap_cluster_info *ci, unsigned int off, unsigned long swp_tb) { + atomic_long_t *table =3D rcu_dereference_protected(ci->table, true); + + lockdep_assert_held(&ci->lock); VM_WARN_ON_ONCE(off >=3D SWAPFILE_CLUSTER); - atomic_long_set(&ci->table[off], swp_tb); + atomic_long_set(&table[off], swp_tb); } =20 static inline unsigned long __swap_table_get(struct swap_cluster_info *ci, unsigned int off) { + atomic_long_t *table; + VM_WARN_ON_ONCE(off >=3D SWAPFILE_CLUSTER); - return atomic_long_read(&ci->table[off]); + table =3D rcu_dereference_check(ci->table, lockdep_is_held(&ci->lock)); + + return atomic_long_read(&table[off]); +} + +static inline unsigned long swap_table_get(struct swap_cluster_info *ci, + unsigned int off) +{ + atomic_long_t *table; + unsigned long swp_tb; + + rcu_read_lock(); + table =3D rcu_dereference(ci->table); + swp_tb =3D table ? atomic_long_read(&table[off]) : null_to_swp_tb(); + rcu_read_unlock(); + + return swp_tb; } =20 static inline void __swap_table_set_folio(struct swap_cluster_info *ci, diff --git a/mm/swapfile.c b/mm/swapfile.c index 0c8001c99f30..00651e947eb2 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -105,6 +105,8 @@ static DEFINE_SPINLOCK(swap_avail_lock); =20 struct swap_info_struct *swap_info[MAX_SWAPFILES]; =20 +static struct kmem_cache *swap_table_cachep; + static DEFINE_MUTEX(swapon_mutex); =20 static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait); @@ -402,10 +404,17 @@ static inline bool cluster_is_discard(struct swap_clu= ster_info *info) return info->flags =3D=3D CLUSTER_FLAG_DISCARD; } =20 +static inline bool cluster_table_is_alloced(struct swap_cluster_info *ci) +{ + return rcu_dereference_protected(ci->table, lockdep_is_held(&ci->lock)); +} + static inline bool cluster_is_usable(struct swap_cluster_info *ci, int ord= er) { if (unlikely(ci->flags > CLUSTER_FLAG_USABLE)) return false; + if (!cluster_table_is_alloced(ci)) + return false; if (!order) return true; return cluster_is_empty(ci) || order =3D=3D ci->order; @@ -423,32 +432,98 @@ static inline unsigned int cluster_offset(struct swap= _info_struct *si, return cluster_index(si, ci) * SWAPFILE_CLUSTER; } =20 -static int swap_table_alloc_table(struct swap_cluster_info *ci) +static void swap_cluster_free_table(struct swap_cluster_info *ci) { - WARN_ON(ci->table); - ci->table =3D kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNE= L); - if (!ci->table) - return -ENOMEM; - return 0; + unsigned int ci_off; + struct swap_table *table; + + /* Only empty cluster's table is allow to be freed */ + lockdep_assert_held(&ci->lock); + VM_WARN_ON_ONCE(!cluster_is_empty(ci)); + for (ci_off =3D 0; ci_off < SWAPFILE_CLUSTER; ci_off++) + VM_WARN_ON_ONCE(!swp_tb_is_null(__swap_table_get(ci, ci_off))); + table =3D (void *)rcu_dereference_protected(ci->table, true); + rcu_assign_pointer(ci->table, NULL); + + kmem_cache_free(swap_table_cachep, table); } =20 -static void swap_cluster_free_table(struct swap_cluster_info *ci) +/* + * Allocate a swap table may need to sleep, which leads to migration, + * so attempt an atomic allocation first then fallback and handle + * potential race. + */ +static struct swap_cluster_info * +swap_cluster_alloc_table(struct swap_info_struct *si, + struct swap_cluster_info *ci, + int order) { - unsigned int ci_off; - unsigned long swp_tb; + struct swap_cluster_info *pcp_ci; + struct swap_table *table; + unsigned long offset; =20 - if (!ci->table) - return; + /* + * Only cluster isolation from the allocator does table allocation. + * Swap allocator uses a percpu cluster and holds the local lock. + */ + lockdep_assert_held(&ci->lock); + lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock); + + table =3D kmem_cache_zalloc(swap_table_cachep, + __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN); + if (table) { + rcu_assign_pointer(ci->table, table); + return ci; + } + + /* + * Try a sleep allocation. Each isolated free cluster may cause + * a sleep allocation, but there is a limited number of them, so + * the potential recursive allocation should be limited. + */ + spin_unlock(&ci->lock); + if (!(si->flags & SWP_SOLIDSTATE)) + spin_unlock(&si->global_cluster_lock); + local_unlock(&percpu_swap_cluster.lock); + table =3D kmem_cache_zalloc(swap_table_cachep, __GFP_HIGH | GFP_KERNEL); =20 - for (ci_off =3D 0; ci_off < SWAPFILE_CLUSTER; ci_off++) { - swp_tb =3D __swap_table_get(ci, ci_off); - if (!swp_tb_is_null(swp_tb)) - pr_err_once("swap: unclean swap space on swapoff: 0x%lx", - swp_tb); + local_lock(&percpu_swap_cluster.lock); + if (!(si->flags & SWP_SOLIDSTATE)) + spin_lock(&si->global_cluster_lock); + /* + * Back to atomic context. First, check if we migrated to a new + * CPU with a usable percpu cluster. If so, try using that instead. + * No need to check it for the spinning device, as swap is + * serialized by the global lock on them. + * + * The is_usable check is a bit rough, but ensures order 0 success. + */ + offset =3D this_cpu_read(percpu_swap_cluster.offset[order]); + if ((si->flags & SWP_SOLIDSTATE) && offset) { + pcp_ci =3D swap_cluster_lock(si, offset); + if (cluster_is_usable(pcp_ci, order) && + pcp_ci->count < SWAPFILE_CLUSTER) { + ci =3D pcp_ci; + goto free_table; + } + swap_cluster_unlock(pcp_ci); } =20 - kfree(ci->table); - ci->table =3D NULL; + if (!table) + return NULL; + + spin_lock(&ci->lock); + /* Nothing should have touched the dangling empty cluster. */ + if (WARN_ON_ONCE(cluster_table_is_alloced(ci))) + goto free_table; + + rcu_assign_pointer(ci->table, table); + return ci; + +free_table: + if (table) + kmem_cache_free(swap_table_cachep, table); + return ci; } =20 static void move_cluster(struct swap_info_struct *si, @@ -480,7 +555,7 @@ static void swap_cluster_schedule_discard(struct swap_i= nfo_struct *si, =20 static void __free_cluster(struct swap_info_struct *si, struct swap_cluste= r_info *ci) { - lockdep_assert_held(&ci->lock); + swap_cluster_free_table(ci); move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE); ci->order =3D 0; } @@ -495,15 +570,11 @@ static void __free_cluster(struct swap_info_struct *s= i, struct swap_cluster_info * this returns NULL for an non-empty list. */ static struct swap_cluster_info *isolate_lock_cluster( - struct swap_info_struct *si, struct list_head *list) + struct swap_info_struct *si, struct list_head *list, int order) { - struct swap_cluster_info *ci, *ret =3D NULL; + struct swap_cluster_info *ci, *found =3D NULL; =20 spin_lock(&si->lock); - - if (unlikely(!(si->flags & SWP_WRITEOK))) - goto out; - list_for_each_entry(ci, list, list) { if (!spin_trylock(&ci->lock)) continue; @@ -515,13 +586,19 @@ static struct swap_cluster_info *isolate_lock_cluster( =20 list_del(&ci->list); ci->flags =3D CLUSTER_FLAG_NONE; - ret =3D ci; + found =3D ci; break; } -out: spin_unlock(&si->lock); =20 - return ret; + if (found && !cluster_table_is_alloced(found)) { + /* Only an empty free cluster's swap table can be freed. */ + VM_WARN_ON_ONCE(list !=3D &si->free_clusters); + VM_WARN_ON_ONCE(!cluster_is_empty(found)); + return swap_cluster_alloc_table(si, found, order); + } + + return found; } =20 /* @@ -654,17 +731,27 @@ static void relocate_cluster(struct swap_info_struct = *si, * added to free cluster list and its usage counter will be increased by 1. * Only used for initialization. */ -static void inc_cluster_info_page(struct swap_info_struct *si, +static int inc_cluster_info_page(struct swap_info_struct *si, struct swap_cluster_info *cluster_info, unsigned long page_nr) { unsigned long idx =3D page_nr / SWAPFILE_CLUSTER; + struct swap_table *table; struct swap_cluster_info *ci; =20 ci =3D cluster_info + idx; + if (!ci->table) { + table =3D kmem_cache_zalloc(swap_table_cachep, GFP_KERNEL); + if (!table) + return -ENOMEM; + rcu_assign_pointer(ci->table, table); + } + ci->count++; =20 VM_BUG_ON(ci->count > SWAPFILE_CLUSTER); VM_BUG_ON(ci->flags); + + return 0; } =20 static bool cluster_reclaim_range(struct swap_info_struct *si, @@ -845,7 +932,7 @@ static unsigned int alloc_swap_scan_list(struct swap_in= fo_struct *si, unsigned int found =3D SWAP_ENTRY_INVALID; =20 do { - struct swap_cluster_info *ci =3D isolate_lock_cluster(si, list); + struct swap_cluster_info *ci =3D isolate_lock_cluster(si, list, order); unsigned long offset; =20 if (!ci) @@ -870,7 +957,7 @@ static void swap_reclaim_full_clusters(struct swap_info= _struct *si, bool force) if (force) to_scan =3D swap_usage_in_pages(si) / SWAPFILE_CLUSTER; =20 - while ((ci =3D isolate_lock_cluster(si, &si->full_clusters))) { + while ((ci =3D isolate_lock_cluster(si, &si->full_clusters, 0))) { offset =3D cluster_offset(si, ci); end =3D min(si->max, offset + SWAPFILE_CLUSTER); to_scan--; @@ -1018,6 +1105,7 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o done: if (!(si->flags & SWP_SOLIDSTATE)) spin_unlock(&si->global_cluster_lock); + return found; } =20 @@ -1885,7 +1973,13 @@ swp_entry_t get_swap_page_of_type(int type) /* This is called for allocating swap entry, not cache */ if (get_swap_device_info(si)) { if (si->flags & SWP_WRITEOK) { + /* + * Grab the local lock to be complaint + * with swap table allocation. + */ + local_lock(&percpu_swap_cluster.lock); offset =3D cluster_alloc_swap_entry(si, 0, 1); + local_unlock(&percpu_swap_cluster.lock); if (offset) { entry =3D swp_entry(si->type, offset); atomic_long_dec(&nr_swap_pages); @@ -2678,12 +2772,21 @@ static void wait_for_allocation(struct swap_info_st= ruct *si) static void free_cluster_info(struct swap_cluster_info *cluster_info, unsigned long maxpages) { + struct swap_cluster_info *ci; int i, nr_clusters =3D DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); =20 if (!cluster_info) return; - for (i =3D 0; i < nr_clusters; i++) - swap_cluster_free_table(&cluster_info[i]); + for (i =3D 0; i < nr_clusters; i++) { + ci =3D cluster_info + i; + /* Cluster with bad marks count will have a remaining table */ + spin_lock(&ci->lock); + if (rcu_dereference_protected(ci->table, true)) { + ci->count =3D 0; + swap_cluster_free_table(ci); + } + spin_unlock(&ci->lock); + } kvfree(cluster_info); } =20 @@ -2719,6 +2822,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) struct address_space *mapping; struct inode *inode; struct filename *pathname; + unsigned int maxpages; int err, found =3D 0; =20 if (!capable(CAP_SYS_ADMIN)) @@ -2825,8 +2929,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) p->swap_map =3D NULL; zeromap =3D p->zeromap; p->zeromap =3D NULL; + maxpages =3D p->max; cluster_info =3D p->cluster_info; - free_cluster_info(cluster_info, p->max); p->max =3D 0; p->cluster_info =3D NULL; spin_unlock(&p->lock); @@ -2838,6 +2942,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) p->global_cluster =3D NULL; vfree(swap_map); kvfree(zeromap); + free_cluster_info(cluster_info, maxpages); /* Destroy swap account information */ swap_cgroup_swapoff(p->type); =20 @@ -3216,11 +3321,8 @@ static struct swap_cluster_info *setup_clusters(stru= ct swap_info_struct *si, if (!cluster_info) goto err; =20 - for (i =3D 0; i < nr_clusters; i++) { + for (i =3D 0; i < nr_clusters; i++) spin_lock_init(&cluster_info[i].lock); - if (swap_table_alloc_table(&cluster_info[i])) - goto err_free; - } =20 if (!(si->flags & SWP_SOLIDSTATE)) { si->global_cluster =3D kmalloc(sizeof(*si->global_cluster), @@ -3239,16 +3341,23 @@ static struct swap_cluster_info *setup_clusters(str= uct swap_info_struct *si, * See setup_swap_map(): header page, bad pages, * and the EOF part of the last cluster. */ - inc_cluster_info_page(si, cluster_info, 0); + err =3D inc_cluster_info_page(si, cluster_info, 0); + if (err) + goto err; for (i =3D 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr =3D swap_header->info.badpages[i]; =20 if (page_nr >=3D maxpages) continue; - inc_cluster_info_page(si, cluster_info, page_nr); + err =3D inc_cluster_info_page(si, cluster_info, page_nr); + if (err) + goto err; + } + for (i =3D maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) { + err =3D inc_cluster_info_page(si, cluster_info, i); + if (err) + goto err; } - for (i =3D maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) - inc_cluster_info_page(si, cluster_info, i); =20 INIT_LIST_HEAD(&si->free_clusters); INIT_LIST_HEAD(&si->full_clusters); @@ -3962,6 +4071,15 @@ static int __init swapfile_init(void) =20 swapfile_maximum_size =3D arch_max_swapfile_size(); =20 + /* + * Once a cluster is freed, it's swap table content is read + * only, and all swap cache readers (swap_cache_*) verifies + * the content before use. So it's safe to use RCU slab here. + */ + swap_table_cachep =3D kmem_cache_create("swap_table", + sizeof(struct swap_table), + 0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL); + #ifdef CONFIG_MIGRATION if (swapfile_maximum_size >=3D (1UL << SWP_MIG_TOTAL_BITS)) swap_migration_ad_supported =3D true; --=20 2.51.0 From nobody Fri Oct 3 23:02:52 2025 Received: from mail-qv1-f54.google.com (mail-qv1-f54.google.com [209.85.219.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E4E662D3228 for ; Fri, 22 Aug 2025 19:21:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.54 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890502; cv=none; b=tgDnt2Fu9VFdLIgq9Yl324ILpcZmsnUdCz9Pm1g2dI3TB1jJQixwokuhTpKBleL//2KoR0y7YUPGdCEUJ0DQMc+vrvDGM+fTh+ZLAjW6lMpZCNM0eU8jlzFVJPh5DIJ5/yAml3Gy0azs6FezI9301XUPKZUMBy5FwxXnINxMENs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755890502; c=relaxed/simple; bh=BIAou0EJH+Cde+0lcC7baB0kaHraaOyHB1jhoaFuDZ4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=KhCBJWq+X17B9jOdJ+NgOovr3+4uAY25AOgYPBVtP2jNNWaAuLha7mKhU0t1FOrPlzbHyJaGmoXGzyeRJ6008GqBaaBW/lj+XsMBfP1sKfaZHJAifn3y0XcBB58AHf6CDkkdvKmp37evcKB7f2IEL/fXDMlDgW/STwO3QQO9BNs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=BWLbgAj/; arc=none smtp.client-ip=209.85.219.54 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="BWLbgAj/" Received: by mail-qv1-f54.google.com with SMTP id 6a1803df08f44-70d9a65c373so12705056d6.3 for ; Fri, 22 Aug 2025 12:21:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1755890500; x=1756495300; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=beM5HIc5H97tJfr+Bv+Qn5iVpdoA6CB95zAhcoVOSug=; b=BWLbgAj/kOQPDNNBGrDAW4YTZO9lJDIrF9mv6KsoWkcgy/2pRycC11kAFq2Aolqth3 a8uxuh2kmM5gBKG5pGIr2ZFOA6VSXe7L8L1S39nL9feKBsFTAFK8DlYldOL9Hae09RIr DQck99WXkJfRVI39ubNP4/mQrjvoDUXpKMXADgjWBv3QIBeof/A9wrzDSSkycuEBaMCo vF+nxuSx3b4XSS2mukXqp5HQP9ud851ucJu3bhICbakDvhS4vaXDU5jcQykOurZoCqJg ISlF1TCeCVNprf+5BdMiFOx8zkrPMxgDV2nI4cmWH8BuSvHYn9vwHfKH8hAwDLYPwNI/ Q7GA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755890500; x=1756495300; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=beM5HIc5H97tJfr+Bv+Qn5iVpdoA6CB95zAhcoVOSug=; b=THeU730gRlja/OcP5BxfgSspiObo/IjrYizhJSrCLbeqY/aE6sMjUZUAHf1iYKkmGn wqd5Q5uWzc/fY3Nx+jdbFdsbDiKYsrLeGGZbL8RbbbQISlTKsdItVVol06X+TrVF5DGO 9EBu4VDGfONR5SVpOCkhrllT5thRkTUlSSZe9OHnApvAsEZkVYU89cPMW7+zCrK+a8Cm u2H6mwyEQFcOtAFo1QigNK2OVSKaIjJ6bRombBfgXJu/9SMJGfXxhSoFSbtlBlI9RF2V CHzzHovt/74/vYZ0Y1txv9L3Z8sL2rxDCu/FxFiMafb6eR72LYK89ieQkhWEmO9va7SO 5rQQ== X-Forwarded-Encrypted: i=1; AJvYcCXTKS83L9aXveEsU+99/NHTwBjzdGVjff8d3eYRx7ZI9ZlU1yJlw7Emg8cVKycEFBbesYkrnjq6QhUNT/k=@vger.kernel.org X-Gm-Message-State: AOJu0YxFD8MorxvrO6xJNBpmSt99aUViKAoDDCjU84+ERwhKeF8iBtF3 WCtc+D9MSBra6J73QgTPsFrXofJ+jl/+tzdBGXSJYu6bAiVZbZXwFadeRQl9nF+ogQU= X-Gm-Gg: ASbGncu1sQQSz70rMLqG4/qmmSj9YygieYXXPgkh7QpKhxI3TJ36MuxQpJWi7fuF7qz XDdtq02COi/OdfMRsAum3fQ/bITOkeWbDzm6YZr2Ik5bBPFQeYtbZcPewbSht71yPa4IK4YYmNv 4kyabvQF3AslKw9zovL5rqvGxNEKYCgTEkiD7R5cSBWDRx+0DwiiTV0bhfvftYo9gU7CZPakWjr HRxLOVc8z/YqAtD1DqXEaKg2oU6qHpq0ETqR2/0EbtDjzRmG+c3QHHLM7sXU7aR7KcQAvSs7fbf IbXSWtMyXktBgS1XjT8Mejby6UtZTwOSOdtjre6Miu5I2bOSAnZgOUn16u9LfuN/BccTwxC5WR4 8lsmdjaLnPg8+EJ/wkLYze9YQHPyvItuvVz3VdkW8azw= X-Google-Smtp-Source: AGHT+IGDvVKjUk3ao3C3T4+HLvJurBFSlIikKQIK2AlVn7WhL5Fvzgj5Yph3zvj5UA7QSkPoWKGNyw== X-Received: by 2002:ad4:4b32:0:b0:70d:a44c:786d with SMTP id 6a1803df08f44-70da44c7c63mr16656606d6.7.1755890499801; Fri, 22 Aug 2025 12:21:39 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-70da72b04a6sm3843656d6.52.2025.08.22.12.21.34 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 22 Aug 2025 12:21:39 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , Barry Song , Baoquan He , Nhat Pham , Kemeng Shi , Baolin Wang , Ying Huang , Johannes Weiner , David Hildenbrand , Yosry Ahmed , Lorenzo Stoakes , Zi Yan , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 9/9] mm, swap: use a single page for swap table when the size fits Date: Sat, 23 Aug 2025 03:20:23 +0800 Message-ID: <20250822192023.13477-10-ryncsn@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20250822192023.13477-1-ryncsn@gmail.com> References: <20250822192023.13477-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song We have a cluster size of 512 slots. Each slot consumes 8 bytes in swap table so the swap table size of each cluster is exactly one page (4K). If that condition is true, allocate one page direct and disable the slab cache to reduce the memory usage of swap table and avoid fragmentation. Co-developed-by: Chris Li Signed-off-by: Chris Li Signed-off-by: Kairui Song Acked-by: Chris Li --- mm/swap_table.h | 2 ++ mm/swapfile.c | 50 ++++++++++++++++++++++++++++++++++++++++--------- 2 files changed, 43 insertions(+), 9 deletions(-) diff --git a/mm/swap_table.h b/mm/swap_table.h index 4e97513b11ef..984474e37dd7 100644 --- a/mm/swap_table.h +++ b/mm/swap_table.h @@ -11,6 +11,8 @@ struct swap_table { atomic_long_t entries[SWAPFILE_CLUSTER]; }; =20 +#define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) =3D=3D PAGE_SIZE) + /* * A swap table entry represents the status of a swap slot on a swap * (physical or virtual) device. The swap table in each cluster is a diff --git a/mm/swapfile.c b/mm/swapfile.c index 00651e947eb2..7539ee26d59a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -432,6 +432,38 @@ static inline unsigned int cluster_offset(struct swap_= info_struct *si, return cluster_index(si, ci) * SWAPFILE_CLUSTER; } =20 +static struct swap_table *swap_table_alloc(gfp_t gfp) +{ + struct folio *folio; + + if (!SWP_TABLE_USE_PAGE) + return kmem_cache_zalloc(swap_table_cachep, gfp); + + folio =3D folio_alloc(gfp | __GFP_ZERO, 0); + if (folio) + return folio_address(folio); + return NULL; +} + +static void swap_table_free_folio_rcu_cb(struct rcu_head *head) +{ + struct folio *folio; + + folio =3D page_folio(container_of(head, struct page, rcu_head)); + folio_put(folio); +} + +static void swap_table_free(struct swap_table *table) +{ + if (!SWP_TABLE_USE_PAGE) { + kmem_cache_free(swap_table_cachep, table); + return; + } + + call_rcu(&(folio_page(virt_to_folio(table), 0)->rcu_head), + swap_table_free_folio_rcu_cb); +} + static void swap_cluster_free_table(struct swap_cluster_info *ci) { unsigned int ci_off; @@ -445,7 +477,7 @@ static void swap_cluster_free_table(struct swap_cluster= _info *ci) table =3D (void *)rcu_dereference_protected(ci->table, true); rcu_assign_pointer(ci->table, NULL); =20 - kmem_cache_free(swap_table_cachep, table); + swap_table_free(table); } =20 /* @@ -469,8 +501,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si, lockdep_assert_held(&ci->lock); lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock); =20 - table =3D kmem_cache_zalloc(swap_table_cachep, - __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN); + table =3D swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN); if (table) { rcu_assign_pointer(ci->table, table); return ci; @@ -485,7 +516,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si, if (!(si->flags & SWP_SOLIDSTATE)) spin_unlock(&si->global_cluster_lock); local_unlock(&percpu_swap_cluster.lock); - table =3D kmem_cache_zalloc(swap_table_cachep, __GFP_HIGH | GFP_KERNEL); + table =3D swap_table_alloc(__GFP_HIGH | GFP_KERNEL); =20 local_lock(&percpu_swap_cluster.lock); if (!(si->flags & SWP_SOLIDSTATE)) @@ -522,7 +553,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si, =20 free_table: if (table) - kmem_cache_free(swap_table_cachep, table); + swap_table_free(table); return ci; } =20 @@ -740,7 +771,7 @@ static int inc_cluster_info_page(struct swap_info_struc= t *si, =20 ci =3D cluster_info + idx; if (!ci->table) { - table =3D kmem_cache_zalloc(swap_table_cachep, GFP_KERNEL); + table =3D swap_table_alloc(GFP_KERNEL); if (!table) return -ENOMEM; rcu_assign_pointer(ci->table, table); @@ -4076,9 +4107,10 @@ static int __init swapfile_init(void) * only, and all swap cache readers (swap_cache_*) verifies * the content before use. So it's safe to use RCU slab here. */ - swap_table_cachep =3D kmem_cache_create("swap_table", - sizeof(struct swap_table), - 0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL); + if (!SWP_TABLE_USE_PAGE) + swap_table_cachep =3D kmem_cache_create("swap_table", + sizeof(struct swap_table), + 0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL); =20 #ifdef CONFIG_MIGRATION if (swapfile_maximum_size >=3D (1UL << SWP_MIG_TOTAL_BITS)) --=20 2.51.0