From nobody Mon Jun 8 12:13:56 2026 Received: from xmbghk7.mail.qq.com (xmbghk7.mail.qq.com [43.163.128.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4A41D3DFC93; Fri, 29 May 2026 12:19:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=43.163.128.44 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057178; cv=none; b=C1j3GMqJbfJdL7yIBwT7s4qZzlIjZYyntlJVCczBGDktfT3Ud52WUNAlW1w6RVBHhmbKCdzGpwbQcajO7SMR53mxyDGNAHoebhe34xdsngIb/tMDIqbWybvW2/J26g8dNjveYxWylZWqZ6txTgLMkIpAAH287S5JIoPMWaWIVhI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057178; c=relaxed/simple; bh=7jLMGW+raRr9jRp/zuJHoM7n9LG+68alAMBMXxBd+uo=; h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References: MIME-Version; b=qDkI5DHzY+U0xD7pgQ5mi7LdDEFKDDfpLNjPZG4R+QfEOfywYS8i/BySrcQxuyTxlm+y4UaqBoyVD/3yt4QVFhsn9qbcq6ykOftY9gCstScORnfTXw22pOOB5UAm+m7aW3x4lbFsoW+iLcT2zc6Y4ZMW2cr54jCTFhJ0l2kkcM0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com; spf=pass smtp.mailfrom=qq.com; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=mCkTiPo2; arc=none smtp.client-ip=43.163.128.44 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=qq.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="mCkTiPo2" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512; t=1780057172; bh=bVdSFaRs3+EuMAnxIy5DMxxMx3cOnErIMJI6trzSpzw=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=mCkTiPo2IHFozwt8DpBgCjFjySFOUCuCqTBoaeJIrfHDNtKRvPSy8ga8RIIhXoGD1 +Vp+23K9Y/9cO0wv9B05pN4RsbWvBjmiFM6f2UqnZxxpT5BPdEDwYq0/0pvLDMdxkT ZT7UMm7mvMFjuQvpNyL+8DmLq55Qmg1UFp/uDPHI= Received: from node68.. ([166.111.236.25]) by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP id 4DC10017; Fri, 29 May 2026 20:19:28 +0800 X-QQ-mid: xmsmtpt1780057168tzjwrsgd1 Message-ID: X-QQ-XMAILINFO: Mi3PnGw1zbXUjsp55hP//2TzDYS0KqexQwEQGN+M9bSC/Uz/t0z4g0+EhjTtIp aeC+9AOG4QF1yU3MizVH6jsjooanTR/YImuIOHMLZtLBWkfeahthNaouMCgnkq+4hSzD3tgQh2XG cqRup5WSheqyiCFxk21Z3mbwtSfMbOQMHLRbr+E/+chvZheENv8ppzUeBEI8UCEzfG1/jxZiFxyF JE7xjP8ub2ACuhw+tODKS062CDu4/T9QD8F4NdlKhcTl3Jc8EeTjr2kYsRQ5UK5Hk0KLHSrKMFhu mZLyoxhqmL1jKhtZL3fx+CpZe3qVa7BYYlggv4aTQdCjfE+gPcyezlNIe/RqbynSqgetnZZgsqdf tQ8sNXkXnO3Pz3ypFtgl71X/b7TsQgVV6YKTEJiTObQezFUY9yhVU03QB06s1/56FkU5WR5nm+6k 3TV8h2C14/FRkpGI5zAxZguRacsR3epJaRo/i1fKKeTOOPLil7pQc6nrJsPOiHluEG5RqdTLXO4K cgqOd1uaAA1UEPNoiFIryuDsv+ioAjdNf0nRomgnDn/v2ha5Wcum3JEmj3UAPnSUaFL2CuEuSlGP XxQyizk8zeTOLyKJkYZbh1FIGsZ91Az/F18jOPCi5v1DxHim6oBqViPSqpO7QmKOeE1j2abubHlh Y0UD9hZObBOFfAZXXADZ7j85B5bFiXwYoJHkkTLrlRokRx1srF81L9m1TLVDMmlmsyIGNkAICdLj qlyDS0lRnH9L7NG1iBiExGC8Ys3vKHoeQjs5L9X+64zilXWEn2oOeXYVxAO44Edo9KYIjEU4G/BR l3jySsFO94cRwP9qfuxfnK1oYgBx7kzjFWri8MeLRAwaFANb0AcvRqLz5lw4qtItsg613ZSgXNIK IreARaqnQ0WTnkMJIBgyoZtL9itkNwxPi4V0HTRNt3cp4cMpSceA6PZyfd/wLkKN4HugFzokiodR yP0IMUMzY+oT7YKZlNRnMy7NsctfIbfvyV93nAb6jwA1Un3gbA9m8YXxYjEZLOj7Ox9RtHkiWbd9 6tpV34wAsTreBbqm/Hw9CMLzwYvmtSTm/70HLX2IrRJayVsRbBlEiEnUB8qygbYNw9XXJA8d9Vpt DNjywgVE4gVz9JYNTNHYobOfd5IKqB8kxs12z+UbOK0fU3j6uV6lOBmLn2hw== X-QQ-XMRINFO: M/715EihBoGS47X28/vv4NpnfpeBLnr4Qg== From: fujunjie To: Andrew Morton , linux-mm@kvack.org, Alexandre Ghiti , Kairui Song , Usama Arif Cc: Chris Li , Johannes Weiner , Yosry Ahmed , Nhat Pham , David Hildenbrand , Hugh Dickins , Roman Gushchin , Shakeel Butt , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [RFC PATCH v2 1/9] mm/zswap: expose range state for swapin policy Date: Fri, 29 May 2026 12:19:20 +0000 X-OQ-MSGID: <20260529121928.4115683-1-fujunjie1@qq.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Large folio swapin needs to know whether a candidate swap range is fully backed by zswap before it can choose an order. That decision should stay in common swapin code, not inside zswap. Export two zswap facts for that caller: a lockless range occupancy snapshot and the current zswap reclaim-pressure state. The range state is advisory only. Writeback or invalidation can change the backend after the snapshot, so users must recheck before issuing large-folio IO. Signed-off-by: fujunjie --- include/linux/zswap.h | 26 +++++++++++++++++++++++++ mm/zswap.c | 44 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 70 insertions(+) diff --git a/include/linux/zswap.h b/include/linux/zswap.h index 30c193a1207e..8f9aee97517c 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -9,6 +9,18 @@ struct lruvec; =20 extern atomic_long_t zswap_stored_pages; =20 +/* + * Advisory zswap occupancy snapshot for a swap range. This is not a compl= ete + * backend classifier; callers must recheck before depending on ALL_ZSWAP = for + * large-folio IO. + */ +enum zswap_range_state { + ZSWAP_RANGE_NEVER_ENABLED, + ZSWAP_RANGE_NO_ZSWAP, + ZSWAP_RANGE_ALL_ZSWAP, + ZSWAP_RANGE_MIXED, +}; + #ifdef CONFIG_ZSWAP =20 struct zswap_lruvec_state { @@ -27,6 +39,9 @@ struct zswap_lruvec_state { unsigned long zswap_total_pages(void); bool zswap_store(struct folio *folio); int zswap_load(struct folio *folio); +enum zswap_range_state zswap_probe_range(swp_entry_t swp, + unsigned int nr_pages); +bool zswap_pool_reclaim_pressure(void); void zswap_invalidate(swp_entry_t swp); int zswap_swapon(int type, unsigned long nr_pages); void zswap_swapoff(int type); @@ -49,6 +64,17 @@ static inline int zswap_load(struct folio *folio) return -ENOENT; } =20 +static inline enum zswap_range_state zswap_probe_range(swp_entry_t swp, + unsigned int nr_pages) +{ + return ZSWAP_RANGE_NEVER_ENABLED; +} + +static inline bool zswap_pool_reclaim_pressure(void) +{ + return false; +} + static inline void zswap_invalidate(swp_entry_t swp) {} static inline int zswap_swapon(int type, unsigned long nr_pages) { diff --git a/mm/zswap.c b/mm/zswap.c index 761cd699e0a3..da5297f7bd69 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -506,6 +506,19 @@ unsigned long zswap_total_pages(void) return total; } =20 +/* + * Expose whether zswap reclaim pressure is active. This is a backend fact: + * zswap_check_limits() sets the state once the pool reaches the hard limi= t and + * keeps it set until the pool falls below the accept threshold. + */ +bool zswap_pool_reclaim_pressure(void) +{ + if (zswap_never_enabled()) + return false; + + return READ_ONCE(zswap_pool_reached_full); +} + static bool zswap_check_limits(void) { unsigned long cur_pages =3D zswap_total_pages(); @@ -1559,6 +1572,37 @@ bool zswap_store(struct folio *folio) return ret; } =20 +enum zswap_range_state zswap_probe_range(swp_entry_t swp, + unsigned int nr_pages) +{ + unsigned int type =3D swp_type(swp); + pgoff_t offset =3D swp_offset(swp); + bool present =3D false, missing =3D false; + unsigned int i; + + /* + * This is an advisory, lockless snapshot for common swapin admission. + * Callers must recheck before depending on an all-zswap range for IO: + * concurrent writeback or invalidation can change the backend state. + */ + if (zswap_never_enabled()) + return ZSWAP_RANGE_NEVER_ENABLED; + + for (i =3D 0; i < nr_pages; i++) { + struct xarray *tree =3D swap_zswap_tree(swp_entry(type, offset + i)); + + if (xa_load(tree, offset + i)) + present =3D true; + else + missing =3D true; + + if (present && missing) + return ZSWAP_RANGE_MIXED; + } + + return present ? ZSWAP_RANGE_ALL_ZSWAP : ZSWAP_RANGE_NO_ZSWAP; +} + /** * zswap_load() - load a folio from zswap * @folio: folio to load --=20 2.34.1 From nobody Mon Jun 8 12:13:56 2026 Received: from xmbghk7.mail.qq.com (xmbghk7.mail.qq.com [43.163.128.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9C1F23BA24F; Fri, 29 May 2026 12:19:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=43.163.128.48 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057183; cv=none; b=sHbj+GWvHy3H844dRZwrRI+sjKeALAa1fKU2sZLGRr5MwkpiKcv1AkFAoa3Z6ZZJJHQZA2ahwVRO7LNBUeod0Zx97XYuAObVS9FSjkhYwjOgLbwmjje2XyAOuFL/1UMfyN3tYL3zf5uJELBl0Fqfd2cUHGiiL2yIV+ptAZtKQCc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057183; c=relaxed/simple; bh=UsDJ1SlBWLJpayBtQ4Q84psXrtlnhwGU8Oh1yWXjOXE=; h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References: MIME-Version; b=CBRHiWA/8r7fTcg6UsDvsMggdg0t+/luyZcuGhc+87XGtotLAX+4ZFmtGuIRvfHgBt97B9d+dS7geQeoybgt3IQ8Y+MJFhExQmKbxatI44wrWJ6q7TgGJWdlAGUkPdznv2mNcqqoVFVYvzblXHGKGG1vy93aUK9j7DlG2lAy338= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com; spf=pass smtp.mailfrom=qq.com; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=BzvFiQIv; arc=none smtp.client-ip=43.163.128.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=qq.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="BzvFiQIv" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512; t=1780057173; bh=6upOtVdcSMEiWQj/LoAhI+cSqrQul/bXobDwKNZi3uE=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=BzvFiQIvEF5lRfApXuRIxXloyfW601cGekF8ReFj0UlHXWD9RHUFiEFDrLk7GXs1J 7at9eI/Y2ppncPSORnYgaR+zy2U5HhUJ6SEq4ztojqt0LsG3nv10NrU73s7W6HLBiQ UPA5LeF5x9NJmz9dXdOA+uSBXx+lGhzGX1PXPY9c= Received: from node68.. ([166.111.236.25]) by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP id 4DC10017; Fri, 29 May 2026 20:19:28 +0800 X-QQ-mid: xmsmtpt1780057170tklm0azzd Message-ID: X-QQ-XMAILINFO: OOyEews/EdUgGrC4kGz6V3LDdcHDMeu94hgH5yaIgtByJKumzwq6uB/LtnUX5R NhmcZbGtYlmURp2oM1N69k54MXOemCKwW531GwD9CHAQzgSkdUfQ+G51ao2t9n0h/ek1cueUGZjv oGf/2fN8/9HbK/4I7fQiyEx8TguKH2lXJAqD5ocJy0zjJndYW1p3bgE+aGbYgGe0+2e8w8vW48PQ rRQZVsbVSa0bCZbTax1zSG7ZZchB0vywtSbJ6tG/A3/vYjfJDm+89dV+PmTkiLLqaUDv07/yeUmn 0/5BKilNIWIrY5caenuAr1qohxZjapyNZ4QcShN7K/ILv8ed0HZFwNMDh1YQBBg3TlI7sbOk39yX J9rrE97T7IMgYlE7UFzBAjsHGUH2opi9VGznwyWyG/nI4Ju+6CrOJwt71y+wlKAaCB2noNv+cdfC BL/Yw0pSDCNLsShxYtCEZXadUKNPMjD//CTiHS8I17KAZdhrCno6NX8i+3NWo3pe5vYwGWsRzEvr aAK4L3zS9JQiEVCcbzT+z0QcWKEcmYP1e7vuaWrWKg6Y+Glkam4raxVypRdI9qQJ6CV1VWRa0+ir wxbY1ByjmCzt5I2poqyT13EiBvgoSUcFuLKlTmvfVai43SZ8MDefdUrtZGTI8qme/WsA5xHIyMYC zYIUFPT0vCMiZp2sPWJHdDAq9r0sPFCp4bnfd98UkPA1m2MiGpokbzj1oxVrz4f4YR59fn3VAccP VYzyHyoORAeyGMnUahdrRLTbiZfyC3vlXlOMFzPrKTJ6qL3qS70aHdxHzO3sJE7TQrwcNTOAZLzy Inbj/ucF6w9UcFL3ZN2YqZbCsnkMRIJe5GiBw1SVfjxF+Jizz1UDAJ2NW2mEeBbat2GprN4ATd2P MHAV1jsF1QK1QVY92ZzgvKj8vKvqjDuFl8NjIcw+y9b+yxM0gxI1/N7lXLPHXZUuVJmjOGH7wEge VcobaFqn0REtwLVTc/GIFIHaW3Jni9Q9p4irGdaGOerWpqZJDNEnPhVaDf47gaxpibaSO/frPnma Aoj/hz08eLwq61sqEVp+hh2pCt0nkXSjZ93AvWh1h+CN5Dd7Dw X-QQ-XMRINFO: MPJ6Tf5t3I/ylTmHUqvI8+Wpn+Gzalws3A== From: fujunjie To: Andrew Morton , linux-mm@kvack.org, Alexandre Ghiti , Kairui Song , Usama Arif Cc: Chris Li , Johannes Weiner , Yosry Ahmed , Nhat Pham , David Hildenbrand , Hugh Dickins , Roman Gushchin , Shakeel Butt , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [RFC PATCH v2 2/9] mm: let swap_read_folio() report retryable zswap races Date: Fri, 29 May 2026 12:19:21 +0000 X-OQ-MSGID: <20260529121928.4115683-2-fujunjie1@qq.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Large zswap loads need a way to ask the caller to drop a speculative large swapcache folio and retry order-0. A void swap_read_folio() cannot express that without turning a backend race into an IO failure. Return int from swap_read_folio() and reserve -EAGAIN for retryable large zswap races. Existing order-0 paths keep treating the read as before; the synchronous swapin path only warns for now. A later patch will consume -EAGAIN and retry order-0. Signed-off-by: fujunjie --- mm/page_io.c | 19 +++++++++++++++++-- mm/swap.h | 5 +++-- mm/swap_state.c | 13 +++++++++++-- 3 files changed, 31 insertions(+), 6 deletions(-) diff --git a/mm/page_io.c b/mm/page_io.c index f2d8fe7fd057..16724bdfb400 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -653,13 +653,21 @@ static void swap_read_folio_bdev_async(struct folio *= folio, submit_bio(bio); } =20 -void swap_read_folio(struct folio *folio, struct swap_iocb **plug) +/* + * Return -EAGAIN only when a locked large swapcache folio hit a retryable + * zswap backend race. The caller owns that still-locked folio and must dr= op or + * retry it. Other zswap errors are still reported through the usual folio + * state: the folio is unlocked without PG_uptodate and the fault path will + * turn that into an I/O error. + */ +int swap_read_folio(struct folio *folio, struct swap_iocb **plug) { struct swap_info_struct *sis =3D __swap_entry_to_info(folio->swap); bool synchronous =3D sis->flags & SWP_SYNCHRONOUS_IO; bool workingset =3D folio_test_workingset(folio); unsigned long pflags; bool in_thrashing; + int ret =3D 0; =20 VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); @@ -681,8 +689,14 @@ void swap_read_folio(struct folio *folio, struct swap_= iocb **plug) goto finish; } =20 - if (zswap_load(folio) !=3D -ENOENT) + ret =3D zswap_load(folio); + if (ret =3D=3D -EAGAIN) { + VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio); goto finish; + } + if (ret !=3D -ENOENT) + goto finish; + ret =3D 0; =20 /* We have to read from slower devices. Increase zswap protection. */ zswap_folio_swapin(folio); @@ -701,6 +715,7 @@ void swap_read_folio(struct folio *folio, struct swap_i= ocb **plug) psi_memstall_leave(&pflags); } delayacct_swapin_end(); + return ret; } =20 void __swap_read_unplug(struct swap_iocb *sio) diff --git a/mm/swap.h b/mm/swap.h index 77d2d14eda42..ea7e1f3c4410 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -241,7 +241,7 @@ extern void __swap_cluster_free_entries(struct swap_inf= o_struct *si, /* linux/mm/page_io.c */ int sio_pool_init(void); struct swap_iocb; -void swap_read_folio(struct folio *folio, struct swap_iocb **plug); +int swap_read_folio(struct folio *folio, struct swap_iocb **plug); void __swap_read_unplug(struct swap_iocb *plug); static inline void swap_read_unplug(struct swap_iocb *plug) { @@ -381,8 +381,9 @@ static inline void folio_put_swap(struct folio *folio, = struct page *page) { } =20 -static inline void swap_read_folio(struct folio *folio, struct swap_iocb *= *plug) +static inline int swap_read_folio(struct folio *folio, struct swap_iocb **= plug) { + return 0; } =20 static inline void swap_write_unplug(struct swap_iocb *sio) diff --git a/mm/swap_state.c b/mm/swap_state.c index 04f5ce992401..d37097913b30 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -628,6 +628,7 @@ static struct folio *swap_cache_read_folio(swp_entry_t = entry, gfp_t gfp, struct swap_iocb **plug, bool readahead) { struct folio *folio; + int ret; =20 do { folio =3D swap_cache_get_folio(entry); @@ -639,7 +640,13 @@ static struct folio *swap_cache_read_folio(swp_entry_t= entry, gfp_t gfp, if (IS_ERR_OR_NULL(folio)) return NULL; =20 - swap_read_folio(folio, plug); + ret =3D swap_read_folio(folio, plug); + /* + * Swap readahead allocates order-0 folios. -EAGAIN is reserved for + * retryable large zswap backend races and must be handled by the + * synchronous common swapin path. + */ + VM_WARN_ON_ONCE(ret =3D=3D -EAGAIN); if (readahead) { folio_set_readahead(folio); count_vm_event(SWAP_RA); @@ -668,6 +675,7 @@ struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp,= unsigned long orders, struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx) { struct folio *folio; + int ret; =20 do { folio =3D swap_cache_get_folio(entry); @@ -679,7 +687,8 @@ struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp,= unsigned long orders, if (IS_ERR(folio)) return folio; =20 - swap_read_folio(folio, NULL); + ret =3D swap_read_folio(folio, NULL); + VM_WARN_ON_ONCE(ret =3D=3D -EAGAIN); return folio; } =20 --=20 2.34.1 From nobody Mon Jun 8 12:13:56 2026 Received: from xmbghk7.mail.qq.com (xmbghk7.mail.qq.com [43.163.128.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 868173E00BA; Fri, 29 May 2026 12:19:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=43.163.128.52 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057184; cv=none; b=pybA6eqq9aDNNG3nxM5M6M9eWapPSe4T8cavL3X51X+gno/ZU9SjNHFmIdFptxSd5s2Va0yvw3AXTJLggWzs3X5aXn0yNeb+940gmOuSoCqhLq8LgIddqXsZNUXYemaj2rXvOl/TZT+gTdbbIW+MHvU+M3akU22C/RqdcMQARWc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057184; c=relaxed/simple; bh=/K2wG6zQbNZTvESItts/W7FuL5ZTh3gamKmzdrcQRVM=; h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References: MIME-Version; b=HsTrGrLaJhFIOw+ekWO5NVk1EWyTZKM1nR4xBkaDSj6o69WxGGyOrlJ+ndM2LztZTqnIeIIeD+iVVZajW/kWEJyDmULeYuE2DB3iHgHuB5sKbxBRIFiDHGOAyDTrYEEkUT7fWBO9dptZbg7l6nS7Ur/bwC1ppxIUniieMLCq1vU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com; spf=pass smtp.mailfrom=qq.com; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=M7hGKodZ; arc=none smtp.client-ip=43.163.128.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=qq.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="M7hGKodZ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512; t=1780057175; bh=xut7uEA2LcmM4LrlZiZ/byyWZ1r0WI1YSP1vZ4Egj64=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=M7hGKodZsqPkt26jzW2LkQsc6xsh6OEcmg69gQ4tmGWLrp4MG/I8hglTllW2D/G6I hJYW3mcxbVoEaTUTBlDhm/wscJUy0CZq0IvnfiT+rCGPsbS8LQLTRJMzJVtFrvPmcH lZ3jATB+1ppOf+u95kPUyGtBEDxKAsX+KNDcWsI0= Received: from node68.. ([166.111.236.25]) by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP id 4DC10017; Fri, 29 May 2026 20:19:28 +0800 X-QQ-mid: xmsmtpt1780057172tac6i9zgh Message-ID: X-QQ-XMAILINFO: OATpkVjS499uNEtWMEQWZx4ULda/MSJcXY82UGEO/ENJfG6dAYdIxvNqNQ/4b5 NHjOLjOuuAhYF6UBBQoNqJg19p5N8mZHmqoquhVxAR/rlpLMCNtE3cRRk9CUve/AaHFTV6ZXFrU3 yeqjsTFYNmxR+Z6ruyX+zd/HCdJUErIf/ytbubdXHsfjzdlwKC1+WwOAEbqwvha+9rvrow2rlZ2r de35kdqnkk62m40gvQ4nKczdt55wpV+TC4gARpND2xGdEurKCsJ8ux8fsuUiB7GEdHs/C5viEize xmxKjD++Fcux8yvlWSQsA+J8j8RkKagftsl+/BnbAxzpWzPs02kQMxZ+nhwPWT6L62yhP7VIEvbz CkqqQPROLBg6XEQolIGukrp1iM98/n44SQQIILHJ5fjF1SInNRBDHJosBvHLYwqAf6x03xGGvZs4 A7QAvojdiXMbu4s6ZgPuGbcFWgt3PZQZvy4LTi03VfZdMn1ON+rEupb+q2TmFCuGGCM3FLnyg8O+ BrqU2yUT9YIXuLRkkjnDgpeIGjYYHBcj11odwImoAlTec1N+FbGQYCTAiClAYDu+9W5PMQvC66LE ZbqeY6hgPE25rj8FfQ3JgHzKe9zepNn7LcyjFNIIVz0mfZvkvS6GVDM+KVt6IHDAKV7cGqnC7Hu7 8HF0j5RzVfZKlrIV30lt/445qaD/GXFN/Y4Q/gfCHl40YOw8FOSQP53iNY/9c8LZ5YbbFSxsU3UA 25Y7kBdc/3U64ORJ+85DGz0ruorUU8kHUl+0iNtnmJMMadp8+U/ivsXcbmbAXawixcUSqoSBKJ6j rJcqAsEFIf63tN7SZdkbIP/DApDQW43Kf6872Ux5u3nWN6ihONCF+JR2GfLg+0CXpX+weqLikMAJ 1gQ/QN5S0UITTDoA7uJfSDz6pM2/u0J79ZiI0FKqKF41ZN20jVKFqt7R4WzMgjmBgB+1b4+z9fHJ 9NGej8Q9WFPgKD80I4VxxNc0qkjqriULetfmT1SM8IHDf83DRec0I18+2d8zcnDVlTsvTz+0O91n OkhSF9Zac1BqRyUOPDKLT3IJa6Fe6SIy/9nQlOyxcwYknlSNNh X-QQ-XMRINFO: MPJ6Tf5t3I/ylTmHUqvI8+Wpn+Gzalws3A== From: fujunjie To: Andrew Morton , linux-mm@kvack.org, Alexandre Ghiti , Kairui Song , Usama Arif Cc: Chris Li , Johannes Weiner , Yosry Ahmed , Nhat Pham , David Hildenbrand , Hugh Dickins , Roman Gushchin , Shakeel Butt , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [RFC PATCH v2 3/9] mm/zswap: support fully zswap-backed large folio loads Date: Fri, 29 May 2026 12:19:22 +0000 X-OQ-MSGID: <20260529121928.4115683-3-fujunjie1@qq.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" zswap currently refuses large swapcache folios. That is correct for mixed backend ranges, but it also prevents the common swapin path from loading a range that is still fully backed by zswap. Teach zswap_load() to fill a locked large swapcache folio by decompressing each base-page entry into the matching folio offset, then flushing the folio once. A missing entry after zswap data has been seen is reported as -EAGAIN so the caller can drop the speculative large folio and retry order-0. The large load keeps the zswap entries in place. It is a clean speculative fill: until the swap slots are freed, zswap remains the backing copy if reclaim drops the large folio before PTEs are installed. Signed-off-by: fujunjie --- mm/zswap.c | 105 ++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 87 insertions(+), 18 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index da5297f7bd69..94ba112a2982 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -15,6 +15,8 @@ =20 #include #include +#include +#include #include #include #include @@ -934,7 +936,8 @@ static bool zswap_compress(struct page *page, struct zs= wap_entry *entry, return comp_ret =3D=3D 0 && alloc_ret =3D=3D 0; } =20 -static bool zswap_decompress(struct zswap_entry *entry, struct folio *foli= o) +static bool zswap_decompress(struct zswap_entry *entry, struct folio *foli= o, + unsigned int page_idx, bool flush_dcache) { struct zswap_pool *pool =3D entry->pool; struct scatterlist input[2]; /* zsmalloc returns an SG list 1-2 entries */ @@ -952,14 +955,15 @@ static bool zswap_decompress(struct zswap_entry *entr= y, struct folio *folio) =20 WARN_ON_ONCE(input->length !=3D PAGE_SIZE); =20 - dst =3D kmap_local_folio(folio, 0); + dst =3D kmap_local_folio(folio, page_idx * PAGE_SIZE); memcpy_from_sglist(dst, input, 0, PAGE_SIZE); dlen =3D PAGE_SIZE; kunmap_local(dst); - flush_dcache_folio(folio); + if (flush_dcache) + flush_dcache_folio(folio); } else { sg_init_table(&output, 1); - sg_set_folio(&output, folio, PAGE_SIZE, 0); + sg_set_folio(&output, folio, PAGE_SIZE, page_idx * PAGE_SIZE); acomp_request_set_params(acomp_ctx->req, input, &output, entry->length, PAGE_SIZE); ret =3D crypto_acomp_decompress(acomp_ctx->req); @@ -1042,7 +1046,7 @@ static int zswap_writeback_entry(struct zswap_entry *= entry, goto out; } =20 - if (!zswap_decompress(entry, folio)) { + if (!zswap_decompress(entry, folio, 0, true)) { ret =3D -EIO; goto out; } @@ -1615,10 +1619,9 @@ enum zswap_range_state zswap_probe_range(swp_entry_t= swp, * NOT marked up-to-date, so that an IO error is emitted (e.g. do_swap_pa= ge() * will SIGBUS). * - * -EINVAL: if the swapped out content was in zswap, but the page belongs - * to a large folio, which is not supported by zswap. The folio is unlock= ed, - * but NOT marked up-to-date, so that an IO error is emitted (e.g. - * do_swap_page() will SIGBUS). + * -EAGAIN: if the swapped out content belongs to a large folio, but the + * range is mixed or raced with writeback. The folio remains locked so the + * caller can drop the large swapcache folio and retry order-0. * * -ENOENT: if the swapped out content was not in zswap. The folio remains * locked on return. @@ -1626,9 +1629,12 @@ enum zswap_range_state zswap_probe_range(swp_entry_t= swp, int zswap_load(struct folio *folio) { swp_entry_t swp =3D folio->swap; + unsigned int nr_pages =3D folio_nr_pages(folio); + unsigned int type =3D swp_type(swp); pgoff_t offset =3D swp_offset(swp); - struct xarray *tree =3D swap_zswap_tree(swp); + struct xarray *tree; struct zswap_entry *entry; + unsigned int i; =20 VM_WARN_ON_ONCE(!folio_test_locked(folio)); VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); @@ -1636,21 +1642,84 @@ int zswap_load(struct folio *folio) if (zswap_never_enabled()) return -ENOENT; =20 - /* - * Large folios should not be swapped in while zswap is being used, as - * they are not properly handled. Zswap does not properly load large - * folios, and a large folio may only be partially in zswap. - */ - if (WARN_ON_ONCE(folio_test_large(folio))) { + if (folio_test_large(folio)) { + struct obj_cgroup *first_objcg =3D NULL; + bool same_objcg =3D true; + bool saw_zswap =3D false; + bool saw_non_zswap =3D false; + + /* + * The locked large swapcache folio now covers the range and + * conflicts with zswap writeback's order-0 swapcache allocation. + * If the range is mixed or an entry disappears, retry order-0. + */ + for (i =3D 0; i < nr_pages; i++) { + tree =3D swap_zswap_tree(swp_entry(type, offset + i)); + entry =3D xa_load(tree, offset + i); + if (!entry) { + if (saw_zswap) + return -EAGAIN; + saw_non_zswap =3D true; + continue; + } + if (saw_non_zswap) + return -EAGAIN; + + if (!saw_zswap) + first_objcg =3D entry->objcg; + else if (entry->objcg !=3D first_objcg) + same_objcg =3D false; + saw_zswap =3D true; + } + if (!saw_zswap) + return -ENOENT; + + for (i =3D 0; i < nr_pages; i++) { + tree =3D swap_zswap_tree(swp_entry(type, offset + i)); + entry =3D xa_load(tree, offset + i); + if (!entry) + return -EAGAIN; + + if (!zswap_decompress(entry, folio, i, false)) { + folio_unlock(folio); + return -EIO; + } + } + + flush_dcache_folio(folio); + /* + * Keep zswap entries until swap slots are freed. This is a clean + * speculative fill; zswap remains the backing copy if reclaim + * drops the large folio before PTEs are installed. + */ + folio_mark_uptodate(folio); + count_vm_events(ZSWPIN, nr_pages); + count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN); + + if (same_objcg) { + if (first_objcg) + count_objcg_events(first_objcg, ZSWPIN, nr_pages); + } else { + for (i =3D 0; i < nr_pages; i++) { + tree =3D swap_zswap_tree(swp_entry(type, offset + i)); + entry =3D xa_load(tree, offset + i); + if (WARN_ON_ONCE(!entry)) + continue; + if (entry->objcg) + count_objcg_events(entry->objcg, ZSWPIN, 1); + } + } + folio_unlock(folio); - return -EINVAL; + return 0; } =20 + tree =3D swap_zswap_tree(swp); entry =3D xa_load(tree, offset); if (!entry) return -ENOENT; =20 - if (!zswap_decompress(entry, folio)) { + if (!zswap_decompress(entry, folio, 0, true)) { folio_unlock(folio); return -EIO; } --=20 2.34.1 From nobody Mon Jun 8 12:13:56 2026 Received: from out162-62-58-216.mail.qq.com (out162-62-58-216.mail.qq.com [162.62.58.216]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 211D43BA24F; Fri, 29 May 2026 12:23:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=162.62.58.216 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057400; cv=none; b=rY8qktdwVc495SZVW1RPOhmaoFAFZsnNmzsDPt5GZ81R/3wZ0vFkOQLtN9TZb4FERpeDDWTU+w9KkdNuM1r6MifGqiqhE0EIzXLfI8xV2Y+w0BF6W8Zf0D6EYbf05Yn2fom+SpgzukBrL+Pm/5a+SeH8AAHE59CKXT5XMRZhL8I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057400; c=relaxed/simple; bh=3GejemDuQ+B6ja3wGhZoKUlAXxNXc/BATRCoZWzMKto=; h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References: MIME-Version; b=DiafdoQfHx5tQhUuK7fQs3gSIROYBz6dnc11Ztc7+TJ1rln4A3a8JmTqKwqn5phqKo9Gbv73U40t7HAJTLbp6w0wKmJZIAyzCsPjn0obfol0TJBTzhJc/lr94p4B4dwre24BUO3LNOBBP0b44scBTqGE5N1EYdcBR3QYRf7cFng= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com; spf=pass smtp.mailfrom=qq.com; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=DfjL4Abv; arc=none smtp.client-ip=162.62.58.216 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=qq.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="DfjL4Abv" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512; t=1780057388; bh=nDsGZbWSvOF3obvNlWe9iRNaS8HxvEtyxSV9cL2PYxY=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=DfjL4AbvPPHUAc8aYoIv2/9+jtY5Kmptyv2iXtQu4KHgfxhBZjcrEbquKqYKJLBU/ ekPKI7qXYmcJJY/SDwvYD4dVLIZYNg6KDZHP+SE5Hl+0g0blzXbN+nTCMWhaYKRnBH VY52470BYEjpVrv3Z0h9TCykKvKIkyvY4tOmJwnc= Received: from node68.. ([166.111.236.25]) by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP id 4DC10017; Fri, 29 May 2026 20:19:28 +0800 X-QQ-mid: xmsmtpt1780057174txygo7p0f Message-ID: X-QQ-XMAILINFO: M2SvzgchpLqfOoNEbAiJkRB8duAu+hjJoUl9Si4GM4H2cgFZGLV6zbPHlxKEsa 6Llk5KmrboO8QkCi9+MjUpV4XtHw7OxB0j1dTCB2wROWoZcF0hUfsiLNZUXno/R/40OliDIRo6LR ncP8tLRIARrkJEjo6U9A+lg97PVOwRDi7mAwBTUbglLI2hhXLezDN3vKDrGOoN9aRjBCQDJ7C0+7 OfIVszyzD7QKPBf2ScZiyGEiLtuE5X/DmHhxb8+999pYjhnkd4Q9MeYxcvweChIXROGDnloJEORd ZXY4H5CirLavW87XX2zBRCKvUPCMvhQHEjiBmA7kcYCHyyx3nU0ZqxqQUJPhJ9IpjFZO8g/JhugO 5uxr0YTIfk0Y64FJ4paz85qwnDTpENDnfM5D5+4rp0u0xt48LuGUETujUfuwDjWUyb2TnHlQMmQu cRjwZB7MdDgSWML5UnXxw9EmdVVKWP9ihfgG13MgUetZb7b/Eitu6iKm3jYfdcBJhmenYJwxzNZO qhYHEzIA6FJ7o6Yn3hGtoGVCY1+a0hBWTiapwGCewo89htBbQuNgJr8MA7/61nW6zCj0Rl0x2YOA 0cAp+Ta6znmLxyLAleKSaFXXn1w/kmeLpjAanKcwzDao5vOLTLeevhbTDWP/uOZlcaEhqZcdFALd dgXqwV+fnNn39KhmoskAJsO3yYOXmD//SVPtiYxkbTN0QPbRRaOYUfcshoiP+LgqlAXXKIavtIfT aiVstxGOWk7cjlhD2F9Yhk53HNwFMU0RsNG4SThh0VmvE3FVkFnmziN174aZtaMT+wt+oF9/FeWO YIqQihGNFIIeohCUUeeCaMq7UruOc//+HtxNe/gJyzipTsN2aVoUVDmL1J8MbbLYAMVkn3EYjD4j nJKX4X+GqTlDdXz74lRqXDsqbpRhQikRehJp7S9UNm74hVwoUzXdpPUShqklrREExwAY6EBjjJua nTE+iSEW0/rB/FJGCYyeuKl0ofpRGfnWOy+L2zzpkuiEoiaoUv0vOmWSNTVKwz0l/k4CSNKfdMbC Em7Qh3Y4bWyGxh/SogHslxF5U/TzisZIeny5psaPupSq+A9F+IupHp+UK15aM= X-QQ-XMRINFO: MSVp+SPm3vtSI1QTLgDHQqIV1w2oNKDqfg== From: fujunjie To: Andrew Morton , linux-mm@kvack.org, Alexandre Ghiti , Kairui Song , Usama Arif Cc: Chris Li , Johannes Weiner , Yosry Ahmed , Nhat Pham , David Hildenbrand , Hugh Dickins , Roman Gushchin , Shakeel Butt , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [RFC PATCH v2 4/9] mm: admit large swapin by backend range in swapin_sync() Date: Fri, 29 May 2026 12:19:23 +0000 X-OQ-MSGID: <20260529121928.4115683-4-fujunjie1@qq.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" A large swapin can only read one folio when the whole range has compatible backing. Mixed zswap/disk ranges must not reach large-folio IO, and zswap range probes are only snapshots. Filter the orders passed to swap_cache_alloc_folio() in swapin_sync(). Uniform zeromap ranges and all-disk ranges keep the existing large swapin path. Fully zswap-backed ranges may be tried. Mixed zswap/disk ranges fall back before allocation. After a large swapcache folio is installed, recheck the zswap range and drop the fresh folio if it became mixed. Also consume -EAGAIN from swap_read_folio() the same way. Both cases retry order-0, where each slot can resolve its current backend independently. Signed-off-by: fujunjie --- mm/memcontrol-v1.c | 8 ++- mm/memory.c | 31 ++++++++- mm/swap_state.c | 169 ++++++++++++++++++++++++++++++++++++++++++--- 3 files changed, 194 insertions(+), 14 deletions(-) diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 765069211567..5b11b8055c66 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -682,8 +682,8 @@ void __memcg1_swapout(struct folio *folio, struct swap_= cluster_info *ci) * memcg1_swapin - uncharge swap slot on swapin * @folio: folio being swapped in * - * Call this function after successfully adding the charged - * folio to swapcache. + * Call this after the charged folio has been added to swapcache and the c= aller + * is no longer going to drop it back to swapped-out state. * * Context: The folio has to be in swap cache and locked. */ @@ -721,7 +721,9 @@ void memcg1_swapin(struct folio *folio) id =3D __swap_cgroup_clear(ci, swp_cluster_offset(folio->swap), nr_pages); swap_cluster_unlock(ci); - mem_cgroup_uncharge_swap(id, nr_pages); + + if (id) + mem_cgroup_uncharge_swap(id, nr_pages); } #endif =20 diff --git a/mm/memory.c b/mm/memory.c index 5a365492a9a2..d73a19692dea 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4538,6 +4538,24 @@ static inline bool should_try_to_free_swap(struct sw= ap_info_struct *si, folio_ref_count(folio) =3D=3D (extra_refs + folio_nr_pages(folio)); } =20 +static void memcg1_swapin_retry_folio(struct folio *folio, + struct vm_fault *vmf) +{ + if (!folio_test_large(folio) || !folio_test_swapcache(folio)) + return; + + if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) { + if (!folio_trylock(folio)) + return; + } else { + folio_lock(folio); + } + + if (folio_test_large(folio) && folio_test_swapcache(folio)) + memcg1_swapin(folio); + folio_unlock(folio); +} + static vm_fault_t pte_marker_clear(struct vm_fault *vmf) { vmf->pte =3D pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, @@ -4857,8 +4875,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) =20 swapcache =3D folio; ret |=3D folio_lock_or_retry(folio, vmf); - if (ret & VM_FAULT_RETRY) + if (ret & VM_FAULT_RETRY) { + memcg1_swapin_retry_folio(folio, vmf); goto out_release; + } =20 page =3D folio_file_page(folio, swp_offset(entry)); /* @@ -5067,6 +5087,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (unlikely(folio !=3D swapcache)) { folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); + if (folio_test_large(swapcache)) + memcg1_swapin(swapcache); folio_put_swap(swapcache, NULL); } else if (!folio_test_anon(folio)) { /* @@ -5076,6 +5098,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) !=3D nr_pages, folio); VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); + if (folio_test_large(folio)) + memcg1_swapin(folio); folio_put_swap(folio, NULL); } else { VM_WARN_ON_ONCE(nr_pages !=3D 1 && nr_pages !=3D folio_nr_pages(folio)); @@ -5132,8 +5156,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); out_page: - if (folio_test_swapcache(folio)) + if (folio_test_swapcache(folio)) { + if (folio_test_large(folio)) + memcg1_swapin(folio); folio_free_swap(folio); + } folio_unlock(folio); out_release: folio_put(folio); diff --git a/mm/swap_state.c b/mm/swap_state.c index d37097913b30..f03ad4832f16 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include "internal.h" #include "swap_table.h" @@ -403,7 +404,8 @@ void __swap_cache_replace_folio(struct swap_cluster_inf= o *ci, static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci, swp_entry_t targ_entry, gfp_t gfp, unsigned int order, struct vm_fault *vmf, - struct mempolicy *mpol, pgoff_t ilx) + struct mempolicy *mpol, pgoff_t ilx, + bool defer_memcg1_swapin) { int err; swp_entry_t entry; @@ -466,7 +468,8 @@ static struct folio *__swap_cache_alloc(struct swap_clu= ster_info *ci, } =20 /* memsw uncharges swap when folio is added to swap cache */ - memcg1_swapin(folio); + if (!defer_memcg1_swapin || !order) + memcg1_swapin(folio); if (shadow) workingset_refault(folio, shadow); =20 @@ -495,9 +498,12 @@ static struct folio *__swap_cache_alloc(struct swap_cl= uster_info *ci, * Return: Returns the folio if allocation succeeded and folio is in the s= wap * cache. Returns error code if failed due to race, OOM or invalid argumen= ts. */ -struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp, - unsigned long orders, struct vm_fault *vmf, - struct mempolicy *mpol, pgoff_t ilx) +static struct folio *__swap_cache_alloc_folio(swp_entry_t targ_entry, + gfp_t gfp, unsigned long orders, + struct vm_fault *vmf, + struct mempolicy *mpol, + pgoff_t ilx, + bool defer_memcg1_swapin) { int order, err; struct folio *ret; @@ -512,7 +518,8 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_e= ntry, gfp_t gfp, =20 do { ret =3D __swap_cache_alloc(ci, targ_entry, gfp, order, - vmf, mpol, ilx); + vmf, mpol, ilx, + defer_memcg1_swapin); if (!IS_ERR(ret)) break; err =3D PTR_ERR(ret); @@ -525,6 +532,124 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ= _entry, gfp_t gfp, return ret; } =20 +struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp, + unsigned long orders, struct vm_fault *vmf, + struct mempolicy *mpol, pgoff_t ilx) +{ + return __swap_cache_alloc_folio(targ_entry, gfp, orders, vmf, + mpol, ilx, false); +} + +static struct folio *swap_cache_alloc_speculative_folio(swp_entry_t targ_e= ntry, + gfp_t gfp, + unsigned long orders, + struct vm_fault *vmf, + struct mempolicy *mpol, + pgoff_t ilx) +{ + /* + * Speculative large swapin may drop this fresh swapcache folio and + * retry order-0 after backend or page-table revalidation. Keep the + * cgroup v1 memsw swap owner until the caller commits the folio. + */ + return __swap_cache_alloc_folio(targ_entry, gfp, orders, vmf, + mpol, ilx, true); +} + +static bool swapin_zeromap_same(swp_entry_t entry, unsigned int nr_pages) +{ + unsigned int ci_start =3D swp_cluster_offset(entry); + struct swap_cluster_info *ci =3D __swap_entry_to_cluster(entry); + bool is_zero; + unsigned int i; + + if (ci_start + nr_pages > SWAPFILE_CLUSTER) { + VM_WARN_ON_ONCE(1); + return false; + } + + rcu_read_lock(); + if (!rcu_dereference(ci->table)) { + rcu_read_unlock(); + return true; + } + + is_zero =3D __swap_table_test_zero(ci, ci_start); + for (i =3D 1; i < nr_pages; i++) { + if (is_zero !=3D __swap_table_test_zero(ci, ci_start + i)) { + rcu_read_unlock(); + return false; + } + } + rcu_read_unlock(); + + return true; +} + +static unsigned long swapin_admit_orders(swp_entry_t entry, + unsigned long orders) +{ + unsigned long candidates =3D orders & ~BIT(0); + unsigned long admitted =3D orders & BIT(0); + int order; + + if (!candidates) + return orders; + + while (candidates) { + enum zswap_range_state state; + unsigned int nr_pages; + swp_entry_t range_entry; + bool admit =3D false; + + order =3D fls_long(candidates) - 1; + if (order > MAX_PAGE_ORDER) { + candidates &=3D ~BIT(order); + continue; + } + + nr_pages =3D 1U << order; + range_entry =3D swp_entry(swp_type(entry), + round_down(swp_offset(entry), nr_pages)); + if (!swapin_zeromap_same(range_entry, nr_pages)) + goto next; + + state =3D zswap_probe_range(range_entry, nr_pages); + switch (state) { + case ZSWAP_RANGE_MIXED: + break; + case ZSWAP_RANGE_ALL_ZSWAP: + case ZSWAP_RANGE_NEVER_ENABLED: + case ZSWAP_RANGE_NO_ZSWAP: + admit =3D true; + break; + } + +next: + if (admit) + admitted |=3D BIT(order); + else + count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK); + candidates &=3D ~BIT(order); + } + + return admitted ? admitted : BIT(0); +} + +static bool zswap_needs_order0_retry(struct folio *folio) +{ + if (!folio_test_large(folio)) + return false; + + /* + * Admission sees only an advisory zswap snapshot. Recheck after the + * large swapcache folio is installed; if the range became mixed, drop + * the fresh folio before IO and let order-0 handle each slot. + */ + return zswap_probe_range(folio->swap, folio_nr_pages(folio)) =3D=3D + ZSWAP_RANGE_MIXED; +} + /* * If we are the only user, then try to free up the swap cache. * @@ -634,7 +759,8 @@ static struct folio *swap_cache_read_folio(swp_entry_t = entry, gfp_t gfp, folio =3D swap_cache_get_folio(entry); if (folio) return folio; - folio =3D swap_cache_alloc_folio(entry, gfp, BIT(0), NULL, mpol, ilx); + folio =3D swap_cache_alloc_folio(entry, gfp, BIT(0), NULL, + mpol, ilx); } while (PTR_ERR(folio) =3D=3D -EEXIST); =20 if (IS_ERR_OR_NULL(folio)) @@ -677,18 +803,43 @@ struct folio *swapin_sync(swp_entry_t entry, gfp_t gf= p, unsigned long orders, struct folio *folio; int ret; =20 + orders =3D swapin_admit_orders(entry, orders); +again: do { folio =3D swap_cache_get_folio(entry); if (folio) return folio; - folio =3D swap_cache_alloc_folio(entry, gfp, orders, vmf, mpol, ilx); + folio =3D swap_cache_alloc_speculative_folio(entry, gfp, orders, + vmf, mpol, ilx); } while (PTR_ERR(folio) =3D=3D -EEXIST); =20 if (IS_ERR(folio)) return folio; =20 + if (zswap_needs_order0_retry(folio)) { + count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN_FALLBACK); + /* + * The folio is newly allocated, locked, clean and not uptodate; + * no data has been read into it. Removing it only restores the + * swap table entries so order-0 swapin can resolve a backend + * race without attempting speculative large-folio zswapin. + */ + swap_cache_del_folio(folio); + folio_unlock(folio); + folio_put(folio); + orders =3D BIT(0); + goto again; + } + ret =3D swap_read_folio(folio, NULL); - VM_WARN_ON_ONCE(ret =3D=3D -EAGAIN); + if (ret =3D=3D -EAGAIN) { + count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN_FALLBACK); + swap_cache_del_folio(folio); + folio_unlock(folio); + folio_put(folio); + orders =3D BIT(0); + goto again; + } return folio; } =20 --=20 2.34.1 From nobody Mon Jun 8 12:13:56 2026 Received: from xmbghk7.mail.qq.com (xmbghk7.mail.qq.com [43.163.128.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9B9BE3438BD; Fri, 29 May 2026 12:19:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=43.163.128.48 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057183; cv=none; b=OKFcv5ewCkaTdaR/5+GnoTDV9dt6jM7xyy0eAQhfEx2oFmVxF068lSrZeCPafcqo+5gsbefaJvpEGqcgCWAXme5vAOlm5GUeWU7hnDjoinkJqvKzy1zNO2n1UPos7vljiwj/3J3Ue4LD6yvdU6KRP0ou9NBYHgnE01ffAVkNG3M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057183; c=relaxed/simple; bh=IJIhxNj6wWIRUVVmjKYBMeKl3sJ/RkXIW4Hu88GF9gE=; h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References: MIME-Version; b=Pqz2tQMM+4HOdlTjWH3oOqkimIk+E5iqyimD93/K6c/VVR3Uvv1zO2hEXIyIPWVVv/FgwVViSbjzyKR39rp4YfEFunX2r2+7aonI6XLZ5whmAeUKlSETanzaQrGiedbfFE4lq94z+F4cpNDpU+2zX6Ls/opYcS6uQOj5dim9O7c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com; spf=pass smtp.mailfrom=qq.com; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=h+Xilcb+; arc=none smtp.client-ip=43.163.128.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=qq.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="h+Xilcb+" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512; t=1780057179; bh=e9pDd68+wLvH4AeGJEXTLZCNqdsKkVQuDmz0y/Qyl40=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=h+Xilcb+Fe2a+TdyND9eRmzpcWn65K1nBdf+U1Tcpxrle9mkaWuaGs92Dj7+4Gyqu E5kqBNXOpgr4kNi7T3fhli4vTM6GVQ2gt42yEsJ1O+AzgjDPfPS6p4L/Zf9KtjNZMy Bx2TH95F4j3um3bRRG5JuxuisRPgCw5F6ZQVWuXw= Received: from node68.. ([166.111.236.25]) by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP id 4DC10017; Fri, 29 May 2026 20:19:28 +0800 X-QQ-mid: xmsmtpt1780057176tr24p40mn Message-ID: X-QQ-XMAILINFO: NcJw5CIirWgGxMPMMf2Xe4hBEBXn4hAHfolRSxm6+zPzISbV8Hly/4GctJa4j0 /Zwax7rBPKgCvjRhIpWIEMfK/hLRiJNnJv5iYpCP8i5S5K3nMVpulonB9o0ul+xA9hcQ0mx+X+85 2NJdkYsR/B9ahT5scviN33mGJYnW4QltRFCRA/lTjDQyeTozqDPEHYSzmODT52L4xGBHcR4WnJaF NhiOv+sE1NQE9qdszpvr0iVpMg8Pg4hGvHwTZAYU/iWhgHiNF55eMcX50UFh1lRdS/1nkur5lydG FLVGBLwrVzv+ga0RF6bxF7z4ZioHDKh0OD31pp0hlwJyixBEswr0ReNnl/sQqaY5YJqwyMEfdekZ BxIHOMNKyY9mOQOAFul5mbS43GsrLLEZHPI3LG3oHH9JFzJCLc7NBbvGNyw2ymKgu5pYqJcFhPYg kqyDna8DHjyvVjont2VGFEVUjYSRU2XMZZWUA9y+4pG8beI1CPfssyimdEl6HmNaMfyxfF9t/3Yc tEF5zoBXdSUa77Ch8bC1GoOt1OsGGkoANEY+PUbZ2Euob8C4ZQXrFYl8TfFBPIw6vFsbr0zBx+bg kLjBaUboDP11xAE1vE1nGBGjt9PGzGtCxUfbzmwXXnXOrhjRPzBOCj4gsNWALJu4p7eL21EPDvmQ yfjvioRg6edGBBhgP+AUxlOWr30REOTMsq1ImSZqKgUENMxZdrua4namZUD2CBC0ig8kQu7LKTnG GgNKS3nkTVMEYshUKswxb8TGE/0LPku+SBz63hq8MjjbqqDvDG62UnFwV9QAYSYmgo/iK7qkwQyv Kpug2b7ABBs7Ngszgk0QwvkTIIEMLxv3f+VgAERJ2tQLbyc6++FXndclzN1PUsN64zhsQvfNabqE Y+vXnSkAEg914R0pjjmh1S0keSAE5pBD+4U2HAZ4NDfZSLV9QsnaOzuy0CYJj0fEDjpSVjTRUFRC NuZ8cFSN+FpmpCHw9H+z+PS6RIuqlUi3KihEhPHgtzjuv+AgbUeKiC4tZfg6FUWgmuP4QpqeST3T uVTGS3pWJ1c3t/Vd5ezCpKIta1kwPhccBSJQYQaFnXzjMy7UEUUwf1rvHM7Ih/TUBiGqSBww== X-QQ-XMRINFO: NyFYKkN4Ny6FuXrnB5Ye7Aabb3ujjtK+gg== From: fujunjie To: Andrew Morton , linux-mm@kvack.org, Alexandre Ghiti , Kairui Song , Usama Arif Cc: Chris Li , Johannes Weiner , Yosry Ahmed , Nhat Pham , David Hildenbrand , Hugh Dickins , Roman Gushchin , Shakeel Butt , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [RFC PATCH v2 5/9] mm: add common locality admission for zswap large swapin Date: Fri, 29 May 2026 12:19:24 +0000 X-OQ-MSGID: <20260529121928.4115683-5-fujunjie1@qq.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Fully zswap-backed ranges are safe to load as a large folio only when the caller has a reason to expect the neighbouring slots to be useful. Otherwise a sparse refault can turn one 4K demand fault into a 64K decompression and swapcache fill. Add a common admission gate for zswap-backed large swapin. The common layer keeps backend checks, the 64K cap, recent-refault rejection, and zswap reclaim-pressure rejection. It consumes a caller-provided locality order mask instead of looking at anon or shmem state directly. Callers pass no locality evidence for now, so this patch only installs the common policy hook. Later patches add anon and shmem producers. Signed-off-by: fujunjie --- mm/memory.c | 2 +- mm/shmem.c | 2 +- mm/swap.h | 8 ++-- mm/swap_state.c | 118 ++++++++++++++++++++++++++++++++++++++++++++---- 4 files changed, 117 insertions(+), 13 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index d73a19692dea..92a82008d583 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4849,7 +4849,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) folio =3D swapin_sync(entry, GFP_HIGHUSER_MOVABLE, thp_swapin_suitable_orders(vmf) | BIT(0), - vmf, NULL, 0); + 0, vmf, NULL, 0); else folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf); =20 diff --git a/mm/shmem.c b/mm/shmem.c index 56c23a7b15c7..fa99b48ed62b 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2031,7 +2031,7 @@ static struct folio *shmem_swap_alloc_folio(struct in= ode *inode, =20 again: mpol =3D shmem_get_pgoff_policy(info, index, order, &ilx); - folio =3D swapin_sync(entry, gfp, BIT(order), vmf, mpol, ilx); + folio =3D swapin_sync(entry, gfp, BIT(order), 0, vmf, mpol, ilx); mpol_cond_put(mpol); =20 if (!IS_ERR(folio)) diff --git a/mm/swap.h b/mm/swap.h index ea7e1f3c4410..dd35a310d06d 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -323,9 +323,10 @@ struct folio *read_swap_cache_async(swp_entry_t entry,= gfp_t gfp_mask, struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, struct mempolicy *mpol, pgoff_t ilx); struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, - struct vm_fault *vmf); + struct vm_fault *vmf); struct folio *swapin_sync(swp_entry_t entry, gfp_t flag, unsigned long ord= ers, - struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx); + unsigned long locality_orders, struct vm_fault *vmf, + struct mempolicy *mpol, pgoff_t ilx); void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, unsigned long addr); =20 @@ -418,7 +419,8 @@ static inline struct folio *swapin_readahead(swp_entry_= t swp, gfp_t gfp_mask, =20 static inline struct folio *swapin_sync( swp_entry_t entry, gfp_t flag, unsigned long orders, - struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx) + unsigned long locality_orders, struct vm_fault *vmf, + struct mempolicy *mpol, pgoff_t ilx) { return NULL; } diff --git a/mm/swap_state.c b/mm/swap_state.c index f03ad4832f16..5a4ca289009a 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include "internal.h" @@ -556,6 +557,24 @@ static struct folio *swap_cache_alloc_speculative_foli= o(swp_entry_t targ_entry, mpol, ilx, true); } =20 +/* + * Initial conservative cap for speculative zswap large swapin. Locality + * evidence is supplied by the caller or by generic VMA hints; the common + * swapin layer keeps backend safety and pressure decisions here. + */ +#define SWAPIN_ZSWAP_MAX_SIZE SZ_64K +#if PAGE_SIZE < SWAPIN_ZSWAP_MAX_SIZE +#define SWAPIN_ZSWAP_MAX_ORDER \ + ilog2(SWAPIN_ZSWAP_MAX_SIZE / PAGE_SIZE) +#else +#define SWAPIN_ZSWAP_MAX_ORDER 0 +#endif + +struct zswap_admit_ctx { + bool pressure_checked; + bool reclaim_pressure; +}; + static bool swapin_zeromap_same(swp_entry_t entry, unsigned int nr_pages) { unsigned int ci_start =3D swp_cluster_offset(entry); @@ -586,11 +605,84 @@ static bool swapin_zeromap_same(swp_entry_t entry, un= signed int nr_pages) return true; } =20 +static bool swapin_zswap_locality(struct vm_fault *vmf, unsigned int order, + unsigned long locality_orders) +{ + struct vm_area_struct *vma =3D vmf ? vmf->vma : NULL; + + if (!order || order > MAX_PAGE_ORDER) + return false; + + if (vma && (vma->vm_flags & VM_RAND_READ)) + return false; + + return locality_orders & BIT(order); +} + +static bool swapin_zswap_refaulted(swp_entry_t entry, unsigned int nr_page= s) +{ + unsigned int type =3D swp_type(entry); + pgoff_t offset =3D swp_offset(entry); + unsigned int i; + + for (i =3D 0; i < nr_pages; i++) { + bool workingset; + void *shadow; + + shadow =3D swap_cache_get_shadow(swp_entry(type, offset + i)); + if (!shadow) + continue; + if (workingset_test_recent(shadow, false, &workingset, false) && + workingset) + return true; + } + + return false; +} + +static bool swapin_zswap_admit(swp_entry_t entry, + unsigned int order, unsigned int nr_pages, + struct vm_fault *vmf, + unsigned long locality_orders, + struct zswap_admit_ctx *ctx) +{ + if (order > SWAPIN_ZSWAP_MAX_ORDER) + return false; + + /* + * Treat zswap-backed large swapin as speculative. The common layer + * consumes caller-provided locality orders, but does not inspect + * anon-specific PTE state or shmem-specific mapping state directly. + */ + if (!swapin_zswap_locality(vmf, order, locality_orders)) + return false; + + /* + * A recent workingset refault shadow in the target range means reclaim + * already saw churn there. Keep the refault path narrow instead of + * speculatively decompressing neighbouring slots. + */ + if (swapin_zswap_refaulted(entry, nr_pages)) + return false; + + if (!ctx->pressure_checked) { + ctx->reclaim_pressure =3D zswap_pool_reclaim_pressure(); + ctx->pressure_checked =3D true; + } + if (ctx->reclaim_pressure) + return false; + + return true; +} + static unsigned long swapin_admit_orders(swp_entry_t entry, - unsigned long orders) + unsigned long orders, + struct vm_fault *vmf, + unsigned long locality_orders) { unsigned long candidates =3D orders & ~BIT(0); unsigned long admitted =3D orders & BIT(0); + struct zswap_admit_ctx zswap_ctx =3D {}; int order; =20 if (!candidates) @@ -616,9 +708,14 @@ static unsigned long swapin_admit_orders(swp_entry_t e= ntry, =20 state =3D zswap_probe_range(range_entry, nr_pages); switch (state) { + case ZSWAP_RANGE_ALL_ZSWAP: + admit =3D swapin_zswap_admit(range_entry, order, + nr_pages, vmf, + locality_orders, + &zswap_ctx); + break; case ZSWAP_RANGE_MIXED: break; - case ZSWAP_RANGE_ALL_ZSWAP: case ZSWAP_RANGE_NEVER_ENABLED: case ZSWAP_RANGE_NO_ZSWAP: admit =3D true; @@ -769,8 +866,8 @@ static struct folio *swap_cache_read_folio(swp_entry_t = entry, gfp_t gfp, ret =3D swap_read_folio(folio, plug); /* * Swap readahead allocates order-0 folios. -EAGAIN is reserved for - * retryable large zswap backend races and must be handled by the - * synchronous common swapin path. + * retryable large zswap backend races and should never escape to this + * order-0 path. */ VM_WARN_ON_ONCE(ret =3D=3D -EAGAIN); if (readahead) { @@ -786,6 +883,7 @@ static struct folio *swap_cache_read_folio(swp_entry_t = entry, gfp_t gfp, * @entry: swap entry indicating the target slot * @gfp: memory allocation flags * @orders: allocation orders + * @locality_orders: orders with caller-provided locality evidence * @vmf: fault information * @mpol: NUMA memory allocation policy to be applied * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE @@ -794,16 +892,20 @@ static struct folio *swap_cache_read_folio(swp_entry_= t entry, gfp_t gfp, * existing folio in the swap cache for @entry. This initiates the IO, too, * if needed. @entry is rounded down if @orders allow large allocation. * - * Context: Caller must ensure @entry is valid and pin the swap device wit= h refcount. + * Context: Caller must ensure @entry is valid and pin the swap device with + * refcount. * Return: Returns the folio on success, error code if failed. */ -struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp, unsigned long orde= rs, - struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx) +struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp, + unsigned long orders, + unsigned long locality_orders, + struct vm_fault *vmf, struct mempolicy *mpol, + pgoff_t ilx) { struct folio *folio; int ret; =20 - orders =3D swapin_admit_orders(entry, orders); + orders =3D swapin_admit_orders(entry, orders, vmf, locality_orders); again: do { folio =3D swap_cache_get_folio(entry); --=20 2.34.1 From nobody Mon Jun 8 12:13:56 2026 Received: from xmbghk7.mail.qq.com (xmbghk7.mail.qq.com [43.163.128.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 964623DBD53; Fri, 29 May 2026 12:19:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=43.163.128.48 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057186; cv=none; b=hMexri45wr8wHD5sZZ/BYji4OyaSSFvIJwuvyIVVxBvhrqC7UdnVCeMoYQxI71j1XEa8LiUv1S9CJ6f33dyinvWAmWLI1b+82dkq50T5XNT7id8TK8LAJ7NQns9Jzjk2J2QdfIraXJ/eQIYDztaDXUylgYdneJKt8XMFUSgTVhw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057186; c=relaxed/simple; bh=XIjrOmuzKDqZwfu/U4RkH6jnfzV1A1isv3f+BqgqDpg=; h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References: MIME-Version; b=EEqIo00Pqpfvr7iorGtimbdwrCoiBOKKkfwWR56dlCHxH8qt3D3X8+fP0nlxir01yKSlY+YqZtyEVN6ANHqSnghgkxHmtaTcZbTtnqc2MG/g/DJqF11SN3r3i4sa3YQQNlCBY/2B1kojjL/vV9Sxs/QjUxdufM67uoKuAvFZtv4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com; spf=pass smtp.mailfrom=qq.com; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=bg/vJ0vD; arc=none smtp.client-ip=43.163.128.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=qq.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="bg/vJ0vD" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512; t=1780057181; bh=LLCdcZM28LSqnhskCBNjIY/IATM4Pd8LdF/tlghQRJQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=bg/vJ0vD8Wq0E8xQ9lzRiF2CcmID31e+p9BP697CZf008EDAhLiqgi/G+SyfO1EhC jpiezYKBUAoDoo7KWwZBFkvdR7QdfLPc/tNIhgLI8JXSAii3DfHe/TekIyu78dmIF/ nbi5hAtlAIzswGMAolgpOS3lTeuahX2KL+SqvQHA= Received: from node68.. ([166.111.236.25]) by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP id 4DC10017; Fri, 29 May 2026 20:19:28 +0800 X-QQ-mid: xmsmtpt1780057178t8d71qf7g Message-ID: X-QQ-XMAILINFO: MZtEYADUG4AgFKeqSv1IeByIiaUnNl54ksdfcW2cSvrmU+UkYbeDznU9ZeKWcW Fa+snv+S5Hh8iRLGeTdFPQy6v/uDbbAxElaP+GdWL1jv1GUoon3bMyomyKlDb2joL4deC9UcFUdh jrdkn+9GB2y7uy8jMSO4LCWrJYqQGdJDfr8aB6b72wsxjDj7tKka3KOqSo6BWX9s7hqEhBZVAPgy 1pw3i5IAr3kIYT9bV9ab8m6+4IgWGGl1WWDCWrxWSf3ubJGR0Z+dMQNcK7VU80LdrMqVwQZNklUi X4gLiztR5flLNtzfpqa6lOakdPqCcC7xpvSEfL7oDo4eWtiX1Eg/RdDAwdXEHKwlFn2SCPCI0N97 sg4GjU2G5ub0CsBjw1/zGN1nSuwFoI0ghjrq59bSLAeP3vDn5PlvT0Q1sfI+TP8r22UhUpOLvA18 K8UF5A5fbdIxWom4TQ6Js1pxoEQ1/lLgQ2J4j9Lfqqnk6ifMuJ6zu4x8OEHsg54zPX2m9Eiy+fVD nh/AXqAxQbyhoMa0DMQfWw3ti7N2vi9FQbsobRalnefxIJ+Dhi8DUaSnZLb1lKxql47IeTggC27y wu92J78jkOkEWMKCB1M7KPEPoEJ6YGNJbpp7UrdVdCtx4bbjBC03uLD9zVNVWpOzEei/UT3ygdHx U5HMqjD1WbgIagPg6JFuenf62rVAoZA21QX3Dvs+g21NLdmVZNiEIgxX4pgW/Cb+FNqB3gG8Kkgj e8VTkq7zSTolqLjFHp1IYBVnwurn9rABUxFkkPkSsu6kw2ep9ZrBXdlwzbnYEKX8hIsars4l9K6M 3kPCF6HRnRI32pJ2S39kD8tX6AxN+KwCuOhnfTI2fxXZh8l9TjdjEGlLlh91KfsRILyQfq3ks0Fx brQNXSIztgvz7hIwZ55iTToNhhx+q6tB0LFUeiCaDUl0RNTaazo3ZCrVvI9AT2WGNOq+vHXYvEZQ tUXcB3RfkPaFY8/jt4xGm0ItYEiEj/v7tJg7OCHN3+doHSHZ6UPsYHJaAVXMK0A5PBo+aYufpYAH 72vYPw5VAu82yPkzMmsZMq4Z5kYCXYxkvBQ66bPrfog+4FWHHNCUP+McfAgLuzn4OXouHWnQ== X-QQ-XMRINFO: Nq+8W0+stu50tPAe92KXseR0ZZmBTk3gLg== From: fujunjie To: Andrew Morton , linux-mm@kvack.org, Alexandre Ghiti , Kairui Song , Usama Arif Cc: Chris Li , Johannes Weiner , Yosry Ahmed , Nhat Pham , David Hildenbrand , Hugh Dickins , Roman Gushchin , Shakeel Butt , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [RFC PATCH v2 6/9] mm: provide anon locality evidence for zswap large swapin Date: Fri, 29 May 2026 12:19:25 +0000 X-OQ-MSGID: <20260529121928.4115683-6-fujunjie1@qq.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The common zswap large-swapin policy needs locality evidence from callers before it can admit a large folio. For anonymous faults, provide that evidence from existing VMA hints and from the PTE young state left by earlier zswap-backed large swapins. Keep non-faulting PTEs old when mapping a speculative all-zswap large folio. A later fault can then require a dense young previous range before admitting another large swapin without adding VMA state. This also removes the old zswap-enabled guard from the THP swapin candidate scan. The common swapin path now classifies the backend range and falls back to order-0 for mixed zswap/disk ranges or races. Signed-off-by: fujunjie --- mm/memory.c | 234 +++++++++++++++++++++++++++++++++++++++++++----- mm/swap.h | 6 ++ mm/swap_state.c | 15 ++++ 3 files changed, 235 insertions(+), 20 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 92a82008d583..7bbb89632000 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4556,6 +4556,35 @@ static void memcg1_swapin_retry_folio(struct folio *= folio, folio_unlock(folio); } =20 +static void set_swapin_ptes(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, pte_t pte, + unsigned int nr_pages, unsigned int fault_pte_idx, + bool fault_only_young) +{ + struct mm_struct *mm =3D vma->vm_mm; + pte_t old_pte; + + if (!fault_only_young || nr_pages =3D=3D 1) { + set_ptes(mm, address, ptep, pte, nr_pages); + return; + } + + old_pte =3D pte_mkold(pte); + if (fault_pte_idx) + set_ptes(mm, address, ptep, old_pte, fault_pte_idx); + + set_pte_at(mm, address + fault_pte_idx * PAGE_SIZE, + ptep + fault_pte_idx, + pte_mkyoung(pte_advance_pfn(pte, fault_pte_idx))); + + fault_pte_idx++; + if (fault_pte_idx < nr_pages) + set_ptes(mm, address + fault_pte_idx * PAGE_SIZE, + ptep + fault_pte_idx, + pte_advance_pfn(old_pte, fault_pte_idx), + nr_pages - fault_pte_idx); +} + static vm_fault_t pte_marker_clear(struct vm_fault *vmf) { vmf->pte =3D pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, @@ -4628,6 +4657,157 @@ static vm_fault_t handle_pte_marker(struct vm_fault= *vmf) } =20 #ifdef CONFIG_TRANSPARENT_HUGEPAGE +#define SWAPIN_ANON_YOUNG_MIN_PERCENT 75 +#define SWAPIN_ANON_MAX_FAULT_SKIP_SHIFT 2 + +static bool swapin_anon_prev_young_dense(struct vm_fault *vmf, + unsigned int order) +{ + struct vm_area_struct *vma; + unsigned int nr_pages; + unsigned int threshold; + unsigned long size; + unsigned long base, prev, addr; + struct folio *first =3D NULL; + unsigned int present =3D 0; + unsigned int young =3D 0; + pmd_t *pmd; + pmd_t pmdval; + spinlock_t *ptl; /* protects the previous PTE range */ + pte_t *ptep; + unsigned int i; + + if (!IS_ENABLED(CONFIG_MMU) || !arch_has_hw_pte_young() || !vmf || + !vmf->vma || !vmf->pmd || !order || order > MAX_PAGE_ORDER) + return false; + + nr_pages =3D 1U << order; + threshold =3D DIV_ROUND_UP(nr_pages * + SWAPIN_ANON_YOUNG_MIN_PERCENT, 100); + size =3D PAGE_SIZE << order; + + vma =3D vmf->vma; + base =3D ALIGN_DOWN(vmf->address, size); + if (base < size) + return false; + + prev =3D base - size; + if (prev < vma->vm_start || prev + size > vma->vm_end) + return false; + + pmd =3D vmf->pmd; + if ((prev & PMD_MASK) !=3D (base & PMD_MASK)) { + pmd =3D mm_find_pmd(vma->vm_mm, prev); + if (!pmd) + return false; + } + + pmdval =3D pmdp_get_lockless(pmd); + if (!pmd_present(pmdval) || pmd_leaf(pmdval)) + return false; + + ptep =3D pte_offset_map_lock(vma->vm_mm, pmd, prev, &ptl); + if (!ptep) + return false; + + for (i =3D 0, addr =3D prev; i < nr_pages; i++, addr +=3D PAGE_SIZE) { + struct folio *folio; + pte_t pte =3D ptep_get(ptep + i); + + if (!pte_present(pte)) + break; + + folio =3D vm_normal_folio(vma, addr, pte); + if (!folio || folio_order(folio) !=3D order) + break; + if (!first) + first =3D folio; + else if (folio !=3D first) + break; + + present++; + if (pte_young(pte)) + young++; + } + + pte_unmap_unlock(ptep, ptl); + if (present !=3D nr_pages) + return false; + + return young >=3D threshold; +} + +static bool swapin_anon_accessed_neighbour(struct vm_fault *vmf, + unsigned int order) +{ + unsigned long size; + unsigned long base; + unsigned long fault_idx; + unsigned long max_skip; + + if (!vmf || !vmf->vma || !order || order > MAX_PAGE_ORDER) + return false; + + size =3D PAGE_SIZE << order; + base =3D ALIGN_DOWN(vmf->address, size); + + /* + * Without a sequential hint, require prior young-density evidence and + * only allow faults near the start of the candidate range. + */ + fault_idx =3D (vmf->address - base) >> PAGE_SHIFT; + max_skip =3D (1UL << order) >> SWAPIN_ANON_MAX_FAULT_SKIP_SHIFT; + if (fault_idx > max_skip) + return false; + + return swapin_anon_prev_young_dense(vmf, order); +} + +static bool swapin_anon_fault_starts_range(struct vm_fault *vmf, + unsigned int order) +{ + struct vm_area_struct *vma; + unsigned long size; + unsigned long base; + unsigned long first; + + if (!vmf || !vmf->vma || !order || order > MAX_PAGE_ORDER) + return false; + + vma =3D vmf->vma; + size =3D PAGE_SIZE << order; + base =3D ALIGN_DOWN(vmf->address, size); + first =3D ALIGN(vma->vm_start, size); + + return base =3D=3D first && vmf->address =3D=3D base && + base + size <=3D vma->vm_end; +} + +static unsigned long swapin_anon_locality_orders(struct vm_fault *vmf, + unsigned long orders) +{ + struct vm_area_struct *vma =3D vmf ? vmf->vma : NULL; + unsigned long locality_orders =3D 0; + unsigned long candidates =3D orders & ~BIT(0); + int order; + + if (vma && (vma->vm_flags & VM_RAND_READ)) + return 0; + + if (vma && (vma->vm_flags & VM_SEQ_READ)) + return candidates; + + while (candidates) { + order =3D fls_long(candidates) - 1; + if (swapin_anon_fault_starts_range(vmf, order) || + swapin_anon_accessed_neighbour(vmf, order)) + locality_orders |=3D BIT(order); + candidates &=3D ~BIT(order); + } + + return locality_orders; +} + /* * Check if the PTEs within a range are contiguous swap entries. */ @@ -4644,9 +4824,9 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_= t *ptep, int nr_pages) if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) return false; /* - * swap_read_folio() can't handle the case a large folio is hybridly - * from different backends. And they are likely corner cases. Similar - * things might be added once zswap support large folios. + * swap_read_folio() can't do mixed-backend large folio IO. The common + * synchronous swapin path will recheck backend state and fall back to + * order-0 if a zswap/disk race makes the range mixed. */ if (swap_pte_batch(ptep, nr_pages, pte) !=3D nr_pages) return false; @@ -4693,14 +4873,6 @@ static unsigned long thp_swapin_suitable_orders(stru= ct vm_fault *vmf) if (unlikely(userfaultfd_armed(vma))) return 0; =20 - /* - * A large swapped out folio could be partially or fully in zswap. We - * lack handling for such cases, so fallback to swapping in order-0 - * folio. - */ - if (!zswap_never_enabled()) - return 0; - entry =3D softleaf_from_pte(vmf->orig_pte); /* * Get a list of all the (large) orders below PMD_ORDER that are enabled @@ -4708,10 +4880,13 @@ static unsigned long thp_swapin_suitable_orders(str= uct vm_fault *vmf) */ orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT, BIT(PMD_ORDER) - 1); + if (!orders) + return 0; orders =3D thp_vma_suitable_orders(vma, vmf->address, orders); + if (!orders) + return 0; orders =3D thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); - if (!orders) return 0; =20 @@ -4741,6 +4916,12 @@ static unsigned long thp_swapin_suitable_orders(stru= ct vm_fault *vmf) { return 0; } + +static unsigned long swapin_anon_locality_orders(struct vm_fault *vmf, + unsigned long orders) +{ + return 0; +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 /* Sanity check that a folio is fully exclusive */ @@ -4777,6 +4958,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) unsigned long page_idx; unsigned long address; pte_t *ptep; + bool fault_only_young =3D false; =20 if (!pte_unmap_same(vmf)) goto out; @@ -4845,13 +5027,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (folio) swap_update_readahead(folio, vma, vmf->address); if (!folio) { - /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */ - if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) + /* + * Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices. + * The swap device is pinned while checking the flag, matching + * the existing fault path. + */ + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { + unsigned long swapin_orders =3D thp_swapin_suitable_orders(vmf); + unsigned long locality_orders =3D + swapin_anon_locality_orders(vmf, swapin_orders); + folio =3D swapin_sync(entry, GFP_HIGHUSER_MOVABLE, - thp_swapin_suitable_orders(vmf) | BIT(0), - 0, vmf, NULL, 0); - else + swapin_orders | BIT(0), + locality_orders, vmf, NULL, 0); + } else { folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf); + } =20 if (IS_ERR_OR_NULL(folio)) { /* @@ -5110,9 +5301,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) =20 VM_BUG_ON(!folio_test_anon(folio) || (pte_write(pte) && !PageAnonExclusive(page))); - set_ptes(vma->vm_mm, address, ptep, pte, nr_pages); - arch_do_swap_page_nr(vma->vm_mm, vma, address, - pte, pte, nr_pages); + if (folio =3D=3D swapcache && nr_pages =3D=3D folio_nr_pages(folio) && + arch_has_hw_pte_young()) + fault_only_young =3D swapin_fault_only_young(folio); + set_swapin_ptes(vma, address, ptep, pte, nr_pages, page_idx, + fault_only_young); + arch_do_swap_page_nr(vma->vm_mm, vma, address, pte, pte, nr_pages); =20 /* * Remove the swap entry and conditionally try to free up the swapcache. diff --git a/mm/swap.h b/mm/swap.h index dd35a310d06d..5d1c81ab49b9 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -327,6 +327,7 @@ struct folio *swapin_readahead(swp_entry_t entry, gfp_t= flag, struct folio *swapin_sync(swp_entry_t entry, gfp_t flag, unsigned long ord= ers, unsigned long locality_orders, struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx); +bool swapin_fault_only_young(struct folio *folio); void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, unsigned long addr); =20 @@ -430,6 +431,11 @@ static inline void swap_update_readahead(struct folio = *folio, { } =20 +static inline bool swapin_fault_only_young(struct folio *folio) +{ + return false; +} + static inline int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug) { diff --git a/mm/swap_state.c b/mm/swap_state.c index 5a4ca289009a..80dff6a1ee65 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -747,6 +747,21 @@ static bool zswap_needs_order0_retry(struct folio *fol= io) ZSWAP_RANGE_MIXED; } =20 +/* + * A speculative large swapin may install PTEs for pages that did not faul= t. + * Keep those non-faulting PTEs old so a later anon fault can report + * PTE-young density as caller-provided locality evidence without storing + * state in the VMA. + */ +bool swapin_fault_only_young(struct folio *folio) +{ + if (!folio_test_large(folio) || !folio_test_swapcache(folio)) + return false; + + return zswap_probe_range(folio->swap, folio_nr_pages(folio)) =3D=3D + ZSWAP_RANGE_ALL_ZSWAP; +} + /* * If we are the only user, then try to free up the swap cache. * --=20 2.34.1 From nobody Mon Jun 8 12:13:56 2026 Received: from xmbghk7.mail.qq.com (xmbghk7.mail.qq.com [43.163.128.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A04C33E1688; Fri, 29 May 2026 12:19:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=43.163.128.54 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057187; cv=none; b=FbgME5prBXFNPVs6bhlz2TWxi7pAwq4hlxF6gBIWKEQt3u1gnVaqE0MFKED7HD0M+lF/CcWTMxtbrnZ9iMgRQlY9GELmU7om9ySXaH7vnZwf5fRSsxQlwEbfdj631bE1HO1YS5z3qY1WhD8BJkH4OQovFhTRvCm2YO9Yg1TdGjk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057187; c=relaxed/simple; bh=2jOwB81hIiabcdQV76BP550VxG9p9yrp8SjCpV0ZmVA=; h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References: MIME-Version; b=riD4vmIQhf8TWP8ze3Vnsd0jOg6eeMryNcKqK4Jqshn4mhIpEni2CcehVZSKZ40aSNuZaI9G4TzHR1zQC1IRYULPHULQGD9SNjTTDBEYAk7/RiOfe1zdgQlhGK2SBtbe5GuXSKtdSkYKv88zaDalBiNHbhiQAQvOyKj5Nyp4vb0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com; spf=pass smtp.mailfrom=qq.com; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=xYlebyWJ; arc=none smtp.client-ip=43.163.128.54 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=qq.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="xYlebyWJ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512; t=1780057183; bh=cGeZMeV+6Toy2BfwU50kHDN1vTufg8/wOGuq0DvP3ZE=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=xYlebyWJ6iNwSk68CRHjhw0IyNqkx2uaU+DJwpaoVhlC+UHcNa1NzcvVgU8zTlrrX UEBFmK3fiyKWgTZuq/tIKSePlxCtX+GvRTLfefN3WEC7gT++na7N+Uei+c5ZLtLCBP P1L/wO9xJFb7QTN4kQWiB+bTYWUS8exkZ4hUwe9g= Received: from node68.. ([166.111.236.25]) by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP id 4DC10017; Fri, 29 May 2026 20:19:28 +0800 X-QQ-mid: xmsmtpt1780057180tkgo7kc1y Message-ID: X-QQ-XMAILINFO: NGZp1yYNf7Y+XlrNUl1Lxurg2YUpupTBj+94wLieCfHcv0JrqBr9ojPlIzbJqI x386pIbILgiph0fcnAAdez/p4vVKlgRXojF+wmEB5ZY7a9m7zElx+jUi6NEpdq6D3+uxyqpjhCgI WNrxNLKhLIMM9915NjPrml5pw1rOJY8NCxptsD4LYTdTSYiiD37GDyqa/WqfyaLXgvxG81S3Sq+j 34xI8IlJCZwsAJZQLhFsggR804KMeWER3l8eCo6GtrHle981IvR7mB5qhwZKuuPdvO9tWf5oXue8 db5F+5LlCb7ky0SqFz/8ECb0Y63scxgKvddp6HcPFWDgHyou2diJp6RC4ZrjAELeyxxfEKv3HFoJ cJ0rcz5XrTl9wW+KmFXTN3mweDxeQFy4etXz2AnnkEUx16bAupA1NCObm/e6+gTlcq6nfLqiqYSg +PqQX+Fc8dih2wIvYxFVOzpAwA31J4vNHE1CUwMqH2pIRs1rElumI5dVpRQbuBAB75+wQt0ZyYIt qN4GSCu6hSJVqlaZ19mY1v1W+w1wLCoW89IkL85LEIkFW6w+Efc1ma99G564fOJ4B/JYooxYU0Si EjbwIsmrIH4aNqdb9X2pp7pex4NVVnkKQvCiLUqIXkykzmDCAHv979boLO5y4uToWSBBa+5C9kqO dKLJtNHIgVec+bMDneMhz1TVr06qjmuK7H2noH3umPYqRaOTgQpcui2/4QXFksVJr9dLliSHXANw y3K8cdC9BjfNSptWmOivQUcTwHKzofvSVMA2YGCyCsu9Rm3qZg2L6h4okkpZ4/p2jdqnfVtz3Fq+ Wie6Aq60x60wATTWD6tVXEPFC1/OEx3AiS6brGN7SbnoKkYQT5+pdyyHuVB3t0i4Ct4gGqAwPn+E 99jccLH7PL2dm8BAKZLBKV08PWyYaNmwOS2dS/yDQ2SUK1IGBuVHrRSdY5Vfx9hxVmJR7gFpEKyZ +B2cIwX+8LA3JHi560XBMpBYR9X4boC0KHiTSRkVoQ4fCpuq5BRyrT105Y8erdJqT7na3WhGN04t uaEx0vLH6nnB/agzx/qqna9bL5n5IdOKqN4k+GWQ== X-QQ-XMRINFO: Nq+8W0+stu50tPAe92KXseR0ZZmBTk3gLg== From: fujunjie To: Andrew Morton , linux-mm@kvack.org, Alexandre Ghiti , Kairui Song , Usama Arif Cc: Chris Li , Johannes Weiner , Yosry Ahmed , Nhat Pham , David Hildenbrand , Hugh Dickins , Roman Gushchin , Shakeel Butt , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [RFC PATCH v2 7/9] mm/shmem: provide VMA-hint locality for zswap large swapin Date: Fri, 29 May 2026 12:19:26 +0000 X-OQ-MSGID: <20260529121928.4115683-7-fujunjie1@qq.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Let the shmem swap fault path pass locality evidence into the common zswap large-swapin policy. Shmem does not have anon PTE-young density evidence, so this first step only treats explicit VM_SEQ_READ as positive evidence and VM_RAND_READ as a veto. The non-fault shmem readahead path remains unchanged. This keeps large zswap swapin limited to synchronous shmem faults where the caller supplies a VMA and the common policy can still fall back to order-0. Signed-off-by: fujunjie --- mm/shmem.c | 42 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 37 insertions(+), 5 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index fa99b48ed62b..a5ac35ac85fb 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include #include @@ -1791,6 +1792,29 @@ static struct folio *shmem_swapin_cluster(swp_entry_= t swap, gfp_t gfp, return folio; } =20 +static unsigned long shmem_swapin_locality_orders(struct vm_fault *vmf, + unsigned long orders) +{ + struct vm_area_struct *vma =3D vmf ? vmf->vma : NULL; + unsigned long candidates =3D orders & ~BIT(0); + + /* + * Shmem does not have anon-style PTE young density evidence. Start with + * explicit VMA access hints; future shmem/page-cache readahead evidence + * can be folded into this producer without changing common swapin policy. + */ + if (!vma) + return 0; + + if (vma->vm_flags & VM_RAND_READ) + return 0; + + if (vma->vm_flags & VM_SEQ_READ) + return candidates; + + return 0; +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE bool shmem_hpage_pmd_enabled(void) { @@ -2020,18 +2044,22 @@ static struct folio *shmem_swap_alloc_folio(struct = inode *inode, struct vm_fault *vmf, pgoff_t index, swp_entry_t entry, int order, gfp_t gfp) { + unsigned long locality_orders; + unsigned long orders; pgoff_t ilx; struct folio *folio; struct mempolicy *mpol; struct shmem_inode_info *info =3D SHMEM_I(inode); =20 - if ((vmf && unlikely(userfaultfd_armed(vmf->vma))) || - !zswap_never_enabled()) + if (vmf && unlikely(userfaultfd_armed(vmf->vma))) order =3D 0; =20 again: + orders =3D BIT(order); + locality_orders =3D shmem_swapin_locality_orders(vmf, orders); mpol =3D shmem_get_pgoff_policy(info, index, order, &ilx); - folio =3D swapin_sync(entry, gfp, BIT(order), 0, vmf, mpol, ilx); + folio =3D swapin_sync(entry, gfp, orders, locality_orders, vmf, mpol, + ilx); mpol_cond_put(mpol); =20 if (!IS_ERR(folio)) @@ -2339,7 +2367,7 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, if (!folio_matches_swap_entry(folio, swap) || shmem_confirm_swap(mapping, index, swap) < 0) { error =3D -EEXIST; - goto unlock; + goto failed_swapcache; } if (!folio_test_uptodate(folio)) { error =3D -EIO; @@ -2369,6 +2397,8 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, if (sgp =3D=3D SGP_WRITE) folio_mark_accessed(folio); =20 + if (folio_test_large(folio)) + memcg1_swapin(folio); folio_put_swap(folio, NULL); swap_cache_del_folio(folio); folio_mark_dirty(folio); @@ -2379,9 +2409,11 @@ static int shmem_swapin_folio(struct inode *inode, p= goff_t index, failed: if (shmem_confirm_swap(mapping, index, swap) < 0) error =3D -EEXIST; +failed_swapcache: + if (folio && folio_test_large(folio) && folio_test_swapcache(folio)) + memcg1_swapin(folio); if (error =3D=3D -EIO) shmem_set_folio_swapin_error(inode, index, folio, swap); -unlock: if (folio) folio_unlock(folio); failed_nolock: --=20 2.34.1 From nobody Mon Jun 8 12:13:56 2026 Received: from xmbghk7.mail.qq.com (xmbghk7.mail.qq.com [43.163.128.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8069F3E0240; Fri, 29 May 2026 12:19:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=43.163.128.53 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057196; cv=none; b=Y/OJ0OqzoY3yn0LNyZEd4s7Zl8EtIVCr+i+oKVfrVn/92xeyQzkuz2EWav8xVVdyRJqU/qid3u/nTxVKJceIFxCkzrVGaKWTMG6RTstqPLJmkgniyNSHCwTNpqzlVXd4JG5DncXd8APlDeqV4hLGhT9CW8I7M4YkLaswiaxgPLk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057196; c=relaxed/simple; bh=N3a8S+cu7eF5svVcyWafGORVDlZ1pYeRMLR5WOLvyQc=; h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References: MIME-Version; b=B+Dig6TKRzdWKVs9kCuJxj/Ehg8FhMZm6ka8N2eVI/JsMw0m7YfEsx4w/9+otYPWq0kkSRuvAXqcQMnMA00C3poHCRhKRzsPJpxSjLugY5he7GJFHjcsiSaeTK8422iHf5rIwrZIX2S0AFRJaBbhTPJ+X1P9vYObpuvxP2m6xvI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com; spf=pass smtp.mailfrom=qq.com; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=S7/7VMTF; arc=none smtp.client-ip=43.163.128.53 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=qq.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="S7/7VMTF" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512; t=1780057185; bh=kl4wii9J3ChOs7nTTpM79Hhbg1V1e1SmdcBHFxEE0sY=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=S7/7VMTF2p4WUnkjXd9oMK4prBrss/vil+bREisKEI3a5N1ZH3gW+zSUgNKfHt9sV 0u510DSEovZt/WtrLm9/01lOWTRQxlMu8DIuu5E44z8ojGgAzjCn9vj8LvUgMkox0O lPXr6nnNrlnAj5aWyiANMSOjmdVI+DMkq36zZCJ4= Received: from node68.. ([166.111.236.25]) by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP id 4DC10017; Fri, 29 May 2026 20:19:28 +0800 X-QQ-mid: xmsmtpt1780057182t9ia4wj44 Message-ID: X-QQ-XMAILINFO: MFA3rFz8fXqrL3TcEFDEZcEzucYlFRdI7QXOI63aSlIg61BH/4wc6n0xwKM/Py /yZog2STHLXTXeVT5sNihqGQ3EW/bly5bV8I3/pBmGrkWpHv7jFVitqKgz0zSwhsvwykkUtQFgKR lTwt2i3FLG6weLwapkzFNDygnomcSuPM3HIHhLTrnFQ06mrNTMggEohFvpOtCkA/25jofLNhAdoO IQO+SJwCN+Q0UiYZbjvxzHmk22sWLzg90wIjWCoFHOuVMb14lwU2PtRovbS3V82uBqchkK5Aruok Sjhtz1KxyhuUg8Y1QnZYkehz5CUC0yGQf3EPlAiQp3FIhq/vXZUbYxjda2mm10RCscaUfVJntIgo PbFK0BPq8wuVK2/5kp3igGP28/Db1K1W9on3bcc8PD+kxnxA27hMSXpTfltl3BhXoXzYlPOffBb4 kNojrW1XvTetpGjqT+fI9Y7XOEsHSKoJkFeE5NgNQQ7SPA1BKgJpE9F36oyM8QaXVzIoIm54lh1N d0Re30SwdrHWMRKtj0b8dn14bu0CCo3u4BNCNepwoWIRdKg+z8GqvV0CO2Cz3I+1re/guRqVWhyZ Uk++SablSl7mCTNpRLsT1HKRNKYin1Ippfh3fvT8QET/fEa9JHHIvsq3nRFyYviEB/eW+qGxIj7n menDaWqozWKNjv4IH5wsRBNDZg80k3EXc8YgmyQY7SuzxE5CSZ+AgRhLEcgyAW4aPcZ7URWoSRbj RjKk7ZiKUosItWJGBd1eK2n9/fdXk1u6GUdWtZmVSBjOoT2lASijNEBSMf7YJAAwGW1YaolYB+aw lVgzFCmXPJirp7Vhl57fYoc5dgqKXuqEgElk94QTwX+5qpcfvKZuhaiUGjT3W/zda5ssEi+4B+jP r4aCn53Q48dvCzrBh5VBqZIuITbGA7IDcdma4Qb5jM7Zbr0RKhlV/fMRd8dNNF1PfyKz4kezJBaU ATcCukAi4fWL9m1pV/tPlyzBBY6vQZ3xjhs8nr6WRLJDcmoDhaLGg/cibhVxM7fFjo6w3fURW4u6 18zh/hJEUBJqyXbmiChsZGmCfovZgCNBhf4Evf5OSQNISaGDouWATATlXsicU= X-QQ-XMRINFO: MSVp+SPm3vtSI1QTLgDHQqIV1w2oNKDqfg== From: fujunjie To: Andrew Morton , linux-mm@kvack.org, Alexandre Ghiti , Kairui Song , Usama Arif Cc: Chris Li , Johannes Weiner , Yosry Ahmed , Nhat Pham , David Hildenbrand , Hugh Dickins , Roman Gushchin , Shakeel Butt , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [RFC PATCH v2 8/9] mm: try all-zswap large swapin within swap readahead windows Date: Fri, 29 May 2026 12:19:27 +0000 X-OQ-MSGID: <20260529121928.4115683-8-fujunjie1@qq.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The non-synchronous swap fault path already computes either a VMA-based or cluster-based readahead window. Use that existing window as locality evidence for zswap-backed large swapin instead of mixing it with the synchronous anon/shmem evidence. The path first prepares the normal readahead window. If the faulting aligned range is fully covered by that window and is still all-zswap, it may be loaded as one large folio. If the large attempt fails or a backend race is detected, the precomputed order-0 readahead window is used without updating readahead state again. Mixed zswap/disk ranges remain order-0 only. Disk-backed large swapin is not added by this change. Signed-off-by: fujunjie --- mm/memory.c | 6 +- mm/swap.h | 4 +- mm/swap_state.c | 434 +++++++++++++++++++++++++++++++++++++++--------- mm/swapfile.c | 2 +- 4 files changed, 360 insertions(+), 86 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 7bbb89632000..451375090d83 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5027,13 +5027,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (folio) swap_update_readahead(folio, vma, vmf->address); if (!folio) { + unsigned long swapin_orders =3D thp_swapin_suitable_orders(vmf); + /* * Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices. * The swap device is pinned while checking the flag, matching * the existing fault path. */ if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { - unsigned long swapin_orders =3D thp_swapin_suitable_orders(vmf); unsigned long locality_orders =3D swapin_anon_locality_orders(vmf, swapin_orders); =20 @@ -5041,7 +5042,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) swapin_orders | BIT(0), locality_orders, vmf, NULL, 0); } else { - folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf); + folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, + swapin_orders, vmf); } =20 if (IS_ERR_OR_NULL(folio)) { diff --git a/mm/swap.h b/mm/swap.h index 5d1c81ab49b9..0e1bf9218b5e 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -323,7 +323,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, = gfp_t gfp_mask, struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, struct mempolicy *mpol, pgoff_t ilx); struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, - struct vm_fault *vmf); + unsigned long orders, struct vm_fault *vmf); struct folio *swapin_sync(swp_entry_t entry, gfp_t flag, unsigned long ord= ers, unsigned long locality_orders, struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx); @@ -413,7 +413,7 @@ static inline struct folio *swap_cluster_readahead(swp_= entry_t entry, } =20 static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_ma= sk, - struct vm_fault *vmf) + unsigned long orders, struct vm_fault *vmf) { return NULL; } diff --git a/mm/swap_state.c b/mm/swap_state.c index 80dff6a1ee65..4f1eb0a7f9f5 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -678,20 +678,24 @@ static bool swapin_zswap_admit(swp_entry_t entry, static unsigned long swapin_admit_orders(swp_entry_t entry, unsigned long orders, struct vm_fault *vmf, - unsigned long locality_orders) + unsigned long locality_orders, + bool zswap_only) { unsigned long candidates =3D orders & ~BIT(0); - unsigned long admitted =3D orders & BIT(0); + unsigned long admitted =3D zswap_only ? 0 : orders & BIT(0); + enum zswap_range_state fault_zswap_state =3D ZSWAP_RANGE_NEVER_ENABLED; struct zswap_admit_ctx zswap_ctx =3D {}; + bool fault_zswap_checked =3D false; int order; =20 if (!candidates) - return orders; + return zswap_only ? 0 : orders; =20 while (candidates) { enum zswap_range_state state; unsigned int nr_pages; swp_entry_t range_entry; + bool zswap_locality; bool admit =3D false; =20 order =3D fls_long(candidates) - 1; @@ -703,6 +707,29 @@ static unsigned long swapin_admit_orders(swp_entry_t e= ntry, nr_pages =3D 1U << order; range_entry =3D swp_entry(swp_type(entry), round_down(swp_offset(entry), nr_pages)); + zswap_locality =3D order <=3D SWAPIN_ZSWAP_MAX_ORDER && + swapin_zswap_locality(vmf, order, + locality_orders); + /* + * If the faulting slot is already in zswap but this order has + * no zswap locality evidence, a larger range covering the fault + * cannot be admitted: it is either all-zswap or mixed, and both + * require zswap locality. Avoid scanning the whole range on + * sparse/random zswap refaults. If the faulting slot is not in + * zswap, keep the full classification so all-disk large swapin + * can follow the existing policy. + */ + if (!zswap_locality) { + if (zswap_only) + goto next; + if (!fault_zswap_checked) { + fault_zswap_state =3D zswap_probe_range(entry, 1); + fault_zswap_checked =3D true; + } + if (fault_zswap_state =3D=3D ZSWAP_RANGE_ALL_ZSWAP) + goto next; + } + if (!swapin_zeromap_same(range_entry, nr_pages)) goto next; =20 @@ -718,7 +745,7 @@ static unsigned long swapin_admit_orders(swp_entry_t en= try, break; case ZSWAP_RANGE_NEVER_ENABLED: case ZSWAP_RANGE_NO_ZSWAP: - admit =3D true; + admit =3D !zswap_only; break; } =20 @@ -730,21 +757,32 @@ static unsigned long swapin_admit_orders(swp_entry_t = entry, candidates &=3D ~BIT(order); } =20 - return admitted ? admitted : BIT(0); + return admitted ? admitted : (zswap_only ? 0 : BIT(0)); } =20 -static bool zswap_needs_order0_retry(struct folio *folio) +static bool zswap_folio_all_zswap(struct folio *folio) { + return zswap_probe_range(folio->swap, folio_nr_pages(folio)) =3D=3D + ZSWAP_RANGE_ALL_ZSWAP; +} + +static bool zswap_needs_fallback(struct folio *folio, bool zswap_only) +{ + enum zswap_range_state state; + if (!folio_test_large(folio)) return false; =20 + state =3D zswap_probe_range(folio->swap, folio_nr_pages(folio)); + if (zswap_only) + return state !=3D ZSWAP_RANGE_ALL_ZSWAP; + /* * Admission sees only an advisory zswap snapshot. Recheck after the * large swapcache folio is installed; if the range became mixed, drop * the fresh folio before IO and let order-0 handle each slot. */ - return zswap_probe_range(folio->swap, folio_nr_pages(folio)) =3D=3D - ZSWAP_RANGE_MIXED; + return state =3D=3D ZSWAP_RANGE_MIXED; } =20 /* @@ -758,8 +796,7 @@ bool swapin_fault_only_young(struct folio *folio) if (!folio_test_large(folio) || !folio_test_swapcache(folio)) return false; =20 - return zswap_probe_range(folio->swap, folio_nr_pages(folio)) =3D=3D - ZSWAP_RANGE_ALL_ZSWAP; + return zswap_folio_all_zswap(folio); } =20 /* @@ -893,34 +930,15 @@ static struct folio *swap_cache_read_folio(swp_entry_= t entry, gfp_t gfp, return folio; } =20 -/** - * swapin_sync - swap-in one or multiple entries skipping readahead. - * @entry: swap entry indicating the target slot - * @gfp: memory allocation flags - * @orders: allocation orders - * @locality_orders: orders with caller-provided locality evidence - * @vmf: fault information - * @mpol: NUMA memory allocation policy to be applied - * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE - * - * This allocates a folio suitable for given @orders, or returns the - * existing folio in the swap cache for @entry. This initiates the IO, too, - * if needed. @entry is rounded down if @orders allow large allocation. - * - * Context: Caller must ensure @entry is valid and pin the swap device with - * refcount. - * Return: Returns the folio on success, error code if failed. - */ -struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp, - unsigned long orders, - unsigned long locality_orders, - struct vm_fault *vmf, struct mempolicy *mpol, - pgoff_t ilx) +static struct folio *swapin_alloc_read(swp_entry_t entry, gfp_t gfp, + unsigned long orders, + struct vm_fault *vmf, + struct mempolicy *mpol, pgoff_t ilx, + bool retry_order0, bool zswap_only) { struct folio *folio; int ret; =20 - orders =3D swapin_admit_orders(entry, orders, vmf, locality_orders); again: do { folio =3D swap_cache_get_folio(entry); @@ -931,19 +949,21 @@ struct folio *swapin_sync(swp_entry_t entry, gfp_t gf= p, } while (PTR_ERR(folio) =3D=3D -EEXIST); =20 if (IS_ERR(folio)) - return folio; + return retry_order0 ? folio : NULL; =20 - if (zswap_needs_order0_retry(folio)) { + if (zswap_needs_fallback(folio, zswap_only)) { count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN_FALLBACK); /* * The folio is newly allocated, locked, clean and not uptodate; * no data has been read into it. Removing it only restores the - * swap table entries so order-0 swapin can resolve a backend + * swap table entries so the fallback path can resolve a backend * race without attempting speculative large-folio zswapin. */ swap_cache_del_folio(folio); folio_unlock(folio); folio_put(folio); + if (!retry_order0) + return NULL; orders =3D BIT(0); goto again; } @@ -954,12 +974,62 @@ struct folio *swapin_sync(swp_entry_t entry, gfp_t gf= p, swap_cache_del_folio(folio); folio_unlock(folio); folio_put(folio); + if (!retry_order0) + return NULL; orders =3D BIT(0); goto again; } return folio; } =20 +/** + * swapin_sync - swap-in one or multiple entries skipping readahead. + * @entry: swap entry indicating the target slot + * @gfp: memory allocation flags + * @orders: allocation orders + * @locality_orders: orders with caller-provided locality evidence + * @vmf: fault information + * @mpol: NUMA memory allocation policy to be applied + * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE + * + * This allocates a folio suitable for given @orders, or returns the + * existing folio in the swap cache for @entry. This initiates the IO, too, + * if needed. @entry is rounded down if @orders allow large allocation. + * + * Context: Caller must ensure @entry is valid and pin the swap device with + * refcount. + * Return: Returns the folio on success, error code if failed. + */ +struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp, + unsigned long orders, + unsigned long locality_orders, + struct vm_fault *vmf, struct mempolicy *mpol, + pgoff_t ilx) +{ + orders =3D swapin_admit_orders(entry, orders, vmf, + locality_orders, false); + return swapin_alloc_read(entry, gfp, orders, vmf, mpol, ilx, + true, false); +} + +static struct folio *swapin_zswap_large(swp_entry_t entry, gfp_t gfp, + unsigned long orders, + unsigned long locality_orders, + struct vm_fault *vmf, + struct mempolicy *mpol, pgoff_t ilx) +{ + if (READ_ONCE(page_cluster) <=3D 0) + return NULL; + + orders =3D swapin_admit_orders(entry, orders, vmf, + locality_orders, true); + if (!orders) + return NULL; + + return swapin_alloc_read(entry, gfp, orders, vmf, mpol, ilx, + false, true); +} + /* * Locate a page of swap in physical memory, reserving swap cache space * and reading the disk if it is not already cached. @@ -1048,12 +1118,88 @@ static unsigned long swapin_nr_pages(unsigned long = offset) return pages; } =20 +struct swap_cluster_ra { + unsigned long start_offset; + unsigned long end_offset; + bool readahead; +}; + +static void swap_cluster_ra_prepare(swp_entry_t entry, + struct swap_cluster_ra *ra) +{ + struct swap_info_struct *si =3D __swap_entry_to_info(entry); + unsigned long entry_offset =3D swp_offset(entry); + unsigned long mask; + + mask =3D swapin_nr_pages(entry_offset) - 1; + ra->readahead =3D !!mask; + ra->start_offset =3D entry_offset; + ra->end_offset =3D entry_offset; + if (!mask) + return; + + /* Read a page_cluster sized and aligned cluster around offset. */ + ra->start_offset =3D entry_offset & ~mask; + ra->end_offset =3D entry_offset | mask; + if (!ra->start_offset) /* First page is swap header. */ + ra->start_offset++; + if (ra->end_offset >=3D si->max) + ra->end_offset =3D si->max - 1; +} + +static unsigned long swap_cluster_ra_orders(swp_entry_t entry, + unsigned long orders, + const struct swap_cluster_ra *ra) +{ + unsigned long admitted =3D 0; + unsigned long candidates =3D orders & ~BIT(0); + unsigned long entry_offset =3D swp_offset(entry); + int order; + + if (!ra->readahead) + return 0; + + while (candidates) { + unsigned long nr_pages; + unsigned long start_offset; + unsigned long end_offset; + + order =3D fls_long(candidates) - 1; + if (order > MAX_PAGE_ORDER) { + candidates &=3D ~BIT(order); + continue; + } + + nr_pages =3D 1UL << order; + start_offset =3D round_down(entry_offset, nr_pages); + end_offset =3D start_offset + nr_pages - 1; + if (start_offset >=3D ra->start_offset && + end_offset <=3D ra->end_offset) + admitted |=3D BIT(order); + candidates &=3D ~BIT(order); + } + + return admitted; +} + +static bool swapin_readahead_skip(unsigned long index, + unsigned long skip_start, + unsigned long skip_end) +{ + return skip_start < skip_end && + index >=3D skip_start && index < skip_end; +} + /** - * swap_cluster_readahead - swap in pages in hope we need them soon + * swap_cluster_readahead_win - swap in pages from a prepared swap window * @entry: swap entry of this memory * @gfp_mask: memory allocation flags * @mpol: NUMA memory allocation policy to be applied * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE + * @ra: readahead window prepared by swap_cluster_ra_prepare() + * @skip_start: first offset already covered by @target_folio + * @skip_end: offset after the already covered range + * @target_folio: target folio to return after queueing the rest of the wi= ndow * * Returns the struct folio for entry and addr, after queueing swapin. * @@ -1066,33 +1212,38 @@ static unsigned long swapin_nr_pages(unsigned long = offset) * are used for every page of the readahead: neighbouring pages on swap * are fairly likely to have been swapped out from the same node. */ -struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, - struct mempolicy *mpol, pgoff_t ilx) +static struct folio *swap_cluster_readahead_win(swp_entry_t entry, + gfp_t gfp_mask, + struct mempolicy *mpol, + pgoff_t ilx, + const struct swap_cluster_ra *ra, + unsigned long skip_start, + unsigned long skip_end, + struct folio *target_folio) { struct folio *folio; unsigned long entry_offset =3D swp_offset(entry); - unsigned long offset =3D entry_offset; - unsigned long start_offset, end_offset; - unsigned long mask; - struct swap_info_struct *si =3D __swap_entry_to_info(entry); + unsigned long offset; struct blk_plug plug; struct swap_iocb *splug =3D NULL; swp_entry_t ra_entry; =20 - mask =3D swapin_nr_pages(offset) - 1; - if (!mask) + if (!ra->readahead) goto skip; =20 - /* Read a page_cluster sized and aligned cluster around offset. */ - start_offset =3D offset & ~mask; - end_offset =3D offset | mask; - if (!start_offset) /* First page is swap header. */ - start_offset++; - if (end_offset >=3D si->max) - end_offset =3D si->max - 1; + if (target_folio && + skip_start <=3D ra->start_offset && skip_end > ra->end_offset) + goto skip; =20 blk_start_plug(&plug); - for (offset =3D start_offset; offset <=3D end_offset ; offset++) { + for (offset =3D ra->start_offset; offset <=3D ra->end_offset; offset++) { + if (swapin_readahead_skip(offset, skip_start, skip_end)) { + if (skip_end > ra->end_offset) + break; + offset =3D skip_end - 1; + continue; + } + /* Ok, do the async read-ahead now */ ra_entry =3D swp_entry(swp_type(entry), offset); folio =3D swap_cache_read_folio(ra_entry, gfp_mask, mpol, ilx, @@ -1105,10 +1256,29 @@ struct folio *swap_cluster_readahead(swp_entry_t en= try, gfp_t gfp_mask, swap_read_unplug(splug); lru_add_drain(); /* Push any new pages onto the LRU now */ skip: + if (target_folio) + return target_folio; + /* The page was likely read above, so no need for plugging here */ return swap_cache_read_folio(entry, gfp_mask, mpol, ilx, NULL, false); } =20 +struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, + struct mempolicy *mpol, pgoff_t ilx) +{ + struct swap_cluster_ra ra; + + swap_cluster_ra_prepare(entry, &ra); + return swap_cluster_readahead_win(entry, gfp_mask, mpol, ilx, &ra, + 0, 0, NULL); +} + +struct swap_vma_ra { + unsigned long start; + unsigned long end; + int win; +}; + static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start, unsigned long *end) { @@ -1147,35 +1317,69 @@ static int swap_vma_ra_win(struct vm_fault *vmf, un= signed long *start, return win; } =20 -/** - * swap_vma_readahead - swap in pages in hope we need them soon - * @targ_entry: swap entry of the targeted memory - * @gfp_mask: memory allocation flags - * @mpol: NUMA memory allocation policy to be applied - * @targ_ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE - * @vmf: fault information - * - * Returns the struct folio for entry and addr, after queueing swapin. - * - * Primitive swap readahead code. We simply read in a few pages whose - * virtual addresses are around the fault address in the same vma. - * - * Caller must hold read mmap_lock if vmf->vma is not NULL. - * +static unsigned long swap_vma_ra_orders(struct vm_fault *vmf, + unsigned long orders, + const struct swap_vma_ra *ra) +{ + unsigned long admitted =3D 0; + unsigned long candidates =3D orders & ~BIT(0); + int order; + + if (ra->win <=3D 1) + return 0; + + while (candidates) { + unsigned long size; + unsigned long start; + unsigned long end; + + order =3D fls_long(candidates) - 1; + if (order > MAX_PAGE_ORDER) { + candidates &=3D ~BIT(order); + continue; + } + + size =3D PAGE_SIZE << order; + start =3D ALIGN_DOWN(vmf->address, size); + end =3D start + size; + if (start >=3D ra->start && end <=3D ra->end) + admitted |=3D BIT(order); + candidates &=3D ~BIT(order); + } + + return admitted; +} + +/* + * Queue swapin for a precomputed VMA readahead window. The window has alr= eady + * been accounted in vma->swap_readahead_info, so fallback after a failed + * zswap-large attempt does not update readahead state a second time. If + * @target_folio is already populated, queue only the part of the window o= utside + * [@skip_start, @skip_end) and return @target_folio. */ -static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_= mask, - struct mempolicy *mpol, pgoff_t targ_ilx, struct vm_fault *vmf) +static struct folio *swap_vma_readahead_win(swp_entry_t targ_entry, + gfp_t gfp_mask, + struct mempolicy *mpol, + pgoff_t targ_ilx, + struct vm_fault *vmf, + const struct swap_vma_ra *ra, + unsigned long skip_start, + unsigned long skip_end, + struct folio *target_folio) { struct blk_plug plug; struct swap_iocb *splug =3D NULL; struct folio *folio; pte_t *pte =3D NULL, pentry; - int win; unsigned long start, end, addr; pgoff_t ilx =3D targ_ilx; =20 - win =3D swap_vma_ra_win(vmf, &start, &end); - if (win =3D=3D 1) + if (ra->win <=3D 1) + goto skip; + + start =3D ra->start; + end =3D ra->end; + if (target_folio && skip_start <=3D start && skip_end >=3D end) goto skip; =20 ilx =3D targ_ilx - PFN_DOWN(vmf->address - start); @@ -1185,6 +1389,18 @@ static struct folio *swap_vma_readahead(swp_entry_t = targ_entry, gfp_t gfp_mask, struct swap_info_struct *si =3D NULL; softleaf_t entry; =20 + if (swapin_readahead_skip(addr, skip_start, skip_end)) { + unsigned long next =3D min(skip_end, end); + + if (pte) { + pte_unmap(pte); + pte =3D NULL; + } + ilx +=3D PFN_DOWN(next - addr) - 1; + addr =3D next - PAGE_SIZE; + continue; + } + if (!pte++) { pte =3D pte_offset_map(vmf->pmd, addr); if (!pte) @@ -1220,6 +1436,9 @@ static struct folio *swap_vma_readahead(swp_entry_t t= arg_entry, gfp_t gfp_mask, swap_read_unplug(splug); lru_add_drain(); skip: + if (target_folio) + return target_folio; + /* The folio was likely read above, so no need for plugging here */ folio =3D swap_cache_read_folio(targ_entry, gfp_mask, mpol, targ_ilx, NULL, false); @@ -1230,25 +1449,78 @@ static struct folio *swap_vma_readahead(swp_entry_t= targ_entry, gfp_t gfp_mask, * swapin_readahead - swap in pages in hope we need them soon * @entry: swap entry of this memory * @gfp_mask: memory allocation flags + * @orders: large folio orders suitable for the faulting entry * @vmf: fault information * * Returns the struct folio for entry and addr, after queueing swapin. * - * It's a main entry function for swap readahead. By the configuration, - * it will read ahead blocks by cluster-based(ie, physical disk based) - * or vma-based(ie, virtual address based on faulty address) readahead. + * This first computes the normal VMA or cluster readahead window. If the + * window fully covers an aligned all-zswap range containing the fault, th= at + * range may be swapped in as one large folio. The remaining window is sti= ll + * queued through the original order-0 readahead path, skipping the already + * covered target range and without updating readahead state a second time. */ struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, - struct vm_fault *vmf) + unsigned long orders, struct vm_fault *vmf) { struct mempolicy *mpol; pgoff_t ilx; struct folio *folio; + unsigned long ra_orders; + bool vma_ra; =20 mpol =3D get_vma_policy(vmf->vma, vmf->address, 0, &ilx); - folio =3D swap_use_vma_readahead() ? - swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) : - swap_cluster_readahead(entry, gfp_mask, mpol, ilx); + vma_ra =3D swap_use_vma_readahead(); + if (vma_ra) { + struct swap_vma_ra ra =3D {}; + unsigned long skip_start =3D 0; + unsigned long skip_end =3D 0; + + ra.win =3D swap_vma_ra_win(vmf, &ra.start, &ra.end); + ra_orders =3D swap_vma_ra_orders(vmf, orders, &ra); + if (ra_orders) { + folio =3D swapin_zswap_large(entry, gfp_mask, ra_orders, + ra_orders, vmf, mpol, ilx); + if (folio) { + skip_start =3D ALIGN_DOWN(vmf->address, + folio_size(folio)); + skip_end =3D skip_start + folio_size(folio); + folio =3D swap_vma_readahead_win(entry, gfp_mask, + mpol, ilx, vmf, + &ra, skip_start, + skip_end, folio); + goto out; + } + } + folio =3D swap_vma_readahead_win(entry, gfp_mask, mpol, ilx, + vmf, &ra, 0, 0, NULL); + } else { + struct swap_cluster_ra ra; + unsigned long skip_start =3D 0; + unsigned long skip_end =3D 0; + + swap_cluster_ra_prepare(entry, &ra); + ra_orders =3D swap_cluster_ra_orders(entry, orders, &ra); + if (ra_orders) { + folio =3D swapin_zswap_large(entry, gfp_mask, ra_orders, + ra_orders, vmf, mpol, ilx); + if (folio) { + skip_start =3D swp_offset(folio->swap); + skip_end =3D skip_start + folio_nr_pages(folio); + folio =3D swap_cluster_readahead_win(entry, + gfp_mask, + mpol, ilx, + &ra, + skip_start, + skip_end, + folio); + goto out; + } + } + folio =3D swap_cluster_readahead_win(entry, gfp_mask, mpol, ilx, + &ra, 0, 0, NULL); + } +out: mpol_cond_put(mpol); =20 return folio; diff --git a/mm/swapfile.c b/mm/swapfile.c index 615d90867111..3b7e7d8ae89d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2452,7 +2452,7 @@ static int unuse_pte_range(struct vm_area_struct *vma= , pmd_t *pmd, }; =20 folio =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, - &vmf); + 0, &vmf); } if (!folio) { swp_tb =3D swap_table_get(__swap_entry_to_cluster(entry), --=20 2.34.1 From nobody Mon Jun 8 12:13:56 2026 Received: from xmbghk7.mail.qq.com (xmbghk7.mail.qq.com [43.163.128.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 070813CEBA7; Fri, 29 May 2026 12:19:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=43.163.128.47 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057196; cv=none; b=okKcxVl8v9GS3xNzlTDzkbyqTe62JzSTU/UrQMxt5dqJM4e6qq10GhK1poCnrVVBDD/g+DmCTUpacMAJfIv/V3SY+eHDKrGvcKQINOFA/Bi9cGzxa0HmMzovEe+d+FEm4fvImkE33FpF7q2E5ZmoKwATPdyeSOKwcNRz4+WsKt8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780057196; c=relaxed/simple; bh=N0bkbZk9e+3vO2o+P1+AL6OGGPDRe1aokkNHMcWFoa0=; h=Message-ID:From:To:Cc:Subject:Date:In-Reply-To:References: MIME-Version; b=hc0kaIVVXBUjxRN/04DgWof6+xwA5aPKgAFc0MhrCO1WAOxmRsxYMa40qmmSqbcSEPx/lgJOXehLxJWU5tmgFKm5EZGveN9aKZ2ApuxtQIsgWcxvHvDkq2UO6wqm4Yu7G2YN0kxSwbWZsUKPR1n5hpDWKex5AcO/gFesp9+luBQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com; spf=pass smtp.mailfrom=qq.com; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=bSaVaO1p; arc=none smtp.client-ip=43.163.128.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=qq.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=qq.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="bSaVaO1p" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512; t=1780057187; bh=0F7tBsnwtr9hum4Sk2+uPTGTQ76BR9xyPCoNaMnjzxw=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=bSaVaO1pHWsE2as9v5aqSMI3gC6pea5BdBgdNC7Zp/OSLATc3MR/tPAEO2FgIruVC YB9mZhzuJch1Lyv9TcBU1iRgTAX1iW1VQk/UqItvsjW7pq3FuwOdkBfXuvHXobr1Zz JyGIUF9yPPyeUphvvzOes8X1ZlDuBPPGyfWROCuI= Received: from node68.. ([166.111.236.25]) by newxmesmtplogicsvrsza73-0.qq.com (NewEsmtp) with SMTP id 4DC10017; Fri, 29 May 2026 20:19:28 +0800 X-QQ-mid: xmsmtpt1780057184ts6fo2ynp Message-ID: X-QQ-XMAILINFO: Mzcurg9uYAemtBwe+INFz74//KNnE+Z/ry9r/kfo1MYznumf3xBU+p9iy79Xaw 5SFqWyk9ocoWQh4qtxi57Wl8PIpPfEbG5//RFyUtp7BvID1cq8ZOL2RNORIdwQAzWbREte+FISGS rQLBGEYc6eUsDDl76FFUFO5/22ui9p3L6U1JZw0N66akvDKNhv/1ynm4BWilnqfjBg2mCXJnkx67 pwCuegeawCfvgAVS+7KNfwxM5DDiHZnFNXH6ZG4+UCSsbK7o40q0YwxET8LyS0tiHcYiLb5VtZFL AhOMNGUUEYOXEXb+8Skf3IL+tU0BsQDlyt/+KP2F+tZGfDgScU1V/cFt104Yj/eM6uMAX/97T5NW 74mzCT2cveD5PAqm4nQwFkHJFe5i9cJReF55joujorHiYDxaX8gXPlCIF8ejkOK+BlPhSqw7MEsR GWIOs7Xh0FTyVz+6i/R+CfGiEPU1aMpB+fMq67UMOM0cs+kt69P07GXv8K3gO1g5Nr2DA/5BLnjZ mAToYyjg9Ydu/7jk0GSMBpTyEG0GJEgZ14RiImwQOaFQxrcor9tUVNIuTs/1jVyeZjt/rfUMJH9D H0vc0MXd/ln8DTRmhr0dDIfzVJtXGNKUGZLszineBd2wsEbTzl1DVmsjblQCkRC8dDR3iCGOoC9P KhoOLRf/Y01X/QHibkPXac0vuPGeZFo9sYT5Xajl2kQgof722s+DxCQB4b6domZD1/i4gkgfsLSC AwqOLfuWfOn3JASXbC8cLYiyXjduwjbvL2Y9NYDWSMmWTjAYBO9atEQ0d8770p2vzQ8PyP+KfWqg tx6Y2DHMgRODRVKFoBsMSERP7Ar8uNDk2TvEdEStURJEMO6K50+et+x2VpxYvFUpk2QSPcwSvK0L TXjiwyPvWmy8RVmLFleFoPIBszmIlosHJPcYsGUoKTbY/wH3sFaXX2C3JgUOYva0ncVfCDuOWYMa FPuQLGY7kjudHqGukOTmTK7VYzX/GUfQwLlbrJwUrdcxSdrFqwAbD9eESEvPovIzqoUER1VAm/H6 uELPk+BgwYmrm1cLTFhdappr6OAVFkmrqx+qYiY9YcRRoUL9mzGIOdpgDNqH0= X-QQ-XMRINFO: Mp0Kj//9VHAxzExpfF+O8yhSrljjwrznVg== From: fujunjie To: Andrew Morton , linux-mm@kvack.org, Alexandre Ghiti , Kairui Song , Usama Arif Cc: Chris Li , Johannes Weiner , Yosry Ahmed , Nhat Pham , David Hildenbrand , Hugh Dickins , Roman Gushchin , Shakeel Butt , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [RFC PATCH v2 9/9] docs: mm: update THP swapin counter descriptions Date: Fri, 29 May 2026 12:19:28 +0000 X-OQ-MSGID: <20260529121928.4115683-9-fujunjie1@qq.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The THP swapin counter descriptions still describe large swapin as coming only from non-zswap swap devices. Update them now that zswap-backed large folio swapin can also increment swpin. Also describe policy and backend rejection as swpin_fallback cases, since speculative zswap large swapin can intentionally fall back before doing large IO. Signed-off-by: fujunjie --- Documentation/admin-guide/mm/transhuge.rst | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/adm= in-guide/mm/transhuge.rst index 23f8d13c2629..59b7a0d09243 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -667,13 +667,14 @@ zswpout piece without splitting. =20 swpin - is incremented every time a huge page is swapped in from a non-zswap - swap device in one piece. + is incremented every time a huge page is swapped in from swap or + zswap in one piece. =20 swpin_fallback - is incremented if swapin fails to allocate or charge a huge page - and instead falls back to using huge pages with lower orders or - small pages. + is incremented if swapin cannot use a huge page and instead falls + back to using huge pages with lower orders or small pages. This can + happen because allocation or charging fails, or because policy or + backend state rejects a speculative large swapin. =20 swpin_fallback_charge is incremented if swapin fails to charge a huge page and instead --=20 2.34.1