From nobody Sun Oct 5 09:04:13 2025 Received: from mail-qk1-f173.google.com (mail-qk1-f173.google.com [209.85.222.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E216029A9C9 for ; Wed, 6 Aug 2025 16:18:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754497121; cv=none; b=XVXK1zfpetNsOYSDAfFYuzeEXv4qkRVpMwGUXdUz8JNFcPoIK9nDWZbked5o7Y4sym544iHAmIrBBGaGIACfTA6DKNk1MtyX8vDODrPNVwhhimZ7rH2yOBfyNTR6/rJ6zYIskGCvHm3oUPaLY6ZHCrvurto5v626/PRk3bNYQDY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754497121; c=relaxed/simple; bh=avwlZ+c/zpvHhbg85xbrO7KU1flaZtAHfckjXFJRz2s=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=WXzd0qIFwnu2VdUzq64HayUKFg+EuMUQxPaRj+2MBnzvx+sH85MV/hfepbeNf7sMcJg52OELl57gS7c+AtReJABga83iLfD6EbmrhX13LZm3Hl0wZILB/W/jsmgB09iagaoVr4TxeSLKAXUU3lbPG+XFMGqfcsAHzBbCAaMRHm8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=hB5cUDxj; arc=none smtp.client-ip=209.85.222.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="hB5cUDxj" Received: by mail-qk1-f173.google.com with SMTP id af79cd13be357-7e7fb58425cso6185485a.1 for ; Wed, 06 Aug 2025 09:18:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1754497118; x=1755101918; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=S2V8Av5scJE14Rv5wfsiXUc3HN8xsIIfjL3a9op53nI=; b=hB5cUDxjEVyZ4lfAHJofVSuhwHguUIyws0rnsMjYUUWozidrolsiPb1VIEhf7VRYkz 7LDe14ONGH0CWJzcaq3wgytufOrLERoqQCwEKzSkGNEuHw4+rbTTK6LMV2WmnTWifbXO Sb0XYTDg+qGLDf45DJz/ckNd8J1LnHvM203xCYBs5htDzTNLdPIjY/EvDpmoyRIOkRxl tS9oNCbsonOe1bCBEozcoOmBenV4Mih2by6QwJf9rn+MdOgSkXacp9fB8AuFtijYvnk2 LBX4t/fvUgMj+qMR5vRg7pBxSTtN4QicmNW/WwUqYmuAxeMqQDvWWY4/EnE1rEDAfyCy hRVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1754497118; x=1755101918; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=S2V8Av5scJE14Rv5wfsiXUc3HN8xsIIfjL3a9op53nI=; b=kQdLFoe7lxoecBIuNzD7hfSxfQSiXGi2VghauSVI1Zu877eJ+j/zB27HGJcETTY/nH zXi+nFQisJLgDS/JgbByIP9W/cxoDDmZQqjWWlBvsZr9e7wr5r0YzyCbUNBugftRNTnM a9hXMRp+X2q2HycBxqLCIgjnbq84xZkelJPhoMf95BIGBPOi+w43u4RggSHrH81FCHFp 0JOEE9nlYOOe6gFDObGclBhoGyZf/AOTXd1D0iP5tCprZ0lXgKUiqtJKhAfta1mOFRqA KKLnDbuRgAW8/qnxOf2m0kkl6j4yA+zCPKnVEuQdR3DiPboHFypxYuxy0+23KgzVm52u 7ihg== X-Forwarded-Encrypted: i=1; AJvYcCXFL2+qCj6x234u8Y8h90DtItU5mihKWfS+aQP5wc6Xj8a70Myw7oEhPOmQrCsexJTE+pOx89TaeQxQFJ0=@vger.kernel.org X-Gm-Message-State: AOJu0Yxgq7bDAnmKwPpdd6f7KUfFFLI/iR2d2o7ocXXmrSQc576LvEYe T9QrcwFuvntyDmRPTdh8o3NJy63nRstds8B7NMGoPOmiPQjpjPjS7iFG X-Gm-Gg: ASbGncsEQjTxG+1XTFvaeho7/b2pbk8JI4xOjVk+QFnibtQXKsEZjjJx1CIDVx9CPIm ClYJf32qzI4in6+fiyCRKsCyvN9qFzyHxq1sYt/DyIzRETk4fcgUkSaThK7py9OS4lXcRR9XpiO ghocc9RzubqwXY9ZjDF4jvdFn3HZqn/IzP26CtaSTJm1gJJvHygo1xmk07Agc7AKnXcA36PEEWn KXUCL/EYaUmlG6axIaJVEX2v9JfkbYc9YwX5q37ZxNB6upDCOO5I1b6Y1UBA7bvZruMoiRpNH9u I99i/AIlfcH8JnpVvY5CCcRPLiUsJowjtpV/0BDiev3YU+es+ovcslFtz1Svw+5oLN9w1ImPoOS 4pGQZCRTsWfx8l8VpxMpx7dEeiQHkvuUy3A2SkQ== X-Google-Smtp-Source: AGHT+IGtwj+7xMQjpcrNPb4NOoKMCSSUbHYA/8w8TOKtmXnOhSASyNJ1d9KeQ8Tqwyjf9iYGjSUHLw== X-Received: by 2002:a05:620a:4150:b0:7e6:7c82:f0ec with SMTP id af79cd13be357-7e814d27a51mr553391185a.17.1754497117544; Wed, 06 Aug 2025 09:18:37 -0700 (PDT) Received: from KASONG-MC4 ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7e7fa35144esm464081885a.48.2025.08.06.09.18.34 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 06 Aug 2025 09:18:37 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Kemeng Shi , Chris Li , Nhat Pham , Baoquan He , Barry Song , "Huang, Ying" , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v2 1/3] mm, swap: only scan one cluster in fragment list Date: Thu, 7 Aug 2025 00:17:46 +0800 Message-ID: <20250806161748.76651-2-ryncsn@gmail.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250806161748.76651-1-ryncsn@gmail.com> References: <20250806161748.76651-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Fragment clusters were mostly failing high order allocation already. The reason we scan it through now is that a swap slot may get freed without releasing the swap cache, so a swap map entry will end up in HAS_CACHE only status, and the cluster won't be moved back to non-full or free cluster list. This may cause a higher allocation failure rate. Usually only !SWP_SYNCHRONOUS_IO devices may have a large number of slots stuck in HAS_CACHE only status. Because when a !SWP_SYNCHRONOUS_IO device's usage is low (!vm_swap_full()), it will try to lazy free the swap cache. But this fragment list scan out is a bit overkill. Fragmentation is only an issue for the allocator when the device is getting full, and by that time, swap will be releasing the swap cache aggressively already. Only scan one fragment cluster at a time is good enough to reclaim already pinned slots, and move the cluster back to nonfull. And besides, only high order allocation requires iterating over the list, order 0 allocation will succeed on the first attempt. And high order allocation failure isn't a serious problem. So the iteration of fragment clusters is trivial, but it will slow down large allocation by a lot when the fragment cluster list is long. So it's better to drop this fragment cluster iteration design. Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48, defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio only: Before: sys time: 4432.56s After: sys time: 4430.18s Change to make -j96, 2G memory limit, 64kB mTHP enabled, and 10G ZRAM: Before: sys time: 11609.69s 64kB/swpout: 1787051 64kB/swpout_fallback: 20= 917 After: sys time: 5572.85s 64kB/swpout: 1797612 64kB/swpout_fallback: 19= 254 Change to 8G ZRAM: Before: sys time: 21524.35s 64kB/swpout: 1687142 64kB/swpout_fallback: 12= 8496 After: sys time: 6278.45s 64kB/swpout: 1679127 64kB/swpout_fallback: 13= 0942 Change to use 10G brd device with SWP_SYNCHRONOUS_IO flag removed: Before: sys time: 7393.50s 64kB/swpout:1788246 swpout_fallback: 0 After: sys time: 7399.88s 64kB/swpout:1784257 swpout_fallback: 0 Change to use 8G brd device with SWP_SYNCHRONOUS_IO flag removed: Before: sys time: 26292.26s 64kB/swpout:1645236 swpout_fallback: 138945 After: sys time: 9463.16s 64kB/swpout:1581376 swpout_fallback: 259979 The performance is a lot better for large folios, and the large order allocation failure rate is only very slightly higher or unchanged even for !SWP_SYNCHRONOUS_IO devices high pressure. Signed-off-by: Kairui Song Acked-by: Nhat Pham Acked-by: Chris Li --- mm/swapfile.c | 23 ++++++++--------------- 1 file changed, 8 insertions(+), 15 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index b4f3cc712580..1f1110e37f68 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -926,32 +926,25 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o swap_reclaim_full_clusters(si, false); =20 if (order < PMD_ORDER) { - unsigned int frags =3D 0, frags_existing; - while ((ci =3D isolate_lock_cluster(si, &si->nonfull_clusters[order]))) { found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), order, usage); if (found) goto done; - /* Clusters failed to allocate are moved to frag_clusters */ - frags++; } =20 - frags_existing =3D atomic_long_read(&si->frag_cluster_nr[order]); - while (frags < frags_existing && - (ci =3D isolate_lock_cluster(si, &si->frag_clusters[order]))) { - atomic_long_dec(&si->frag_cluster_nr[order]); - /* - * Rotate the frag list to iterate, they were all - * failing high order allocation or moved here due to - * per-CPU usage, but they could contain newly released - * reclaimable (eg. lazy-freed swap cache) slots. - */ + /* + * Scan only one fragment cluster is good enough. Order 0 + * allocation will surely success, and large allocation + * failure is not critical. Scanning one cluster still + * keeps the list rotated and reclaimed (for HAS_CACHE). + */ + ci =3D isolate_lock_cluster(si, &si->frag_clusters[order]); + if (ci) { found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), order, usage); if (found) goto done; - frags++; } } =20 --=20 2.50.1 From nobody Sun Oct 5 09:04:13 2025 Received: from mail-qk1-f180.google.com (mail-qk1-f180.google.com [209.85.222.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AC7C429ACE6 for ; Wed, 6 Aug 2025 16:18:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754497125; cv=none; b=nsUSb6MrYWbEtdebDA2WRMA0obCoFAp4bP/sflqNEvoQG8Te2XJw1d7hCNeuuUp3zD0SZFTEWoNt/PTmGyuN2vODToe3+Q0QN3Pr0/UxlZdR1KfOm3Dv5uYKbw6pbppC8nfjHyYBkhD0gU03tSQ393Id0p8p7Q7Ul1/XKBpYYkY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754497125; c=relaxed/simple; bh=ldCQB4I5JXQ9U4l14mB3/8okrRjdZZ+mHBvLVZQeIFA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=H7OdC6v3KikEV/IxfElK9si4ZFJ3tqeyyfOqgy8PqpOX1WKgrEqxu9iQU2x3+MfDQj0ZbOSmSNZXmJ+T1BNy6dB1Bgam/c7D6ssu5PHxJkYWpqLTnXV2949Fb069CjtYmW0HzpXnvORRTBCq3bDqptI7BRwyjH1jdTdMig0bZBU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=gFIbCEHZ; arc=none smtp.client-ip=209.85.222.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="gFIbCEHZ" Received: by mail-qk1-f180.google.com with SMTP id af79cd13be357-7e69960c557so6480585a.1 for ; Wed, 06 Aug 2025 09:18:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1754497121; x=1755101921; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=sMTOIN9uA6FtHcEuuMQ+RGlJRjFH2hc2yLEqKn8nGIg=; b=gFIbCEHZW/GX6uhPdnhAcqrzaqF6M/CE9J4SsCssJ8RgLyOR4IIBazeb2UEhDURx4K Ixo/080ctAWaALbt664HQUxN3F6147yf4TK4AfOVfttckeAyeAdKUhrPuwh73apfawwe 0O3RLukmA447mdbKBaexOto8pheWCFmV9/cxeife0gULm3VkV30Pvj+VSQJonqgNkHq3 EogvRTGRiOEDMETR7vzzeus8X1OZZusO3y/Zb+aFrOFv6/g6u5oX/Ki1JSKyjMMjudcT WUvy+G5dGTKzKQ1Xs84mHwcfSKBMpn1sRPWSU+QQNBuSxL0N26nqWEtF+w1ttzAwh/+Z pXNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1754497121; x=1755101921; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=sMTOIN9uA6FtHcEuuMQ+RGlJRjFH2hc2yLEqKn8nGIg=; b=gDki8Ek0CrmwKTZ0BItu72nhh5L3LzO0xZ+uQ4rZtTFDaxW5ClYMZh8hPr+FKYn0+p XRVOEnNAV2iMYiJAjrjP/t8EK0mZr1KdMb5MK9Hg+iVBaZ33zcOO1clhE7gadMxZokgz 96luh7G/h99LngSgT09Ydkqm1eTEenC4f8cNfPVf2lxng4cmeVKLbI/uPsVumfhwIFbt Z39Pd99D+vGO97kB+cGfakr2juVwkpgv3h2dlVlHyH/pZGoDRqFDx1b0/mGf8O/Nfmj4 YdOVih+YpdVW4oxc8nP982GnKqApXj7Bhqe5w0UYDRGzTo5RNcKzSvYaWxreaWT1Vynx W2cA== X-Forwarded-Encrypted: i=1; AJvYcCUgtrL68zdarLcrrfjFCH3MOidqIIckg56AmVs/+hqhiDN6WY6UzhWTvEqEjwF9p4yu/Khig2aBEK0TYTs=@vger.kernel.org X-Gm-Message-State: AOJu0YwqUbP9k0C03kpautKy/r9uWiHCRyCEhRAlja/HFxOBzVasqkwE 7UAX0RqQAgK+xQGPCYtGkLV+Pnss6CGd5XOz+mIQuRD8ZoVbfAmzS76t X-Gm-Gg: ASbGnctiv4yz+WGY7pUA6ck8i8gfNMtzk4Z1cVVs4NL1jn58b5rY5Mhu6a7BdIRo7Fj TWMoMb5CPZ8WTTak/3FSCweIpvR6//jCyqRnr/kYlfYsiDAtOme8UhXZ1psmRqHmSgFOFJUi88B 95zYTnMsDIwB8XxCpoeX/bheq9xuxHpMGtcUWFOGOJi5eGawtKU56yEcaggMcOXm7z2nZZqA3JC euU0C0YDEKAH3lGlBMC/B+m3b+yjEvXTUaYWKMX3keJW76kcNaclASdJDxBhE9mhNPTqJ1d11RX +XyeUztE4+/cJt0RCWIVT27OEy9xQxq61pPhJCnoSsMzO2BpGcG3mRYG0Wn3UqqnSNQsJZrTYob 7ZhWuOi0E833iexkAyGekkhutzBA= X-Google-Smtp-Source: AGHT+IH2ToR5PB3gBgQpmqW6rQV+H4WJbPvMMoHldmsha2gx+ZjR+coMJuG5GMek6CdkUTk9HM3ITA== X-Received: by 2002:a05:620a:1a1b:b0:7e0:6402:bece with SMTP id af79cd13be357-7e814df809dmr410774185a.38.1754497121240; Wed, 06 Aug 2025 09:18:41 -0700 (PDT) Received: from KASONG-MC4 ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7e7fa35144esm464081885a.48.2025.08.06.09.18.37 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 06 Aug 2025 09:18:40 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Kemeng Shi , Chris Li , Nhat Pham , Baoquan He , Barry Song , "Huang, Ying" , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v2 2/3] mm, swap: remove fragment clusters counter Date: Thu, 7 Aug 2025 00:17:47 +0800 Message-ID: <20250806161748.76651-3-ryncsn@gmail.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250806161748.76651-1-ryncsn@gmail.com> References: <20250806161748.76651-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song It was used for calculating the iteration number when the swap allocator wants to scan the whole fragment list. Now the allocator only scans one fragment cluster at a time, so no one uses this counter anymore. Remove it as a cleanup; the performance change is marginal: Build linux kernel using 10G ZRAM, make -j96, defconfig with 2G cgroup memory limit, on top of tmpfs, 64kB mTHP enabled: Before: sys time: 6278.45s After: sys time: 6176.34s Change to 8G ZRAM: Before: sys time: 5572.85s After: sys time: 5531.49s Signed-off-by: Kairui Song Acked-by: Chris Li Reviewed-by: Nhat Pham --- include/linux/swap.h | 1 - mm/swapfile.c | 7 ------- 2 files changed, 8 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 2fe6ed2cc3fd..a060d102e0d1 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -310,7 +310,6 @@ struct swap_info_struct { /* list of cluster that contains at least one free slot */ struct list_head frag_clusters[SWAP_NR_ORDERS]; /* list of cluster that are fragmented or contented */ - atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS]; unsigned int pages; /* total of usable pages of swap */ atomic_long_t inuse_pages; /* number of those currently in use */ struct swap_sequential_cluster *global_cluster; /* Use one global cluster= for rotating device */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 1f1110e37f68..5fdb3cb2b8b7 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -470,11 +470,6 @@ static void move_cluster(struct swap_info_struct *si, else list_move_tail(&ci->list, list); spin_unlock(&si->lock); - - if (ci->flags =3D=3D CLUSTER_FLAG_FRAG) - atomic_long_dec(&si->frag_cluster_nr[ci->order]); - else if (new_flags =3D=3D CLUSTER_FLAG_FRAG) - atomic_long_inc(&si->frag_cluster_nr[ci->order]); ci->flags =3D new_flags; } =20 @@ -965,7 +960,6 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o * allocation, but reclaim may drop si->lock and race with another user. */ while ((ci =3D isolate_lock_cluster(si, &si->frag_clusters[o]))) { - atomic_long_dec(&si->frag_cluster_nr[o]); found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), 0, usage); if (found) @@ -3217,7 +3211,6 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, for (i =3D 0; i < SWAP_NR_ORDERS; i++) { INIT_LIST_HEAD(&si->nonfull_clusters[i]); INIT_LIST_HEAD(&si->frag_clusters[i]); - atomic_long_set(&si->frag_cluster_nr[i], 0); } =20 /* --=20 2.50.1 From nobody Sun Oct 5 09:04:13 2025 Received: from mail-qk1-f178.google.com (mail-qk1-f178.google.com [209.85.222.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7817A29B216 for ; Wed, 6 Aug 2025 16:18:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754497128; cv=none; b=Mo9mR1y+yoBLMhNIuZ4MIqEAq80kEU+72voP0h5yPwj0f04NRa5SGvSA9SOuSXdnV8nSPEb5fO8MWI5lSGqrZ7hz3FQPnC65viVLQa+7VSflgFTrfk1ss4icNCjDgJMVo10lnQdDPdkM46Co2uijxO6bQ8EG9swDlyFMftF/na4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754497128; c=relaxed/simple; bh=D/90zl+iZ+yT5khZEgBn2baqUJ8jl1dXxhgL2921rzQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=MozuSBYHuM+PHm4OJht3pYrXzd6AfYM2M/ZtzOwqpbspjUqB33Xio01vYZI5ZUZ9oobTS25U/0nMEbTszhRP73t3NZvz61rIy0v0tgg0ZJ+1FI6OuudbdS5cKGSWUjWJwGwF6aH3OeftbxNZotPYzhXwjGkppmBm/M/gPQVbq9A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=lhyu1uMT; arc=none smtp.client-ip=209.85.222.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="lhyu1uMT" Received: by mail-qk1-f178.google.com with SMTP id af79cd13be357-7e80ba947cfso5805185a.0 for ; Wed, 06 Aug 2025 09:18:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1754497125; x=1755101925; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=7xTP21K+jB6LayI4lbO9257BksRaVlqE7Ti1iQOs5KI=; b=lhyu1uMTDzmFWHD52YWpLaOxFhBAbj0PRDdxxYQLM2Mxd4dFdGWTYP66PppLzKy6vY 6dt7JVaPZ/FPPrBnKF8gG7MYoRFHd/1P3cgugwR1sbKAf9UB4b5USBW4ljpuLWHYp0cE YJ1WHRQN4NXkG52BxOSQrufQqUGHSHbqmohR75MsgHc4YUaEqcEezLb7ivUgf6Hgd5hf ac8ZR25Ie3SIk74mYWXDM2hXuyFB83ZbMkMeQgSD2jTpsIMbF0nZKpac1aR9VgebUqJ7 VrGPlNa/mjZtwdnlqq7K/TLgulb5QDHh+xy2h5sgxuzRqlaSfSf4nYQHT7tkxNf4r9pf SPbA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1754497125; x=1755101925; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=7xTP21K+jB6LayI4lbO9257BksRaVlqE7Ti1iQOs5KI=; b=k8oUgX+kCWpIlxbsCkYEpz6NDFus5VIAb6/cSTqw885UgPz1hbs1SVdlq0UiAD7Odq 3ikaU31FNXmKSXMah82/h5jj6WoEHbRx+d28ygFUCWFMEK6L0+hlnJeB5R6D302XYxBM krws2sbd7j9hnW9Wx075JFuo11hj7Mi6iMpADRuTDKdzgZTvdEjq6IJNBkJqUofBwPtd i1x0ZKBiVB6nz0zn08t9TaCw2H5tAylNIg6eJWk9DkVeVacnHLE12vuYpANBVlor2pvm 5JOmr6ldTYog5Nr6qNEoplmu00SxrTiNJC5tLrChoJeW4jbC3GYYNsrTLodTyIetcBoP 9/2w== X-Forwarded-Encrypted: i=1; AJvYcCXsvVUoXedb9XHrW7NbyBxpQRnUk26PUbQo+3uwDv1gyDLYeuFYb/NL6aI1MfDhhDOYtk3rnezkXSYTtuk=@vger.kernel.org X-Gm-Message-State: AOJu0Yy0vvSb+9Tp1AB2kbMdS9veLtFGwxhNy183CNrazJq3XcMQQGhm nWOsL086yYAhh4ecNdzOQKUpch6U6B7ifcRe7YAKeEv3hOJQsPFRx3be X-Gm-Gg: ASbGncuOeLYWII/HKqpDOprT0j8zxE6WHwqkP3eyBR/2M7aCX7r0cZU869MXZ+VNDAR DYhXuKSkHa+l0/S1qXhTNk3ScyaaD0IXBv4dYKZPa7SH+OjIL68JK69/AsKyhr0QynfR50AX/bI lHLGZEQQKe7E8AHGnmIeWQAg2PWan+KoHu3koEizeubAWIFoCBmDDgVu8sh1KAEaB8Dkdj8SV+0 XeV55bdmR1SPsFv9mYbtAgcudkJmVvBdXIVv41gaWtUcJh040Pmjd7ddfrRx9yRN4MkQOSxcTPK zZZb8GmzZvtRp1QNTde5QbazZtfvi0m8UhVLTCDcHg6nEpPgxyUylkKKLGFNOkKkthLiq3cxwG9 x1Ah/gWmha69JitRPABCHvFd0lEc= X-Google-Smtp-Source: AGHT+IEiDj+VysZpnFBYXYgnr7uq4D1H/+DepEQer1BDI3uxqce/V9Sjo60cv+6UqW8BMCTnSCbvyg== X-Received: by 2002:a05:620a:4513:b0:7e8:848:64c1 with SMTP id af79cd13be357-7e8165f603dmr442638985a.12.1754497125020; Wed, 06 Aug 2025 09:18:45 -0700 (PDT) Received: from KASONG-MC4 ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7e7fa35144esm464081885a.48.2025.08.06.09.18.41 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 06 Aug 2025 09:18:44 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Kemeng Shi , Chris Li , Nhat Pham , Baoquan He , Barry Song , "Huang, Ying" , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v2 3/3] mm, swap: prefer nonfull over free clusters Date: Thu, 7 Aug 2025 00:17:48 +0800 Message-ID: <20250806161748.76651-4-ryncsn@gmail.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250806161748.76651-1-ryncsn@gmail.com> References: <20250806161748.76651-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song We prefer a free cluster over a nonfull cluster whenever a CPU local cluster is drained to respect the SSD discard behavior [1]. It's not a best practice for non-discarding devices. And this is causing a higher fragmentation rate. So for a non-discarding device, prefer nonfull over free clusters. This reduces the fragmentation issue by a lot. Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM: Before: sys time: 6176.34s 64kB/swpout: 1659757 64kB/swpout_fallback: 139= 503 After: sys time: 6194.11s 64kB/swpout: 1689470 64kB/swpout_fallback: 561= 47 Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM: After: sys time: 5531.49s 64kB/swpout: 1791142 64kB/swpout_fallback: 176= 76 After: sys time: 5587.53s 64kB/swpout: 1811598 64kB/swpout_fallback: 0 Performance is basically unchanged, and the large allocation failure rate is lower. Enabling all mTHP sizes showed a more significant result. Using the same test setup with 10G ZRAM and enabling all mTHP sizes: 128kB swap failure rate: Before: swpout:451599 swpout_fallback:54525 After: swpout:502710 swpout_fallback:870 256kB swap failure rate: Before: swpout:63652 swpout_fallback:2708 After: swpout:65913 swpout_fallback:20 512kB swap failure rate: Before: swpout:11663 swpout_fallback:1767 After: swpout:14480 swpout_fallback:6 2M swap failure rate: Before: swpout:24 swpout_fallback:1442 After: swpout:1329 swpout_fallback:7 The success rate of large allocations is much higher. Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.cor= p.intel.com/ [1] Signed-off-by: Kairui Song Acked-by: Chris Li Reviewed-by: Nhat Pham --- mm/swapfile.c | 38 ++++++++++++++++++++++++++++---------- 1 file changed, 28 insertions(+), 10 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 5fdb3cb2b8b7..4a0cf4fb348d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -908,18 +908,20 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o } =20 new_cluster: - ci =3D isolate_lock_cluster(si, &si->free_clusters); - if (ci) { - found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), - order, usage); - if (found) - goto done; + /* + * If the device need discard, prefer new cluster over nonfull + * to spread out the writes. + */ + if (si->flags & SWP_PAGE_DISCARD) { + ci =3D isolate_lock_cluster(si, &si->free_clusters); + if (ci) { + found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), + order, usage); + if (found) + goto done; + } } =20 - /* Try reclaim from full clusters if free clusters list is drained */ - if (vm_swap_full()) - swap_reclaim_full_clusters(si, false); - if (order < PMD_ORDER) { while ((ci =3D isolate_lock_cluster(si, &si->nonfull_clusters[order]))) { found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), @@ -927,7 +929,23 @@ static unsigned long cluster_alloc_swap_entry(struct s= wap_info_struct *si, int o if (found) goto done; } + } =20 + if (!(si->flags & SWP_PAGE_DISCARD)) { + ci =3D isolate_lock_cluster(si, &si->free_clusters); + if (ci) { + found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), + order, usage); + if (found) + goto done; + } + } + + /* Try reclaim full clusters if free and nonfull lists are drained */ + if (vm_swap_full()) + swap_reclaim_full_clusters(si, false); + + if (order < PMD_ORDER) { /* * Scan only one fragment cluster is good enough. Order 0 * allocation will surely success, and large allocation --=20 2.50.1