From nobody Thu Dec 18 10:00:10 2025 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 939931C1F27 for ; Mon, 13 Jan 2025 17:59:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791196; cv=none; b=SLOtN/1waA70UfxRcBd6nYo5XWFgmKL9cQPR5srQeavOCxWi5R8pSZmKCvrPrIlGoUuANKVG1+Lrc8SjppgNz3CwZuinMB7v4ZnYN/79Eqd1vpKze3brK3gdWPWu4SzpPh6b5OvdE8zma7UwQHgZOD3MA5E9fvno7+Ztzt7MeTI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791196; c=relaxed/simple; bh=gh/4+Jxj/+y8cvOwuEPBuA7ZivYFEDiTwJlpVHum0Iw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=TFw89aETqOSHAGrwR/2EqU/R6dsQqv/d9MHKsGypw52iZ5F67ke9EfuKeUFsJb03Gp6m/d2T70C7MnrUe1KAHGgr1owhHoGAlUO3ClSM97GPd+J/Q//q/UmMtnO36Ghp36r+TXULtNhPvoFoSyrfIvv0o2BiufSsRAY7OzL0eUs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=QSopeymt; arc=none smtp.client-ip=209.85.214.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="QSopeymt" Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-2165448243fso96850365ad.1 for ; Mon, 13 Jan 2025 09:59:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736791194; x=1737395994; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=3bA8B6i+M3WoYB0+zo3p6m39muwwLLEWjZCQybnFwG0=; b=QSopeymt9Nv7/v8Z7p3O0M+KwOvSKYkQ5KrorUswncip8rkcZfIuAuB682rOOCaIlS ACXdVKRMokETOVpYpGB4P+LcHaw+b06BdcTmzg31VcDoj39PW5CUrAqspohItIK2CfaZ nrKtUbck3kVS1LrYeiIfrkIXefmxEGGQaWHIF94tmpphLCqNdZ1kMG/pCbNi17Fjfkl0 OJD7P2eVlv+LpGGY5LM7afOcWe8AdNTwfMV2GH40LsAEY+drUtqVWaU1zm4/uJ/zzerP I3KB+tMVrGBh2O8VlVS3/ll8ZLeudCiW3GsWEjcE0IC49AukDY2kjY4AOYvaj/GwJdkw BKrw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736791194; x=1737395994; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=3bA8B6i+M3WoYB0+zo3p6m39muwwLLEWjZCQybnFwG0=; b=iJ/30cXMopV6oetZP1JeWvhSF7UZm3Hk2t4to5OMmY4wsEM6YgHsiaDU/iV2GqxKCB a4Tu6ud9pm6JTPmU00W5OCPXQ8fYsuI/yQSHpS/mqE+ohgL1nBN8X0p5KMJOCzZt5WFB iN6vfU2l0z/+JnRiUi9CSHH7z6fKMM/v1amHN+W/vxlthY3DJzzwRRyWg9coEJwrHDsJ 3cdTnF+dBiCx/G2VU0yFl6ICyQ7EegkdAn2zM8lb/IE+yElALPebdIebd8+gd13oQgar 4uiz/v+OWMS9IZF6brKVnNWjRCm8gNUyaLPxWpfjPushGE4FmwSgcDUxZXu731aDTbVn tMzQ== X-Forwarded-Encrypted: i=1; AJvYcCXe8MT/HTvvg8EDueFCjg/c7QJJYxi2RY8gk6aLFNW2Er2A1A87W+y7nLy2ACWPfTwVWzrGCTTuMThXcx0=@vger.kernel.org X-Gm-Message-State: AOJu0YzHNJ6H9OyrsNxED8HH6XxUUGIAsWpV/XbaEnA3eJyxsweTq7wo Z+ZY0HFEl36NliLphsWsxj6rOhcqezL/5o576/DyN1pV2GIPoC3yy114MB0/5wE= X-Gm-Gg: ASbGnct1nksD1JilCrWkjLMexFep68iXnHJYxeyKxk1C8CVxGTFFYcgX7nYywQuujQ2 nJ+OtozQKgf42vZYJ78RJy/m0KPVoEF4cWJBj6QrrmDJ02acwTZ1bn6/dzaXUz8c4GBkicV5jrE WdL6z9ByO+34Ky+3XZy+SqgtqLqGtE2y8loQITI26mgpBZV5NF1IzxdW2JyWFVwz225mCbjCPGp W6UR0Gfh3MTcJ0qCPOftBjH5qLEpejKRGcHc5qyVaiekLw3v7H1LMXXCCVGpPtae51qapNuhmPd UQ== X-Google-Smtp-Source: AGHT+IHG9J+kj0NwuV4dCCUaflW/hQEMT5mplvb1rxDF2wEH2kzs02TDJ+5TN5Jg0cDR5NohlVyXfw== X-Received: by 2002:a17:903:41c4:b0:212:5786:7bb6 with SMTP id d9443c01a7336-21a83f469b8mr301683785ad.3.1736791193793; Mon, 13 Jan 2025 09:59:53 -0800 (PST) Received: from KASONG-MC4.tencent.com ([115.171.41.132]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.09.59.50 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 13 Jan 2025 09:59:53 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Baoquan He , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 01/13] mm, swap: minor clean up for swap entry allocation Date: Tue, 14 Jan 2025 01:57:20 +0800 Message-ID: <20250113175732.48099-2-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250113175732.48099-1-ryncsn@gmail.com> References: <20250113175732.48099-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Direct reclaim can skip the whole folio after reclaimed a set of folio based slots. Also simplify the code for allocation, reduce indention. Signed-off-by: Kairui Song Reviewed-by: Baoquan He --- mm/swapfile.c | 59 +++++++++++++++++++++++++-------------------------- 1 file changed, 29 insertions(+), 30 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index b0a9071cfe1d..f8002f110104 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -604,23 +604,28 @@ static bool cluster_reclaim_range(struct swap_info_st= ruct *si, unsigned long start, unsigned long end) { unsigned char *map =3D si->swap_map; - unsigned long offset; + unsigned long offset =3D start; + int nr_reclaim; =20 spin_unlock(&ci->lock); spin_unlock(&si->lock); =20 - for (offset =3D start; offset < end; offset++) { + do { switch (READ_ONCE(map[offset])) { case 0: - continue; + offset++; + break; case SWAP_HAS_CACHE: - if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT) > 0) - continue; - goto out; + nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIR= ECT); + if (nr_reclaim > 0) + offset +=3D nr_reclaim; + else + goto out; + break; default: goto out; } - } + } while (offset < end); out: spin_lock(&si->lock); spin_lock(&ci->lock); @@ -838,35 +843,30 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o &found, order, usage); frags++; if (found) - break; + goto done; } =20 - if (!found) { + /* + * Nonfull clusters are moved to frag tail if we reached + * here, count them too, don't over scan the frag list. + */ + while (frags < si->frag_cluster_nr[order]) { + ci =3D list_first_entry(&si->frag_clusters[order], + struct swap_cluster_info, list); /* - * Nonfull clusters are moved to frag tail if we reached - * here, count them too, don't over scan the frag list. + * Rotate the frag list to iterate, they were all failing + * high order allocation or moved here due to per-CPU usage, + * this help keeping usable cluster ahead. */ - while (frags < si->frag_cluster_nr[order]) { - ci =3D list_first_entry(&si->frag_clusters[order], - struct swap_cluster_info, list); - /* - * Rotate the frag list to iterate, they were all failing - * high order allocation or moved here due to per-CPU usage, - * this help keeping usable cluster ahead. - */ - list_move_tail(&ci->list, &si->frag_clusters[order]); - offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, order, usage); - frags++; - if (found) - break; - } + list_move_tail(&ci->list, &si->frag_clusters[order]); + offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), + &found, order, usage); + frags++; + if (found) + goto done; } } =20 - if (found) - goto done; - if (!list_empty(&si->discard_clusters)) { /* * we don't have free cluster but have some clusters in @@ -904,7 +904,6 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o goto done; } } - done: cluster->next[order] =3D offset; return found; --=20 2.47.1 From nobody Thu Dec 18 10:00:10 2025 Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 50CBE1C3C14 for ; Mon, 13 Jan 2025 17:59:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791199; cv=none; b=a6q2hM1xhJEIm0JygbiMe3E8hrYmveSDPKyQydId0Tc35KJYAGIdptIJ7DRpkFUphj48Ak4K2HKZyhQNtz85wQVyJLZH+YqAu7370yQqER574i55oRzCDrWSlbfxPe5xLU+bANNByXfmM8G1omF06kG7AhPtzFV+hh0NGCTEvNE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791199; c=relaxed/simple; bh=ebizjYrxgUBaQcvlifIHoxylJxpYOAd30sM1cB2O0jE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=CBGUxoUfVp8L+3/2Oma4sg9NNZwYiT6AJWOsqwUawILkrVZoSQHjt0tl9UUGa71QvQ8ah5jcwF0c8WAkPvokelqlUdFcFYdqwGvXBNpLow6gh4poDhYueowm/Z1kKQBPv051xPooxZAKlOo2MeJJ8h9uhV+Zd0hcmPLcqM6rwvA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=cw6qx4cW; arc=none smtp.client-ip=209.85.214.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="cw6qx4cW" Received: by mail-pl1-f172.google.com with SMTP id d9443c01a7336-216401de828so76033855ad.3 for ; Mon, 13 Jan 2025 09:59:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736791197; x=1737395997; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=KIPMfjkfQ89B1WDwc6Z1AsnxOVIeae7YspqiuRkp0ic=; b=cw6qx4cW3W/WjhrsOlvV3csHKbum3jQ26XiOwCAXJAq6mNAHkzhblQcNUIHrU7Zdir IHjEQoEqrBSEOa2KTWTlK6N0/InYvvppLwzx3oneHe5rGvhZWoKQK8Sd7j4Ipra6jgb3 4huPyUBvmBIJmwnV2ZCmBGUbOUqYR9QaUpQQgLmbQrsQ6hiNQo/Lry49F+b1uagyKS4d 2+2RFfrPiMJJUm/LVcQvSFMO72SHdPS9HHjMHwPKKU55pO2u273bFcNZBCpGhs3xt9S2 L2ecUQUUCEfPOso5oXB+51UblanNSNBGWTBdJcrx/oJjTK9Z2smSaPrfJtcqz9m8ZcAN F8Zw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736791197; x=1737395997; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=KIPMfjkfQ89B1WDwc6Z1AsnxOVIeae7YspqiuRkp0ic=; b=YknB/PT8TcJbsg7rW5gefCB4tZpragbmvCotGZtuMxxTdc4yyCtHy/t6u0ulKD4xkm B6IF3nf2RvU7P8O1ogAG2QOCtb6NC5iIN/TDICfw+2kljuD0p2ppruDxBeTgPoGywDCq vMWFGg02my2KGkZCT4pMwoyTNR/32fLpqcvCS29YowBaQrCeDHr3ZAIsMWLB6G2e2luq 7Sw84NNNuZ0C8PfmLpeR7GmYgqCW0b/XPG/RIYJHgaQu3Mw7zS7eB9ZUWcee+Cj5tf0K 8z+31zmENMlEUZhbsCTQ5kosbOtknQFAA4TZmTDAiJH/xuRT2x8+hKYN7tzdUVWA7LvZ y7+A== X-Forwarded-Encrypted: i=1; AJvYcCWRzXRylgHiRAjNfFVYEJ5qgXW9EJaRMd8zPm6Xwi3wPYhrpXkcfbA6c4PdhhhuLnSeDjq548c5jUJct14=@vger.kernel.org X-Gm-Message-State: AOJu0Yx2Z0AxSlynH8oG+Aeu75MQdmJlRJnpKiTkosvBWIx2WMXy8zqu God/tGfh8l0hhkip1GRI39n+NP+wJU2Eo/j6jXBdsUrcyntx0qMU X-Gm-Gg: ASbGncteD+FJHP5SmYXRl/CZo4b/mbn9cW+CikWPOXzajm/VgJgs1DFZSdLYgLyCrEe rvD5BIT9xOyHnUcpZjqJk87Gtv8Us31VYDkgWYwAwhuqnO7oHyx56rF1XwatYa13Pg5MkP+PnaI u5eTnD9+//7WXY1reUL2tAUzksAVhbfAMDDuRiapkoszgdlwmhVFHiB9aMQx1pLvsw2yPaYei/U v9SzfCE70tp/ziKDJUP8CdVfTNJeg9uVYh8UwUeFeZzbP9B3hsMuhT7W5oKkgsulzv5l0m6Tl2+ 1g== X-Google-Smtp-Source: AGHT+IEnHU1K7n4HvnC+iaw48qp8W3x2ZslOxNgGfCpcaDHl0MYIhads/GivBX4OKMlfg5wSZ5/7JQ== X-Received: by 2002:a17:902:e808:b0:215:b75f:a18d with SMTP id d9443c01a7336-21a83f36d9emr335932635ad.11.1736791197432; Mon, 13 Jan 2025 09:59:57 -0800 (PST) Received: from KASONG-MC4.tencent.com ([115.171.41.132]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.09.59.54 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 13 Jan 2025 09:59:57 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Baoquan He , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 02/13] mm, swap: fold swap_info_get_cont in the only caller Date: Tue, 14 Jan 2025 01:57:21 +0800 Message-ID: <20250113175732.48099-3-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250113175732.48099-1-ryncsn@gmail.com> References: <20250113175732.48099-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song The name of the function is confusing, and the code is much easier to follow after folding, also rename the confusing naming "p" to more meaningful "si". Signed-off-by: Kairui Song Reviewed-by: Baoquan He --- mm/swapfile.c | 39 +++++++++++++++------------------------ 1 file changed, 15 insertions(+), 24 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index f8002f110104..574059158627 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1375,22 +1375,6 @@ static struct swap_info_struct *_swap_info_get(swp_e= ntry_t entry) return NULL; } =20 -static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry, - struct swap_info_struct *q) -{ - struct swap_info_struct *p; - - p =3D _swap_info_get(entry); - - if (p !=3D q) { - if (q !=3D NULL) - spin_unlock(&q->lock); - if (p !=3D NULL) - spin_lock(&p->lock); - } - return p; -} - static unsigned char __swap_entry_free_locked(struct swap_info_struct *si, unsigned long offset, unsigned char usage) @@ -1687,14 +1671,14 @@ static int swp_entry_cmp(const void *ent1, const vo= id *ent2) =20 void swapcache_free_entries(swp_entry_t *entries, int n) { - struct swap_info_struct *p, *prev; + struct swap_info_struct *si, *prev; int i; =20 if (n <=3D 0) return; =20 prev =3D NULL; - p =3D NULL; + si =3D NULL; =20 /* * Sort swap entries by swap device, so each lock is only taken once. @@ -1704,13 +1688,20 @@ void swapcache_free_entries(swp_entry_t *entries, i= nt n) if (nr_swapfiles > 1) sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL); for (i =3D 0; i < n; ++i) { - p =3D swap_info_get_cont(entries[i], prev); - if (p) - swap_entry_range_free(p, entries[i], 1); - prev =3D p; + si =3D _swap_info_get(entries[i]); + + if (si !=3D prev) { + if (prev !=3D NULL) + spin_unlock(&prev->lock); + if (si !=3D NULL) + spin_lock(&si->lock); + } + if (si) + swap_entry_range_free(si, entries[i], 1); + prev =3D si; } - if (p) - spin_unlock(&p->lock); + if (si) + spin_unlock(&si->lock); } =20 int __swap_count(swp_entry_t entry) --=20 2.47.1 From nobody Thu Dec 18 10:00:10 2025 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 31D7F1C4A20 for ; Mon, 13 Jan 2025 18:00:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791204; cv=none; b=GVY3o0SFJlzY89nji9VIPoShztyV/lAod0ZcINcgdrvVeVqUkVNtrDPACVj0yuZwg4VOP8xsqXuqMFht37FMB05XV76R6p9NNhoUCyROoQ1QPcXOhJy7zl6aCFIeW8SU6MEwreOz+2HTG0Eh0qoGK+AUEqNDaxTg6o5JiJY6huo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791204; c=relaxed/simple; bh=SZLMFCU6NCGPPuCVjV6RfRC7/B5x3lzZXtPUqGogpDI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ZKU3p7w2AEvxGkGuQmGrDJxhye3/VCKmlkSBZg9lboILWr5gvoI+mGJkcgTHYQQEEjp6CaBHT9yIrEllHa5pkBCgggPDGZIl673fXOs5hC+3IFaIFB9pCObDWl60tOmGSHcFJcUUn3+GR2ZKhW+p0bhvRD9QexAfLQPN1ld0r48= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ef541Elp; arc=none smtp.client-ip=209.85.214.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ef541Elp" Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-219f8263ae0so75220005ad.0 for ; Mon, 13 Jan 2025 10:00:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736791201; x=1737396001; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=1QSJCLnDWZKCsPmvpnvEILycrNkHAmJg2j2a6zzrUYg=; b=ef541ElpIsAJC3EvlsVuTLHvJ2RWL8K6GKJZdMHHPoSOSDiCGuTtySqwDkhdcu9h36 aKBCcZw3VMTNCsaCRaP1iGsunprNKQz4UCd8zQ8kbhCjv4shLT6daOTdi04aKgfJe8b/ 9JuxFo5o9V7tTlusayjWXvo2+n+R5paa8dJMLbeAkoOXTpYegF49L7B7JYuQdEEfmJTh RigKya0kVdKMaKazjjHoPswbFYvOaH1+8aZmClcJ1vQDxdA5XHua9c4ztwdRMSPpEAKK QZ/Wgbo7J5JT+k8lmhCbI6g6HdRw5JRrHm0DZkoM6DI2XM9SrEcqYdy6nA2deiFXihol m0/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736791201; x=1737396001; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=1QSJCLnDWZKCsPmvpnvEILycrNkHAmJg2j2a6zzrUYg=; b=spFVGh661wIfXI5eI+0wk9ErFrHMvtd2X+W0pQGhdiFHzXWNHNfnNTxhEkxhGwkmHt HOvVTHIPNBmyOH/Ps42iGWfUyBAjbRysXrChGfyQ35mRUUZifGkyG3VEMQKo/L4MNt+f GnvjyGtqyRppStLqeTfw9uD3IvE9SDEuS4I+6nn+Yy5/BSaDIzYg8yy8Dg1/yn6G7hZZ K5dEq29e2AVS6U2HHDu9JAImOsg5DIvQNbosaebowErm5WtbYeeMjWYTBlfnztpx17LW m10aOs5y9C19C2M1CGSAaweo/MaUrS//E7PsirD4U1CPgZA6kT/buKFTY79JHEYuP+Hv JjOA== X-Forwarded-Encrypted: i=1; AJvYcCUQ1DWd1B2orjRgWQqUJBmqzG+/LsbEJ7sl1thGqS6RzSToMaVYsx8rGddCMmlBx5kN9/KllfxFR3m2kbg=@vger.kernel.org X-Gm-Message-State: AOJu0YwaFH0j+qmIoBkR+GzbkgqAlknSQmi7b+OzgQWkyp3Rw7ld2sHM W0VE6PDDghVrfJioBZjK/SHCcgVO9ZxAyw1Qip9QEUb6lWTZR30I X-Gm-Gg: ASbGncvGxYyr03sSnapvKCuYAWyHD+RsH3c1vSNcHYU2yBs4on2aSZFX1HyZwdN7bzC OA0XPF1TDDMjGwd4bEIl1ZFBpfhNXIeEZVKpDpYUBgji6P2GdHCyBAXkrLNdc42KPqj+uto1y+6 C4BN9OS2pb4as0MehhYTAlAMGNblfQMuf5xMlXBPKUR9PhsIp+T/MXkHsxWU7rY3AAr1JPJZdeG gO7ZQQkVtH9mC1Z90eiL8Vsslj8Au8cWsSVuKP2ZHP3dq+3lFSlbQln6urxEF77WMUnDw1L1PT1 hQ== X-Google-Smtp-Source: AGHT+IHsOcq0v8r/IsD/rTblP/e+H0X8Pne7rzRS4aDwWP++SL+VwES933KtHDTI+0HHkHetWhZy5w== X-Received: by 2002:a17:902:ccc2:b0:216:4165:c05e with SMTP id d9443c01a7336-21a83f67982mr398301215ad.24.1736791201254; Mon, 13 Jan 2025 10:00:01 -0800 (PST) Received: from KASONG-MC4.tencent.com ([115.171.41.132]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.09.59.57 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 13 Jan 2025 10:00:00 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Baoquan He , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 03/13] mm, swap: remove old allocation path for HDD Date: Tue, 14 Jan 2025 01:57:22 +0800 Message-ID: <20250113175732.48099-4-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250113175732.48099-1-ryncsn@gmail.com> References: <20250113175732.48099-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song We are currently using different swap allocation algorithm for HDD and non-HDD. This leads to the existence of a different set of locks, and the code path is heavily bloated, causing difficulties for further optimization and maintenance. This commit removes all HDD swap allocation and related dead code, and uses the cluster allocation algorithm instead. The performance may drop temporarily, but this should be negligible: The main advantage of the legacy HDD allocation algorithm is that it tends to use continuous slots, but swap device gets fragmented quickly anyway, and the attempt to use continuous slots will fail easily. This commit also enables mTHP swap on HDD, which is expected to be beneficial, and following commits will adapt and optimize the cluster allocator for HDD. Suggested-by: Chris Li Suggested-by: "Huang, Ying" Signed-off-by: Kairui Song Reviewed-by: Baoquan He --- include/linux/swap.h | 3 - mm/swapfile.c | 235 ++----------------------------------------- 2 files changed, 9 insertions(+), 229 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 187715eec3cb..0c681aa5cb98 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -310,9 +310,6 @@ struct swap_info_struct { unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ unsigned int inuse_pages; /* number of those currently in use */ - unsigned int cluster_next; /* likely index for next allocation */ - unsigned int cluster_nr; /* countdown to next cluster search */ - unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocati= on */ struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio= n */ struct rb_root swap_extent_root;/* root of the swap extent rbtree */ struct block_device *bdev; /* swap device or bdev of swap file */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 574059158627..fca58d43b836 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1001,49 +1001,6 @@ static void swap_range_free(struct swap_info_struct = *si, unsigned long offset, WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); } =20 -static void set_cluster_next(struct swap_info_struct *si, unsigned long ne= xt) -{ - unsigned long prev; - - if (!(si->flags & SWP_SOLIDSTATE)) { - si->cluster_next =3D next; - return; - } - - prev =3D this_cpu_read(*si->cluster_next_cpu); - /* - * Cross the swap address space size aligned trunk, choose - * another trunk randomly to avoid lock contention on swap - * address space if possible. - */ - if ((prev >> SWAP_ADDRESS_SPACE_SHIFT) !=3D - (next >> SWAP_ADDRESS_SPACE_SHIFT)) { - /* No free swap slots available */ - if (si->highest_bit <=3D si->lowest_bit) - return; - next =3D get_random_u32_inclusive(si->lowest_bit, si->highest_bit); - next =3D ALIGN_DOWN(next, SWAP_ADDRESS_SPACE_PAGES); - next =3D max_t(unsigned int, next, si->lowest_bit); - } - this_cpu_write(*si->cluster_next_cpu, next); -} - -static bool swap_offset_available_and_locked(struct swap_info_struct *si, - unsigned long offset) -{ - if (data_race(!si->swap_map[offset])) { - spin_lock(&si->lock); - return true; - } - - if (vm_swap_full() && READ_ONCE(si->swap_map[offset]) =3D=3D SWAP_HAS_CAC= HE) { - spin_lock(&si->lock); - return true; - } - - return false; -} - static int cluster_alloc_swap(struct swap_info_struct *si, unsigned char usage, int nr, swp_entry_t slots[], int order) @@ -1071,13 +1028,7 @@ static int scan_swap_map_slots(struct swap_info_stru= ct *si, unsigned char usage, int nr, swp_entry_t slots[], int order) { - unsigned long offset; - unsigned long scan_base; - unsigned long last_in_cluster =3D 0; - int latency_ration =3D LATENCY_LIMIT; unsigned int nr_pages =3D 1 << order; - int n_ret =3D 0; - bool scanned_many =3D false; =20 /* * We try to cluster swap pages by allocating them sequentially @@ -1089,7 +1040,6 @@ static int scan_swap_map_slots(struct swap_info_struc= t *si, * But we do now try to find an empty cluster. -Andrea * And we let swap pages go all over an SSD partition. Hugh */ - if (order > 0) { /* * Should not even be attempting large allocations when huge @@ -1109,158 +1059,7 @@ static int scan_swap_map_slots(struct swap_info_str= uct *si, return 0; } =20 - if (si->cluster_info) - return cluster_alloc_swap(si, usage, nr, slots, order); - - si->flags +=3D SWP_SCANNING; - - /* For HDD, sequential access is more important. */ - scan_base =3D si->cluster_next; - offset =3D scan_base; - - if (unlikely(!si->cluster_nr--)) { - if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) { - si->cluster_nr =3D SWAPFILE_CLUSTER - 1; - goto checks; - } - - spin_unlock(&si->lock); - - /* - * If seek is expensive, start searching for new cluster from - * start of partition, to minimize the span of allocated swap. - */ - scan_base =3D offset =3D si->lowest_bit; - last_in_cluster =3D offset + SWAPFILE_CLUSTER - 1; - - /* Locate the first empty (unaligned) cluster */ - for (; last_in_cluster <=3D READ_ONCE(si->highest_bit); offset++) { - if (si->swap_map[offset]) - last_in_cluster =3D offset + SWAPFILE_CLUSTER; - else if (offset =3D=3D last_in_cluster) { - spin_lock(&si->lock); - offset -=3D SWAPFILE_CLUSTER - 1; - si->cluster_next =3D offset; - si->cluster_nr =3D SWAPFILE_CLUSTER - 1; - goto checks; - } - if (unlikely(--latency_ration < 0)) { - cond_resched(); - latency_ration =3D LATENCY_LIMIT; - } - } - - offset =3D scan_base; - spin_lock(&si->lock); - si->cluster_nr =3D SWAPFILE_CLUSTER - 1; - } - -checks: - if (!(si->flags & SWP_WRITEOK)) - goto no_page; - if (!si->highest_bit) - goto no_page; - if (offset > si->highest_bit) - scan_base =3D offset =3D si->lowest_bit; - - /* reuse swap entry of cache-only swap if not busy. */ - if (vm_swap_full() && si->swap_map[offset] =3D=3D SWAP_HAS_CACHE) { - int swap_was_freed; - spin_unlock(&si->lock); - swap_was_freed =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_= DIRECT); - spin_lock(&si->lock); - /* entry was freed successfully, try to use this again */ - if (swap_was_freed > 0) - goto checks; - goto scan; /* check next one */ - } - - if (si->swap_map[offset]) { - if (!n_ret) - goto scan; - else - goto done; - } - memset(si->swap_map + offset, usage, nr_pages); - - swap_range_alloc(si, offset, nr_pages); - slots[n_ret++] =3D swp_entry(si->type, offset); - - /* got enough slots or reach max slots? */ - if ((n_ret =3D=3D nr) || (offset >=3D si->highest_bit)) - goto done; - - /* search for next available slot */ - - /* time to take a break? */ - if (unlikely(--latency_ration < 0)) { - if (n_ret) - goto done; - spin_unlock(&si->lock); - cond_resched(); - spin_lock(&si->lock); - latency_ration =3D LATENCY_LIMIT; - } - - if (si->cluster_nr && !si->swap_map[++offset]) { - /* non-ssd case, still more slots in cluster? */ - --si->cluster_nr; - goto checks; - } - - /* - * Even if there's no free clusters available (fragmented), - * try to scan a little more quickly with lock held unless we - * have scanned too many slots already. - */ - if (!scanned_many) { - unsigned long scan_limit; - - if (offset < scan_base) - scan_limit =3D scan_base; - else - scan_limit =3D si->highest_bit; - for (; offset <=3D scan_limit && --latency_ration > 0; - offset++) { - if (!si->swap_map[offset]) - goto checks; - } - } - -done: - if (order =3D=3D 0) - set_cluster_next(si, offset + 1); - si->flags -=3D SWP_SCANNING; - return n_ret; - -scan: - VM_WARN_ON(order > 0); - spin_unlock(&si->lock); - while (++offset <=3D READ_ONCE(si->highest_bit)) { - if (unlikely(--latency_ration < 0)) { - cond_resched(); - latency_ration =3D LATENCY_LIMIT; - scanned_many =3D true; - } - if (swap_offset_available_and_locked(si, offset)) - goto checks; - } - offset =3D si->lowest_bit; - while (offset < scan_base) { - if (unlikely(--latency_ration < 0)) { - cond_resched(); - latency_ration =3D LATENCY_LIMIT; - scanned_many =3D true; - } - if (swap_offset_available_and_locked(si, offset)) - goto checks; - offset++; - } - spin_lock(&si->lock); - -no_page: - si->flags -=3D SWP_SCANNING; - return n_ret; + return cluster_alloc_swap(si, usage, nr, slots, order); } =20 int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) @@ -2871,8 +2670,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) mutex_unlock(&swapon_mutex); free_percpu(p->percpu_cluster); p->percpu_cluster =3D NULL; - free_percpu(p->cluster_next_cpu); - p->cluster_next_cpu =3D NULL; vfree(swap_map); kvfree(zeromap); kvfree(cluster_info); @@ -3184,8 +2981,6 @@ static unsigned long read_swap_header(struct swap_inf= o_struct *si, } =20 si->lowest_bit =3D 1; - si->cluster_next =3D 1; - si->cluster_nr =3D 0; =20 maxpages =3D swapfile_maximum_size; last_page =3D swap_header->info.last_page; @@ -3271,7 +3066,6 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, unsigned long maxpages) { unsigned long nr_clusters =3D DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); - unsigned long col =3D si->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_= COLS; struct swap_cluster_info *cluster_info; unsigned long i, j, k, idx; int cpu, err =3D -ENOMEM; @@ -3283,15 +3077,6 @@ static struct swap_cluster_info *setup_clusters(stru= ct swap_info_struct *si, for (i =3D 0; i < nr_clusters; i++) spin_lock_init(&cluster_info[i].lock); =20 - si->cluster_next_cpu =3D alloc_percpu(unsigned int); - if (!si->cluster_next_cpu) - goto err_free; - - /* Random start position to help with wear leveling */ - for_each_possible_cpu(cpu) - per_cpu(*si->cluster_next_cpu, cpu) =3D - get_random_u32_inclusive(1, si->highest_bit); - si->percpu_cluster =3D alloc_percpu(struct percpu_cluster); if (!si->percpu_cluster) goto err_free; @@ -3333,7 +3118,7 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, * sharing same address space. */ for (k =3D 0; k < SWAP_CLUSTER_COLS; k++) { - j =3D (k + col) % SWAP_CLUSTER_COLS; + j =3D k % SWAP_CLUSTER_COLS; for (i =3D 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) { struct swap_cluster_info *ci; idx =3D i * SWAP_CLUSTER_COLS + j; @@ -3483,18 +3268,18 @@ SYSCALL_DEFINE2(swapon, const char __user *, specia= lfile, int, swap_flags) =20 if (si->bdev && bdev_nonrot(si->bdev)) { si->flags |=3D SWP_SOLIDSTATE; - - cluster_info =3D setup_clusters(si, swap_header, maxpages); - if (IS_ERR(cluster_info)) { - error =3D PTR_ERR(cluster_info); - cluster_info =3D NULL; - goto bad_swap_unlock_inode; - } } else { atomic_inc(&nr_rotate_swap); inced_nr_rotate_swap =3D true; } =20 + cluster_info =3D setup_clusters(si, swap_header, maxpages); + if (IS_ERR(cluster_info)) { + error =3D PTR_ERR(cluster_info); + cluster_info =3D NULL; + goto bad_swap_unlock_inode; + } + if ((swap_flags & SWAP_FLAG_DISCARD) && si->bdev && bdev_max_discard_sectors(si->bdev)) { /* @@ -3575,8 +3360,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf= ile, int, swap_flags) bad_swap: free_percpu(si->percpu_cluster); si->percpu_cluster =3D NULL; - free_percpu(si->cluster_next_cpu); - si->cluster_next_cpu =3D NULL; inode =3D NULL; destroy_swap_extents(si); swap_cgroup_swapoff(si->type); --=20 2.47.1 From nobody Thu Dec 18 10:00:10 2025 Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3421D1B87CD for ; Mon, 13 Jan 2025 18:00:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791208; cv=none; b=DbTQPUVVu/nA+yOdpgCksbyDSV6APsyLbRtiWSZkNmRQHtAI8wRykKo1wsOmqM0hc/W39yck0GTOLzJk8OT3S9I1FjExJTUIgAwquvzbDTHyNZSVRj5Bgserd83uJfR31k1UUF5P7EL4Wjauy5jeLa/26nFryug3P720xcPwawY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791208; c=relaxed/simple; bh=fyhJPcv1+8wjELUmZAuJP0JjFC47dGNWFp8AUO7PHeg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Bf4LQ38JfGAFlzQvtIJeZOeuPCoDyK/eV/XRi87sBRoKwn5gD+uwU8PcAELCnXkgBWOMakmMSkZZ0sUBRT/EhAiv5+Lzr8zcvFhko6oMPH6tBbIjPqgHZ2cdNZHlTrGJwoBEHZBgAef55MEcBx5w4sH9kjTYEAKNPK8i1bjurks= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ickVvFjG; arc=none smtp.client-ip=209.85.214.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ickVvFjG" Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-216281bc30fso96194465ad.0 for ; Mon, 13 Jan 2025 10:00:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736791205; x=1737396005; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=yCxLYUgkmwrizvVbRUMYt+5+F/5vJNcyB3RTwQ2S4sU=; b=ickVvFjGNssYJlwKlryK3cmOCR1NkFTLneXe/U9tGOp9yjqMLLZm7WhMAjrglquQlM T0XJ+gs6mEaKj+TBk2dM+3tvtzFraEBHAELLfptulgxWb3jj4SVKhGZ23ljqKBPaE+WR fs+BcG9/x1yf5r3i3UeCsszooSLFAaByo9dh41R9PfynjxY/ygnEKDr9sAhoTRliH26M KEyKFIHJ+BhoHcaEqzNtVkdtu7+nl8yShn4bsamImR1QgwhdRal8gqIw2OHM+uD6bSXy efWXtrfiUERrOf5kAMozO6HpqYnvmw55deCqoJ9Bw53eytG/NwnyFPYHh9KzCoF/ufuX 0t8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736791205; x=1737396005; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=yCxLYUgkmwrizvVbRUMYt+5+F/5vJNcyB3RTwQ2S4sU=; b=WY7N5rGJOhG5K+cFVrsuzylthy40fpeF7i3Rv4+Y++Lkh55tZ/cugBu1YdgUIaqwAD CmPvPz37mwvqhK3yGmF5KQAE9BgUODDvIvy3m+2n3Cv6mtP6NP+HTG8MZtBU7Zk4Jr5D 1ISuUcOHf/kmIW5iCd0SMg4LjIhYk7eU2UosgKATkm/oPJ2/q6PnAEecUkwyVarZMqe/ MdcvA1EnAeryq8xPYBq3sXK4mc0/FyIQmzs/mhC+KnwanBr1s3lIVWzR5TTCCpW3qT9f ksAkS8JUry0Mxy9fDK9UFWw5FIF62DPz9n+reonwMnC9ECOl/RvdU+qv2uluPH1D1nRy 0z8Q== X-Forwarded-Encrypted: i=1; AJvYcCXrZrF1r9ig+qxgUwk74kCw8edXu4HIN1Vbzd+S4qb/APF36iv2BOk1/jNTC3RlosrUbwvFG2GT0xmWNuw=@vger.kernel.org X-Gm-Message-State: AOJu0YwDAJet0QboB5H31MwvPs6DsuS+kTRvHBCFx2iYC7WCaPH4+uTi ov6xK7om8L+eSFlwcB6qp2zMH1xhskGfqj/B8swgSiBLbLwG7hj7 X-Gm-Gg: ASbGnctU7dx6ZK4e5qSI0E1d0F/yHrhJzwqoflTNRLL6hUCHSKfDksqlmfB6fbXcp+j +dvOWWmRtDAqVb1FRLmHVi0tSyk3FjaXjUkxipz7K3FTMiukSJuH6h1GXR4rZzI/X8rQsy7dQVD rd9j0ibgPTCaRVLxfdkiukyvRh023uysjc/Lf7983phs1m104JzT5BWTsuqAPWLRGK03oyPP1cO ftALcba7Tdc3ai7zFMlN/jdMF7/sZMPhHSENcR79olKa+pz+prY6++ztZM2b/cbRlLk9xL70C7+ ww== X-Google-Smtp-Source: AGHT+IEl2jGioK0E7VuX6f9gWotqstQiXPHroDcbV/zlArnbCa7XHio7Xb5+yluLOMEp/Y9ntNLtXg== X-Received: by 2002:a17:902:f68b:b0:216:32c4:f807 with SMTP id d9443c01a7336-21a83fdea82mr340270875ad.45.1736791205072; Mon, 13 Jan 2025 10:00:05 -0800 (PST) Received: from KASONG-MC4.tencent.com ([115.171.41.132]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.10.00.01 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 13 Jan 2025 10:00:04 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Baoquan He , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 04/13] mm, swap: use cluster lock for HDD Date: Tue, 14 Jan 2025 01:57:23 +0800 Message-ID: <20250113175732.48099-5-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250113175732.48099-1-ryncsn@gmail.com> References: <20250113175732.48099-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Cluster lock (ci->lock) was introduced to reduce contention for certain operations. Using cluster lock for HDD is not helpful as HDD have a poor performance, so locking isn't the bottleneck. But having different set of locks for HDD / non-HDD prevents further rework of device lock (si->lock). This commit just changed all lock_cluster_or_swap_info to lock_cluster, which is a safe and straight conversion since cluster info is always allocated now, also removed all cluster_info related checks. Suggested-by: Chris Li Signed-off-by: Kairui Song Reviewed-by: Baoquan He --- mm/swapfile.c | 109 ++++++++++++++++---------------------------------- 1 file changed, 35 insertions(+), 74 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index fca58d43b836..83ebc24cc94b 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -58,10 +58,9 @@ static void swap_entry_range_free(struct swap_info_struc= t *si, swp_entry_t entry static void swap_range_alloc(struct swap_info_struct *si, unsigned long of= fset, unsigned int nr_entries); static bool folio_swapcache_freeable(struct folio *folio); -static struct swap_cluster_info *lock_cluster_or_swap_info( - struct swap_info_struct *si, unsigned long offset); -static void unlock_cluster_or_swap_info(struct swap_info_struct *si, - struct swap_cluster_info *ci); +static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, + unsigned long offset); +static void unlock_cluster(struct swap_cluster_info *ci); =20 static DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; @@ -222,9 +221,9 @@ static int __try_to_reclaim_swap(struct swap_info_struc= t *si, * swap_map is HAS_CACHE only, which means the slots have no page table * reference or pending writeback, and can't be allocated to others. */ - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); need_reclaim =3D swap_is_has_cache(si, offset, nr_pages); - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); if (!need_reclaim) goto out_unlock; =20 @@ -404,45 +403,15 @@ static inline struct swap_cluster_info *lock_cluster(= struct swap_info_struct *si { struct swap_cluster_info *ci; =20 - ci =3D si->cluster_info; - if (ci) { - ci +=3D offset / SWAPFILE_CLUSTER; - spin_lock(&ci->lock); - } - return ci; -} - -static inline void unlock_cluster(struct swap_cluster_info *ci) -{ - if (ci) - spin_unlock(&ci->lock); -} - -/* - * Determine the locking method in use for this device. Return - * swap_cluster_info if SSD-style cluster-based locking is in place. - */ -static inline struct swap_cluster_info *lock_cluster_or_swap_info( - struct swap_info_struct *si, unsigned long offset) -{ - struct swap_cluster_info *ci; - - /* Try to use fine-grained SSD-style locking if available: */ - ci =3D lock_cluster(si, offset); - /* Otherwise, fall back to traditional, coarse locking: */ - if (!ci) - spin_lock(&si->lock); + ci =3D &si->cluster_info[offset / SWAPFILE_CLUSTER]; + spin_lock(&ci->lock); =20 return ci; } =20 -static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si, - struct swap_cluster_info *ci) +static inline void unlock_cluster(struct swap_cluster_info *ci) { - if (ci) - unlock_cluster(ci); - else - spin_unlock(&si->lock); + spin_unlock(&ci->lock); } =20 /* Add a cluster to discard list and schedule it to do discard */ @@ -558,9 +527,6 @@ static void inc_cluster_info_page(struct swap_info_stru= ct *si, unsigned long idx =3D page_nr / SWAPFILE_CLUSTER; struct swap_cluster_info *ci; =20 - if (!cluster_info) - return; - ci =3D cluster_info + idx; ci->count++; =20 @@ -576,9 +542,6 @@ static void inc_cluster_info_page(struct swap_info_stru= ct *si, static void dec_cluster_info_page(struct swap_info_struct *si, struct swap_cluster_info *ci, int nr_pages) { - if (!si->cluster_info) - return; - VM_BUG_ON(ci->count < nr_pages); VM_BUG_ON(cluster_is_free(ci)); lockdep_assert_held(&si->lock); @@ -940,7 +903,7 @@ static void swap_range_alloc(struct swap_info_struct *s= i, unsigned long offset, si->highest_bit =3D 0; del_from_avail_list(si); =20 - if (si->cluster_info && vm_swap_full()) + if (vm_swap_full()) schedule_work(&si->reclaim_work); } } @@ -1007,8 +970,6 @@ static int cluster_alloc_swap(struct swap_info_struct = *si, { int n_ret =3D 0; =20 - VM_BUG_ON(!si->cluster_info); - si->flags +=3D SWP_SCANNING; =20 while (n_ret < nr) { @@ -1052,10 +1013,10 @@ static int scan_swap_map_slots(struct swap_info_str= uct *si, } =20 /* - * Swapfile is not block device or not using clusters so unable + * Swapfile is not block device so unable * to allocate large entries. */ - if (!(si->flags & SWP_BLKDEV) || !si->cluster_info) + if (!(si->flags & SWP_BLKDEV)) return 0; } =20 @@ -1295,9 +1256,9 @@ static unsigned char __swap_entry_free(struct swap_in= fo_struct *si, unsigned long offset =3D swp_offset(entry); unsigned char usage; =20 - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); usage =3D __swap_entry_free_locked(si, offset, 1); - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); if (!usage) free_swap_slot(entry); =20 @@ -1320,14 +1281,14 @@ static bool __swap_entries_free(struct swap_info_st= ruct *si, if (nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER) goto fallback; =20 - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); if (!swap_is_last_map(si, offset, nr, &has_cache)) { - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); goto fallback; } for (i =3D 0; i < nr; i++) WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE); - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); =20 if (!has_cache) { for (i =3D 0; i < nr; i++) @@ -1383,7 +1344,7 @@ static void cluster_swap_free_nr(struct swap_info_str= uct *si, DECLARE_BITMAP(to_free, BITS_PER_LONG) =3D { 0 }; int i, nr; =20 - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); while (nr_pages) { nr =3D min(BITS_PER_LONG, nr_pages); for (i =3D 0; i < nr; i++) { @@ -1391,18 +1352,18 @@ static void cluster_swap_free_nr(struct swap_info_s= truct *si, bitmap_set(to_free, i, 1); } if (!bitmap_empty(to_free, BITS_PER_LONG)) { - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); for_each_set_bit(i, to_free, BITS_PER_LONG) free_swap_slot(swp_entry(si->type, offset + i)); if (nr =3D=3D nr_pages) return; bitmap_clear(to_free, 0, BITS_PER_LONG); - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); } offset +=3D nr; nr_pages -=3D nr; } - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); } =20 /* @@ -1441,9 +1402,9 @@ void put_swap_folio(struct folio *folio, swp_entry_t = entry) if (!si) return; =20 - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); if (size > 1 && swap_is_has_cache(si, offset, size)) { - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); spin_lock(&si->lock); swap_entry_range_free(si, entry, size); spin_unlock(&si->lock); @@ -1451,14 +1412,14 @@ void put_swap_folio(struct folio *folio, swp_entry_= t entry) } for (int i =3D 0; i < size; i++, entry.val++) { if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) { - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); free_swap_slot(entry); if (i =3D=3D size - 1) return; - lock_cluster_or_swap_info(si, offset); + lock_cluster(si, offset); } } - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); } =20 static int swp_entry_cmp(const void *ent1, const void *ent2) @@ -1522,9 +1483,9 @@ int swap_swapcount(struct swap_info_struct *si, swp_e= ntry_t entry) struct swap_cluster_info *ci; int count; =20 - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); count =3D swap_count(si->swap_map[offset]); - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); return count; } =20 @@ -1547,7 +1508,7 @@ int swp_swapcount(swp_entry_t entry) =20 offset =3D swp_offset(entry); =20 - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); =20 count =3D swap_count(si->swap_map[offset]); if (!(count & COUNT_CONTINUED)) @@ -1570,7 +1531,7 @@ int swp_swapcount(swp_entry_t entry) n *=3D (SWAP_CONT_MAX + 1); } while (tmp_count & COUNT_CONTINUED); out: - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); return count; } =20 @@ -1585,8 +1546,8 @@ static bool swap_page_trans_huge_swapped(struct swap_= info_struct *si, int i; bool ret =3D false; =20 - ci =3D lock_cluster_or_swap_info(si, offset); - if (!ci || nr_pages =3D=3D 1) { + ci =3D lock_cluster(si, offset); + if (nr_pages =3D=3D 1) { if (swap_count(map[roffset])) ret =3D true; goto unlock_out; @@ -1598,7 +1559,7 @@ static bool swap_page_trans_huge_swapped(struct swap_= info_struct *si, } } unlock_out: - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); return ret; } =20 @@ -3428,7 +3389,7 @@ static int __swap_duplicate(swp_entry_t entry, unsign= ed char usage, int nr) offset =3D swp_offset(entry); VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); VM_WARN_ON(usage =3D=3D 1 && nr > 1); - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); =20 err =3D 0; for (i =3D 0; i < nr; i++) { @@ -3483,7 +3444,7 @@ static int __swap_duplicate(swp_entry_t entry, unsign= ed char usage, int nr) } =20 unlock_out: - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); return err; } =20 --=20 2.47.1 From nobody Thu Dec 18 10:00:10 2025 Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6692E1C4A20 for ; Mon, 13 Jan 2025 18:00:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791213; cv=none; b=f90T9aBH4UGvR1vA9OEsZ6/RE9IzqADn7iCDsAEnNw3KqRb5cw3gNn4v0EM8PNNNCm6L86Twe90+id2DimnaFd8vSwO8kfVcfAh2MVA+W/JbzyL0sNHwR/Ho1FY1rkAYDrNs8nREmxB9kSs5v2jGgT4KeeE2njSsswf/9r6Q8CQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791213; c=relaxed/simple; bh=6KYQITAfq02WCUP/VkTZrmdu/jT+2h8D7j6IYd56acQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=MG70o/i0X1Lixmyzfu2utBF+krykqWiPvdXCqh8qJXv0TIo4M3ExU2WnN2thXwJ9vS8Coubb7mNzdDM9hs6yv3zvDMY0GofyW6h/e261zNqvmZEiWPJimE2JMHpd/MpKCTu8y+2HBKpZLBx4FcAL3DmUtQnBCC8smi0ndlObgnA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=jTtASz0r; arc=none smtp.client-ip=209.85.214.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="jTtASz0r" Received: by mail-pl1-f181.google.com with SMTP id d9443c01a7336-21644aca3a0so101075495ad.3 for ; Mon, 13 Jan 2025 10:00:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736791210; x=1737396010; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=caiI/VOqAJNiDWWa2+iKhM4Oze87dUQbCW5ZECHeaJM=; b=jTtASz0rfYS07aT7R4/TdjeR1xrxM9WgdkZHljt+oTarFm8yxy/E1CpfmzP2SHajyF qLjgn9qD2mioYILcdpGM/gUuUcW7UxTUocfL15ORYfdiiMyqAx4FYYeyYDlMfTh3vFi/ 8Me4/hwI1M5nRy/ZKRHRDGm+5ZegpTvisb9wmXa0jF7K/P8yeJiNtUlK9iS9tWsISGlx Wcz8DJi2hFJkRwEgH8hyPQsV9/JBgXJQutf/T5FiE5F5qGCYhTXF+QCJKnSzI41wRkZJ NDSlSzjq2qa1wSBXJPnlfQDjHBk++SB4RpfHPZUuSnyyWcRqbgJK1nI2veB41pocBHNr jnMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736791210; x=1737396010; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=caiI/VOqAJNiDWWa2+iKhM4Oze87dUQbCW5ZECHeaJM=; b=cp05Xa7tP48h8O4s0t0DfZCbAMogC5ZG2qbucNrpgUh+/vW3D+GJqHJTLQc0oS//Ie 6q+5a8fzwZlwYdD9n5zS8l7g4MU14mv9HVtqR8LMp9tG7+4njEvUnnu0/3dR+a+uR5IG ooFALzYpGSRZ/nwA32ZJ78nMJj5JSlj/qm5me4N9/455M1NDa3mZpEd4FQpYfkjV7aNU 3elEK569C749hU2l4X+I3X1IWyL5z+MyjTKuAkaJMb918rxfPXcoIJyfjIyjv41IxmSw yKui7S2t0T88i2keQ0tBaKL2myA04F7oPKOnuWqUPodrKPtcTG39RseqsNr3QOInHSfJ egMg== X-Forwarded-Encrypted: i=1; AJvYcCWrmen0iJ4/OvAQNrYp1CasOGMUeFVX+Fc60RT4CJKymbZ3AXNrSxu+6wRDFYV42te9WE9hff8pfdLL/Qw=@vger.kernel.org X-Gm-Message-State: AOJu0YzG07ndxJjSBh1k5L8nD0iZh3ih/FmtXNkrTfJcauNYlpGB45/f L/qD1zYYr1ifNHN3rtqnkNh2GNEnmLdVauMNj+rZUJY44xrw3xxm X-Gm-Gg: ASbGnct0yZyaSxCADi59c2iAzcRlgz+hUfNUDWnZ4u3UBYo3iL/NhUmnaa7Jm7Z1QyA i2wN2v9wltusZa3Gfuo+56y1k5Lv490C+HT8fYa2/ZDmqkkOa5p5Rkf/FA7R4RMqH86gAXuOpXs hrhxIo3k8iL7W7eWyH2As4p+bfATMNdwFs1FoqLVexe60Sa7uUHlO+u/J/RxCuS3BneFgnpBSM6 0jWu+uZHfqWAoR+fXpiBZitsgtOvOoD488cRpbgjJk1N3ADThpdPCwy4D4lUMyiAm/D/bY5h1BN 9A== X-Google-Smtp-Source: AGHT+IHY2pN1dmNgQ/jErEKoR8jYGQXUl/IK9P33gRI+ptN1dWK+uFbMLz+TQgZSbpLd4QV4khSiBw== X-Received: by 2002:a17:902:d48a:b0:216:4064:53ad with SMTP id d9443c01a7336-21a840109d2mr324618055ad.48.1736791210057; Mon, 13 Jan 2025 10:00:10 -0800 (PST) Received: from KASONG-MC4.tencent.com ([115.171.41.132]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.10.00.05 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 13 Jan 2025 10:00:08 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Baoquan He , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 05/13] mm, swap: clean up device availability check Date: Tue, 14 Jan 2025 01:57:24 +0800 Message-ID: <20250113175732.48099-6-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250113175732.48099-1-ryncsn@gmail.com> References: <20250113175732.48099-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Remove highest_bit and lowest_bit. After the HDD allocation path has been removed, the only purpose of these two fields is to determine whether the device is full or not, which can instead be determined by checking the inuse_pages. Signed-off-by: Kairui Song Reviewed-by: Baoquan He --- fs/btrfs/inode.c | 1 - fs/f2fs/data.c | 1 - fs/iomap/swapfile.c | 1 - include/linux/swap.h | 2 -- mm/page_io.c | 1 - mm/swapfile.c | 38 ++++++++------------------------------ 6 files changed, 8 insertions(+), 36 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 27b2fe7f735d..3b99b1e19371 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -10110,7 +10110,6 @@ static int btrfs_swap_activate(struct swap_info_str= uct *sis, struct file *file, *span =3D bsi.highest_ppage - bsi.lowest_ppage + 1; sis->max =3D bsi.nr_pages; sis->pages =3D bsi.nr_pages - 1; - sis->highest_bit =3D bsi.nr_pages - 1; return bsi.nr_extents; } #else diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index a2478c2afb3a..a9eddd782dbc 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -4043,7 +4043,6 @@ static int check_swap_activate(struct swap_info_struc= t *sis, cur_lblock =3D 1; /* force Empty message */ sis->max =3D cur_lblock; sis->pages =3D cur_lblock - 1; - sis->highest_bit =3D cur_lblock - 1; out: if (not_aligned) f2fs_warn(sbi, "Swapfile (%u) is not align to section: 1) creat(), 2) io= ctl(F2FS_IOC_SET_PIN_FILE), 3) fallocate(%lu * N)", diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c index 5fc0ac36dee3..b90d0eda9e51 100644 --- a/fs/iomap/swapfile.c +++ b/fs/iomap/swapfile.c @@ -189,7 +189,6 @@ int iomap_swapfile_activate(struct swap_info_struct *si= s, *pagespan =3D 1 + isi.highest_ppage - isi.lowest_ppage; sis->max =3D isi.nr_pages; sis->pages =3D isi.nr_pages - 1; - sis->highest_bit =3D isi.nr_pages - 1; return isi.nr_extents; } EXPORT_SYMBOL_GPL(iomap_swapfile_activate); diff --git a/include/linux/swap.h b/include/linux/swap.h index 0c681aa5cb98..0c222017b5c6 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -306,8 +306,6 @@ struct swap_info_struct { struct list_head frag_clusters[SWAP_NR_ORDERS]; /* list of cluster that are fragmented or contented */ unsigned int frag_cluster_nr[SWAP_NR_ORDERS]; - unsigned int lowest_bit; /* index of first free in swap_map */ - unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ unsigned int inuse_pages; /* number of those currently in use */ struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio= n */ diff --git a/mm/page_io.c b/mm/page_io.c index 4b4ea8e49cf6..9b983de351f9 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -163,7 +163,6 @@ int generic_swapfile_activate(struct swap_info_struct *= sis, page_no =3D 1; /* force Empty message */ sis->max =3D page_no; sis->pages =3D page_no - 1; - sis->highest_bit =3D page_no - 1; out: return ret; bad_bmap: diff --git a/mm/swapfile.c b/mm/swapfile.c index 83ebc24cc94b..2686032d3510 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -55,7 +55,7 @@ static bool swap_count_continued(struct swap_info_struct = *, pgoff_t, static void free_swap_count_continuations(struct swap_info_struct *); static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t= entry, unsigned int nr_pages); -static void swap_range_alloc(struct swap_info_struct *si, unsigned long of= fset, +static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); static bool folio_swapcache_freeable(struct folio *folio); static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, @@ -650,7 +650,7 @@ static bool cluster_alloc_range(struct swap_info_struct= *si, struct swap_cluster } =20 memset(si->swap_map + start, usage, nr_pages); - swap_range_alloc(si, start, nr_pages); + swap_range_alloc(si, nr_pages); ci->count +=3D nr_pages; =20 if (ci->count =3D=3D SWAPFILE_CLUSTER) { @@ -888,19 +888,11 @@ static void del_from_avail_list(struct swap_info_stru= ct *si) spin_unlock(&swap_avail_lock); } =20 -static void swap_range_alloc(struct swap_info_struct *si, unsigned long of= fset, +static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries) { - unsigned int end =3D offset + nr_entries - 1; - - if (offset =3D=3D si->lowest_bit) - si->lowest_bit +=3D nr_entries; - if (end =3D=3D si->highest_bit) - WRITE_ONCE(si->highest_bit, si->highest_bit - nr_entries); WRITE_ONCE(si->inuse_pages, si->inuse_pages + nr_entries); if (si->inuse_pages =3D=3D si->pages) { - si->lowest_bit =3D si->max; - si->highest_bit =3D 0; del_from_avail_list(si); =20 if (vm_swap_full()) @@ -933,15 +925,8 @@ static void swap_range_free(struct swap_info_struct *s= i, unsigned long offset, for (i =3D 0; i < nr_entries; i++) clear_bit(offset + i, si->zeromap); =20 - if (offset < si->lowest_bit) - si->lowest_bit =3D offset; - if (end > si->highest_bit) { - bool was_full =3D !si->highest_bit; - - WRITE_ONCE(si->highest_bit, end); - if (was_full && (si->flags & SWP_WRITEOK)) - add_to_avail_list(si); - } + if (si->inuse_pages =3D=3D si->pages) + add_to_avail_list(si); if (si->flags & SWP_BLKDEV) swap_slot_free_notify =3D si->bdev->bd_disk->fops->swap_slot_free_notify; @@ -1051,15 +1036,12 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entr= ies[], int entry_order) plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); spin_unlock(&swap_avail_lock); spin_lock(&si->lock); - if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) { + if ((si->inuse_pages =3D=3D si->pages) || !(si->flags & SWP_WRITEOK)) { spin_lock(&swap_avail_lock); if (plist_node_empty(&si->avail_lists[node])) { spin_unlock(&si->lock); goto nextsi; } - WARN(!si->highest_bit, - "swap_info %d in list but !highest_bit\n", - si->type); WARN(!(si->flags & SWP_WRITEOK), "swap_info %d in list but !SWP_WRITEOK\n", si->type); @@ -2441,8 +2423,8 @@ static void _enable_swap_info(struct swap_info_struct= *si) */ plist_add(&si->list, &swap_active_head); =20 - /* add to available list iff swap device is not full */ - if (si->highest_bit) + /* add to available list if swap device is not full */ + if (si->inuse_pages < si->pages) add_to_avail_list(si); } =20 @@ -2606,7 +2588,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) drain_mmlist(); =20 /* wait for anyone still in scan_swap_map_slots */ - p->highest_bit =3D 0; /* cuts scans short */ while (p->flags >=3D SWP_SCANNING) { spin_unlock(&p->lock); spin_unlock(&swap_lock); @@ -2941,8 +2922,6 @@ static unsigned long read_swap_header(struct swap_inf= o_struct *si, return 0; } =20 - si->lowest_bit =3D 1; - maxpages =3D swapfile_maximum_size; last_page =3D swap_header->info.last_page; if (!last_page) { @@ -2959,7 +2938,6 @@ static unsigned long read_swap_header(struct swap_inf= o_struct *si, if ((unsigned int)maxpages =3D=3D 0) maxpages =3D UINT_MAX; } - si->highest_bit =3D maxpages - 1; =20 if (!maxpages) return 0; --=20 2.47.1 From nobody Thu Dec 18 10:00:10 2025 Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D583313AA31 for ; Mon, 13 Jan 2025 18:00:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791219; cv=none; b=t9w0hz3rF8x+xV+7oJiq2n8jTihFly7CgYrAJbxmkmSbeDWYM4Ga0E8g9nTO7qtFbUOBoXVJniOxgm9e5krPtPGgSM+YbUVUeaqM63gMQMTJksezWX2sO+/JlrX7mTIFIP0zCZeh1TXKZzUke/iBM/VqRAq63m/aAmFwBDknrB4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791219; c=relaxed/simple; bh=j7ZF/CnChkhjMKc33ilV0w1pM0XiQ4E16id7i/IH7BM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=YU3P1VWpRjrQODNkZ0HZAFfXKwyLVVVLnknRbboZPvOVyktwPQhDAXtMv9HO+9PxbFf76xoxPo7vbCKzFbees/N3NkuipQBp0cig/h3kLgDLsZP5eBiadKMiarWu3QVsqrViquY/dR248NQwY/s9KKBfiDerjdO6Ol1Hdqs+/eE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=fQdhfltN; arc=none smtp.client-ip=209.85.214.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="fQdhfltN" Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-21636268e43so107041765ad.2 for ; Mon, 13 Jan 2025 10:00:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736791216; x=1737396016; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=qtETt4hJiQh7efUCU6F69OscBxra6w0MDgbP2RHhprU=; b=fQdhfltNsEpkGOyAPvAgOI7ROJylYYiHRLXGwvQdjR29lgcon2GRXeXtjuzmBNusnQ 2hEjWOXWXc4AQjiv7dA5/wNr5HRNuyWpydEMYr9J2fxij9KrjkYYSPuZ+//VPwjnXi5g tuIpBPtfcfU5CMESPsO2nDJXT+23qz0ds7ODGlgBybpwWqV3H4+9w9u7Bt512XvKeofX nUaMe7l3oco3hAvbMS6u7pdA1AtzFQ27eHC8650YHLB8jQMwUT6/252WTT9zFyfF5fLJ bi6ZnA46dTLz0AlIBnBA8LRPPUDO0ULJmHYpkKxvhJhgJiA3Xl4NKUTAamX0L28+DsLU w6Rg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736791216; x=1737396016; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=qtETt4hJiQh7efUCU6F69OscBxra6w0MDgbP2RHhprU=; b=GqAdrMOoRe6yDUW3gnmGUKRPE4+7oNKZPfz0wtJex5K2Hic0Dy8akhmynTSbfBH7R9 hu+vmJ9y2YlwzNCg2tiwwioIdUH836+FcYr3UOureAZXrj9/MOj4vc9mEuykOseHyn4V d4L7TafE8OVaXFdHAf1xxwAL+dohWqL/M8jrem9wwbgnlfZAmcU5ro5JjccNS6j3tZT/ TQOfm0Ugt2xEh97k78dV6eQWFcQxfZlOWb3EsKFGwv5Ygklr9qTAyYeeg7F2+uWfZ7q1 M98gGW+M9rw0nTwfYJq2r5zll08x2zxDxwcGluTqp68DbCy5mCevyRexRyvFZfTJ24Gu YONQ== X-Forwarded-Encrypted: i=1; AJvYcCW+NJ4DeYcST0p6d61ITcNxSCUy3AYGNcOs9g0l635QnCTX4IZ1tzgyUDwZXKEmCUtNPFFN6dqCpJn4nDU=@vger.kernel.org X-Gm-Message-State: AOJu0Yw7Syxz++YWoOquctWrYlEiK+3zVbtQ545qJX5aU1Qttm13WeZt l+WvHe/5dx9lCa8/elVAjbcjdoPYW4nDlnejxfQQtt/lK/DSgWYO X-Gm-Gg: ASbGncsdxECsiaLmW3tZd3nAWzhaueuKlhSoOi/ZJMYvM72i2hvp4tCGBxnh0li75CJ Aj7cMYD2A6yklJZ5nqY5BkkAJY5f98iY/1fPE8dw/3TR0juj0k7BJR7yv3Eg2TcOi2dfzA7mnG5 hbJChVjvV2q+vPoj6p1OQM/Hje3Dn00dXEyB8Mef1aK6FwLBQyxPIgivGOk0RrnZi7YPxlHsYle teR4scRu/4RkOVvyFlYGFIbaqXKWHNyDhfiQsMbA3Li1227MW+T7xi8EDBBUhZNEekvqQy2wz7T lA== X-Google-Smtp-Source: AGHT+IHgJeS+srOGqdry/3K8xXT02L3VLVOokYD22RUqbvTi1GzphHH6M9xyOJ1unQcXSjNg3M09tw== X-Received: by 2002:a17:902:db0e:b0:215:a57e:88e7 with SMTP id d9443c01a7336-21a83f48cd6mr314107555ad.3.1736791213848; Mon, 13 Jan 2025 10:00:13 -0800 (PST) Received: from KASONG-MC4.tencent.com ([115.171.41.132]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.10.00.10 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 13 Jan 2025 10:00:13 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Baoquan He , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 06/13] mm, swap: clean up plist removal and adding Date: Tue, 14 Jan 2025 01:57:25 +0800 Message-ID: <20250113175732.48099-7-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250113175732.48099-1-ryncsn@gmail.com> References: <20250113175732.48099-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song When the swap device is full (inuse_pages =3D=3D pages), it should be removed from the allocation available plist. If any slot is freed, the swap device should be added back to the plist. Additionally, during swapon or swapoff, the swap device is forcefully added or removed. Currently, the condition (inuse_pages =3D=3D pages) is checked after every counter update, then remove or add the device accordingly. This is serialized by si->lock. This commit decouples it from the protection of si->lock and reworked plist removal and adding, making it possible to get rid of the hard dependency on si->lock in allocation path in later commits. To achieve this, simply using another lock is not an optimal approach, as the overhead is observable for a hot counter, and may cause complex locking issues. Thus, this commit manages to make it a lock-free atomic operation, by embedding the plist state into the second highest bit of the atomic counter. Simply making the counter an atomic will not work, if the update and plist status check are not performed atomically, we may miss an addition or removal. With the embedded info we can update the counter and check the plist status with single atomic operations, and avoid any extra overheads: If the counter is full (inuse_pages =3D=3D pages) and the off-list bit is unset, we attempt to remove it from the plist. If the counter is not full (inuse_pages !=3D pages) and the off-list bit is set, we attempt to add it to the plist. Removing, adding and bit update is serialized with a lock, which is a cold path. Ordinary counter updates will be lock-free. Signed-off-by: Kairui Song --- include/linux/swap.h | 2 +- mm/swapfile.c | 186 +++++++++++++++++++++++++++++++------------ 2 files changed, 138 insertions(+), 50 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 0c222017b5c6..e1eeea6307cd 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -307,7 +307,7 @@ struct swap_info_struct { /* list of cluster that are fragmented or contented */ unsigned int frag_cluster_nr[SWAP_NR_ORDERS]; unsigned int pages; /* total of usable pages of swap */ - unsigned int inuse_pages; /* number of those currently in use */ + atomic_long_t inuse_pages; /* number of those currently in use */ struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio= n */ struct rb_root swap_extent_root;/* root of the swap extent rbtree */ struct block_device *bdev; /* swap device or bdev of swap file */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 2686032d3510..91faf2073006 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -128,6 +128,26 @@ static inline unsigned char swap_count(unsigned char e= nt) return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */ } =20 +/* + * Use the second highest bit of inuse_pages counter as the indicator + * if one swap device is on the available plist, so the atomic can + * still be updated arithmetically while having special data embedded. + * + * inuse_pages counter is the only thing indicating if a device should + * be on avail_lists or not (except swapon / swapoff). By embedding the + * off-list bit in the atomic counter, updates no longer need any lock + * to check the list status. + * + * This bit will be set if the device is not on the plist and not + * usable, will be cleared if the device is on the plist. + */ +#define SWAP_USAGE_OFFLIST_BIT (1UL << (BITS_PER_TYPE(atomic_t) - 2)) +#define SWAP_USAGE_COUNTER_MASK (~SWAP_USAGE_OFFLIST_BIT) +static long swap_usage_in_pages(struct swap_info_struct *si) +{ + return atomic_long_read(&si->inuse_pages) & SWAP_USAGE_COUNTER_MASK; +} + /* Reclaim the swap entry anyway if possible */ #define TTRS_ANYWAY 0x1 /* @@ -717,7 +737,7 @@ static void swap_reclaim_full_clusters(struct swap_info= _struct *si, bool force) int nr_reclaim; =20 if (force) - to_scan =3D si->inuse_pages / SWAPFILE_CLUSTER; + to_scan =3D swap_usage_in_pages(si) / SWAPFILE_CLUSTER; =20 while (!list_empty(&si->full_clusters)) { ci =3D list_first_entry(&si->full_clusters, struct swap_cluster_info, li= st); @@ -872,42 +892,128 @@ static unsigned long cluster_alloc_swap_entry(struct= swap_info_struct *si, int o return found; } =20 -static void __del_from_avail_list(struct swap_info_struct *si) +/* SWAP_USAGE_OFFLIST_BIT can only be set by this helper. */ +static void del_from_avail_list(struct swap_info_struct *si, bool swapoff) { int nid; + unsigned long pages; + + spin_lock(&swap_avail_lock); + + if (swapoff) { + /* + * Forcefully remove it. Clear the SWP_WRITEOK flags for + * swapoff here so it's synchronized by both si->lock and + * swap_avail_lock, to ensure the result can be seen by + * add_to_avail_list. + */ + lockdep_assert_held(&si->lock); + si->flags &=3D ~SWP_WRITEOK; + atomic_long_or(SWAP_USAGE_OFFLIST_BIT, &si->inuse_pages); + } else { + /* + * If not called by swapoff, take it off-list only if it's + * full and SWAP_USAGE_OFFLIST_BIT is not set (strictly + * si->inuse_pages =3D=3D pages), any concurrent slot freeing, + * or device already removed from plist by someone else + * will make this return false. + */ + pages =3D si->pages; + if (!atomic_long_try_cmpxchg(&si->inuse_pages, &pages, + pages | SWAP_USAGE_OFFLIST_BIT)) + goto skip; + } =20 - assert_spin_locked(&si->lock); for_each_node(nid) plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]); + +skip: + spin_unlock(&swap_avail_lock); } =20 -static void del_from_avail_list(struct swap_info_struct *si) +/* SWAP_USAGE_OFFLIST_BIT can only be cleared by this helper. */ +static void add_to_avail_list(struct swap_info_struct *si, bool swapon) { + int nid; + long val; + unsigned long pages; + spin_lock(&swap_avail_lock); - __del_from_avail_list(si); + + /* Corresponding to SWP_WRITEOK clearing in del_from_avail_list */ + if (swapon) { + lockdep_assert_held(&si->lock); + si->flags |=3D SWP_WRITEOK; + } else { + if (!(READ_ONCE(si->flags) & SWP_WRITEOK)) + goto skip; + } + + if (!(atomic_long_read(&si->inuse_pages) & SWAP_USAGE_OFFLIST_BIT)) + goto skip; + + val =3D atomic_long_fetch_and_relaxed(~SWAP_USAGE_OFFLIST_BIT, &si->inuse= _pages); + + /* + * When device is full and device is on the plist, only one updater will + * see (inuse_pages =3D=3D si->pages) and will call del_from_avail_list. = If + * that updater happen to be here, just skip adding. + */ + pages =3D si->pages; + if (val =3D=3D pages) { + /* Just like the cmpxchg in del_from_avail_list */ + if (atomic_long_try_cmpxchg(&si->inuse_pages, &pages, + pages | SWAP_USAGE_OFFLIST_BIT)) + goto skip; + } + + for_each_node(nid) + plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]); + +skip: spin_unlock(&swap_avail_lock); } =20 -static void swap_range_alloc(struct swap_info_struct *si, - unsigned int nr_entries) +/* + * swap_usage_add / swap_usage_sub of each slot are serialized by ci->lock + * within each cluster, so the total contribution to the global counter sh= ould + * always be positive and cannot exceed the total number of usable slots. + */ +static bool swap_usage_add(struct swap_info_struct *si, unsigned int nr_en= tries) { - WRITE_ONCE(si->inuse_pages, si->inuse_pages + nr_entries); - if (si->inuse_pages =3D=3D si->pages) { - del_from_avail_list(si); + long val =3D atomic_long_add_return_relaxed(nr_entries, &si->inuse_pages); =20 - if (vm_swap_full()) - schedule_work(&si->reclaim_work); + /* + * If device is full, and SWAP_USAGE_OFFLIST_BIT is not set, + * remove it from the plist. + */ + if (unlikely(val =3D=3D si->pages)) { + del_from_avail_list(si, false); + return true; } + + return false; } =20 -static void add_to_avail_list(struct swap_info_struct *si) +static void swap_usage_sub(struct swap_info_struct *si, unsigned int nr_en= tries) { - int nid; + long val =3D atomic_long_sub_return_relaxed(nr_entries, &si->inuse_pages); =20 - spin_lock(&swap_avail_lock); - for_each_node(nid) - plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]); - spin_unlock(&swap_avail_lock); + /* + * If device is not full, and SWAP_USAGE_OFFLIST_BIT is set, + * remove it from the plist. + */ + if (unlikely(val & SWAP_USAGE_OFFLIST_BIT)) + add_to_avail_list(si, false); +} + +static void swap_range_alloc(struct swap_info_struct *si, + unsigned int nr_entries) +{ + if (swap_usage_add(si, nr_entries)) { + if (vm_swap_full()) + schedule_work(&si->reclaim_work); + } } =20 static void swap_range_free(struct swap_info_struct *si, unsigned long off= set, @@ -925,8 +1031,6 @@ static void swap_range_free(struct swap_info_struct *s= i, unsigned long offset, for (i =3D 0; i < nr_entries; i++) clear_bit(offset + i, si->zeromap); =20 - if (si->inuse_pages =3D=3D si->pages) - add_to_avail_list(si); if (si->flags & SWP_BLKDEV) swap_slot_free_notify =3D si->bdev->bd_disk->fops->swap_slot_free_notify; @@ -946,7 +1050,7 @@ static void swap_range_free(struct swap_info_struct *s= i, unsigned long offset, */ smp_wmb(); atomic_long_add(nr_entries, &nr_swap_pages); - WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); + swap_usage_sub(si, nr_entries); } =20 static int cluster_alloc_swap(struct swap_info_struct *si, @@ -1036,19 +1140,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entri= es[], int entry_order) plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); spin_unlock(&swap_avail_lock); spin_lock(&si->lock); - if ((si->inuse_pages =3D=3D si->pages) || !(si->flags & SWP_WRITEOK)) { - spin_lock(&swap_avail_lock); - if (plist_node_empty(&si->avail_lists[node])) { - spin_unlock(&si->lock); - goto nextsi; - } - WARN(!(si->flags & SWP_WRITEOK), - "swap_info %d in list but !SWP_WRITEOK\n", - si->type); - __del_from_avail_list(si); - spin_unlock(&si->lock); - goto nextsi; - } n_ret =3D scan_swap_map_slots(si, SWAP_HAS_CACHE, n_goal, swp_entries, order); spin_unlock(&si->lock); @@ -1057,7 +1148,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entrie= s[], int entry_order) cond_resched(); =20 spin_lock(&swap_avail_lock); -nextsi: /* * if we got here, it's likely that si was almost full before, * and since scan_swap_map_slots() can drop the si->lock, @@ -1789,7 +1879,7 @@ unsigned int count_swap_pages(int type, int free) if (sis->flags & SWP_WRITEOK) { n =3D sis->pages; if (free) - n -=3D sis->inuse_pages; + n -=3D swap_usage_in_pages(sis); } spin_unlock(&sis->lock); } @@ -2124,7 +2214,7 @@ static int try_to_unuse(unsigned int type) swp_entry_t entry; unsigned int i; =20 - if (!READ_ONCE(si->inuse_pages)) + if (!swap_usage_in_pages(si)) goto success; =20 retry: @@ -2137,7 +2227,7 @@ static int try_to_unuse(unsigned int type) =20 spin_lock(&mmlist_lock); p =3D &init_mm.mmlist; - while (READ_ONCE(si->inuse_pages) && + while (swap_usage_in_pages(si) && !signal_pending(current) && (p =3D p->next) !=3D &init_mm.mmlist) { =20 @@ -2165,7 +2255,7 @@ static int try_to_unuse(unsigned int type) mmput(prev_mm); =20 i =3D 0; - while (READ_ONCE(si->inuse_pages) && + while (swap_usage_in_pages(si) && !signal_pending(current) && (i =3D find_next_to_unuse(si, i)) !=3D 0) { =20 @@ -2200,7 +2290,7 @@ static int try_to_unuse(unsigned int type) * folio_alloc_swap(), temporarily hiding that swap. It's easy * and robust (though cpu-intensive) just to keep retrying. */ - if (READ_ONCE(si->inuse_pages)) { + if (swap_usage_in_pages(si)) { if (!signal_pending(current)) goto retry; return -EINTR; @@ -2227,7 +2317,7 @@ static void drain_mmlist(void) unsigned int type; =20 for (type =3D 0; type < nr_swapfiles; type++) - if (swap_info[type]->inuse_pages) + if (swap_usage_in_pages(swap_info[type])) return; spin_lock(&mmlist_lock); list_for_each_safe(p, next, &init_mm.mmlist) @@ -2406,7 +2496,6 @@ static void setup_swap_info(struct swap_info_struct *= si, int prio, =20 static void _enable_swap_info(struct swap_info_struct *si) { - si->flags |=3D SWP_WRITEOK; atomic_long_add(si->pages, &nr_swap_pages); total_swap_pages +=3D si->pages; =20 @@ -2423,9 +2512,8 @@ static void _enable_swap_info(struct swap_info_struct= *si) */ plist_add(&si->list, &swap_active_head); =20 - /* add to available list if swap device is not full */ - if (si->inuse_pages < si->pages) - add_to_avail_list(si); + /* Add back to available list */ + add_to_avail_list(si, true); } =20 static void enable_swap_info(struct swap_info_struct *si, int prio, @@ -2523,7 +2611,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) goto out_dput; } spin_lock(&p->lock); - del_from_avail_list(p); + del_from_avail_list(p, true); if (p->prio < 0) { struct swap_info_struct *si =3D p; int nid; @@ -2541,7 +2629,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) plist_del(&p->list, &swap_active_head); atomic_long_sub(p->pages, &nr_swap_pages); total_swap_pages -=3D p->pages; - p->flags &=3D ~SWP_WRITEOK; spin_unlock(&p->lock); spin_unlock(&swap_lock); =20 @@ -2721,7 +2808,7 @@ static int swap_show(struct seq_file *swap, void *v) } =20 bytes =3D K(si->pages); - inuse =3D K(READ_ONCE(si->inuse_pages)); + inuse =3D K(swap_usage_in_pages(si)); =20 file =3D si->swap_file; len =3D seq_file_path(swap, file, " \t\n\\"); @@ -2838,6 +2925,7 @@ static struct swap_info_struct *alloc_swap_info(void) } spin_lock_init(&p->lock); spin_lock_init(&p->cont_lock); + atomic_long_set(&p->inuse_pages, SWAP_USAGE_OFFLIST_BIT); init_completion(&p->comp); =20 return p; @@ -3335,7 +3423,7 @@ void si_swapinfo(struct sysinfo *val) struct swap_info_struct *si =3D swap_info[type]; =20 if ((si->flags & SWP_USED) && !(si->flags & SWP_WRITEOK)) - nr_to_be_unused +=3D READ_ONCE(si->inuse_pages); + nr_to_be_unused +=3D swap_usage_in_pages(si); } val->freeswap =3D atomic_long_read(&nr_swap_pages) + nr_to_be_unused; val->totalswap =3D total_swap_pages + nr_to_be_unused; --=20 2.47.1 From nobody Thu Dec 18 10:00:10 2025 Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com [209.85.214.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1AB091C5F30 for ; Mon, 13 Jan 2025 18:00:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791221; cv=none; b=sYQJ1C6lqHXpNDq4jzkHloP0QaDm7msmSbrkzBFmO+aJSnbvZkry1/NH9y5xlSeF7HuQXRpP6EjK8HqWqbmIqLV6nO/ggaMA6mn92TlyROLI80dwCz4i41ph0zNV4Xq7vYGRRv+l16W6WckmICJ0ywurNi6wtWJ9jJKvv7ZG1iE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791221; c=relaxed/simple; bh=09kqjsGWxrTZxltfWLj6iwGSywuk0KWET1bZzNPMlgU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=XSFIMTIxRVITfjWoTZriE3mLLnf19dTCXaXG+9JNK4O858aRSRy8qPS4TXMNGWisa7JFewK3ncza1D413Z6I7/0+rhlDvRMV9tFRG/xuuY0nOY5SOJOKgP8ZJ6O6P+t2uvE2rxjOaZ/UH0lCwzXhiij+ourpBPZYU7ZYH1XU/3Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=NWIfOh/Z; arc=none smtp.client-ip=209.85.214.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="NWIfOh/Z" Received: by mail-pl1-f174.google.com with SMTP id d9443c01a7336-2165cb60719so82362395ad.0 for ; Mon, 13 Jan 2025 10:00:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736791218; x=1737396018; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=dWh63O4GHmxsuBKSbBliKUVuioOhHJ78YLSnvJ5urEw=; b=NWIfOh/ZyPUueA/2QSKuNjgOOAc6iUvFXNj1ndzRcnMra39uZl/yUUD1qzVes8AVW5 8plWOLq/rvBf7i9nlZcOjlb+O7YkoEIaHnYZllKWKU2miMbYkETDA0BJ03dBL15+pnT9 9IJH7qPof49iXHLxkpnPB+V+oDPtmK9wkRbKfclDlRaNgBFHEjI107lbfYM8cbCyaCsN ik0zvYa32ecH7JBnXur9dqxb+ZPCia4gHknu2fAuOv5qc0YUnbRb1TqekU/GBsF8W8aU YPV+XX67AcPNyCba7LWzPTS4bDwkDiaT9rH0gIvSsl/FkgL5fP+qX4hvyuz0zd672hwa Q3dg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736791218; x=1737396018; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=dWh63O4GHmxsuBKSbBliKUVuioOhHJ78YLSnvJ5urEw=; b=eBO4f59yr55KrHGeF3rgzmqETApufSa9+JZIb2H7U1OZxoWRJqCMaInki4aYhfRPYK aSJTG1WetN3vb9L0sgNDF2R3NbftEltTjz7g+a55sX0p2QP4Uiw3H+540t0iZKWB4UF1 hh8R2TIH+uoLTrqCcaJKEFoP/J0WOfULeCAGzEOwIIpH+aFwVHqyaM6Zv09qeU60BYyT K/Ij5ArUcboR8yL7GQ7BVxkJjGkJMSmvekFB+v7yxXE0ecE5uOat+V4MC/iALNFkuLKy i3MsBf6dXbokTpwXgSg7QHqyAJJl64wR1cVjRLfuI0jWW3k8coqhPNbelPPp4LKjJA97 NRfg== X-Forwarded-Encrypted: i=1; AJvYcCXg0QXBK1D44sSSqWVDbZA4KQPl2Z0O09851mO1DjB104NRF+1mMbRrzkrvSJo6oYcrIWS5yq+Z7H+obJ8=@vger.kernel.org X-Gm-Message-State: AOJu0YyHX31b/z/t66+O+KCCllWq21AlrDfIP8Xu7rCMInvcXHlajk+G HaocsqykE7caJ0jpYiZTWzH5kOFHE3FZKAyK/FAfIemP64rmoT7ceUYRGJHUVFw= X-Gm-Gg: ASbGnctPnTI4Dp2KliqwPjqCUXsfUptu9kYpe4QF8uLLRKpl1uDUpniw5fGlMwfeAuB q27GOzCm/VjGofkF+4z+Mo8CpqXIOCcr5uxemPh+kD8wZR4wIAS0g0xVV4kpPH1rG1WIC8NK6zY DlgISFbG7cjLysEekcU84H3p6HAjB1W42NtvF5ilm5kNewiTBtvzLUQ2A8YZED/eFNwqR2dqqns 8Yk68rETr2lI7AdVGRgLm+Mj9gBg5of3Lp/+Fk96vbFRHU7tFMDLnOEbHxt9/PoHJjY+eD5tFYu jg== X-Google-Smtp-Source: AGHT+IFkbMojpX7s6VN8Lz/QQdWEvSeMZ/ugRu8uwONPdYOHeJ6nn0l7HLskgv5eoQxu2cxPIS9zAg== X-Received: by 2002:a17:903:2311:b0:215:a2e2:53ff with SMTP id d9443c01a7336-21a83f36e1dmr352526495ad.11.1736791218117; Mon, 13 Jan 2025 10:00:18 -0800 (PST) Received: from KASONG-MC4.tencent.com ([115.171.41.132]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.10.00.14 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 13 Jan 2025 10:00:17 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Baoquan He , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 07/13] mm, swap: hold a reference during scan and cleanup flag usage Date: Tue, 14 Jan 2025 01:57:26 +0800 Message-ID: <20250113175732.48099-8-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250113175732.48099-1-ryncsn@gmail.com> References: <20250113175732.48099-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song The flag SWP_SCANNING was used as an indicator of whether a device is being scanned for allocation, and prevents swapoff. Combined with SWP_WRITEOK, they work as a set of barriers for a clean swapoff: 1. Swapoff clears SWP_WRITEOK, allocation requests will see ~SWP_WRITEOK and abort as it's serialized by si->lock. 2. Swapoff unuses all allocated entries. 3. Swapoff waits for SWP_SCANNING flag to be cleared, so ongoing allocations will stop, preventing UAF. 4. Now swapoff can free everything safely. This will make the allocation path have a hard dependency on si->lock. Allocation always have to acquire si->lock first for setting SWP_SCANNING and checking SWP_WRITEOK. This commit removes this flag, and just uses the existing per-CPU refcount instead to prevent UAF in step 3, which serves well for such usage without dependency on si->lock, and scales very well too. Just hold a reference during the whole scan and allocation process. Swapoff will kill and wait for the counter. And for preventing any allocation from happening after step 1 so the unuse in step 2 can ensure all slots are free, swapoff will acquire the ci->lock of each cluster one by one to ensure all allocations see ~SWP_WRITEOK and abort. This way these dependences on si->lock are gone. And worth noting we can't kill the refcount as the first step for swapoff as the unuse process have to acquire the refcount. Signed-off-by: Kairui Song --- include/linux/swap.h | 1 - mm/swapfile.c | 90 ++++++++++++++++++++++++++++---------------- 2 files changed, 57 insertions(+), 34 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index e1eeea6307cd..02120f1005d5 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -219,7 +219,6 @@ enum { SWP_STABLE_WRITES =3D (1 << 11), /* no overwrite PG_writeback pages */ SWP_SYNCHRONOUS_IO =3D (1 << 12), /* synchronous IO is efficient */ /* add others here before... */ - SWP_SCANNING =3D (1 << 14), /* refcount in scan_swap_map */ }; =20 #define SWAP_CLUSTER_MAX 32UL diff --git a/mm/swapfile.c b/mm/swapfile.c index 91faf2073006..3898576f947a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -658,6 +658,8 @@ static bool cluster_alloc_range(struct swap_info_struct= *si, struct swap_cluster { unsigned int nr_pages =3D 1 << order; =20 + lockdep_assert_held(&ci->lock); + if (!(si->flags & SWP_WRITEOK)) return false; =20 @@ -1059,8 +1061,6 @@ static int cluster_alloc_swap(struct swap_info_struct= *si, { int n_ret =3D 0; =20 - si->flags +=3D SWP_SCANNING; - while (n_ret < nr) { unsigned long offset =3D cluster_alloc_swap_entry(si, order, usage); =20 @@ -1069,8 +1069,6 @@ static int cluster_alloc_swap(struct swap_info_struct= *si, slots[n_ret++] =3D swp_entry(si->type, offset); } =20 - si->flags -=3D SWP_SCANNING; - return n_ret; } =20 @@ -1112,6 +1110,22 @@ static int scan_swap_map_slots(struct swap_info_stru= ct *si, return cluster_alloc_swap(si, usage, nr, slots, order); } =20 +static bool get_swap_device_info(struct swap_info_struct *si) +{ + if (!percpu_ref_tryget_live(&si->users)) + return false; + /* + * Guarantee the si->users are checked before accessing other + * fields of swap_info_struct, and si->flags (SWP_WRITEOK) is + * up to dated. + * + * Paired with the spin_unlock() after setup_swap_info() in + * enable_swap_info(), and smp_wmb() in swapoff. + */ + smp_rmb(); + return true; +} + int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) { int order =3D swap_entry_order(entry_order); @@ -1139,13 +1153,16 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entr= ies[], int entry_order) /* requeue si to after same-priority siblings */ plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); spin_unlock(&swap_avail_lock); - spin_lock(&si->lock); - n_ret =3D scan_swap_map_slots(si, SWAP_HAS_CACHE, - n_goal, swp_entries, order); - spin_unlock(&si->lock); - if (n_ret || size > 1) - goto check_out; - cond_resched(); + if (get_swap_device_info(si)) { + spin_lock(&si->lock); + n_ret =3D scan_swap_map_slots(si, SWAP_HAS_CACHE, + n_goal, swp_entries, order); + spin_unlock(&si->lock); + put_swap_device(si); + if (n_ret || size > 1) + goto check_out; + cond_resched(); + } =20 spin_lock(&swap_avail_lock); /* @@ -1296,16 +1313,8 @@ struct swap_info_struct *get_swap_device(swp_entry_t= entry) si =3D swp_swap_info(entry); if (!si) goto bad_nofile; - if (!percpu_ref_tryget_live(&si->users)) + if (!get_swap_device_info(si)) goto out; - /* - * Guarantee the si->users are checked before accessing other - * fields of swap_info_struct. - * - * Paired with the spin_unlock() after setup_swap_info() in - * enable_swap_info(). - */ - smp_rmb(); offset =3D swp_offset(entry); if (offset >=3D si->max) goto put_out; @@ -1785,10 +1794,13 @@ swp_entry_t get_swap_page_of_type(int type) goto fail; =20 /* This is called for allocating swap entry, not cache */ - spin_lock(&si->lock); - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0)) - atomic_long_dec(&nr_swap_pages); - spin_unlock(&si->lock); + if (get_swap_device_info(si)) { + spin_lock(&si->lock); + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0= )) + atomic_long_dec(&nr_swap_pages); + spin_unlock(&si->lock); + put_swap_device(si); + } fail: return entry; } @@ -2562,6 +2574,25 @@ bool has_usable_swap(void) return ret; } =20 +/* + * Called after clearing SWP_WRITEOK, ensures cluster_alloc_range + * see the updated flags, so there will be no more allocations. + */ +static void wait_for_allocation(struct swap_info_struct *si) +{ + unsigned long offset; + unsigned long end =3D ALIGN(si->max, SWAPFILE_CLUSTER); + struct swap_cluster_info *ci; + + BUG_ON(si->flags & SWP_WRITEOK); + + for (offset =3D 0; offset < end; offset +=3D SWAPFILE_CLUSTER) { + ci =3D lock_cluster(si, offset); + unlock_cluster(ci); + offset +=3D SWAPFILE_CLUSTER; + } +} + SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) { struct swap_info_struct *p =3D NULL; @@ -2632,6 +2663,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) spin_unlock(&p->lock); spin_unlock(&swap_lock); =20 + wait_for_allocation(p); + disable_swap_slots_cache_lock(); =20 set_current_oom_origin(); @@ -2674,15 +2707,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specia= lfile) spin_lock(&p->lock); drain_mmlist(); =20 - /* wait for anyone still in scan_swap_map_slots */ - while (p->flags >=3D SWP_SCANNING) { - spin_unlock(&p->lock); - spin_unlock(&swap_lock); - schedule_timeout_uninterruptible(1); - spin_lock(&swap_lock); - spin_lock(&p->lock); - } - swap_file =3D p->swap_file; p->swap_file =3D NULL; p->max =3D 0; --=20 2.47.1 From nobody Thu Dec 18 10:00:10 2025 Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E288F1C3F00 for ; Mon, 13 Jan 2025 18:00:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.51 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791224; cv=none; b=mSupBUyU20E4z1R5eu7grCw6M/9AWiLBAM90lJ65C1i+zk5J3XPzaFy+LizJs1qEU+1M3om5OU+KdvDFfwn57xNPxamNZLiaUd5MGZ+bWwm+oI6egkaKH3I+GsJZHsGwHfMa9LuVbAbYRjk7cPEDIoFbxgsbq38XHXBlwcT5/ek= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791224; c=relaxed/simple; bh=HIejNutHqYjSTnhgFSzWtGIWy4SvmLyElRTFel63dxU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=EZv7BvzCaEhmXDwy7II4xhmoY4oH6j3EFiSIRS+Wc6ziCVx9FW8fN3iRp3XF81O0Eyqe2Hjq6ZcAlRizXUwaebxg2v3XnIe4rWGPQBbYGB5gZVxuXSvYdTPWsUkklsG0k36PRLlBn7rIBb97MJh6VDyX241+UH7/E2LL2cINw5s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=O4j7ScUn; arc=none smtp.client-ip=209.85.216.51 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="O4j7ScUn" Received: by mail-pj1-f51.google.com with SMTP id 98e67ed59e1d1-2eeb4d643a5so8058854a91.3 for ; Mon, 13 Jan 2025 10:00:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736791222; x=1737396022; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=ZcBqrejXjDN/SI+43XhdviH2TahNKIN+YXabcnMV+tI=; b=O4j7ScUn3a5682HI49MfSSG/jyiLljutD2bgXDaiuO7cEy4DNVDYV5v0vZ26cW65gw i/r5x4L3waRq4LbIDThKNmpLygUPk2olL85vqx96BWL3MYADZSqyqKJcHjNXunZXyeDZ QJz5fYqO7jYeHaWRaCjbjXL8M6QvLmx0P4eOu1Z3ad/Yp++PQR7yYoDEm72/Ib+PaiTJ GJjAAqXRECUMypl8Iw635ZVyb6w+hQg8E4ja+gPuJh/WppabeF0zA31Klu0jjPxpbXK4 C6+sotmk1m6BAKvz9l31xWZ/Ix866qhUiAEb/T88cn/m749V3EHFZY/TOa9aVtI8mS/j kewg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736791222; x=1737396022; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=ZcBqrejXjDN/SI+43XhdviH2TahNKIN+YXabcnMV+tI=; b=Ysew/jSMajlqjN/n2XNPv1Zu/dxL4zXq59q/Fqlwm/mBdQBesbK+TklYGvmC8wsK5O +xThO8jQqFa2Eu7Nuqz+L98p9g4JfPeQqZIVGBSiiFvDar9T1P2lx4Ul1+sOMd7mHq28 LQtpy9brzdENMKXPasP+Fhn20loqH9r/iOUbBrSLao644LkXVQ/SqkTdbn3lttakPTi4 QGTqeJ4svcZH/qGbLr7NqvzlP+/SmORSQ9HxF+AbLX0EJQkWJm4If1SZSO9V7WIFt11M UbijHV/tc067en66gLmMsErOimqVv7iNL2Db5Xy9cqiIgYAfDnecqFR4kpw39NFsZkJD ZFYQ== X-Forwarded-Encrypted: i=1; AJvYcCXjvGg/yfg2kr/9sX+ARjreovuj1vuKzUyxExYp+nVsWGSJYSIY2qqQumsWPsUvwU/dlDwuKR5m5xVIk6w=@vger.kernel.org X-Gm-Message-State: AOJu0Yw7TnCiaMGMUoxnUWMinlsjnz/3Gs585JhoF6w1Nssat07cndIX yOz+Pb+EOiffHnZYCPEN5cz8t1eq/x/BAQYIMrHNrBZTbvcHaoM7 X-Gm-Gg: ASbGncuRLEk8HzkYLPYGdFJ3s2gWbSuM34x9Hth9Z76Js2+mt9Pd0tNgJYSDh0XIcl6 a4BKwiX6D/A3DZ7OBq1eQzRYxo40vgOL+7d8Ix5NkRJptmYgxpzBc2GtLK6gTqu04Z0Pw+jYCDl Q21eOUDOEqWiCzDyu824A4qlkF3I4AEF35gbIy5nrA37F+OAk4oWMrZidR5iy/RBMXFz9xzTyqv nXCB0IE3eSSbIO2Dlo9sQrb7Ct5f35R7uMl1hwkTIXH2/7XHTKkMX8VRenLaLOg8L9f4UjWDTxm +A== X-Google-Smtp-Source: AGHT+IF7Lv6p+QRH7O5OPr3Tvh76Vf7drCt/uBv+BS+KwCQv0OckMgkTAS/vMPDOVtUVb/e8JFn2VQ== X-Received: by 2002:a17:90b:54c4:b0:2ee:8ea0:6b9c with SMTP id 98e67ed59e1d1-2f548f33220mr37465138a91.12.1736791221839; Mon, 13 Jan 2025 10:00:21 -0800 (PST) Received: from KASONG-MC4.tencent.com ([115.171.41.132]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.10.00.18 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 13 Jan 2025 10:00:21 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Baoquan He , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes Date: Tue, 14 Jan 2025 01:57:27 +0800 Message-ID: <20250113175732.48099-9-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250113175732.48099-1-ryncsn@gmail.com> References: <20250113175732.48099-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Currently, we are only using flags to indicate which list the cluster is on. Using one bit for each list type might be a waste, as the list type grows, we will consume too many bits. Additionally, the current mixed usage of '&' and '=3D=3D' is a bit confusing. Make it clean by using an enum to define all possible cluster statuses. Only an off-list cluster will have the NONE (0) flag. And use a wrapper to annotate and sanitize all flag settings and list movements. Suggested-by: Chris Li Signed-off-by: Kairui Song --- include/linux/swap.h | 17 +++++++--- mm/swapfile.c | 76 +++++++++++++++++++++++--------------------- 2 files changed, 53 insertions(+), 40 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 02120f1005d5..339d7f0192ff 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -257,10 +257,19 @@ struct swap_cluster_info { u8 order; struct list_head list; }; -#define CLUSTER_FLAG_FREE 1 /* This cluster is free */ -#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ -#define CLUSTER_FLAG_FRAG 4 /* This cluster is on nonfull list */ -#define CLUSTER_FLAG_FULL 8 /* This cluster is on full list */ + +/* All on-list cluster must have a non-zero flag. */ +enum swap_cluster_flags { + CLUSTER_FLAG_NONE =3D 0, /* For temporary off-list cluster */ + CLUSTER_FLAG_FREE, + CLUSTER_FLAG_NONFULL, + CLUSTER_FLAG_FRAG, + /* Clusters with flags above are allocatable */ + CLUSTER_FLAG_USABLE =3D CLUSTER_FLAG_FRAG, + CLUSTER_FLAG_FULL, + CLUSTER_FLAG_DISCARD, + CLUSTER_FLAG_MAX, +}; =20 /* * The first page in the swap file is the swap header, which is always mar= ked diff --git a/mm/swapfile.c b/mm/swapfile.c index 3898576f947a..b754c9e16c3b 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -403,7 +403,7 @@ static void discard_swap_cluster(struct swap_info_struc= t *si, =20 static inline bool cluster_is_free(struct swap_cluster_info *info) { - return info->flags & CLUSTER_FLAG_FREE; + return info->flags =3D=3D CLUSTER_FLAG_FREE; } =20 static inline unsigned int cluster_index(struct swap_info_struct *si, @@ -434,6 +434,28 @@ static inline void unlock_cluster(struct swap_cluster_= info *ci) spin_unlock(&ci->lock); } =20 +static void move_cluster(struct swap_info_struct *si, + struct swap_cluster_info *ci, struct list_head *list, + enum swap_cluster_flags new_flags) +{ + VM_WARN_ON(ci->flags =3D=3D new_flags); + + BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX); + + if (ci->flags =3D=3D CLUSTER_FLAG_NONE) { + list_add_tail(&ci->list, list); + } else { + if (ci->flags =3D=3D CLUSTER_FLAG_FRAG) { + VM_WARN_ON(!si->frag_cluster_nr[ci->order]); + si->frag_cluster_nr[ci->order]--; + } + list_move_tail(&ci->list, list); + } + ci->flags =3D new_flags; + if (new_flags =3D=3D CLUSTER_FLAG_FRAG) + si->frag_cluster_nr[ci->order]++; +} + /* Add a cluster to discard list and schedule it to do discard */ static void swap_cluster_schedule_discard(struct swap_info_struct *si, struct swap_cluster_info *ci) @@ -447,10 +469,8 @@ static void swap_cluster_schedule_discard(struct swap_= info_struct *si, */ memset(si->swap_map + idx * SWAPFILE_CLUSTER, SWAP_MAP_BAD, SWAPFILE_CLUSTER); - - VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); - list_move_tail(&ci->list, &si->discard_clusters); - ci->flags =3D 0; + VM_BUG_ON(ci->flags =3D=3D CLUSTER_FLAG_FREE); + move_cluster(si, ci, &si->discard_clusters, CLUSTER_FLAG_DISCARD); schedule_work(&si->discard_work); } =20 @@ -458,12 +478,7 @@ static void __free_cluster(struct swap_info_struct *si= , struct swap_cluster_info { lockdep_assert_held(&si->lock); lockdep_assert_held(&ci->lock); - - if (ci->flags) - list_move_tail(&ci->list, &si->free_clusters); - else - list_add_tail(&ci->list, &si->free_clusters); - ci->flags =3D CLUSTER_FLAG_FREE; + move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE); ci->order =3D 0; } =20 @@ -479,6 +494,8 @@ static void swap_do_scheduled_discard(struct swap_info_= struct *si) while (!list_empty(&si->discard_clusters)) { ci =3D list_first_entry(&si->discard_clusters, struct swap_cluster_info,= list); list_del(&ci->list); + /* Must clear flag when taking a cluster off-list */ + ci->flags =3D CLUSTER_FLAG_NONE; idx =3D cluster_index(si, ci); spin_unlock(&si->lock); =20 @@ -519,9 +536,6 @@ static void free_cluster(struct swap_info_struct *si, s= truct swap_cluster_info * lockdep_assert_held(&si->lock); lockdep_assert_held(&ci->lock); =20 - if (ci->flags & CLUSTER_FLAG_FRAG) - si->frag_cluster_nr[ci->order]--; - /* * If the swap is discardable, prepare discard the cluster * instead of free it immediately. The cluster will be freed @@ -573,13 +587,9 @@ static void dec_cluster_info_page(struct swap_info_str= uct *si, return; } =20 - if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { - VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); - if (ci->flags & CLUSTER_FLAG_FRAG) - si->frag_cluster_nr[ci->order]--; - list_move_tail(&ci->list, &si->nonfull_clusters[ci->order]); - ci->flags =3D CLUSTER_FLAG_NONFULL; - } + if (ci->flags !=3D CLUSTER_FLAG_NONFULL) + move_cluster(si, ci, &si->nonfull_clusters[ci->order], + CLUSTER_FLAG_NONFULL); } =20 static bool cluster_reclaim_range(struct swap_info_struct *si, @@ -663,11 +673,13 @@ static bool cluster_alloc_range(struct swap_info_stru= ct *si, struct swap_cluster if (!(si->flags & SWP_WRITEOK)) return false; =20 + VM_BUG_ON(ci->flags =3D=3D CLUSTER_FLAG_NONE); + VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE); + if (cluster_is_free(ci)) { - if (nr_pages < SWAPFILE_CLUSTER) { - list_move_tail(&ci->list, &si->nonfull_clusters[order]); - ci->flags =3D CLUSTER_FLAG_NONFULL; - } + if (nr_pages < SWAPFILE_CLUSTER) + move_cluster(si, ci, &si->nonfull_clusters[order], + CLUSTER_FLAG_NONFULL); ci->order =3D order; } =20 @@ -675,14 +687,8 @@ static bool cluster_alloc_range(struct swap_info_struc= t *si, struct swap_cluster swap_range_alloc(si, nr_pages); ci->count +=3D nr_pages; =20 - if (ci->count =3D=3D SWAPFILE_CLUSTER) { - VM_BUG_ON(!(ci->flags & - (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG))); - if (ci->flags & CLUSTER_FLAG_FRAG) - si->frag_cluster_nr[ci->order]--; - list_move_tail(&ci->list, &si->full_clusters); - ci->flags =3D CLUSTER_FLAG_FULL; - } + if (ci->count =3D=3D SWAPFILE_CLUSTER) + move_cluster(si, ci, &si->full_clusters, CLUSTER_FLAG_FULL); =20 return true; } @@ -821,9 +827,7 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o while (!list_empty(&si->nonfull_clusters[order])) { ci =3D list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list); - list_move_tail(&ci->list, &si->frag_clusters[order]); - ci->flags =3D CLUSTER_FLAG_FRAG; - si->frag_cluster_nr[order]++; + move_cluster(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG); offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); frags++; --=20 2.47.1 From nobody Thu Dec 18 10:00:10 2025 Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E98C71CBE96 for ; Mon, 13 Jan 2025 18:00:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791229; cv=none; b=E5frLSGIIMkOD4O0O9grNcGZhKgoaKGgYfDg7fJoXnYCTc4UPDQHYyhg/2btIf9XKp7ZOUAc9AK8o0p9cZBQ+p+LNuAJbC0+Zq7GQLVupRI6UiTrRB+AYuSPEYqKeAFprXpjo4A4xx6fcPWNpSP5XArOdmvuF6qttN/OvwoWASU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791229; c=relaxed/simple; bh=lcVFuzoMaDPs0s8onnopRtZjh7EXQF2uesmSUFBTTLY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=emuzRC6mEraVItqkQGGLWdDs+Lgk8OnxaOwxwtPfWKqf+wq3bYGE+T9lCRa7NN8nffor4KkVI9wHjvRU/QACsV4tpPSsRkMgw37kXIwlJNbFfZyHScJTW5fj+69+NhGwIUNleUyq27KFV7O1PbO8+ppw2jenr6bGxkon5848n5U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Luy9lFIq; arc=none smtp.client-ip=209.85.214.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Luy9lFIq" Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-2163bd70069so83040745ad.0 for ; Mon, 13 Jan 2025 10:00:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736791226; x=1737396026; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=MatFqnsLfI0J5qXug55VUY2bL9Wigg14qsEA0grIGoU=; b=Luy9lFIq19JgpysxQDyhqzjIvUHiVdFGy4ID1Djf6X5p1wqUZqjvUllB2YtOpc9sqR cvXcmrJ5zQ1DCJ6358S/59RR2W4HvbewAgmO0J4x5EvzSoTk8THMIKE12s3v44fWbkoS lgXxsEioLF/KlFyMd4+viNmT8a3kCUnn3CuIj+Ve70adRb/adIeYOHyr/kJYiNn35fiO bl9eO9Lj4ZlSqTTuBgvxZhy3GLcZKAWluFiyS/Z6ZEFHgexiarAzYgu1+E3hAqytfanC KwAoGdFRy1WPkhmYvCqFgoHtUEMl3w93Ffc5Wl9GHEgwx1LLD+pbk7UixRB4c4T94xIy bYDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736791226; x=1737396026; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=MatFqnsLfI0J5qXug55VUY2bL9Wigg14qsEA0grIGoU=; b=s8/JmLG+FvCXqWIlWawmzWCIVlb1FeHJabG4DHpeV8n0lWGiwlBbdb2cslXYd4ZFNI OiZQWfbm+TfZq49aa7nMTISQkt2Ode7F1Qj4l3Bo5vtje6hGhAWwNqfsxrftf6uV5Yl2 CLOXoXYWD/OBaxvPjmgcYoJORJfcDsHYwstfDE26VzJeg1L5bh6F1vjpykRzpKrnqXH/ o7OThVkhsmI8++3WSOaCDPL9/C4Xpr+BaZkCUTJy5JSmhfRjkxCEIRIx9b022rGONAX6 bNfpZ1w7Xl3HYvJZ03ZgSCugAKl0E9XyRkjEbdndjlF7hc1+xAdHngbWMsloNPT4NLai gstw== X-Forwarded-Encrypted: i=1; AJvYcCXLLLsCRGmfSvVdjVmAPoO9urOC9EmgeyBYNt4VejXE7bEDRkGl5itbFKsnBuAHnkv7pP+F1PW3G8IlZQs=@vger.kernel.org X-Gm-Message-State: AOJu0YzCaloA1sQ8AIgXqN8sXLBDkRV04AMiUZWorodNkSt3sqRwFsBB Yybkt/oGb5LxCACBuvDU0P9dwsv+RwWxSxpDU/1ppjDOrhdomejt X-Gm-Gg: ASbGncvQgLfaZDLjO6/5Jz7pyoIB+rtZbtbawIoIXlcxQYkaAfhH/kVv4iJepCJ0x/m eYE3lR4VOGL8JsZ/q6JySe54kGA9F+kETQOnaKMIAZKvAJu25e1Ai4ZQULjJcJU9Ev3DbLXbkkt vl+MYIQxB+m6pmdPklLf0GEeNwwGbqY/AUvvmYlNxHSrYcvjRypZi39NaFA+pZE3DFo9xXrdTfi ie5ko6UwQ6F/+YQEAkASUz3+Mmt/5/PpkVK7XX0/n+l/RDmykrKk6QQ1tav/Rp6CVCSFjMzIUQM /w== X-Google-Smtp-Source: AGHT+IFmnF/a9IwBa/i3UBvbehPRpoxJTxM4S0nHx+6HfhdZT9vedpg6yyvBXC+d6d5PBeQ2RciM/Q== X-Received: by 2002:a17:902:ecc5:b0:215:bf1b:a894 with SMTP id d9443c01a7336-21a83f76704mr333067675ad.24.1736791225664; Mon, 13 Jan 2025 10:00:25 -0800 (PST) Received: from KASONG-MC4.tencent.com ([115.171.41.132]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.10.00.22 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 13 Jan 2025 10:00:25 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Baoquan He , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 09/13] mm, swap: reduce contention on device lock Date: Tue, 14 Jan 2025 01:57:28 +0800 Message-ID: <20250113175732.48099-10-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250113175732.48099-1-ryncsn@gmail.com> References: <20250113175732.48099-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Currently, swap locking is mainly composed of two locks: the cluster lock (ci->lock) and the device lock (si->lock). The cluster lock is much more fine-grained, so it is best to use ci->lock instead of si->lock as much as possible. We have cleaned up other hard dependencies on si->lock. Following the new cluster allocator design, most operations don't need to touch si->lock at all. In practice, we only need to take si->lock when moving clusters between lists. To achieve this, this commit reworks the locking pattern of all si->lock and ci->lock users, eliminates all usage of ci->lock inside si->lock, and introduces a new design to avoid touching si->lock unless needed. For minimal contention and easier understanding of the system, two ideas are introduced with the corresponding helpers: isolation and relocation. - Clusters will be `isolated` from the list when iterating the list to search for an allocatable cluster. This ensures other CPUs won't walk into the same cluster easily, and it releases si->lock after acquiring ci->lock, providing the only place that handles the inversion of two locks, and avoids contention. Iterating the cluster list almost always moves the cluster (free -> nonfull, nonfull -> frag, frag -> frag tail), but it doesn't know where the cluster should be moved to until scanning is done. So keeping the cluster off-list is a good option with low overhead. The off-list time window of a cluster is also minimal. In the worst case, one CPU will return the cluster after scanning the 512 entries on it, which we used to busy wait with a spin lock. This is done with the new helper `isolate_lock_cluster`. - Clusters will be `relocated` after allocation or freeing, according to their usage count and status. Allocations no longer hold si->lock now, and may drop ci->lock for reclaim, so the cluster could be moved to any location while no lock is held. Besides, isolation clears all flags when it takes the cluster off the list (the flags must be in sync with the list status, so cluster users don't need to touch si->lock for checking its list status). So the cluster has to be relocated to the right list according to its usage after allocation or freeing. Relocation is optional, if the cluster flags indicate it's already on the right list, it will skip touching the list or si->lock. This is done with `relocate_cluster` after allocation or with `[partial_]free_cluster` after freeing. This handled usage of all kinds of clusters in a clean way. Scanning and allocation by iterating the cluster list is handled by "isolate - - relocate". Scanning and allocation of per-CPU clusters will only involve " - relocate", as it knows which cluster to lock and use. Freeing will only involve "relocate". Each CPU will keep using its per-CPU cluster until the 512 entries are all consumed. Freeing also has to free 512 entries to trigger cluster movement in the best case, so si->lock is rarely touched. Testing with building the Linux kernel with defconfig showed huge improvement: tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C: Before: Sys time: 73578.30, Real time: 864.05 After: (-50.7% sys time, -44.8% real time) Sys time: 36227.49, Real time: 476.66 time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C: (avg of 4 test run) Before: Sys time: 74044.85, Real time: 846.51 hugepages-64kB/stats/swpout: 1735216 hugepages-64kB/stats/swpout_fallback: 430333 After: (-40.4% sys time, -37.1% real time) Sys time: 44160.56, Real time: 532.07 hugepages-64kB/stats/swpout: 1786288 hugepages-64kB/stats/swpout_fallback: 243384 time make -j32 / 512M memcg, 4K pages, 5G ZRAM, on AMD 7K62: Before: Sys time: 8098.21, Real time: 401.3 After: (-22.6% sys time, -12.8% real time ) Sys time: 6265.02, Real time: 349.83 The allocation success rate also slightly improved as we sanitized the usage of clusters with new defined helpers, previously dropping si->lock or ci->lock during scan will cause cluster order shuffle. Suggested-by: Chris Li Signed-off-by: Kairui Song --- include/linux/swap.h | 3 +- mm/swapfile.c | 432 ++++++++++++++++++++++++------------------- 2 files changed, 247 insertions(+), 188 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 339d7f0192ff..c4ff31cb6bde 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -291,6 +291,7 @@ enum swap_cluster_flags { * throughput. */ struct percpu_cluster { + local_lock_t lock; /* Protect the percpu_cluster above */ unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ }; =20 @@ -313,7 +314,7 @@ struct swap_info_struct { /* list of cluster that contains at least one free slot */ struct list_head frag_clusters[SWAP_NR_ORDERS]; /* list of cluster that are fragmented or contented */ - unsigned int frag_cluster_nr[SWAP_NR_ORDERS]; + atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS]; unsigned int pages; /* total of usable pages of swap */ atomic_long_t inuse_pages; /* number of those currently in use */ struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio= n */ diff --git a/mm/swapfile.c b/mm/swapfile.c index b754c9e16c3b..489ac6997a0c 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -261,12 +261,10 @@ static int __try_to_reclaim_swap(struct swap_info_str= uct *si, folio_ref_sub(folio, nr_pages); folio_set_dirty(folio); =20 - spin_lock(&si->lock); /* Only sinple page folio can be backed by zswap */ if (nr_pages =3D=3D 1) zswap_invalidate(entry); swap_entry_range_free(si, entry, nr_pages); - spin_unlock(&si->lock); ret =3D nr_pages; out_unlock: folio_unlock(folio); @@ -401,9 +399,23 @@ static void discard_swap_cluster(struct swap_info_stru= ct *si, #endif #define LATENCY_LIMIT 256 =20 -static inline bool cluster_is_free(struct swap_cluster_info *info) +static inline bool cluster_is_empty(struct swap_cluster_info *info) +{ + return info->count =3D=3D 0; +} + +static inline bool cluster_is_discard(struct swap_cluster_info *info) +{ + return info->flags =3D=3D CLUSTER_FLAG_DISCARD; +} + +static inline bool cluster_is_usable(struct swap_cluster_info *ci, int ord= er) { - return info->flags =3D=3D CLUSTER_FLAG_FREE; + if (unlikely(ci->flags > CLUSTER_FLAG_USABLE)) + return false; + if (!order) + return true; + return cluster_is_empty(ci) || order =3D=3D ci->order; } =20 static inline unsigned int cluster_index(struct swap_info_struct *si, @@ -441,19 +453,20 @@ static void move_cluster(struct swap_info_struct *si, VM_WARN_ON(ci->flags =3D=3D new_flags); =20 BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX); + lockdep_assert_held(&ci->lock); =20 - if (ci->flags =3D=3D CLUSTER_FLAG_NONE) { + spin_lock(&si->lock); + if (ci->flags =3D=3D CLUSTER_FLAG_NONE) list_add_tail(&ci->list, list); - } else { - if (ci->flags =3D=3D CLUSTER_FLAG_FRAG) { - VM_WARN_ON(!si->frag_cluster_nr[ci->order]); - si->frag_cluster_nr[ci->order]--; - } + else list_move_tail(&ci->list, list); - } + spin_unlock(&si->lock); + + if (ci->flags =3D=3D CLUSTER_FLAG_FRAG) + atomic_long_dec(&si->frag_cluster_nr[ci->order]); + else if (new_flags =3D=3D CLUSTER_FLAG_FRAG) + atomic_long_inc(&si->frag_cluster_nr[ci->order]); ci->flags =3D new_flags; - if (new_flags =3D=3D CLUSTER_FLAG_FRAG) - si->frag_cluster_nr[ci->order]++; } =20 /* Add a cluster to discard list and schedule it to do discard */ @@ -476,39 +489,91 @@ static void swap_cluster_schedule_discard(struct swap= _info_struct *si, =20 static void __free_cluster(struct swap_info_struct *si, struct swap_cluste= r_info *ci) { - lockdep_assert_held(&si->lock); lockdep_assert_held(&ci->lock); move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE); ci->order =3D 0; } =20 +/* + * Isolate and lock the first cluster that is not contented on a list, + * clean its flag before taken off-list. Cluster flag must be in sync + * with list status, so cluster updaters can always know the cluster + * list status without touching si lock. + * + * Note it's possible that all clusters on a list are contented so + * this returns NULL for an non-empty list. + */ +static struct swap_cluster_info *isolate_lock_cluster( + struct swap_info_struct *si, struct list_head *list) +{ + struct swap_cluster_info *ci, *ret =3D NULL; + + spin_lock(&si->lock); + + if (unlikely(!(si->flags & SWP_WRITEOK))) + goto out; + + list_for_each_entry(ci, list, list) { + if (!spin_trylock(&ci->lock)) + continue; + + /* We may only isolate and clear flags of following lists */ + VM_BUG_ON(!ci->flags); + VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE && + ci->flags !=3D CLUSTER_FLAG_FULL); + + list_del(&ci->list); + ci->flags =3D CLUSTER_FLAG_NONE; + ret =3D ci; + break; + } +out: + spin_unlock(&si->lock); + + return ret; +} + /* * Doing discard actually. After a cluster discard is finished, the cluster - * will be added to free cluster list. caller should hold si->lock. -*/ -static void swap_do_scheduled_discard(struct swap_info_struct *si) + * will be added to free cluster list. Discard cluster is a bit special as + * they don't participate in allocation or reclaim, so clusters marked as + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list. + */ +static bool swap_do_scheduled_discard(struct swap_info_struct *si) { struct swap_cluster_info *ci; + bool ret =3D false; unsigned int idx; =20 + spin_lock(&si->lock); while (!list_empty(&si->discard_clusters)) { ci =3D list_first_entry(&si->discard_clusters, struct swap_cluster_info,= list); + /* + * Delete the cluster from list to prepare for discard, but keep + * the CLUSTER_FLAG_DISCARD flag, there could be percpu_cluster + * pointing to it, or ran into by relocate_cluster. + */ list_del(&ci->list); - /* Must clear flag when taking a cluster off-list */ - ci->flags =3D CLUSTER_FLAG_NONE; idx =3D cluster_index(si, ci); spin_unlock(&si->lock); - discard_swap_cluster(si, idx * SWAPFILE_CLUSTER, SWAPFILE_CLUSTER); =20 - spin_lock(&si->lock); spin_lock(&ci->lock); - __free_cluster(si, ci); + /* + * Discard is done, clear its flags as it's off-list, then + * return the cluster to allocation list. + */ + ci->flags =3D CLUSTER_FLAG_NONE; memset(si->swap_map + idx * SWAPFILE_CLUSTER, 0, SWAPFILE_CLUSTER); + __free_cluster(si, ci); spin_unlock(&ci->lock); + ret =3D true; + spin_lock(&si->lock); } + spin_unlock(&si->lock); + return ret; } =20 static void swap_discard_work(struct work_struct *work) @@ -517,9 +582,7 @@ static void swap_discard_work(struct work_struct *work) =20 si =3D container_of(work, struct swap_info_struct, discard_work); =20 - spin_lock(&si->lock); swap_do_scheduled_discard(si); - spin_unlock(&si->lock); } =20 static void swap_users_ref_free(struct percpu_ref *ref) @@ -530,10 +593,14 @@ static void swap_users_ref_free(struct percpu_ref *re= f) complete(&si->comp); } =20 +/* + * Must be called after freeing if ci->count =3D=3D 0, moves the cluster t= o free + * or discard list. + */ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_= info *ci) { VM_BUG_ON(ci->count !=3D 0); - lockdep_assert_held(&si->lock); + VM_BUG_ON(ci->flags =3D=3D CLUSTER_FLAG_FREE); lockdep_assert_held(&ci->lock); =20 /* @@ -550,6 +617,48 @@ static void free_cluster(struct swap_info_struct *si, = struct swap_cluster_info * __free_cluster(si, ci); } =20 +/* + * Must be called after freeing if ci->count !=3D 0, moves the cluster to + * nonfull list. + */ +static void partial_free_cluster(struct swap_info_struct *si, + struct swap_cluster_info *ci) +{ + VM_BUG_ON(!ci->count || ci->count =3D=3D SWAPFILE_CLUSTER); + lockdep_assert_held(&ci->lock); + + if (ci->flags !=3D CLUSTER_FLAG_NONFULL) + move_cluster(si, ci, &si->nonfull_clusters[ci->order], + CLUSTER_FLAG_NONFULL); +} + +/* + * Must be called after allocation, moves the cluster to full or frag list. + * Note: allocation doesn't acquire si lock, and may drop the ci lock for + * reclaim, so the cluster could be any where when called. + */ +static void relocate_cluster(struct swap_info_struct *si, + struct swap_cluster_info *ci) +{ + lockdep_assert_held(&ci->lock); + + /* Discard cluster must remain off-list or on discard list */ + if (cluster_is_discard(ci)) + return; + + if (!ci->count) { + free_cluster(si, ci); + } else if (ci->count !=3D SWAPFILE_CLUSTER) { + if (ci->flags !=3D CLUSTER_FLAG_FRAG) + move_cluster(si, ci, &si->frag_clusters[ci->order], + CLUSTER_FLAG_FRAG); + } else { + if (ci->flags !=3D CLUSTER_FLAG_FULL) + move_cluster(si, ci, &si->full_clusters, + CLUSTER_FLAG_FULL); + } +} + /* * The cluster corresponding to page_nr will be used. The cluster will not= be * added to free cluster list and its usage counter will be increased by 1. @@ -568,30 +677,6 @@ static void inc_cluster_info_page(struct swap_info_str= uct *si, VM_BUG_ON(ci->flags); } =20 -/* - * The cluster ci decreases @nr_pages usage. If the usage counter becomes = 0, - * which means no page in the cluster is in use, we can optionally discard - * the cluster and add it to free cluster list. - */ -static void dec_cluster_info_page(struct swap_info_struct *si, - struct swap_cluster_info *ci, int nr_pages) -{ - VM_BUG_ON(ci->count < nr_pages); - VM_BUG_ON(cluster_is_free(ci)); - lockdep_assert_held(&si->lock); - lockdep_assert_held(&ci->lock); - ci->count -=3D nr_pages; - - if (!ci->count) { - free_cluster(si, ci); - return; - } - - if (ci->flags !=3D CLUSTER_FLAG_NONFULL) - move_cluster(si, ci, &si->nonfull_clusters[ci->order], - CLUSTER_FLAG_NONFULL); -} - static bool cluster_reclaim_range(struct swap_info_struct *si, struct swap_cluster_info *ci, unsigned long start, unsigned long end) @@ -601,8 +686,6 @@ static bool cluster_reclaim_range(struct swap_info_stru= ct *si, int nr_reclaim; =20 spin_unlock(&ci->lock); - spin_unlock(&si->lock); - do { switch (READ_ONCE(map[offset])) { case 0: @@ -620,9 +703,7 @@ static bool cluster_reclaim_range(struct swap_info_stru= ct *si, } } while (offset < end); out: - spin_lock(&si->lock); spin_lock(&ci->lock); - /* * Recheck the range no matter reclaim succeeded or not, the slot * could have been be freed while we are not holding the lock. @@ -636,11 +717,11 @@ static bool cluster_reclaim_range(struct swap_info_st= ruct *si, =20 static bool cluster_scan_range(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long start, unsigned int nr_pages) + unsigned long start, unsigned int nr_pages, + bool *need_reclaim) { unsigned long offset, end =3D start + nr_pages; unsigned char *map =3D si->swap_map; - bool need_reclaim =3D false; =20 for (offset =3D start; offset < end; offset++) { switch (READ_ONCE(map[offset])) { @@ -649,16 +730,13 @@ static bool cluster_scan_range(struct swap_info_struc= t *si, case SWAP_HAS_CACHE: if (!vm_swap_full()) return false; - need_reclaim =3D true; + *need_reclaim =3D true; continue; default: return false; } } =20 - if (need_reclaim) - return cluster_reclaim_range(si, ci, start, end); - return true; } =20 @@ -673,23 +751,17 @@ static bool cluster_alloc_range(struct swap_info_stru= ct *si, struct swap_cluster if (!(si->flags & SWP_WRITEOK)) return false; =20 - VM_BUG_ON(ci->flags =3D=3D CLUSTER_FLAG_NONE); - VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE); - - if (cluster_is_free(ci)) { - if (nr_pages < SWAPFILE_CLUSTER) - move_cluster(si, ci, &si->nonfull_clusters[order], - CLUSTER_FLAG_NONFULL); + /* + * The first allocation in a cluster makes the + * cluster exclusive to this order + */ + if (cluster_is_empty(ci)) ci->order =3D order; - } =20 memset(si->swap_map + start, usage, nr_pages); swap_range_alloc(si, nr_pages); ci->count +=3D nr_pages; =20 - if (ci->count =3D=3D SWAPFILE_CLUSTER) - move_cluster(si, ci, &si->full_clusters, CLUSTER_FLAG_FULL); - return true; } =20 @@ -700,37 +772,55 @@ static unsigned int alloc_swap_scan_cluster(struct sw= ap_info_struct *si, unsigne unsigned long start =3D offset & ~(SWAPFILE_CLUSTER - 1); unsigned long end =3D min(start + SWAPFILE_CLUSTER, si->max); unsigned int nr_pages =3D 1 << order; + bool need_reclaim, ret; struct swap_cluster_info *ci; =20 - if (end < nr_pages) - return SWAP_NEXT_INVALID; - end -=3D nr_pages; + ci =3D &si->cluster_info[offset / SWAPFILE_CLUSTER]; + lockdep_assert_held(&ci->lock); =20 - ci =3D lock_cluster(si, offset); - if (ci->count + nr_pages > SWAPFILE_CLUSTER) { + if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) { offset =3D SWAP_NEXT_INVALID; - goto done; + goto out; } =20 - while (offset <=3D end) { - if (cluster_scan_range(si, ci, offset, nr_pages)) { - if (!cluster_alloc_range(si, ci, offset, usage, order)) { - offset =3D SWAP_NEXT_INVALID; - goto done; - } - *foundp =3D offset; - if (ci->count =3D=3D SWAPFILE_CLUSTER) { + for (end -=3D nr_pages; offset <=3D end; offset +=3D nr_pages) { + need_reclaim =3D false; + if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim)) + continue; + if (need_reclaim) { + ret =3D cluster_reclaim_range(si, ci, start, end); + /* + * Reclaim drops ci->lock and cluster could be used + * by another order. Not checking flag as off-list + * cluster has no flag set, and change of list + * won't cause fragmentation. + */ + if (!cluster_is_usable(ci, order)) { offset =3D SWAP_NEXT_INVALID; - goto done; + goto out; } - offset +=3D nr_pages; - break; + if (cluster_is_empty(ci)) + offset =3D start; + /* Reclaim failed but cluster is usable, try next */ + if (!ret) + continue; + } + if (!cluster_alloc_range(si, ci, offset, usage, order)) { + offset =3D SWAP_NEXT_INVALID; + goto out; + } + *foundp =3D offset; + if (ci->count =3D=3D SWAPFILE_CLUSTER) { + offset =3D SWAP_NEXT_INVALID; + goto out; } offset +=3D nr_pages; + break; } if (offset > end) offset =3D SWAP_NEXT_INVALID; -done: +out: + relocate_cluster(si, ci); unlock_cluster(ci); return offset; } @@ -747,18 +837,17 @@ static void swap_reclaim_full_clusters(struct swap_in= fo_struct *si, bool force) if (force) to_scan =3D swap_usage_in_pages(si) / SWAPFILE_CLUSTER; =20 - while (!list_empty(&si->full_clusters)) { - ci =3D list_first_entry(&si->full_clusters, struct swap_cluster_info, li= st); - list_move_tail(&ci->list, &si->full_clusters); + while ((ci =3D isolate_lock_cluster(si, &si->full_clusters))) { offset =3D cluster_offset(si, ci); end =3D min(si->max, offset + SWAPFILE_CLUSTER); to_scan--; =20 - spin_unlock(&si->lock); while (offset < end) { if (READ_ONCE(map[offset]) =3D=3D SWAP_HAS_CACHE) { + spin_unlock(&ci->lock); nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT); + spin_lock(&ci->lock); if (nr_reclaim) { offset +=3D abs(nr_reclaim); continue; @@ -766,8 +855,8 @@ static void swap_reclaim_full_clusters(struct swap_info= _struct *si, bool force) } offset++; } - spin_lock(&si->lock); =20 + unlock_cluster(ci); if (to_scan <=3D 0) break; } @@ -779,9 +868,7 @@ static void swap_reclaim_work(struct work_struct *work) =20 si =3D container_of(work, struct swap_info_struct, reclaim_work); =20 - spin_lock(&si->lock); swap_reclaim_full_clusters(si, true); - spin_unlock(&si->lock); } =20 /* @@ -792,29 +879,34 @@ static void swap_reclaim_work(struct work_struct *wor= k) static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,= int order, unsigned char usage) { - struct percpu_cluster *cluster; struct swap_cluster_info *ci; unsigned int offset, found =3D 0; =20 -new_cluster: - lockdep_assert_held(&si->lock); - cluster =3D this_cpu_ptr(si->percpu_cluster); - offset =3D cluster->next[order]; + /* Fast path using per CPU cluster */ + local_lock(&si->percpu_cluster->lock); + offset =3D __this_cpu_read(si->percpu_cluster->next[order]); if (offset) { - offset =3D alloc_swap_scan_cluster(si, offset, &found, order, usage); + ci =3D lock_cluster(si, offset); + /* Cluster could have been used by another order */ + if (cluster_is_usable(ci, order)) { + if (cluster_is_empty(ci)) + offset =3D cluster_offset(si, ci); + offset =3D alloc_swap_scan_cluster(si, offset, &found, + order, usage); + } else { + unlock_cluster(ci); + } if (found) goto done; } =20 - if (!list_empty(&si->free_clusters)) { - ci =3D list_first_entry(&si->free_clusters, struct swap_cluster_info, li= st); - offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, o= rder, usage); - /* - * Either we didn't touch the cluster due to swapoff, - * or the allocation must success. - */ - VM_BUG_ON((si->flags & SWP_WRITEOK) && !found); - goto done; +new_cluster: + ci =3D isolate_lock_cluster(si, &si->free_clusters); + if (ci) { + offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), + &found, order, usage); + if (found) + goto done; } =20 /* Try reclaim from full clusters if free clusters list is drained */ @@ -822,49 +914,42 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o swap_reclaim_full_clusters(si, false); =20 if (order < PMD_ORDER) { - unsigned int frags =3D 0; + unsigned int frags =3D 0, frags_existing; =20 - while (!list_empty(&si->nonfull_clusters[order])) { - ci =3D list_first_entry(&si->nonfull_clusters[order], - struct swap_cluster_info, list); - move_cluster(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG); + while ((ci =3D isolate_lock_cluster(si, &si->nonfull_clusters[order]))) { offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); - frags++; if (found) goto done; + /* Clusters failed to allocate are moved to frag_clusters */ + frags++; } =20 - /* - * Nonfull clusters are moved to frag tail if we reached - * here, count them too, don't over scan the frag list. - */ - while (frags < si->frag_cluster_nr[order]) { - ci =3D list_first_entry(&si->frag_clusters[order], - struct swap_cluster_info, list); + frags_existing =3D atomic_long_read(&si->frag_cluster_nr[order]); + while (frags < frags_existing && + (ci =3D isolate_lock_cluster(si, &si->frag_clusters[order]))) { + atomic_long_dec(&si->frag_cluster_nr[order]); /* - * Rotate the frag list to iterate, they were all failing - * high order allocation or moved here due to per-CPU usage, - * this help keeping usable cluster ahead. + * Rotate the frag list to iterate, they were all + * failing high order allocation or moved here due to + * per-CPU usage, but they could contain newly released + * reclaimable (eg. lazy-freed swap cache) slots. */ - list_move_tail(&ci->list, &si->frag_clusters[order]); offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); - frags++; if (found) goto done; + frags++; } } =20 - if (!list_empty(&si->discard_clusters)) { - /* - * we don't have free cluster but have some clusters in - * discarding, do discard now and reclaim them, then - * reread cluster_next_cpu since we dropped si->lock - */ - swap_do_scheduled_discard(si); + /* + * We don't have free cluster but have some clusters in + * discarding, do discard now and reclaim them, then + * reread cluster_next_cpu since we dropped si->lock + */ + if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si)) goto new_cluster; - } =20 if (order) goto done; @@ -875,26 +960,25 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o * Clusters here have at least one usable slots and can't fail order 0 * allocation, but reclaim may drop si->lock and race with another user. */ - while (!list_empty(&si->frag_clusters[o])) { - ci =3D list_first_entry(&si->frag_clusters[o], - struct swap_cluster_info, list); + while ((ci =3D isolate_lock_cluster(si, &si->frag_clusters[o]))) { + atomic_long_dec(&si->frag_cluster_nr[o]); offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, 0, usage); + &found, order, usage); if (found) goto done; } =20 - while (!list_empty(&si->nonfull_clusters[o])) { - ci =3D list_first_entry(&si->nonfull_clusters[o], - struct swap_cluster_info, list); + while ((ci =3D isolate_lock_cluster(si, &si->nonfull_clusters[o]))) { offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, 0, usage); + &found, order, usage); if (found) goto done; } } done: - cluster->next[order] =3D offset; + __this_cpu_write(si->percpu_cluster->next[order], offset); + local_unlock(&si->percpu_cluster->lock); + return found; } =20 @@ -1158,14 +1242,11 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entr= ies[], int entry_order) plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); spin_unlock(&swap_avail_lock); if (get_swap_device_info(si)) { - spin_lock(&si->lock); n_ret =3D scan_swap_map_slots(si, SWAP_HAS_CACHE, n_goal, swp_entries, order); - spin_unlock(&si->lock); put_swap_device(si); if (n_ret || size > 1) goto check_out; - cond_resched(); } =20 spin_lock(&swap_avail_lock); @@ -1378,9 +1459,7 @@ static bool __swap_entries_free(struct swap_info_stru= ct *si, if (!has_cache) { for (i =3D 0; i < nr; i++) zswap_invalidate(swp_entry(si->type, offset + i)); - spin_lock(&si->lock); swap_entry_range_free(si, entry, nr); - spin_unlock(&si->lock); } return has_cache; =20 @@ -1409,16 +1488,27 @@ static void swap_entry_range_free(struct swap_info_= struct *si, swp_entry_t entry unsigned char *map_end =3D map + nr_pages; struct swap_cluster_info *ci; =20 + /* It should never free entries across different clusters */ + VM_BUG_ON((offset / SWAPFILE_CLUSTER) !=3D ((offset + nr_pages - 1) / SWA= PFILE_CLUSTER)); + ci =3D lock_cluster(si, offset); + VM_BUG_ON(cluster_is_empty(ci)); + VM_BUG_ON(ci->count < nr_pages); + + ci->count -=3D nr_pages; do { VM_BUG_ON(*map !=3D SWAP_HAS_CACHE); *map =3D 0; } while (++map < map_end); - dec_cluster_info_page(si, ci, nr_pages); - unlock_cluster(ci); =20 mem_cgroup_uncharge_swap(entry, nr_pages); swap_range_free(si, offset, nr_pages); + + if (!ci->count) + free_cluster(si, ci); + else + partial_free_cluster(si, ci); + unlock_cluster(ci); } =20 static void cluster_swap_free_nr(struct swap_info_struct *si, @@ -1490,9 +1580,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t = entry) ci =3D lock_cluster(si, offset); if (size > 1 && swap_is_has_cache(si, offset, size)) { unlock_cluster(ci); - spin_lock(&si->lock); swap_entry_range_free(si, entry, size); - spin_unlock(&si->lock); return; } for (int i =3D 0; i < size; i++, entry.val++) { @@ -1507,46 +1595,19 @@ void put_swap_folio(struct folio *folio, swp_entry_= t entry) unlock_cluster(ci); } =20 -static int swp_entry_cmp(const void *ent1, const void *ent2) -{ - const swp_entry_t *e1 =3D ent1, *e2 =3D ent2; - - return (int)swp_type(*e1) - (int)swp_type(*e2); -} - void swapcache_free_entries(swp_entry_t *entries, int n) { - struct swap_info_struct *si, *prev; int i; + struct swap_info_struct *si =3D NULL; =20 if (n <=3D 0) return; =20 - prev =3D NULL; - si =3D NULL; - - /* - * Sort swap entries by swap device, so each lock is only taken once. - * nr_swapfiles isn't absolutely correct, but the overhead of sort() is - * so low that it isn't necessary to optimize further. - */ - if (nr_swapfiles > 1) - sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL); for (i =3D 0; i < n; ++i) { si =3D _swap_info_get(entries[i]); - - if (si !=3D prev) { - if (prev !=3D NULL) - spin_unlock(&prev->lock); - if (si !=3D NULL) - spin_lock(&si->lock); - } if (si) swap_entry_range_free(si, entries[i], 1); - prev =3D si; } - if (si) - spin_unlock(&si->lock); } =20 int __swap_count(swp_entry_t entry) @@ -1799,10 +1860,8 @@ swp_entry_t get_swap_page_of_type(int type) =20 /* This is called for allocating swap entry, not cache */ if (get_swap_device_info(si)) { - spin_lock(&si->lock); if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0= )) atomic_long_dec(&nr_swap_pages); - spin_unlock(&si->lock); put_swap_device(si); } fail: @@ -3142,6 +3201,7 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, cluster =3D per_cpu_ptr(si->percpu_cluster, cpu); for (i =3D 0; i < SWAP_NR_ORDERS; i++) cluster->next[i] =3D SWAP_NEXT_INVALID; + local_lock_init(&cluster->lock); } =20 /* @@ -3165,7 +3225,7 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, for (i =3D 0; i < SWAP_NR_ORDERS; i++) { INIT_LIST_HEAD(&si->nonfull_clusters[i]); INIT_LIST_HEAD(&si->frag_clusters[i]); - si->frag_cluster_nr[i] =3D 0; + atomic_long_set(&si->frag_cluster_nr[i], 0); } =20 /* @@ -3647,7 +3707,6 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) */ goto outer; } - spin_lock(&si->lock); =20 offset =3D swp_offset(entry); =20 @@ -3712,7 +3771,6 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) spin_unlock(&si->cont_lock); out: unlock_cluster(ci); - spin_unlock(&si->lock); put_swap_device(si); outer: if (page) --=20 2.47.1 From nobody Thu Dec 18 10:00:10 2025 Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 681A51C4617 for ; Mon, 13 Jan 2025 18:00:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791232; cv=none; b=seiSF0cxgrNbPSLanmmFrFa+b5ad3sh+yBxi+N5tE0U2b1DVVizLRsTWAQ7nWIsxXPw6hZw8ZNzA3Vi9GT0lD5GQ7X9vbvoXvuRzh7wDugWNwzJgtHCW7sCAX5wLIPFqFQbZgL64LfCSzQ4N2k6Jhpx1rOxUJnfMDBAFVPrFz58= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791232; c=relaxed/simple; bh=3izO00mkjWU4GzS6p74TzLvk4ZT0MF/QoBFCEC3IZ1s=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ES+i9BGExk2D3Qp91wLxF6QBABJ5B3kCUO+nVPcouD20jADJ15VjpFqqV850eva195Nce3gnV5eLLJNyyTvQbj5lRZkK0Phgep32T+gKhRYFkZhYeJhWVINV33a9OFWtjIJbj5cXXAB3juwAOpBtWgmehywnY92XJvfOqEErkMk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=LDlQJm0r; arc=none smtp.client-ip=209.85.214.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="LDlQJm0r" Received: by mail-pl1-f180.google.com with SMTP id d9443c01a7336-21661be2c2dso78046645ad.1 for ; Mon, 13 Jan 2025 10:00:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736791230; x=1737396030; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=w3/A4K7Bt53NzbRkeVq3PRJRV4Gi+Nq+TM9VfYB4Gnc=; b=LDlQJm0rkVquabjEEkSglCJtrVtcrHY61BStBQUt7Dh+6cs2Hoy08wfWaZzSPennSw XZ/As0D9ICu3UeccwemdxwDvBY/gqSGXIJvVAqp2SRqfSsiF8oQuE6oPaLA+ZvL3Q0+Y sa36zieuk6JvEbwzG7ntPUrL2ub9BwUgmdHhBqZqEDjBfPrqsce1tYdEIFCcD036jd2e INCzx7hH/rYPHIfGj/vF0jvLCRYB2wtC2MOWFtIv7Rz/edFCNZ5wLvjBePO/mudf8Q04 9w6km8MSoF5e+jT4qOuzx4+9QIlbYWR5S9ochQd8jFu6H8BU0xfFr4J0ohzfUDv275a1 nh/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736791230; x=1737396030; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=w3/A4K7Bt53NzbRkeVq3PRJRV4Gi+Nq+TM9VfYB4Gnc=; b=XACTjWoBhkCTXqDEe5XELJFOG1O9C8YGhTJlq1E+cqBg+d9ZJAGPJsawaPf5Zz/RrM HWul65DdNFX321V3RA2bGcv2FaRsH3Gnc3K/k3dDDgHPMpSDtR9gakeM+38XHMrS3Esx 3zqEXaaTq8Yz0PTpt/AF0RPfwPZ1AI93RN34uLWNSo/Z0eSKs63kf05uhfaB6JfLn4Du et2vAgphMHSy8iRiAkVfMxzghTbhc7V+9EYAaPR6Zv6a0Jw4SqG9ctQAnwuKNxMt/QLd MjBJ7GFkO98zV0wPY6QUNOskr0hc1YO8/OkIlYQMBh/WmJlfw7hE2VEpAqU6VE90o+I2 +/WA== X-Forwarded-Encrypted: i=1; AJvYcCUFwQrKZaQFVncNiWTc5hAyKHSKmDoPipzvfPua4FhZgbDBxgHSsw1hHa5Epu19jl2giNwXardRlKgNQn0=@vger.kernel.org X-Gm-Message-State: AOJu0YyQF/Y0FxEFF0XGJfaKsyLTOouk4e8V1g3kiF5V6YiufP4wlPkJ 9NYU2uNAan9i0Fw6U6fdmFEnDD8z+nMkCyvyWiBLstIhm0AKhM4AJthiuxOROJI= X-Gm-Gg: ASbGnctMAsJSzarSY6470peRDBEjMZyinbNoLVfyJ9xLSRXQtIrSehZg6qUN3NSDpe1 Vi2Q1LFsUkbjQL8/rAU8B74siaXdlx6fuY0GA0/n+zADfOc/VvZ7oc3Tsf12vrSO+98KeJQibmN 4TTewH38PVos7yxcSmLnD/RLSLadG1NDaO9GXKM5ykk2BlHv2w3htidgffueNq89mgErssLSfxK Gn3JwcWDkKVZsVUfG5Xxn3ND9N8TYKV+UQGYoRVdlZ8h+EnXuyaO3+asnpsOBDwvBZ3PAYlGpDx 8g== X-Google-Smtp-Source: AGHT+IEb+lacgc6xktF1aaBt7hTsj1nPaNZqgDFnny/A4HmF9V2aouf4o4ZmssdJsfpBSua/NdU+2Q== X-Received: by 2002:a17:902:ea09:b0:216:2a36:5b2e with SMTP id d9443c01a7336-21a83f76879mr320846885ad.32.1736791229397; Mon, 13 Jan 2025 10:00:29 -0800 (PST) Received: from KASONG-MC4.tencent.com ([115.171.41.132]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.10.00.26 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 13 Jan 2025 10:00:28 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Baoquan He , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 10/13] mm, swap: simplify percpu cluster updating Date: Tue, 14 Jan 2025 01:57:29 +0800 Message-ID: <20250113175732.48099-11-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250113175732.48099-1-ryncsn@gmail.com> References: <20250113175732.48099-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Instead of using a returning argument, we can simply store the next cluster offset to the fixed percpu location, which reduce the stack usage and simplify the function: Object size: ./scripts/bloat-o-meter mm/swapfile.o mm/swapfile.o.new add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-271 (-271) Function old new delta get_swap_pages 2847 2733 -114 alloc_swap_scan_cluster 894 737 -157 Total: Before=3D30833, After=3D30562, chg -0.88% Stack usage: Before: swapfile.c:1190:5:get_swap_pages 240 static After: swapfile.c:1185:5:get_swap_pages 216 static Signed-off-by: Kairui Song --- include/linux/swap.h | 4 +-- mm/swapfile.c | 66 +++++++++++++++++++------------------------- 2 files changed, 31 insertions(+), 39 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index c4ff31cb6bde..4c1d2e69689f 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -275,9 +275,9 @@ enum swap_cluster_flags { * The first page in the swap file is the swap header, which is always mar= ked * bad to prevent it from being allocated as an entry. This also prevents = the * cluster to which it belongs being marked free. Therefore 0 is safe to u= se as - * a sentinel to indicate next is not valid in percpu_cluster. + * a sentinel to indicate an entry is not valid. */ -#define SWAP_NEXT_INVALID 0 +#define SWAP_ENTRY_INVALID 0 =20 #ifdef CONFIG_THP_SWAP #define SWAP_NR_ORDERS (PMD_ORDER + 1) diff --git a/mm/swapfile.c b/mm/swapfile.c index 489ac6997a0c..6da2f3aa55fb 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -765,23 +765,23 @@ static bool cluster_alloc_range(struct swap_info_stru= ct *si, struct swap_cluster return true; } =20 -static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, u= nsigned long offset, - unsigned int *foundp, unsigned int order, +/* Try use a new cluster for current CPU and allocate from it. */ +static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long offset, + unsigned int order, unsigned char usage) { - unsigned long start =3D offset & ~(SWAPFILE_CLUSTER - 1); + unsigned int next =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTRY_INVALID; + unsigned long start =3D ALIGN_DOWN(offset, SWAPFILE_CLUSTER); unsigned long end =3D min(start + SWAPFILE_CLUSTER, si->max); unsigned int nr_pages =3D 1 << order; bool need_reclaim, ret; - struct swap_cluster_info *ci; =20 - ci =3D &si->cluster_info[offset / SWAPFILE_CLUSTER]; lockdep_assert_held(&ci->lock); =20 - if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) { - offset =3D SWAP_NEXT_INVALID; + if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) goto out; - } =20 for (end -=3D nr_pages; offset <=3D end; offset +=3D nr_pages) { need_reclaim =3D false; @@ -795,34 +795,27 @@ static unsigned int alloc_swap_scan_cluster(struct sw= ap_info_struct *si, unsigne * cluster has no flag set, and change of list * won't cause fragmentation. */ - if (!cluster_is_usable(ci, order)) { - offset =3D SWAP_NEXT_INVALID; + if (!cluster_is_usable(ci, order)) goto out; - } if (cluster_is_empty(ci)) offset =3D start; /* Reclaim failed but cluster is usable, try next */ if (!ret) continue; } - if (!cluster_alloc_range(si, ci, offset, usage, order)) { - offset =3D SWAP_NEXT_INVALID; - goto out; - } - *foundp =3D offset; - if (ci->count =3D=3D SWAPFILE_CLUSTER) { - offset =3D SWAP_NEXT_INVALID; - goto out; - } + if (!cluster_alloc_range(si, ci, offset, usage, order)) + break; + found =3D offset; offset +=3D nr_pages; + if (ci->count < SWAPFILE_CLUSTER && offset <=3D end) + next =3D offset; break; } - if (offset > end) - offset =3D SWAP_NEXT_INVALID; out: relocate_cluster(si, ci); unlock_cluster(ci); - return offset; + __this_cpu_write(si->percpu_cluster->next[order], next); + return found; } =20 /* Return true if reclaimed a whole cluster */ @@ -891,8 +884,8 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o if (cluster_is_usable(ci, order)) { if (cluster_is_empty(ci)) offset =3D cluster_offset(si, ci); - offset =3D alloc_swap_scan_cluster(si, offset, &found, - order, usage); + found =3D alloc_swap_scan_cluster(si, ci, offset, + order, usage); } else { unlock_cluster(ci); } @@ -903,8 +896,8 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o new_cluster: ci =3D isolate_lock_cluster(si, &si->free_clusters); if (ci) { - offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, order, usage); + found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), + order, usage); if (found) goto done; } @@ -917,8 +910,8 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o unsigned int frags =3D 0, frags_existing; =20 while ((ci =3D isolate_lock_cluster(si, &si->nonfull_clusters[order]))) { - offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, order, usage); + found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), + order, usage); if (found) goto done; /* Clusters failed to allocate are moved to frag_clusters */ @@ -935,8 +928,8 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o * per-CPU usage, but they could contain newly released * reclaimable (eg. lazy-freed swap cache) slots. */ - offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, order, usage); + found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), + order, usage); if (found) goto done; frags++; @@ -962,21 +955,20 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o */ while ((ci =3D isolate_lock_cluster(si, &si->frag_clusters[o]))) { atomic_long_dec(&si->frag_cluster_nr[o]); - offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, order, usage); + found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), + 0, usage); if (found) goto done; } =20 while ((ci =3D isolate_lock_cluster(si, &si->nonfull_clusters[o]))) { - offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, order, usage); + found =3D alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), + 0, usage); if (found) goto done; } } done: - __this_cpu_write(si->percpu_cluster->next[order], offset); local_unlock(&si->percpu_cluster->lock); =20 return found; @@ -3200,7 +3192,7 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, =20 cluster =3D per_cpu_ptr(si->percpu_cluster, cpu); for (i =3D 0; i < SWAP_NR_ORDERS; i++) - cluster->next[i] =3D SWAP_NEXT_INVALID; + cluster->next[i] =3D SWAP_ENTRY_INVALID; local_lock_init(&cluster->lock); } =20 --=20 2.47.1 From nobody Thu Dec 18 10:00:10 2025 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DF3EA1C5F1C for ; Mon, 13 Jan 2025 18:00:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791235; cv=none; b=oAf707o3tHW9abxncHYpKCtGj6k8Oepv6ia0INcVgBifF8etwg7S+KVQRrB1b30+RvP8oH4ywI2dddqlmEmSCTb1VQKJeRcNSLAhTWw34OBsL7kVftfSTibIthmYH/Z3bw2kG7hH/Ot6dkp04eTFtgR63ClDhIoGfiHObNkIdPY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791235; c=relaxed/simple; bh=Btpk8Q4K9TruqKsylbnCNcfi6xBpUcgdCeqlnyeFJbk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=JbdEX5Bb/qkcMQlq9mJRk05mW0kg/l2KjBJdqxPG0aDeWHZs6P4YFfgh1Sd0pxxruehdVKdkAFL5vBCNmyGi5J5EP02P4Yo6Sp3tP4NV3N2SNdrv04H+a1LklERGhF+hg/K0zTcMgHA925l9J0eou0+g4akRqsTZN77WFLvbjJE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=EmW7idgz; arc=none smtp.client-ip=209.85.214.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="EmW7idgz" Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-216395e151bso56011225ad.0 for ; Mon, 13 Jan 2025 10:00:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736791233; x=1737396033; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=84ptMZMYZj3rVArRcmuRvTI8mbqV19hN4DkCoI7h/g0=; b=EmW7idgzu6jAHcUNDyVg8Ikt+bra3UKpeZfY+Eyy9LQsdZnUgPDauf/sE3Di3OIcaf lhWXjbh+tCN45W2DWanEv2M+CpJC9HVQ3da7/8emcPE98t/BFoJ5cFMGijSt9mtjxxiI hqNUe8aEfUQzinnCwr8eSv3agwsGuDC1m1UXNoDX7m5Q//zmSUiGjX0pSRy3SWsdmB+j kY3d77AtRLBY+98sI2cb3+LPssJy+62/SmRkATnLUVFpyjuEkfNTFKNISt/uls5i4Sg2 SLexRuf5hdmFW8PlgYMdyMF297QoA2VHmWnMfpAWXnfYjg5vaHUSxlUBObzIx7epSUhq XkEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736791233; x=1737396033; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=84ptMZMYZj3rVArRcmuRvTI8mbqV19hN4DkCoI7h/g0=; b=qGZ1ZNH0CHFPou92e4PNL2s6DaXD2N2yYUfqkYNkmste3nldiJYNdicyzku47vqxPK 1pXf5FzhePbCYHqk5CrD8D2I8WfqUdK11b1j4RWWaMEtqi5tIaIL6Y3vW7i3P2jama+5 0mSjnVb8RADsj+m/++JNeydRcWFdPJrvoS3FnDBIY0cxXPfIOUcHMvP6FvtGGsSJdZa4 5Gr9ch9dV8uKopHQISvzw3bg5ckYKcyPxvSViRujq8FrNtKWwbp5TGMxuHa/fI6VkNLt DAcPmVO5td5R47nWpjuwhMVCXPCJjp5HFKW3JuAzyEfgXI++4tH2ACzku9+/oDXiXuVG m0QQ== X-Forwarded-Encrypted: i=1; AJvYcCUDbzrSeWRWDaCYTgZrcWlPJQRiDhJ/e9WsteUScZoyCaUY8MEEGrsqLjF3Dt2bFAmNeuYGHMEDnpW4Mds=@vger.kernel.org X-Gm-Message-State: AOJu0Yz6Rjl0XHPDnUchSNuOB9azlr7JDOjUHJev8Gn2ANbKNKWqFDMY htlXUlIlgod1Nqut87jhKcYrIcjiLzN1wFa5quwhpq56cH+qghNY X-Gm-Gg: ASbGncv/VkOsgdMwkJBxz75AkcNcsPdgk99t8tzPkNeSFrXSdfQQbbi2P3V7Bp3HVmo C5SXR0XkvspthBywSwtvW80Ci/WbW/aSc6KeUiYMzINS4o/V5TqgFWY1t8N8xNM99SMcaCdV5+w NGSeLgJ8YD50R1ysus7McAiu34snGz1hzfe/aTugiLXoL2z818CJ29WzoNr3OhNXVU2HcjZOJ5G GHXd1F+Pq2AIjQRtWzWrDFDUSMI3oKZMrFNn9CllqFUKMM+ypZECIHFTOe8+UxoSVrhiVRJb/us wA== X-Google-Smtp-Source: AGHT+IHndLsitkjnbmdqmvdITPRrE8bpKXtfyrzCQgnWcbfVYdpyJsxfkIpMOC/VykeL5wA03UubJA== X-Received: by 2002:a17:902:fc47:b0:215:7e49:8202 with SMTP id d9443c01a7336-21ad9f3c0a0mr170609865ad.13.1736791233083; Mon, 13 Jan 2025 10:00:33 -0800 (PST) Received: from KASONG-MC4.tencent.com ([115.171.41.132]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.10.00.29 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 13 Jan 2025 10:00:32 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Baoquan He , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 11/13] mm, swap: introduce a helper for retrieving cluster from offset Date: Tue, 14 Jan 2025 01:57:30 +0800 Message-ID: <20250113175732.48099-12-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250113175732.48099-1-ryncsn@gmail.com> References: <20250113175732.48099-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song It's a common operation to retrieve the cluster info from offset, introduce a helper for this. Suggested-by: Chris Li Signed-off-by: Kairui Song --- mm/swapfile.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 6da2f3aa55fb..37d540fa0310 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -424,6 +424,12 @@ static inline unsigned int cluster_index(struct swap_i= nfo_struct *si, return ci - si->cluster_info; } =20 +static inline struct swap_cluster_info *offset_to_cluster(struct swap_info= _struct *si, + unsigned long offset) +{ + return &si->cluster_info[offset / SWAPFILE_CLUSTER]; +} + static inline unsigned int cluster_offset(struct swap_info_struct *si, struct swap_cluster_info *ci) { @@ -435,7 +441,7 @@ static inline struct swap_cluster_info *lock_cluster(st= ruct swap_info_struct *si { struct swap_cluster_info *ci; =20 - ci =3D &si->cluster_info[offset / SWAPFILE_CLUSTER]; + ci =3D offset_to_cluster(si, offset); spin_lock(&ci->lock); =20 return ci; @@ -1480,10 +1486,10 @@ static void swap_entry_range_free(struct swap_info_= struct *si, swp_entry_t entry unsigned char *map_end =3D map + nr_pages; struct swap_cluster_info *ci; =20 - /* It should never free entries across different clusters */ - VM_BUG_ON((offset / SWAPFILE_CLUSTER) !=3D ((offset + nr_pages - 1) / SWA= PFILE_CLUSTER)); - ci =3D lock_cluster(si, offset); + + /* It should never free entries across different clusters */ + VM_BUG_ON(ci !=3D offset_to_cluster(si, offset + nr_pages - 1)); VM_BUG_ON(cluster_is_empty(ci)); VM_BUG_ON(ci->count < nr_pages); =20 --=20 2.47.1 From nobody Thu Dec 18 10:00:10 2025 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A80401CD214 for ; Mon, 13 Jan 2025 18:00:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791239; cv=none; b=d6damN9k46En2yVMw66YdvlBPhi4aQYutTOXeh9ZNJlfEGGYN+GAS2j2T4KX6nBXvPFwH+ePGJx+4RDAwu5M1CvcnQmmEmg2nC6xD7OvfvlcxCmLD0wIgNxshvxokh9izSjs/4KVqGtSvV4s1qAJAKp3waQxPW61uE7Xo+YXdk4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791239; c=relaxed/simple; bh=TWJeHP5g0DrHX9uDpXjIGxPX2jm6qRBY6opzZAOIqVM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=K8Pg8tbYaRzxJ4Xnj1sGhTRB/a4Vf5UxwPFh/cb3yp0QK2VMkI8DaGpg1qQakIqUuG5gMj4a8CyFpNk6C5MVp3yY0WHNFLxNwJA5YkbFYbC/JGmUbapePZTJ528fp0XsbDC3DFhGDjMIq/Yr1/oerI+OLIcOcrcH+nWzfe1+hOw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=gsh2ttjl; arc=none smtp.client-ip=209.85.214.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="gsh2ttjl" Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-216395e151bso56012075ad.0 for ; Mon, 13 Jan 2025 10:00:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736791237; x=1737396037; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=At1biLHm6e57SEHEw6uwq9JjLDEXAlo5gNBhgwUICrU=; b=gsh2ttjlf0FFKGjchpp/aTJoLdcQRAIoO9gzoB2xuDbRg18LG7YNfU8qBCeZkuOvuD SK8RRKhJCAT0RthDd20ArDb/lfRun85vLcbwB/jE/W3ZaKeXE5uXrX1pOBhjmh4jb/LL 8xDGEhYh6gzvWb5egYssK4+rF+ydz+WHq80KK2zHyQC1itOfpNg28CeffYmEG2F2vyy3 wSwBLY4O1gfJBXCKy9BV2bLPQKQ7k3BS/RB51AQ/J61wjsH2NzRZQrOMIQoduGgKyRpX E1kEqIfkFy5CSKKODLfz4KA6zQZ+cHAZF1PjADhq2E/X0rVfPcMll6NDcPhLcoGjxkO6 NN5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736791237; x=1737396037; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=At1biLHm6e57SEHEw6uwq9JjLDEXAlo5gNBhgwUICrU=; b=e6jyZJZDNRGoNV4OW/ucy5iMlKPBao3ASvk30uBHmaC5y0/ybM6u4uHF7z2op4bG56 0vc5TCfd5Oq9Uo9geWVQUE9TkVvh6uEBVFmemsPHvwduSyTrGTx2Gi0ks+qLSMBcwGmF AaEuMIzoSyWKjRrmOu3yx579M4+SkayM8Axmaosc50v2KKJjwBBULaMnTGkmMH5mrrHT DpxzDOTX7F3WMAq9Qk8L4L3kryTJI96slzMm3sTT1+3gTWV0E+nCWPjhsv93m8JWoY80 qS5h0Y1nhoCxUiwUnP1qDnSoHgGmDyBGvBrALBxkUEVag+kKqxl9d/IYdPlTuGj34ax/ TczA== X-Forwarded-Encrypted: i=1; AJvYcCU6NmzRYszJLVlnJ0rsYMEfEZ5OyFWd/4LxioM3Adtxcykm1BiQjKNRH+fQqtdKVSZfgM3bF6kRympIYV0=@vger.kernel.org X-Gm-Message-State: AOJu0YxbEQCfOeciyjEPuzKhPX2weR6/Jx8v88Od2ybWX/V3qHnKZMAe 6YSs65jed24Cx7x2OA3PC7pVqBakYywg8boAh0xZm5r7GoOu3SWe X-Gm-Gg: ASbGncuRlDmDBXQy1cpGUrbn2xJ6oE5A3Ka/UHWNjMlemjRaL+rnhU3FBvi7TcuGib9 5LSENhwvvSbO4TTPALAXrMhHc5e3IO1sguREbupE3U5/lUTP95+Th5ofEc3HZrJ3r9SJKczPOSG 6J8HgirlRGpZZSVpwluILkTLpACRPYScUZakPTYDwS0oy5TQuyRv4ICfkQSg56LDxbeFyMBnojz PYSPyDf+pZ5m3gG7fQgGe0BKfTbVlnYR1Kwt75lqS3Mr1R7edfubrLYh5PrgDi+wkU/2GLVHFo7 iw== X-Google-Smtp-Source: AGHT+IHxGwhG9J2j0xHRNOesp9JYTQhHyHxaHzGPoN76JvXqmpGF+xTYQiMe5d834HF94l/UImFywQ== X-Received: by 2002:a17:902:d4c9:b0:216:6ef9:60d with SMTP id d9443c01a7336-21a8d6e9625mr285446795ad.23.1736791236839; Mon, 13 Jan 2025 10:00:36 -0800 (PST) Received: from KASONG-MC4.tencent.com ([115.171.41.132]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.10.00.33 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 13 Jan 2025 10:00:36 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Baoquan He , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 12/13] mm, swap: use a global swap cluster for non-rotation devices Date: Tue, 14 Jan 2025 01:57:31 +0800 Message-ID: <20250113175732.48099-13-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250113175732.48099-1-ryncsn@gmail.com> References: <20250113175732.48099-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Non-rotational devices (SSD / ZRAM) can tolerate fragmentation, so the goal of the SWAP allocator is to avoid contention for clusters. It uses a per-CPU cluster design, and each CPU will use a different cluster as much as possible. However, HDDs are very sensitive to fragmentation, contention is trivial in comparison. Therefore, we use one global cluster instead. This ensures that each order will be written to the same cluster as much as possible, which helps make the I/O more continuous. This ensures that the performance of the cluster allocator is as good as that of the old allocator. Tests after this commit compared to those before this series: Tested using 'make -j32' with tinyconfig, a 1G memcg limit, and HDD swap: make -j32 with tinyconfig, using 1G memcg limit and HDD swap: Before this series: 114.44user 29.11system 39:42.90elapsed 6%CPU (0avgtext+0avgdata 157284maxre= sident)k 2901232inputs+0outputs (238877major+4227640minor)pagefaults After this commit: 113.90user 23.81system 38:11.77elapsed 6%CPU (0avgtext+0avgdata 157260maxre= sident)k 2548728inputs+0outputs (235471major+4238110minor)pagefaults Suggested-by: Chris Li Signed-off-by: Kairui Song --- include/linux/swap.h | 2 ++ mm/swapfile.c | 51 ++++++++++++++++++++++++++++++++------------ 2 files changed, 39 insertions(+), 14 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 4c1d2e69689f..b13b72645db3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -318,6 +318,8 @@ struct swap_info_struct { unsigned int pages; /* total of usable pages of swap */ atomic_long_t inuse_pages; /* number of those currently in use */ struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio= n */ + struct percpu_cluster *global_cluster; /* Use one global cluster for rota= ting device */ + spinlock_t global_cluster_lock; /* Serialize usage of global cluster */ struct rb_root swap_extent_root;/* root of the swap extent rbtree */ struct block_device *bdev; /* swap device or bdev of swap file */ struct file *swap_file; /* seldom referenced */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 37d540fa0310..793b2fd1a2a8 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -820,7 +820,10 @@ static unsigned int alloc_swap_scan_cluster(struct swa= p_info_struct *si, out: relocate_cluster(si, ci); unlock_cluster(ci); - __this_cpu_write(si->percpu_cluster->next[order], next); + if (si->flags & SWP_SOLIDSTATE) + __this_cpu_write(si->percpu_cluster->next[order], next); + else + si->global_cluster->next[order] =3D next; return found; } =20 @@ -881,9 +884,16 @@ static unsigned long cluster_alloc_swap_entry(struct s= wap_info_struct *si, int o struct swap_cluster_info *ci; unsigned int offset, found =3D 0; =20 - /* Fast path using per CPU cluster */ - local_lock(&si->percpu_cluster->lock); - offset =3D __this_cpu_read(si->percpu_cluster->next[order]); + if (si->flags & SWP_SOLIDSTATE) { + /* Fast path using per CPU cluster */ + local_lock(&si->percpu_cluster->lock); + offset =3D __this_cpu_read(si->percpu_cluster->next[order]); + } else { + /* Serialize HDD SWAP allocation for each device. */ + spin_lock(&si->global_cluster_lock); + offset =3D si->global_cluster->next[order]; + } + if (offset) { ci =3D lock_cluster(si, offset); /* Cluster could have been used by another order */ @@ -975,8 +985,10 @@ static unsigned long cluster_alloc_swap_entry(struct s= wap_info_struct *si, int o } } done: - local_unlock(&si->percpu_cluster->lock); - + if (si->flags & SWP_SOLIDSTATE) + local_unlock(&si->percpu_cluster->lock); + else + spin_unlock(&si->global_cluster_lock); return found; } =20 @@ -2784,6 +2796,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) mutex_unlock(&swapon_mutex); free_percpu(p->percpu_cluster); p->percpu_cluster =3D NULL; + kfree(p->global_cluster); + p->global_cluster =3D NULL; vfree(swap_map); kvfree(zeromap); kvfree(cluster_info); @@ -3189,17 +3203,24 @@ static struct swap_cluster_info *setup_clusters(str= uct swap_info_struct *si, for (i =3D 0; i < nr_clusters; i++) spin_lock_init(&cluster_info[i].lock); =20 - si->percpu_cluster =3D alloc_percpu(struct percpu_cluster); - if (!si->percpu_cluster) - goto err_free; + if (si->flags & SWP_SOLIDSTATE) { + si->percpu_cluster =3D alloc_percpu(struct percpu_cluster); + if (!si->percpu_cluster) + goto err_free; =20 - for_each_possible_cpu(cpu) { - struct percpu_cluster *cluster; + for_each_possible_cpu(cpu) { + struct percpu_cluster *cluster; =20 - cluster =3D per_cpu_ptr(si->percpu_cluster, cpu); + cluster =3D per_cpu_ptr(si->percpu_cluster, cpu); + for (i =3D 0; i < SWAP_NR_ORDERS; i++) + cluster->next[i] =3D SWAP_ENTRY_INVALID; + local_lock_init(&cluster->lock); + } + } else { + si->global_cluster =3D kmalloc(sizeof(*si->global_cluster), GFP_KERNEL); for (i =3D 0; i < SWAP_NR_ORDERS; i++) - cluster->next[i] =3D SWAP_ENTRY_INVALID; - local_lock_init(&cluster->lock); + si->global_cluster->next[i] =3D SWAP_ENTRY_INVALID; + spin_lock_init(&si->global_cluster_lock); } =20 /* @@ -3473,6 +3494,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf= ile, int, swap_flags) bad_swap: free_percpu(si->percpu_cluster); si->percpu_cluster =3D NULL; + kfree(si->global_cluster); + si->global_cluster =3D NULL; inode =3D NULL; destroy_swap_extents(si); swap_cgroup_swapoff(si->type); --=20 2.47.1 From nobody Thu Dec 18 10:00:10 2025 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9C1A41CDA09 for ; Mon, 13 Jan 2025 18:00:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791244; cv=none; b=kM2yhRbDa1N/ZQ2kYMokVgWZW68mMn5DXFqyZYgOjJEgdj7jvPOBlwYY7QvekjP/OI3FEZmg1T1MPmc9JUFpsZYb+Mxqj2hokOVExaMoOldrRaF6sPWBXkTMC5fFBMU83CdB+QjLKqsGaN3aIrnhduTbbTu4xel7QXm+5x6ZkSc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791244; c=relaxed/simple; bh=euxsGq7fmGgXrPa6F/fFvIg7hN2Bu5TLNeBHt8I8rEQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=U+q0CM60/iraLsSWfBaneLLP/r/GpLxR+GqnveMgLgqYFopXI/cN0XGBR2T156IWA5f8bq/UUO3LXsb1NhxNaBi4qYo2353Uw9z+X4sOuSn6GPXSc6WFpC2jd6WS6wAKbC3nIx4YUnOU1BygRgHWNpHr6lcta4ugtRGEXPGML5U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ZJ5KWc9v; arc=none smtp.client-ip=209.85.214.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ZJ5KWc9v" Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-216395e151bso56012945ad.0 for ; Mon, 13 Jan 2025 10:00:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736791241; x=1737396041; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=FrVAA4WZZ90HWrqJV3lVdkYD6VffUgtaPFosI/u8elE=; b=ZJ5KWc9vx8Ndtq9pcqLHKARYmoufHg3AE5cPyReoI4WgMlFSS4FTYvOkjCQfwZSAAZ wOFL1bvlQO5WIlfyna5IrJfBnhvgxN2BSXGv/bGe/Yv6DeC8nH6Nvrco0P788Zoy1ZKs QAwMkp6ViStMnF/p7f/N1CXecZmACJLNWvOJ5FS6ZFZLRVU8laE34FBaOxC4JHdBXy8Y fAEp2UvVypbrdQiWqRtGxPVIi5nyvHY2TkxllW9vktlINzuQefB5GEimmbJEb7vpC9Z6 fEKDTWZhB3lpYM719RGHM7FfCfBbFd/J+Ap+7FGi436IA+kiCtNrVg/Iug69NC6upxro cORg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736791241; x=1737396041; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=FrVAA4WZZ90HWrqJV3lVdkYD6VffUgtaPFosI/u8elE=; b=oDdqLh3xt3mscoz3gu43g9febdJybiIcmbZg+b/TbXLu5gp+80PwJKz+yPZC549qss 8CtMjdMK0lJKNea4QrlSOsaQALUhx8nFyj7QM/ivySIRMIfPf+zobIhVBEsmDYPyICKO RR6jAAMEnz9iag+RkbqByaI7LLno916VDrBYanz0GhDJ+KMG7d9aYFXFF3SCXptu4HsS rF69Ycw3pJ1uSIXBISIsac7lkt86wuMkqSTQGyjGJcVsJQxX2IcXqeyamKUPWQgLZBOd dBoBLECHbCQ0Th9Hvh9KKrdwaYPa+EDjedwO48yXvg6X6aoDE3m3hgUqrphBzGuJF0tz QyaQ== X-Forwarded-Encrypted: i=1; AJvYcCVyd6ieUFwb/sV5Y3s2jL9wwCjzU9t2Li1QGGAt2aMtMNzGwm9MawNZyzy7KxHRC/4oKRuHdndAh0v2EJM=@vger.kernel.org X-Gm-Message-State: AOJu0YwY683bEys+aJ9cgYYTcPP+yk3T7+K4n1Eo4JgHnXaifp/390RY M+C2vTTf0HU48/kbgtPAPqf+XAbom7jtfnUYvbR2QJIajOMhf3TFRx4i7dETMA4= X-Gm-Gg: ASbGncv6udUy0KZlE+6OksooMDRvQFczkQrZH4oEbApJKX+5XyWqurmaQ1rdDnZ60Tw ApDvqPvunK6SaSWRPnhNY+ztmBklhxZsPcRMm/XZutnuPWHHIhz3awDed8GK03r4dNCr64xjaUt jCzM9TbLeRJhSzoOwNbN5yWoAM2+0mmBn2TstmkgKiPJbiiF5XYIsJLlysoieRE9UKYN+zR3KUa qmHWz9gKiDNg3eIEGhSKjLFG++0CFvuKpzCV2qUfpJoGbK4J7f7pjm76QgkC2aSKWGV2fuEbZME Iw== X-Google-Smtp-Source: AGHT+IHICrhgTxcI0KQLiwilMoiewVem2y8BV2E+Xf0iSPUlg9PW0V2LTkmSTxZnAq9iJkrodfAO5Q== X-Received: by 2002:a17:903:2b10:b0:215:a56f:1e50 with SMTP id d9443c01a7336-21ad9ee7348mr157415045ad.8.1736791240589; Mon, 13 Jan 2025 10:00:40 -0800 (PST) Received: from KASONG-MC4.tencent.com ([115.171.41.132]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.10.00.37 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 13 Jan 2025 10:00:40 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Baoquan He , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 13/13] mm, swap_slots: remove slot cache for freeing path Date: Tue, 14 Jan 2025 01:57:32 +0800 Message-ID: <20250113175732.48099-14-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250113175732.48099-1-ryncsn@gmail.com> References: <20250113175732.48099-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song The slot cache for freeing path is mostly for reducing the overhead of si->lock. As we have basically eliminated the si->lock usage for freeing path, it can be removed. This helps simplify the code, and avoids swap entries from being hold in cache upon freeing. The delayed freeing of entries have been causing trouble for further optimizations for zswap [1] and in theory will also cause more fragmentation, and extra overhead. Test with build linux kernel showed both performance and fragmentation is better without the cache: tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, avg of 4 test run:: Before: Sys time: 36047.78, Real time: 472.43 After: (-7.6% sys time, -7.3% real time) Sys time: 33314.76, Real time: 437.67 time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, avg of 4 test run: Before: Sys time: 46859.04, Real time: 562.63 hugepages-64kB/stats/swpout: 1783392 hugepages-64kB/stats/swpout_fallback: 240875 After: (-23.3% sys time, -21.3% real time) Sys time: 35958.87, Real time: 442.69 hugepages-64kB/stats/swpout: 1866267 hugepages-64kB/stats/swpout_fallback: 158330 Sequential SWAP should be also slightly faster, tests didn't show a measurable difference though, at least no regression: Swapin 4G zero page on ZRAM (time in us): Before (avg. 1923756) 1912391 1927023 1927957 1916527 1918263 1914284 1934753 1940813 1921791 After (avg. 1922290): 1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913 Link: https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdme= sW_59W1BWw@mail.gmail.com/[1] Suggested-by: Chris Li Signed-off-by: Kairui Song --- include/linux/swap_slots.h | 3 -- mm/swap_slots.c | 78 +++++---------------------------- mm/swapfile.c | 89 +++++++++++++++----------------------- 3 files changed, 44 insertions(+), 126 deletions(-) diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h index 15adfb8c813a..840aec3523b2 100644 --- a/include/linux/swap_slots.h +++ b/include/linux/swap_slots.h @@ -16,15 +16,12 @@ struct swap_slots_cache { swp_entry_t *slots; int nr; int cur; - spinlock_t free_lock; /* protects slots_ret, n_ret */ - swp_entry_t *slots_ret; int n_ret; }; =20 void disable_swap_slots_cache_lock(void); void reenable_swap_slots_cache_unlock(void); void enable_swap_slots_cache(void); -void free_swap_slot(swp_entry_t entry); =20 extern bool swap_slot_cache_enabled; =20 diff --git a/mm/swap_slots.c b/mm/swap_slots.c index 13ab3b771409..9c7c171df7ba 100644 --- a/mm/swap_slots.c +++ b/mm/swap_slots.c @@ -43,17 +43,15 @@ static DEFINE_MUTEX(swap_slots_cache_mutex); /* Serialize swap slots cache enable/disable operations */ static DEFINE_MUTEX(swap_slots_cache_enable_mutex); =20 -static void __drain_swap_slots_cache(unsigned int type); +static void __drain_swap_slots_cache(void); =20 #define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_ena= bled) -#define SLOTS_CACHE 0x1 -#define SLOTS_CACHE_RET 0x2 =20 static void deactivate_swap_slots_cache(void) { mutex_lock(&swap_slots_cache_mutex); swap_slot_cache_active =3D false; - __drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET); + __drain_swap_slots_cache(); mutex_unlock(&swap_slots_cache_mutex); } =20 @@ -72,7 +70,7 @@ void disable_swap_slots_cache_lock(void) if (swap_slot_cache_initialized) { /* serialize with cpu hotplug operations */ cpus_read_lock(); - __drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET); + __drain_swap_slots_cache(); cpus_read_unlock(); } } @@ -113,7 +111,7 @@ static bool check_cache_active(void) static int alloc_swap_slot_cache(unsigned int cpu) { struct swap_slots_cache *cache; - swp_entry_t *slots, *slots_ret; + swp_entry_t *slots; =20 /* * Do allocation outside swap_slots_cache_mutex @@ -125,28 +123,19 @@ static int alloc_swap_slot_cache(unsigned int cpu) if (!slots) return -ENOMEM; =20 - slots_ret =3D kvcalloc(SWAP_SLOTS_CACHE_SIZE, sizeof(swp_entry_t), - GFP_KERNEL); - if (!slots_ret) { - kvfree(slots); - return -ENOMEM; - } - mutex_lock(&swap_slots_cache_mutex); cache =3D &per_cpu(swp_slots, cpu); - if (cache->slots || cache->slots_ret) { + if (cache->slots) { /* cache already allocated */ mutex_unlock(&swap_slots_cache_mutex); =20 kvfree(slots); - kvfree(slots_ret); =20 return 0; } =20 if (!cache->lock_initialized) { mutex_init(&cache->alloc_lock); - spin_lock_init(&cache->free_lock); cache->lock_initialized =3D true; } cache->nr =3D 0; @@ -160,19 +149,16 @@ static int alloc_swap_slot_cache(unsigned int cpu) */ mb(); cache->slots =3D slots; - cache->slots_ret =3D slots_ret; mutex_unlock(&swap_slots_cache_mutex); return 0; } =20 -static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type, - bool free_slots) +static void drain_slots_cache_cpu(unsigned int cpu, bool free_slots) { struct swap_slots_cache *cache; - swp_entry_t *slots =3D NULL; =20 cache =3D &per_cpu(swp_slots, cpu); - if ((type & SLOTS_CACHE) && cache->slots) { + if (cache->slots) { mutex_lock(&cache->alloc_lock); swapcache_free_entries(cache->slots + cache->cur, cache->nr); cache->cur =3D 0; @@ -183,20 +169,9 @@ static void drain_slots_cache_cpu(unsigned int cpu, un= signed int type, } mutex_unlock(&cache->alloc_lock); } - if ((type & SLOTS_CACHE_RET) && cache->slots_ret) { - spin_lock_irq(&cache->free_lock); - swapcache_free_entries(cache->slots_ret, cache->n_ret); - cache->n_ret =3D 0; - if (free_slots && cache->slots_ret) { - slots =3D cache->slots_ret; - cache->slots_ret =3D NULL; - } - spin_unlock_irq(&cache->free_lock); - kvfree(slots); - } } =20 -static void __drain_swap_slots_cache(unsigned int type) +static void __drain_swap_slots_cache(void) { unsigned int cpu; =20 @@ -224,13 +199,13 @@ static void __drain_swap_slots_cache(unsigned int typ= e) * There are no slots on such cpu that need to be drained. */ for_each_online_cpu(cpu) - drain_slots_cache_cpu(cpu, type, false); + drain_slots_cache_cpu(cpu, false); } =20 static int free_slot_cache(unsigned int cpu) { mutex_lock(&swap_slots_cache_mutex); - drain_slots_cache_cpu(cpu, SLOTS_CACHE | SLOTS_CACHE_RET, true); + drain_slots_cache_cpu(cpu, true); mutex_unlock(&swap_slots_cache_mutex); return 0; } @@ -269,39 +244,6 @@ static int refill_swap_slots_cache(struct swap_slots_c= ache *cache) return cache->nr; } =20 -void free_swap_slot(swp_entry_t entry) -{ - struct swap_slots_cache *cache; - - /* Large folio swap slot is not covered. */ - zswap_invalidate(entry); - - cache =3D raw_cpu_ptr(&swp_slots); - if (likely(use_swap_slot_cache && cache->slots_ret)) { - spin_lock_irq(&cache->free_lock); - /* Swap slots cache may be deactivated before acquiring lock */ - if (!use_swap_slot_cache || !cache->slots_ret) { - spin_unlock_irq(&cache->free_lock); - goto direct_free; - } - if (cache->n_ret >=3D SWAP_SLOTS_CACHE_SIZE) { - /* - * Return slots to global pool. - * The current swap_map value is SWAP_HAS_CACHE. - * Set it to 0 to indicate it is available for - * allocation in global pool - */ - swapcache_free_entries(cache->slots_ret, cache->n_ret); - cache->n_ret =3D 0; - } - cache->slots_ret[cache->n_ret++] =3D entry; - spin_unlock_irq(&cache->free_lock); - } else { -direct_free: - swapcache_free_entries(&entry, 1); - } -} - swp_entry_t folio_alloc_swap(struct folio *folio) { swp_entry_t entry; diff --git a/mm/swapfile.c b/mm/swapfile.c index 793b2fd1a2a8..b3154e52cb45 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -53,14 +53,15 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t, unsigned char); static void free_swap_count_continuations(struct swap_info_struct *); -static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t= entry, - unsigned int nr_pages); +static void swap_entry_range_free(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry, unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); static bool folio_swapcache_freeable(struct folio *folio); static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, unsigned long offset); -static void unlock_cluster(struct swap_cluster_info *ci); +static inline void unlock_cluster(struct swap_cluster_info *ci); =20 static DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; @@ -261,10 +262,9 @@ static int __try_to_reclaim_swap(struct swap_info_stru= ct *si, folio_ref_sub(folio, nr_pages); folio_set_dirty(folio); =20 - /* Only sinple page folio can be backed by zswap */ - if (nr_pages =3D=3D 1) - zswap_invalidate(entry); - swap_entry_range_free(si, entry, nr_pages); + ci =3D lock_cluster(si, offset); + swap_entry_range_free(si, ci, entry, nr_pages); + unlock_cluster(ci); ret =3D nr_pages; out_unlock: folio_unlock(folio); @@ -1128,8 +1128,10 @@ static void swap_range_free(struct swap_info_struct = *si, unsigned long offset, * Use atomic clear_bit operations only on zeromap instead of non-atomic * bitmap_clear to prevent adjacent bits corruption due to simultaneous w= rites. */ - for (i =3D 0; i < nr_entries; i++) + for (i =3D 0; i < nr_entries; i++) { clear_bit(offset + i, si->zeromap); + zswap_invalidate(swp_entry(si->type, offset + i)); + } =20 if (si->flags & SWP_BLKDEV) swap_slot_free_notify =3D @@ -1434,9 +1436,9 @@ static unsigned char __swap_entry_free(struct swap_in= fo_struct *si, =20 ci =3D lock_cluster(si, offset); usage =3D __swap_entry_free_locked(si, offset, 1); - unlock_cluster(ci); if (!usage) - free_swap_slot(entry); + swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1); + unlock_cluster(ci); =20 return usage; } @@ -1464,13 +1466,10 @@ static bool __swap_entries_free(struct swap_info_st= ruct *si, } for (i =3D 0; i < nr; i++) WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE); + if (!has_cache) + swap_entry_range_free(si, ci, entry, nr); unlock_cluster(ci); =20 - if (!has_cache) { - for (i =3D 0; i < nr; i++) - zswap_invalidate(swp_entry(si->type, offset + i)); - swap_entry_range_free(si, entry, nr); - } return has_cache; =20 fallback: @@ -1490,15 +1489,13 @@ static bool __swap_entries_free(struct swap_info_st= ruct *si, * Drop the last HAS_CACHE flag of swap entries, caller have to * ensure all entries belong to the same cgroup. */ -static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t= entry, - unsigned int nr_pages) +static void swap_entry_range_free(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry, unsigned int nr_pages) { unsigned long offset =3D swp_offset(entry); unsigned char *map =3D si->swap_map + offset; unsigned char *map_end =3D map + nr_pages; - struct swap_cluster_info *ci; - - ci =3D lock_cluster(si, offset); =20 /* It should never free entries across different clusters */ VM_BUG_ON(ci !=3D offset_to_cluster(si, offset + nr_pages - 1)); @@ -1518,7 +1515,6 @@ static void swap_entry_range_free(struct swap_info_st= ruct *si, swp_entry_t entry free_cluster(si, ci); else partial_free_cluster(si, ci); - unlock_cluster(ci); } =20 static void cluster_swap_free_nr(struct swap_info_struct *si, @@ -1526,28 +1522,13 @@ static void cluster_swap_free_nr(struct swap_info_s= truct *si, unsigned char usage) { struct swap_cluster_info *ci; - DECLARE_BITMAP(to_free, BITS_PER_LONG) =3D { 0 }; - int i, nr; + unsigned long end =3D offset + nr_pages; =20 ci =3D lock_cluster(si, offset); - while (nr_pages) { - nr =3D min(BITS_PER_LONG, nr_pages); - for (i =3D 0; i < nr; i++) { - if (!__swap_entry_free_locked(si, offset + i, usage)) - bitmap_set(to_free, i, 1); - } - if (!bitmap_empty(to_free, BITS_PER_LONG)) { - unlock_cluster(ci); - for_each_set_bit(i, to_free, BITS_PER_LONG) - free_swap_slot(swp_entry(si->type, offset + i)); - if (nr =3D=3D nr_pages) - return; - bitmap_clear(to_free, 0, BITS_PER_LONG); - ci =3D lock_cluster(si, offset); - } - offset +=3D nr; - nr_pages -=3D nr; - } + do { + if (!__swap_entry_free_locked(si, offset, usage)) + swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1); + } while (++offset < end); unlock_cluster(ci); } =20 @@ -1588,18 +1569,12 @@ void put_swap_folio(struct folio *folio, swp_entry_= t entry) return; =20 ci =3D lock_cluster(si, offset); - if (size > 1 && swap_is_has_cache(si, offset, size)) { - unlock_cluster(ci); - swap_entry_range_free(si, entry, size); - return; - } - for (int i =3D 0; i < size; i++, entry.val++) { - if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) { - unlock_cluster(ci); - free_swap_slot(entry); - if (i =3D=3D size - 1) - return; - lock_cluster(si, offset); + if (swap_is_has_cache(si, offset, size)) + swap_entry_range_free(si, ci, entry, size); + else { + for (int i =3D 0; i < size; i++, entry.val++) { + if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) + swap_entry_range_free(si, ci, entry, 1); } } unlock_cluster(ci); @@ -1608,6 +1583,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t = entry) void swapcache_free_entries(swp_entry_t *entries, int n) { int i; + struct swap_cluster_info *ci; struct swap_info_struct *si =3D NULL; =20 if (n <=3D 0) @@ -1615,8 +1591,11 @@ void swapcache_free_entries(swp_entry_t *entries, in= t n) =20 for (i =3D 0; i < n; ++i) { si =3D _swap_info_get(entries[i]); - if (si) - swap_entry_range_free(si, entries[i], 1); + if (si) { + ci =3D lock_cluster(si, swp_offset(entries[i])); + swap_entry_range_free(si, ci, entries[i], 1); + unlock_cluster(ci); + } } } =20 --=20 2.47.1