From nobody Tue Nov 26 01:55:10 2024 Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2D0D01CB33F for ; Tue, 22 Oct 2024 19:29:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625389; cv=none; b=ZggpioXtyR9b7J2k79fjZCO7yqOF0eHRtj0EeboMH3lcv9K9yKOytVxoz+Za7beGZLqaWmNFX0OE8EBMBsdYOS+BgB+f52REAi74sid/BfU4ArtiMrsNOzXJ0KfHSPvT1rtREohia9uuGKWAvYo2UhgbCFXGJNbh5qG74rlaFMk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625389; c=relaxed/simple; bh=0GlaNclLZLStjoTYzK1VGaVvWuPsSM+r/LE8FU/4tI4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=PW0dv6h26sv+eUIRxE+hJQtsML7U7kDziBNPJAr8N1LvbYv4N8cfXwKjDEo1DuLpOlK1IWRVcL5zGpgyu3DKCoctP5vFp7zHqojQmkC9SHuKKmBysRlpT+nkCQLJT8yTykGGyufHerPEhZfsxF0dcJ+htvTM9q4ohWgcGYG4lxo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=B6j1lzet; arc=none smtp.client-ip=209.85.214.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="B6j1lzet" Received: by mail-pl1-f170.google.com with SMTP id d9443c01a7336-20cdb889222so58305585ad.3 for ; Tue, 22 Oct 2024 12:29:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729625387; x=1730230187; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=/mi3lQ6T07AhjXuuW+YLZdNW4Sf4NNrHXlGjygFJG4M=; b=B6j1lzetQ9K/IR5rRA+uxzBIQ2sB9TBNhv8xvQ2cAwzwH3EXDjllvGXr6ckqgAJL65 2+OTsOxu3RdOQC9q5NkQS3kN75hjnpx7km1zW1nqQvt9Ub2+oSxW5yyel6q51P+1CkHA kYy+Yje+j3Q3eB2raaLsyNmsmeBkOY4KR9R/EFJMrudLFdJVqzbrWlahqUjWDoZVpbM7 gpmLyQyWTMr1C5ZZfEw8G1ec7IHBVWKUG8rBQjjNQ+SDTiTzkw5RvHIqojmE1MGjCmZX FspI5HOQypW2EYhcAwOoadJzDwtR8tYp+YbZ1XH0WmBCP5ZxVPrNEzrd5sWJV2YZnqsy 7Mrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729625387; x=1730230187; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=/mi3lQ6T07AhjXuuW+YLZdNW4Sf4NNrHXlGjygFJG4M=; b=WRc1Xa1Y/RMK+Z4roPVAkmFjwqhZMZdO+LeBeWftBbmSo4u4m1gtFAI4BHThKrq+gR /uOmsKSojUM1e/51reRz1XbrQda9o80bYXPjg6b6Gz/4pXdF9hjA3tkSVkN86wa0sFEG Rnba75ZMBhkG8GKlihWRpNGigLbePIiXiZrxvWMNXZZAl2B3Of95ADCvnfWRz7m+1nCe NMxSVknBLGUKdynLpZZcmRj9PLB46/jt5h7dxjG4aQlVLl3U7+XEg4Ay/qUmWRsMkLBQ pn2ljkd78sGOp24GbYjWjOk6HstMnmqXs9Q59snznM4aiDxCifWLtGWnZlu+xx3PE3uF 6C7w== X-Forwarded-Encrypted: i=1; AJvYcCX32VLnfKUFJCnXKkQxgwYVzDJ+QzcGO9i8JXd4B8XU6/ox6SUM2FokjOaJhA5av3cNGc+So2AnGMSTYV8=@vger.kernel.org X-Gm-Message-State: AOJu0YyrgEY2V4Mryv0hLNozfkTP0ReF4B4/1h5bmiimy88wJkpOcauZ 8imevhjUSuPoM2oqh7TECNEtM/H/zOaqUo7p8iOQXId2ktHv0e7L X-Google-Smtp-Source: AGHT+IHYM0AqePAeT91a9awvzQB5jl0p+WmH3l4Mu0GnxO1LklkU59Sg3M8q1ydkMmTY3kKflEy/Lg== X-Received: by 2002:a17:903:2447:b0:20c:ce1f:13bd with SMTP id d9443c01a7336-20fa9e2488dmr2805955ad.18.1729625387386; Tue, 22 Oct 2024 12:29:47 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.36]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20e7f0d9f05sm45895305ad.186.2024.10.22.12.29.44 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 22 Oct 2024 12:29:46 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 01/13] mm, swap: minor clean up for swap entry allocation Date: Wed, 23 Oct 2024 03:24:39 +0800 Message-ID: <20241022192451.38138-2-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20241022192451.38138-1-ryncsn@gmail.com> References: <20241022192451.38138-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Direct reclaim can skip the whole folio after reclaimed a set of folio based slots. Also simplify the code for allocation, reduce indention. Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swapfile.c | 59 +++++++++++++++++++++++++-------------------------- 1 file changed, 29 insertions(+), 30 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 46bd4b1a3c07..1128cea95c47 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -604,23 +604,28 @@ static bool cluster_reclaim_range(struct swap_info_st= ruct *si, unsigned long start, unsigned long end) { unsigned char *map =3D si->swap_map; - unsigned long offset; + unsigned long offset =3D start; + int nr_reclaim; =20 spin_unlock(&ci->lock); spin_unlock(&si->lock); =20 - for (offset =3D start; offset < end; offset++) { + do { switch (READ_ONCE(map[offset])) { case 0: - continue; + offset++; + break; case SWAP_HAS_CACHE: - if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT) > 0) - continue; - goto out; + nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIR= ECT); + if (nr_reclaim > 0) + offset +=3D nr_reclaim; + else + goto out; + break; default: goto out; } - } + } while (offset < end); out: spin_lock(&si->lock); spin_lock(&ci->lock); @@ -826,35 +831,30 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o &found, order, usage); frags++; if (found) - break; + goto done; } =20 - if (!found) { + /* + * Nonfull clusters are moved to frag tail if we reached + * here, count them too, don't over scan the frag list. + */ + while (frags < si->frag_cluster_nr[order]) { + ci =3D list_first_entry(&si->frag_clusters[order], + struct swap_cluster_info, list); /* - * Nonfull clusters are moved to frag tail if we reached - * here, count them too, don't over scan the frag list. + * Rotate the frag list to iterate, they were all failing + * high order allocation or moved here due to per-CPU usage, + * this help keeping usable cluster ahead. */ - while (frags < si->frag_cluster_nr[order]) { - ci =3D list_first_entry(&si->frag_clusters[order], - struct swap_cluster_info, list); - /* - * Rotate the frag list to iterate, they were all failing - * high order allocation or moved here due to per-CPU usage, - * this help keeping usable cluster ahead. - */ - list_move_tail(&ci->list, &si->frag_clusters[order]); - offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, order, usage); - frags++; - if (found) - break; - } + list_move_tail(&ci->list, &si->frag_clusters[order]); + offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), + &found, order, usage); + frags++; + if (found) + goto done; } } =20 - if (found) - goto done; - if (!list_empty(&si->discard_clusters)) { /* * we don't have free cluster but have some clusters in @@ -892,7 +892,6 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o goto done; } } - done: cluster->next[order] =3D offset; return found; --=20 2.47.0 From nobody Tue Nov 26 01:55:10 2024 Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 222531CB32A for ; Tue, 22 Oct 2024 19:29:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.175 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625393; cv=none; b=jhRX2cVuEV4IHjVc2UIIcWJ1R7pkJah5g3l++fWp7zo+urs8YaiJuy2pHO2dsu3uF2/LHH4odlbX+M5f1EuE6XORmULkeCU+jsl+EccLc/K46TaiIrz393AolcblStuip8jReQ0UlwFoku+X9wjVr/KKGv5j/AmVwH9qWugptag= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625393; c=relaxed/simple; bh=mu+293/6zhp2wt9TGN1VdqjlUJ7m2B4LDDL9My/DXvs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=LBPnkl7ynzd7Cig+940a3GlcC0n0gmaDPgClt0k2FrvlYuM0tFcL5uA58NywGEmsxPNgJokSdSfis2q5mj9y41xthcbE/Bvvjj0D7VtNzaYykrRoFnKF0mVoXK1zMhPz7QvV3r9bQQGTBq/QxBu8uAZyHsuZDE7fXMKFtUf3tqA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=LQlT7m+/; arc=none smtp.client-ip=209.85.214.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="LQlT7m+/" Received: by mail-pl1-f175.google.com with SMTP id d9443c01a7336-20cb47387ceso54187715ad.1 for ; Tue, 22 Oct 2024 12:29:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729625391; x=1730230191; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=0kiDuJ4tEKZSj0qaWySVrFhdnZUjCFe9iUFsOZ0gkhE=; b=LQlT7m+/R7vwqdObI2RtBh2KjxUjGRrX1X2NI8KNcZ+XZQQDj+rTy0cwupThPf3+sD OlJSgqoNZX1N8T/xlj3DCCv+PLUy0jundACrHyGGaYFf+MN4gU+0qTYoWRipw7//RUv0 JH5IdrKCR+NnaVqXYgpJz+/Sn32zMPnCdhFydAynfHIn9lGt2YMvpJfL9+vvfQnt8kCZ 0a5FXfQLn3F21PP0yu52u0cDcMqGJx+pCQCYB2K2GRazaUCuJOd++Luj1wRskgdqHeVM RXewajIzi3ELtHdT7UOPTvw7ciihBQFlvtuRpVGxf6L0P8I7+3GYs4+N2Y4FGQwjkbPM LcCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729625391; x=1730230191; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=0kiDuJ4tEKZSj0qaWySVrFhdnZUjCFe9iUFsOZ0gkhE=; b=K/3nYnUrCNfIoMg9VxA6Opd4EJyaPTIa6ROggqW3r4oRIBKgOBLF8V6ArpJj/bOEqY a3iXoJUvQUENWzicvgidH1mghFLbaVgLucQJ1UXJme17/zGr4f78FrltCAnsRgbSoQCk RjjeWFXnYBST3ikfT5GFZ9uf7dpQrTn+zIR4pZixI/+AHjBhhETobTMCOdNTu6Aikfge YQ3MEfoCNFTRvUaZpnVRmW6+0F439fOuTAadHIJb2u1Igh2MPb61DO1f5tsfvQjb5QCw /aujhmsGG4twUWSdnF4pr2vEMdqhsWgt9Oo5ZyWAlLBnOJyYShtBxqCGpZXQuytvZOtQ Bpxg== X-Forwarded-Encrypted: i=1; AJvYcCXBe70oQA8Zzppz4F2XfPdCShpupU2aV8rjhOZ9PLkHdQ0MPd9tvP565L4hWG283jSJRLkGcaRjVYM/Tw8=@vger.kernel.org X-Gm-Message-State: AOJu0Yw/bAhE4VP7Sb4O9FM4OUG5qSVWeUKOfHo4wZooa7PxU956cClu dnlD5E47pC0dhMqN6KUvoOVtz54rvfmTEhb04dF7VzJGBaLMgaaC X-Google-Smtp-Source: AGHT+IHJ/mmbPeQO7m1JzLu61hvPRuxTG4vZV7GSls5xo1lWKGBwkK8GicKhDQOEb/5V/S1g/+bTNg== X-Received: by 2002:a17:902:db06:b0:20b:c1e4:2d5d with SMTP id d9443c01a7336-20fa9e72980mr2645385ad.34.1729625390976; Tue, 22 Oct 2024 12:29:50 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.36]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20e7f0d9f05sm45895305ad.186.2024.10.22.12.29.47 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 22 Oct 2024 12:29:50 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 02/13] mm, swap: fold swap_info_get_cont in the only caller Date: Wed, 23 Oct 2024 03:24:40 +0800 Message-ID: <20241022192451.38138-3-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20241022192451.38138-1-ryncsn@gmail.com> References: <20241022192451.38138-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song The name of the function is confusing, and the code is much easier to follow after folding, also rename the confusing naming "p" to more meaningful "si". Signed-off-by: Kairui Song Suggested-by: Chris Li --- mm/swapfile.c | 39 +++++++++++++++------------------------ 1 file changed, 15 insertions(+), 24 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 1128cea95c47..e1e4a1ba4fc5 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1359,22 +1359,6 @@ static struct swap_info_struct *_swap_info_get(swp_e= ntry_t entry) return NULL; } =20 -static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry, - struct swap_info_struct *q) -{ - struct swap_info_struct *p; - - p =3D _swap_info_get(entry); - - if (p !=3D q) { - if (q !=3D NULL) - spin_unlock(&q->lock); - if (p !=3D NULL) - spin_lock(&p->lock); - } - return p; -} - static unsigned char __swap_entry_free_locked(struct swap_info_struct *si, unsigned long offset, unsigned char usage) @@ -1671,14 +1655,14 @@ static int swp_entry_cmp(const void *ent1, const vo= id *ent2) =20 void swapcache_free_entries(swp_entry_t *entries, int n) { - struct swap_info_struct *p, *prev; + struct swap_info_struct *si, *prev; int i; =20 if (n <=3D 0) return; =20 prev =3D NULL; - p =3D NULL; + si =3D NULL; =20 /* * Sort swap entries by swap device, so each lock is only taken once. @@ -1688,13 +1672,20 @@ void swapcache_free_entries(swp_entry_t *entries, i= nt n) if (nr_swapfiles > 1) sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL); for (i =3D 0; i < n; ++i) { - p =3D swap_info_get_cont(entries[i], prev); - if (p) - swap_entry_range_free(p, entries[i], 1); - prev =3D p; + si =3D _swap_info_get(entries[i]); + + if (si !=3D prev) { + if (prev !=3D NULL) + spin_unlock(&prev->lock); + if (si !=3D NULL) + spin_lock(&si->lock); + } + if (si) + swap_entry_range_free(si, entries[i], 1); + prev =3D si; } - if (p) - spin_unlock(&p->lock); + if (si) + spin_unlock(&si->lock); } =20 int __swap_count(swp_entry_t entry) --=20 2.47.0 From nobody Tue Nov 26 01:55:10 2024 Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com [209.85.214.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7668C1CB33F for ; Tue, 22 Oct 2024 19:29:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625397; cv=none; b=gHshAqzOO4o9vdmY1LmC3n7B6VyA9voRAZVvSXg3kWm1l1esZFWtQ7Ks7F3QT88fz21jfLEJk/n9rMqR7tvvF6OiizTqghTZg7X0xEe7AKiK2kJYwCZkpeBO+7VxgKdMI0fCzrtlmYmXF9dCE+8PtAktR9OWj9owxlrdHrVshFw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625397; c=relaxed/simple; bh=f1FjOYXu1AY1uLFNvK/HRXZ26GdUBZXnpiCZ4GADHhk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=cWcJG/jqjcoOJMCseVoDwr84bffK+Lbn6qeglrM3saHaXCkyyTNkUnN2bvNuE8ks0PM09EFLi+0YyfawnVkzZr4qLz71wABX+eS1DdB18IhjwEC1fqNUnORL7Wq7Do/EIfcKqO6rlLaAaZnvS4Y/i/VKsJDariW8eo21C8mLr+c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=e9jCcZT3; arc=none smtp.client-ip=209.85.214.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="e9jCcZT3" Received: by mail-pl1-f174.google.com with SMTP id d9443c01a7336-20c7edf2872so1125485ad.1 for ; Tue, 22 Oct 2024 12:29:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729625395; x=1730230195; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=3VbfK94vYMM7w9zgk7A3WhLYxN2WUErtpJ6jwxCUlfQ=; b=e9jCcZT3shqUCYQtlTBJhOTA/ErrBnPgkBGhu6eevYqqIxml6P8Lhr/a/Muii5IJ4c QnY3VcdnLo1wPcVmd0wfIW7Gv5gdyBvWZNhGuUVirSpJWXXf9WUEQAbjCKGFQI6lEkFk SvkIoQ5o7/XKLK49CqHKYC2ojaNRhHWiyczaiTelrcIUWp3wM2hJ/MoTpWwPimx6uS53 CMH3NwTXdWbOegXSRnSdf/BDZwJisXNv4d6Pjb/M9NC3+qmAwktqdwzct3ePapsey77l vquEzQB319NTT0xIUsiQvi+jw4q5dRD1I3Yt60QOE/897jX2MgTkXCUypGk1VPFus9K3 Svog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729625395; x=1730230195; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=3VbfK94vYMM7w9zgk7A3WhLYxN2WUErtpJ6jwxCUlfQ=; b=i56rddzVR3GMhM49tNhi4fgMqlS31BzVuuFSyiskO9vtnzo/WbaZVLmE5UTficpuJp lic7k3oYtx8+GU90YLtiBFZmfJp4lpy/jOPgWCOzUYYMKBZOXfB6fBidm4CMf4BwdE0g 10nls09diPnNTkpH8yVaBxbivjE0ewy58Y8Xwn2XupfSWF15oF5SP0LWCXcsFR4f+XoP azwjRyRWogCRZOawaezWmIurqcrn64erx7luyaLD2wXf1Tcex5zakG4yFR1UnhAIduzx Rky6JylPsbgO5HQryw2ebuwngfrg/rszR5H7lH9T7i3J007w9smSOO9rtiPfWFzpuMWc xqAw== X-Forwarded-Encrypted: i=1; AJvYcCVWpkQWrF4bMNyyl4mp+tI5TNMxtxkD9ChQ1ynucLVr6Df1h4/g82O5fOtDlPJd2N9aZ/EvUYvN5vTIqBw=@vger.kernel.org X-Gm-Message-State: AOJu0YxlDQ4JsVxWrz18SkTGXZRU/mNFMe8njBppF/mQ3w3bff0VEMvb /lH0jK6Ec3Iy2kHQKiDONTjP5JQSkioeXAbT2WmnynL+LdpasNem X-Google-Smtp-Source: AGHT+IGC2ct67tQlrbrn/euF7RF/i0JZyKYC9YgNS29vka34ew1P1aDh5IqUWUcc+iJ+99k9cKHp5A== X-Received: by 2002:a17:902:c40f:b0:20b:6c1e:1e13 with SMTP id d9443c01a7336-20e970c794cmr70541015ad.23.1729625394613; Tue, 22 Oct 2024 12:29:54 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.36]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20e7f0d9f05sm45895305ad.186.2024.10.22.12.29.51 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 22 Oct 2024 12:29:54 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 03/13] mm, swap: remove old allocation path for HDD Date: Wed, 23 Oct 2024 03:24:41 +0800 Message-ID: <20241022192451.38138-4-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20241022192451.38138-1-ryncsn@gmail.com> References: <20241022192451.38138-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song We are currently using different swap allocation algorithm for HDD and non-HDD. This leads to the existing of different set of locking, and the code path is heavily bloated, causing troubles for further optimization and maintenance. This commit removes all HDD swap allocation and related dead code, and use cluster allocation algorithm instead. The performance may drop a little bit temporarily, and should be negligible: The main advantage of legacy HDD allocation algorithm is that is tend to use continuous slots, but swap device gets fragmented quickly anyway, and the attempt to use continuous slots will fail easily. This commit also enables mTHP swap on HDD, which should be beneficial, and following commits will adapt and optimize the cluster allocator for HDD. Suggested-by: Chris Li Suggested-by: "Huang, Ying" Signed-off-by: Kairui Song --- include/linux/swap.h | 3 - mm/swapfile.c | 235 ++----------------------------------------- 2 files changed, 9 insertions(+), 229 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index f3e0ac20c2e8..3a71198a6957 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -309,9 +309,6 @@ struct swap_info_struct { unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ unsigned int inuse_pages; /* number of those currently in use */ - unsigned int cluster_next; /* likely index for next allocation */ - unsigned int cluster_nr; /* countdown to next cluster search */ - unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocati= on */ struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio= n */ struct rb_root swap_extent_root;/* root of the swap extent rbtree */ struct block_device *bdev; /* swap device or bdev of swap file */ diff --git a/mm/swapfile.c b/mm/swapfile.c index e1e4a1ba4fc5..ffdf7eedecb5 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -989,49 +989,6 @@ static void swap_range_free(struct swap_info_struct *s= i, unsigned long offset, WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); } =20 -static void set_cluster_next(struct swap_info_struct *si, unsigned long ne= xt) -{ - unsigned long prev; - - if (!(si->flags & SWP_SOLIDSTATE)) { - si->cluster_next =3D next; - return; - } - - prev =3D this_cpu_read(*si->cluster_next_cpu); - /* - * Cross the swap address space size aligned trunk, choose - * another trunk randomly to avoid lock contention on swap - * address space if possible. - */ - if ((prev >> SWAP_ADDRESS_SPACE_SHIFT) !=3D - (next >> SWAP_ADDRESS_SPACE_SHIFT)) { - /* No free swap slots available */ - if (si->highest_bit <=3D si->lowest_bit) - return; - next =3D get_random_u32_inclusive(si->lowest_bit, si->highest_bit); - next =3D ALIGN_DOWN(next, SWAP_ADDRESS_SPACE_PAGES); - next =3D max_t(unsigned int, next, si->lowest_bit); - } - this_cpu_write(*si->cluster_next_cpu, next); -} - -static bool swap_offset_available_and_locked(struct swap_info_struct *si, - unsigned long offset) -{ - if (data_race(!si->swap_map[offset])) { - spin_lock(&si->lock); - return true; - } - - if (vm_swap_full() && READ_ONCE(si->swap_map[offset]) =3D=3D SWAP_HAS_CAC= HE) { - spin_lock(&si->lock); - return true; - } - - return false; -} - static int cluster_alloc_swap(struct swap_info_struct *si, unsigned char usage, int nr, swp_entry_t slots[], int order) @@ -1055,13 +1012,7 @@ static int scan_swap_map_slots(struct swap_info_stru= ct *si, unsigned char usage, int nr, swp_entry_t slots[], int order) { - unsigned long offset; - unsigned long scan_base; - unsigned long last_in_cluster =3D 0; - int latency_ration =3D LATENCY_LIMIT; unsigned int nr_pages =3D 1 << order; - int n_ret =3D 0; - bool scanned_many =3D false; =20 /* * We try to cluster swap pages by allocating them sequentially @@ -1073,7 +1024,6 @@ static int scan_swap_map_slots(struct swap_info_struc= t *si, * But we do now try to find an empty cluster. -Andrea * And we let swap pages go all over an SSD partition. Hugh */ - if (order > 0) { /* * Should not even be attempting large allocations when huge @@ -1093,158 +1043,7 @@ static int scan_swap_map_slots(struct swap_info_str= uct *si, return 0; } =20 - if (si->cluster_info) - return cluster_alloc_swap(si, usage, nr, slots, order); - - si->flags +=3D SWP_SCANNING; - - /* For HDD, sequential access is more important. */ - scan_base =3D si->cluster_next; - offset =3D scan_base; - - if (unlikely(!si->cluster_nr--)) { - if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) { - si->cluster_nr =3D SWAPFILE_CLUSTER - 1; - goto checks; - } - - spin_unlock(&si->lock); - - /* - * If seek is expensive, start searching for new cluster from - * start of partition, to minimize the span of allocated swap. - */ - scan_base =3D offset =3D si->lowest_bit; - last_in_cluster =3D offset + SWAPFILE_CLUSTER - 1; - - /* Locate the first empty (unaligned) cluster */ - for (; last_in_cluster <=3D READ_ONCE(si->highest_bit); offset++) { - if (si->swap_map[offset]) - last_in_cluster =3D offset + SWAPFILE_CLUSTER; - else if (offset =3D=3D last_in_cluster) { - spin_lock(&si->lock); - offset -=3D SWAPFILE_CLUSTER - 1; - si->cluster_next =3D offset; - si->cluster_nr =3D SWAPFILE_CLUSTER - 1; - goto checks; - } - if (unlikely(--latency_ration < 0)) { - cond_resched(); - latency_ration =3D LATENCY_LIMIT; - } - } - - offset =3D scan_base; - spin_lock(&si->lock); - si->cluster_nr =3D SWAPFILE_CLUSTER - 1; - } - -checks: - if (!(si->flags & SWP_WRITEOK)) - goto no_page; - if (!si->highest_bit) - goto no_page; - if (offset > si->highest_bit) - scan_base =3D offset =3D si->lowest_bit; - - /* reuse swap entry of cache-only swap if not busy. */ - if (vm_swap_full() && si->swap_map[offset] =3D=3D SWAP_HAS_CACHE) { - int swap_was_freed; - spin_unlock(&si->lock); - swap_was_freed =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_= DIRECT); - spin_lock(&si->lock); - /* entry was freed successfully, try to use this again */ - if (swap_was_freed > 0) - goto checks; - goto scan; /* check next one */ - } - - if (si->swap_map[offset]) { - if (!n_ret) - goto scan; - else - goto done; - } - memset(si->swap_map + offset, usage, nr_pages); - - swap_range_alloc(si, offset, nr_pages); - slots[n_ret++] =3D swp_entry(si->type, offset); - - /* got enough slots or reach max slots? */ - if ((n_ret =3D=3D nr) || (offset >=3D si->highest_bit)) - goto done; - - /* search for next available slot */ - - /* time to take a break? */ - if (unlikely(--latency_ration < 0)) { - if (n_ret) - goto done; - spin_unlock(&si->lock); - cond_resched(); - spin_lock(&si->lock); - latency_ration =3D LATENCY_LIMIT; - } - - if (si->cluster_nr && !si->swap_map[++offset]) { - /* non-ssd case, still more slots in cluster? */ - --si->cluster_nr; - goto checks; - } - - /* - * Even if there's no free clusters available (fragmented), - * try to scan a little more quickly with lock held unless we - * have scanned too many slots already. - */ - if (!scanned_many) { - unsigned long scan_limit; - - if (offset < scan_base) - scan_limit =3D scan_base; - else - scan_limit =3D si->highest_bit; - for (; offset <=3D scan_limit && --latency_ration > 0; - offset++) { - if (!si->swap_map[offset]) - goto checks; - } - } - -done: - if (order =3D=3D 0) - set_cluster_next(si, offset + 1); - si->flags -=3D SWP_SCANNING; - return n_ret; - -scan: - VM_WARN_ON(order > 0); - spin_unlock(&si->lock); - while (++offset <=3D READ_ONCE(si->highest_bit)) { - if (unlikely(--latency_ration < 0)) { - cond_resched(); - latency_ration =3D LATENCY_LIMIT; - scanned_many =3D true; - } - if (swap_offset_available_and_locked(si, offset)) - goto checks; - } - offset =3D si->lowest_bit; - while (offset < scan_base) { - if (unlikely(--latency_ration < 0)) { - cond_resched(); - latency_ration =3D LATENCY_LIMIT; - scanned_many =3D true; - } - if (swap_offset_available_and_locked(si, offset)) - goto checks; - offset++; - } - spin_lock(&si->lock); - -no_page: - si->flags -=3D SWP_SCANNING; - return n_ret; + return cluster_alloc_swap(si, usage, nr, slots, order); } =20 int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) @@ -2855,8 +2654,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) mutex_unlock(&swapon_mutex); free_percpu(p->percpu_cluster); p->percpu_cluster =3D NULL; - free_percpu(p->cluster_next_cpu); - p->cluster_next_cpu =3D NULL; vfree(swap_map); kvfree(zeromap); kvfree(cluster_info); @@ -3168,8 +2965,6 @@ static unsigned long read_swap_header(struct swap_inf= o_struct *si, } =20 si->lowest_bit =3D 1; - si->cluster_next =3D 1; - si->cluster_nr =3D 0; =20 maxpages =3D swapfile_maximum_size; last_page =3D swap_header->info.last_page; @@ -3255,7 +3050,6 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, unsigned long maxpages) { unsigned long nr_clusters =3D DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); - unsigned long col =3D si->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_= COLS; struct swap_cluster_info *cluster_info; unsigned long i, j, k, idx; int cpu, err =3D -ENOMEM; @@ -3267,15 +3061,6 @@ static struct swap_cluster_info *setup_clusters(stru= ct swap_info_struct *si, for (i =3D 0; i < nr_clusters; i++) spin_lock_init(&cluster_info[i].lock); =20 - si->cluster_next_cpu =3D alloc_percpu(unsigned int); - if (!si->cluster_next_cpu) - goto err_free; - - /* Random start position to help with wear leveling */ - for_each_possible_cpu(cpu) - per_cpu(*si->cluster_next_cpu, cpu) =3D - get_random_u32_inclusive(1, si->highest_bit); - si->percpu_cluster =3D alloc_percpu(struct percpu_cluster); if (!si->percpu_cluster) goto err_free; @@ -3317,7 +3102,7 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, * sharing same address space. */ for (k =3D 0; k < SWAP_CLUSTER_COLS; k++) { - j =3D (k + col) % SWAP_CLUSTER_COLS; + j =3D k % SWAP_CLUSTER_COLS; for (i =3D 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) { struct swap_cluster_info *ci; idx =3D i * SWAP_CLUSTER_COLS + j; @@ -3467,18 +3252,18 @@ SYSCALL_DEFINE2(swapon, const char __user *, specia= lfile, int, swap_flags) =20 if (si->bdev && bdev_nonrot(si->bdev)) { si->flags |=3D SWP_SOLIDSTATE; - - cluster_info =3D setup_clusters(si, swap_header, maxpages); - if (IS_ERR(cluster_info)) { - error =3D PTR_ERR(cluster_info); - cluster_info =3D NULL; - goto bad_swap_unlock_inode; - } } else { atomic_inc(&nr_rotate_swap); inced_nr_rotate_swap =3D true; } =20 + cluster_info =3D setup_clusters(si, swap_header, maxpages); + if (IS_ERR(cluster_info)) { + error =3D PTR_ERR(cluster_info); + cluster_info =3D NULL; + goto bad_swap_unlock_inode; + } + if ((swap_flags & SWAP_FLAG_DISCARD) && si->bdev && bdev_max_discard_sectors(si->bdev)) { /* @@ -3559,8 +3344,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf= ile, int, swap_flags) bad_swap: free_percpu(si->percpu_cluster); si->percpu_cluster =3D NULL; - free_percpu(si->cluster_next_cpu); - si->cluster_next_cpu =3D NULL; inode =3D NULL; destroy_swap_extents(si); swap_cgroup_swapoff(si->type); --=20 2.47.0 From nobody Tue Nov 26 01:55:10 2024 Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E864D1CC147 for ; Tue, 22 Oct 2024 19:29:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625402; cv=none; b=j8kJqtDf3Dwp7gMoOZwmf7Fu2lKtKoe8izwSZP9zkvtu6pZztOrU3M4s0mmSIeGUmslqPg4Rj+BsR9N+7v4yquDuScLqNVieVPEujNHqIDJVwvwssJf0ft+hzG44sj7CCG3l0GldIoVFXrY0Gt3mfAl0glTVc3+ty8XZrZqMH0w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625402; c=relaxed/simple; bh=S89AjvCf10QAQEB180eARz2dVWIi+LixjhJPB50JqH4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=fZxmwhV5NkrQzIsxhr6I2RcZCpK9f7Veiz2Cn+KS1MwYvQV9fK3oa5FQw5+hJJKZ23YM2dLc6Ag6JCM12UnSkoxDPSxwAKjzL85GVLBmXhw9LkcqPW/gVCIWQcahQvtJmsm523nGnMsvmsKnKcRKuOq4oXc96JKyL7rKcwJB1c4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=jLV8LKW/; arc=none smtp.client-ip=209.85.214.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="jLV8LKW/" Received: by mail-pl1-f180.google.com with SMTP id d9443c01a7336-20c6f492d2dso69674285ad.0 for ; Tue, 22 Oct 2024 12:29:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729625398; x=1730230198; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=De6FNmrDX2CsofmiG4+hb8nOrMhL9lIlNsptB1OQpyI=; b=jLV8LKW/pzkzN0RWZJjfM+e8YOTCyuZe6EfKXwtV2HgEhg35GLilzYoHd0Vi64nxvt 1PuMSSGFbaDlKCKfFASD/BBVATt2tH5YzWSHXKUK80UkIYakyziadqUQIAfPmI8+/byW bkMNm3NsrbWTqqZClAxEhfaZmvC5BbxAl9WTYY7JTxqrb8PwxzhQ/UScSXn3ItMH9ZOP KJh3IMGCvEHLGBd3TDeNS2XYK/jHdniloWl1atb3X4fCM/ljkWOO3xylK7I6c530CtMp SepG7af0Ty+zAqKRtVPsIxUlvv+Wz74hEnFy3uHdOtKXm51QMfMblu1+4P9T1UNDgtR3 LI2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729625398; x=1730230198; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=De6FNmrDX2CsofmiG4+hb8nOrMhL9lIlNsptB1OQpyI=; b=ZM2UQZ92va2e6atViOIyfK4CLHTeIz4x7OU4F+DV7w7gDoKFvyXQMTGI1AmqdsIAje BzQagOpCBMtjcOz6hA5GYtzKW99d5m/9KPGsn4NfJFJbpHHouMn09RJ31WoxtBC1t+L3 CdCqnj7QEMllmLGz+aDk/xW4CCehy4Gg6gxjySfpCuXV9DNUnZMJoUdz7ExfNr3mrZ9F COZhoAlNYCu6Vqty7/6cHfkzaX1y4aduBvvHFV9K50eBxdZo/saaO8iYuX4WTS2fG6mM 2wJQj6BVgTJrjOz2jmzjIDK7cZ+8okYOLwbQZD4Wo5uby1HqmG0tY+26S2b7Z8cA0RTO Bcog== X-Forwarded-Encrypted: i=1; AJvYcCU8VduGJU0HMsR2HPfGaq+RfbNOKlO6/7f8XG8ERKSo7QmWcA4woKov1gTN2cRHE1sG33AIqwpTGuIBgC8=@vger.kernel.org X-Gm-Message-State: AOJu0YzGiwg5iaCF3ZF7DxhfAzNLwJh5zpXr4vS1IU7XTbFqueHjieao JiDlfSrshd5PMDLUslqeKX/hCINto0KSd7Vz9cIGdj7ytURQLqmp X-Google-Smtp-Source: AGHT+IG648J4NAU1H2vV7qvnmugjAT+R3eqfpZK2sgM9OZGK+vlJTzeCTjXDpzcoLaWx1v4aNcZg4Q== X-Received: by 2002:a17:903:18e:b0:20b:6918:30b5 with SMTP id d9443c01a7336-20fa9e9f77emr3186585ad.41.1729625398260; Tue, 22 Oct 2024 12:29:58 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.36]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20e7f0d9f05sm45895305ad.186.2024.10.22.12.29.54 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 22 Oct 2024 12:29:57 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 04/13] mm, swap: use cluster lock for HDD Date: Wed, 23 Oct 2024 03:24:42 +0800 Message-ID: <20241022192451.38138-5-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20241022192451.38138-1-ryncsn@gmail.com> References: <20241022192451.38138-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Cluster lock (ci->lock) was introduce to reduce contention for certain operations. Using cluster lock for HDD is not helpful as HDD have a poor performance, so locking isn't the bottleneck. But having different set of locks for HDD / non-HDD prevents further rework of device lock (si->lock). This commit just changed all lock_cluster_or_swap_info to lock_cluster, which is a safe and straight conversion since cluster info is always allocated now, also removed all cluster_info related checks. Suggested-by: Chris Li Signed-off-by: Kairui Song --- mm/swapfile.c | 107 ++++++++++++++++---------------------------------- 1 file changed, 34 insertions(+), 73 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index ffdf7eedecb5..f8e70bb5f1d7 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -58,10 +58,9 @@ static void swap_entry_range_free(struct swap_info_struc= t *si, swp_entry_t entry static void swap_range_alloc(struct swap_info_struct *si, unsigned long of= fset, unsigned int nr_entries); static bool folio_swapcache_freeable(struct folio *folio); -static struct swap_cluster_info *lock_cluster_or_swap_info( - struct swap_info_struct *si, unsigned long offset); -static void unlock_cluster_or_swap_info(struct swap_info_struct *si, - struct swap_cluster_info *ci); +static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, + unsigned long offset); +static void unlock_cluster(struct swap_cluster_info *ci); =20 static DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; @@ -222,9 +221,9 @@ static int __try_to_reclaim_swap(struct swap_info_struc= t *si, * swap_map is HAS_CACHE only, which means the slots have no page table * reference or pending writeback, and can't be allocated to others. */ - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); need_reclaim =3D swap_is_has_cache(si, offset, nr_pages); - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); if (!need_reclaim) goto out_unlock; =20 @@ -404,45 +403,15 @@ static inline struct swap_cluster_info *lock_cluster(= struct swap_info_struct *si { struct swap_cluster_info *ci; =20 - ci =3D si->cluster_info; - if (ci) { - ci +=3D offset / SWAPFILE_CLUSTER; - spin_lock(&ci->lock); - } - return ci; -} - -static inline void unlock_cluster(struct swap_cluster_info *ci) -{ - if (ci) - spin_unlock(&ci->lock); -} - -/* - * Determine the locking method in use for this device. Return - * swap_cluster_info if SSD-style cluster-based locking is in place. - */ -static inline struct swap_cluster_info *lock_cluster_or_swap_info( - struct swap_info_struct *si, unsigned long offset) -{ - struct swap_cluster_info *ci; - - /* Try to use fine-grained SSD-style locking if available: */ - ci =3D lock_cluster(si, offset); - /* Otherwise, fall back to traditional, coarse locking: */ - if (!ci) - spin_lock(&si->lock); + ci =3D &si->cluster_info[offset / SWAPFILE_CLUSTER]; + spin_lock(&ci->lock); =20 return ci; } =20 -static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si, - struct swap_cluster_info *ci) +static inline void unlock_cluster(struct swap_cluster_info *ci) { - if (ci) - unlock_cluster(ci); - else - spin_unlock(&si->lock); + spin_unlock(&ci->lock); } =20 /* Add a cluster to discard list and schedule it to do discard */ @@ -558,9 +527,6 @@ static void inc_cluster_info_page(struct swap_info_stru= ct *si, unsigned long idx =3D page_nr / SWAPFILE_CLUSTER; struct swap_cluster_info *ci; =20 - if (!cluster_info) - return; - ci =3D cluster_info + idx; ci->count++; =20 @@ -576,9 +542,6 @@ static void inc_cluster_info_page(struct swap_info_stru= ct *si, static void dec_cluster_info_page(struct swap_info_struct *si, struct swap_cluster_info *ci, int nr_pages) { - if (!si->cluster_info) - return; - VM_BUG_ON(ci->count < nr_pages); VM_BUG_ON(cluster_is_free(ci)); lockdep_assert_held(&si->lock); @@ -995,8 +958,6 @@ static int cluster_alloc_swap(struct swap_info_struct *= si, { int n_ret =3D 0; =20 - VM_BUG_ON(!si->cluster_info); - while (n_ret < nr) { unsigned long offset =3D cluster_alloc_swap_entry(si, order, usage); =20 @@ -1036,10 +997,10 @@ static int scan_swap_map_slots(struct swap_info_stru= ct *si, } =20 /* - * Swapfile is not block device or not using clusters so unable + * Swapfile is not block device so unable * to allocate large entries. */ - if (!(si->flags & SWP_BLKDEV) || !si->cluster_info) + if (!(si->flags & SWP_BLKDEV)) return 0; } =20 @@ -1279,9 +1240,9 @@ static unsigned char __swap_entry_free(struct swap_in= fo_struct *si, unsigned long offset =3D swp_offset(entry); unsigned char usage; =20 - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); usage =3D __swap_entry_free_locked(si, offset, 1); - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); if (!usage) free_swap_slot(entry); =20 @@ -1304,14 +1265,14 @@ static bool __swap_entries_free(struct swap_info_st= ruct *si, if (nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER) goto fallback; =20 - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); if (!swap_is_last_map(si, offset, nr, &has_cache)) { - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); goto fallback; } for (i =3D 0; i < nr; i++) WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE); - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); =20 if (!has_cache) { for (i =3D 0; i < nr; i++) @@ -1367,7 +1328,7 @@ static void cluster_swap_free_nr(struct swap_info_str= uct *si, DECLARE_BITMAP(to_free, BITS_PER_LONG) =3D { 0 }; int i, nr; =20 - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); while (nr_pages) { nr =3D min(BITS_PER_LONG, nr_pages); for (i =3D 0; i < nr; i++) { @@ -1375,18 +1336,18 @@ static void cluster_swap_free_nr(struct swap_info_s= truct *si, bitmap_set(to_free, i, 1); } if (!bitmap_empty(to_free, BITS_PER_LONG)) { - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); for_each_set_bit(i, to_free, BITS_PER_LONG) free_swap_slot(swp_entry(si->type, offset + i)); if (nr =3D=3D nr_pages) return; bitmap_clear(to_free, 0, BITS_PER_LONG); - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); } offset +=3D nr; nr_pages -=3D nr; } - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); } =20 /* @@ -1425,9 +1386,9 @@ void put_swap_folio(struct folio *folio, swp_entry_t = entry) if (!si) return; =20 - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); if (size > 1 && swap_is_has_cache(si, offset, size)) { - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); spin_lock(&si->lock); swap_entry_range_free(si, entry, size); spin_unlock(&si->lock); @@ -1435,14 +1396,14 @@ void put_swap_folio(struct folio *folio, swp_entry_= t entry) } for (int i =3D 0; i < size; i++, entry.val++) { if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) { - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); free_swap_slot(entry); if (i =3D=3D size - 1) return; - lock_cluster_or_swap_info(si, offset); + lock_cluster(si, offset); } } - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); } =20 static int swp_entry_cmp(const void *ent1, const void *ent2) @@ -1506,9 +1467,9 @@ int swap_swapcount(struct swap_info_struct *si, swp_e= ntry_t entry) struct swap_cluster_info *ci; int count; =20 - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); count =3D swap_count(si->swap_map[offset]); - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); return count; } =20 @@ -1531,7 +1492,7 @@ int swp_swapcount(swp_entry_t entry) =20 offset =3D swp_offset(entry); =20 - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); =20 count =3D swap_count(si->swap_map[offset]); if (!(count & COUNT_CONTINUED)) @@ -1554,7 +1515,7 @@ int swp_swapcount(swp_entry_t entry) n *=3D (SWAP_CONT_MAX + 1); } while (tmp_count & COUNT_CONTINUED); out: - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); return count; } =20 @@ -1569,8 +1530,8 @@ static bool swap_page_trans_huge_swapped(struct swap_= info_struct *si, int i; bool ret =3D false; =20 - ci =3D lock_cluster_or_swap_info(si, offset); - if (!ci || nr_pages =3D=3D 1) { + ci =3D lock_cluster(si, offset); + if (nr_pages =3D=3D 1) { if (swap_count(map[roffset])) ret =3D true; goto unlock_out; @@ -1582,7 +1543,7 @@ static bool swap_page_trans_huge_swapped(struct swap_= info_struct *si, } } unlock_out: - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); return ret; } =20 @@ -3412,7 +3373,7 @@ static int __swap_duplicate(swp_entry_t entry, unsign= ed char usage, int nr) offset =3D swp_offset(entry); VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); VM_WARN_ON(usage =3D=3D 1 && nr > 1); - ci =3D lock_cluster_or_swap_info(si, offset); + ci =3D lock_cluster(si, offset); =20 err =3D 0; for (i =3D 0; i < nr; i++) { @@ -3467,7 +3428,7 @@ static int __swap_duplicate(swp_entry_t entry, unsign= ed char usage, int nr) } =20 unlock_out: - unlock_cluster_or_swap_info(si, ci); + unlock_cluster(ci); return err; } =20 --=20 2.47.0 From nobody Tue Nov 26 01:55:10 2024 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A47081CC162 for ; Tue, 22 Oct 2024 19:30:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625404; cv=none; b=P+L6VBCOx1F5NSJuG67JKXx1UMAaBm8/xrcvK8vh4gQtLtH6qdPCJOyyuSjiHgZNHat/TCvY5jGbhxNNPQ947T/iuchC08NA+S786RoA7rQ1m2RMFMU4DKfH1sI2Wlf4+jS5aoUDRX3MJCfAVw3RJE64x6EEMUqAn+qPGFVMLrI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625404; c=relaxed/simple; bh=diIKEzgOcmPu++XnqSivwukR8rpsW2YLUCoOGB0c1z4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=oEqsvwJoVVCnjDsmSYXiwgTDplwO/8X4u7VCEDHeFKHp9ywDWbaj3w9zWk6YnR0az411We6sCq9h6zOwZBDJJi7F/moROeamzpsGxhULZeeHNusrBfXxDQC60LzsgjEHTsNF4hq0r7OXBtn1msuYDEDKSnWPHTK7QeF1Z7omLGQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=J4b7tqmk; arc=none smtp.client-ip=209.85.214.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="J4b7tqmk" Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-20c6f492d2dso69674935ad.0 for ; Tue, 22 Oct 2024 12:30:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729625402; x=1730230202; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=XKHLsBzhzCnbiPIqKsShhAk0UHdn03icCPe/yHcoGTQ=; b=J4b7tqmkMX+Nq1hTDfXKSdOQWCy2QY1Co1AiVYlSA3H6AqUdPkt/soZaFcVybIgZCQ 28krwVVQcpJrGZVyAJU70387UnpW8OwJNQP4SA+AUalDqQCjFDE7Pey7jvXc6U3D+gV8 r+27CWkLbVBNi04RHD1oKbvMcC2dSY6GJHAZZ+rBXFgIKnWE+C0Tw6/EPsa6tTwfvPfT ShY0Od8C+/t47ohzHstibqo+6MD2e8Qw3iAQ3+wClvJrwQA0aa9AZXuUpXWpDULghHuQ o5Y6XlTy5NyA8YTiArNYbGU4crZtKSfuGtxkknhhHBYA/XLUPpaVThk1JyjJqLjVN6K+ P/Og== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729625402; x=1730230202; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=XKHLsBzhzCnbiPIqKsShhAk0UHdn03icCPe/yHcoGTQ=; b=EVZXr7yCbkiHSKS64ji/mJpVvx4/O5Vswx3PPwf9Uo2aQeT802PE8Kz/sQoKUkQ1i7 7H6VoulwIAucYqfNbJgHrwi2KONOQLeap497jLwWLbQh2JLWJ0rIUVVTuVx6fujuXL7M J2XLG6woti5KY7j3ec8k3kbY84h0cBVtM2dPqzDeQCmjVGFgyJxtcr8J4jtwXlp4zv8y AeDFk6PgE+jd/vReL4drO7Vg/KNJ/yfsc2xmmRTZVPskb2YM+IqBnhUXzmQuYg1R7/X7 iCsTwFSgNRNs4bDV5qEQJlQxfvFlHQFZ2vvAz+jO12CP3gh18QnD3VJHFX/fFa7Gj4Y/ UCUg== X-Forwarded-Encrypted: i=1; AJvYcCVnsIt4+AR3fFIAY1b9xwSI04IRjaFGlOEOjCDY008X5z9Z4Hu8dtRs1m0lVqKaQ+8QqGyc7eDd6aNHjhE=@vger.kernel.org X-Gm-Message-State: AOJu0YxoI3lhLY88dgeywnIrmvygG0H8Hlr65f3ZXIIwmk/jAxBrEn32 QsB/yKYbiSZlfMcOszdXawt/oi0RaDHyFwuKKL9ZPmB5/mkVxi2y X-Google-Smtp-Source: AGHT+IH5d48HG7yM1vsiLoHlHuj/5dQGaeruEyz0B4KacEW852FH1mXkSs1tEGlqkW7GrlSz5sKaZA== X-Received: by 2002:a17:902:ec82:b0:20b:4875:2c51 with SMTP id d9443c01a7336-20fa9e5c6f7mr2808315ad.27.1729625401890; Tue, 22 Oct 2024 12:30:01 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.36]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20e7f0d9f05sm45895305ad.186.2024.10.22.12.29.58 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 22 Oct 2024 12:30:01 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 05/13] mm, swap: clean up device availability check Date: Wed, 23 Oct 2024 03:24:43 +0800 Message-ID: <20241022192451.38138-6-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20241022192451.38138-1-ryncsn@gmail.com> References: <20241022192451.38138-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Remove highest_bit and lowest_bit. After HDD allocation path is removed, only purpose of these two fields is to judge if the device is full or not, which can be done by checking inuse_pages instead. Signed-off-by: Kairui Song Suggested-by: Chris Li --- fs/btrfs/inode.c | 1 - fs/iomap/swapfile.c | 1 - include/linux/swap.h | 2 -- mm/page_io.c | 1 - mm/swapfile.c | 38 ++++++++------------------------------ 5 files changed, 8 insertions(+), 35 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 5618ca02934a..aba9c0d58998 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -10023,7 +10023,6 @@ static int btrfs_swap_activate(struct swap_info_str= uct *sis, struct file *file, *span =3D bsi.highest_ppage - bsi.lowest_ppage + 1; sis->max =3D bsi.nr_pages; sis->pages =3D bsi.nr_pages - 1; - sis->highest_bit =3D bsi.nr_pages - 1; return bsi.nr_extents; } #else diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c index 5fc0ac36dee3..b90d0eda9e51 100644 --- a/fs/iomap/swapfile.c +++ b/fs/iomap/swapfile.c @@ -189,7 +189,6 @@ int iomap_swapfile_activate(struct swap_info_struct *si= s, *pagespan =3D 1 + isi.highest_ppage - isi.lowest_ppage; sis->max =3D isi.nr_pages; sis->pages =3D isi.nr_pages - 1; - sis->highest_bit =3D isi.nr_pages - 1; return isi.nr_extents; } EXPORT_SYMBOL_GPL(iomap_swapfile_activate); diff --git a/include/linux/swap.h b/include/linux/swap.h index 3a71198a6957..c0d49dad7a4b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -305,8 +305,6 @@ struct swap_info_struct { struct list_head frag_clusters[SWAP_NR_ORDERS]; /* list of cluster that are fragmented or contented */ unsigned int frag_cluster_nr[SWAP_NR_ORDERS]; - unsigned int lowest_bit; /* index of first free in swap_map */ - unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ unsigned int inuse_pages; /* number of those currently in use */ struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio= n */ diff --git a/mm/page_io.c b/mm/page_io.c index a28d28b6b3ce..c8a25203bcf4 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -163,7 +163,6 @@ int generic_swapfile_activate(struct swap_info_struct *= sis, page_no =3D 1; /* force Empty message */ sis->max =3D page_no; sis->pages =3D page_no - 1; - sis->highest_bit =3D page_no - 1; out: return ret; bad_bmap: diff --git a/mm/swapfile.c b/mm/swapfile.c index f8e70bb5f1d7..e620b41c3120 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -55,7 +55,7 @@ static bool swap_count_continued(struct swap_info_struct = *, pgoff_t, static void free_swap_count_continuations(struct swap_info_struct *); static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t= entry, unsigned int nr_pages); -static void swap_range_alloc(struct swap_info_struct *si, unsigned long of= fset, +static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); static bool folio_swapcache_freeable(struct folio *folio); static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, @@ -647,7 +647,7 @@ static void cluster_alloc_range(struct swap_info_struct= *si, struct swap_cluster } =20 memset(si->swap_map + start, usage, nr_pages); - swap_range_alloc(si, start, nr_pages); + swap_range_alloc(si, nr_pages); ci->count +=3D nr_pages; =20 if (ci->count =3D=3D SWAPFILE_CLUSTER) { @@ -876,19 +876,11 @@ static void del_from_avail_list(struct swap_info_stru= ct *si) spin_unlock(&swap_avail_lock); } =20 -static void swap_range_alloc(struct swap_info_struct *si, unsigned long of= fset, +static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries) { - unsigned int end =3D offset + nr_entries - 1; - - if (offset =3D=3D si->lowest_bit) - si->lowest_bit +=3D nr_entries; - if (end =3D=3D si->highest_bit) - WRITE_ONCE(si->highest_bit, si->highest_bit - nr_entries); WRITE_ONCE(si->inuse_pages, si->inuse_pages + nr_entries); if (si->inuse_pages =3D=3D si->pages) { - si->lowest_bit =3D si->max; - si->highest_bit =3D 0; del_from_avail_list(si); =20 if (vm_swap_full()) @@ -921,15 +913,8 @@ static void swap_range_free(struct swap_info_struct *s= i, unsigned long offset, for (i =3D 0; i < nr_entries; i++) clear_bit(offset + i, si->zeromap); =20 - if (offset < si->lowest_bit) - si->lowest_bit =3D offset; - if (end > si->highest_bit) { - bool was_full =3D !si->highest_bit; - - WRITE_ONCE(si->highest_bit, end); - if (was_full && (si->flags & SWP_WRITEOK)) - add_to_avail_list(si); - } + if (si->inuse_pages =3D=3D si->pages) + add_to_avail_list(si); if (si->flags & SWP_BLKDEV) swap_slot_free_notify =3D si->bdev->bd_disk->fops->swap_slot_free_notify; @@ -1035,15 +1020,12 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entr= ies[], int entry_order) plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); spin_unlock(&swap_avail_lock); spin_lock(&si->lock); - if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) { + if ((si->inuse_pages =3D=3D si->pages) || !(si->flags & SWP_WRITEOK)) { spin_lock(&swap_avail_lock); if (plist_node_empty(&si->avail_lists[node])) { spin_unlock(&si->lock); goto nextsi; } - WARN(!si->highest_bit, - "swap_info %d in list but !highest_bit\n", - si->type); WARN(!(si->flags & SWP_WRITEOK), "swap_info %d in list but !SWP_WRITEOK\n", si->type); @@ -2425,8 +2407,8 @@ static void _enable_swap_info(struct swap_info_struct= *si) */ plist_add(&si->list, &swap_active_head); =20 - /* add to available list iff swap device is not full */ - if (si->highest_bit) + /* add to available list if swap device is not full */ + if (si->inuse_pages < si->pages) add_to_avail_list(si); } =20 @@ -2590,7 +2572,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) drain_mmlist(); =20 /* wait for anyone still in scan_swap_map_slots */ - p->highest_bit =3D 0; /* cuts scans short */ while (p->flags >=3D SWP_SCANNING) { spin_unlock(&p->lock); spin_unlock(&swap_lock); @@ -2925,8 +2906,6 @@ static unsigned long read_swap_header(struct swap_inf= o_struct *si, return 0; } =20 - si->lowest_bit =3D 1; - maxpages =3D swapfile_maximum_size; last_page =3D swap_header->info.last_page; if (!last_page) { @@ -2943,7 +2922,6 @@ static unsigned long read_swap_header(struct swap_inf= o_struct *si, if ((unsigned int)maxpages =3D=3D 0) maxpages =3D UINT_MAX; } - si->highest_bit =3D maxpages - 1; =20 if (!maxpages) return 0; --=20 2.47.0 From nobody Tue Nov 26 01:55:10 2024 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9FA711CC8A8 for ; Tue, 22 Oct 2024 19:30:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625408; cv=none; b=Dn0Y6Nr92v/2XlLws+4AM6CeCYlZzWHuIPKvyPlcMDgK8h/p8m3a0KAA6mKP1AAVvQDz8feC+Zz2UdFrw9csLFvLi4GX1slzPOR4nrXFv1AUuhLIQ2OdZPSV1o3pOk32xJZc7mmE+Ck9lERzTjvhR/y/ZBkMQZKkapr4ri/KURg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625408; c=relaxed/simple; bh=fn6zozrMcwEHwzWYkrPIBo4XzGGhpGTr/HNJIN3LD3w=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=YWQyB1qcY6CULM1AOY3RBXFcik9b8SkTvlGmQP+/TuQuUBwqaxlu+5Rfw56mcsZF0BOQEIou7zlFt56iDaUYQnWt4rq1nhflZkQogZBlYp8wdlcV2xk+QwCPNjfMY3ojOARcxeCXs92NUOp2jvKegWYvKpndVrM94bv5xRlbI8s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=MqEL2W/L; arc=none smtp.client-ip=209.85.214.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="MqEL2W/L" Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-20c805a0753so52314445ad.0 for ; Tue, 22 Oct 2024 12:30:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729625406; x=1730230206; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=91RZU9Im1mJOrq2P9uaFSxI4j9qGIThucf4AzLiTGJQ=; b=MqEL2W/L0iy5ion+fi28BRZK+DQzLUuw3WPhT9ksSeqHM67l/udPitoU96J0gk1wlL lQLS0R+ZZThRvTmWe8gl63v/XacocoTB0wumBRPJzIRR5k/R1a+9J8GStxfpBgJVHfwZ NO+6+f7xbzmojhkOHxGBiUox5HRPCzWHy8Qly52M4nwBZQmnz/wwpuekScKRXSjySZxK QMIz7YmRz+YHGUpbdWNJOIG0E+zol1cB1AmN6bD0/w9uPS5QjSW3J0DW/rvYfxywNyNw TQgi/+c475kqSGYSQ4AHo+ansgezAW8QpMIZL/f2I367zRjpVWozQ3KPQk1dZ8/1XCv0 d/Ig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729625406; x=1730230206; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=91RZU9Im1mJOrq2P9uaFSxI4j9qGIThucf4AzLiTGJQ=; b=s1CYgm13epjCMcOcsgr63Nsyu0Fo56K6qA5mcpiybAiQvgXV0n76bPGY1UYk6s6rWS PQ30QN92Ef6WMmEvmXv4GcSJXhi4owC0rgq1V/BsHdHPPTKgKAa5RVOluFUiLhypVQn1 uI07Vuqfkbj4QYqS2xPw59xARmbFLuOPgriQS9Hh5K6/GnGXfWKGKewRqb4swNGBeEVh qs8OEdPrxY2vEe7tS1Cprfddv30hW2IwokU6zkvRsaERTv0aPzLT8OvgJ3YQ6BevA4iD rGMNsXFk2hl47SA86X85hETKmRydG55fl+GJzH8Xpdtq3D7NnBeZFz5cCrm9d7KL1+Q6 k7+Q== X-Forwarded-Encrypted: i=1; AJvYcCUJjcYTYqM9ZnDLv9g0FwQ71cAjvK6woDqQuFY0Gi9JOUTOBudkMnwk3yYr/CyB9QOyjO/3X+fEVX6j3Vs=@vger.kernel.org X-Gm-Message-State: AOJu0YwjMqoCo+6DuU3KeRD6926qvF2SOnkifHBKBukscVxoxjEqL6FG 1iIJaUwUmo+6RUKejuMHVoGCt5n7dmoJgk8ZdiWA10kvY3swSnuv3PbIZO6m9L4= X-Google-Smtp-Source: AGHT+IGGKxwhFu82Kx287SIzTLPVnCIKI6xEf3aZnrPozIx88QdOWVORIZasbmDFFI1F9UEgs7b7OA== X-Received: by 2002:a17:902:e5c6:b0:20c:c834:b107 with SMTP id d9443c01a7336-20fa9e49d8fmr2964315ad.22.1729625405532; Tue, 22 Oct 2024 12:30:05 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.36]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20e7f0d9f05sm45895305ad.186.2024.10.22.12.30.02 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 22 Oct 2024 12:30:05 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 06/13] mm, swap: clean up plist removal and adding Date: Wed, 23 Oct 2024 03:24:44 +0800 Message-ID: <20241022192451.38138-7-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20241022192451.38138-1-ryncsn@gmail.com> References: <20241022192451.38138-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song When swap device is full (inuse_pages =3D=3D pages), it should be removed from the plist. And if any slot is freed, the swap device should be added back to the plist. On swapoff / swapon, the swap device will also be force removed / added. This is currently serialized by si->lock, and some historical sanity check code are still here. This commit decouple this from the protection of si->lock and clean it up to prepare for si->lock rework. Noticing inuse_pages counter is the only thing decides if a device should be removed from or added to the plist (except swapon / swapoff as a special case), and inuse_pages is a very hot counter. So to avoid extra overhead on the counter update hot path, and make it possible to check and update the plist when the counter value changes, embed the plist state into the inuse_pages counter, and turn the counter into an atomic. This way we can check and update the counter with one CAS and avoid any extra synchronization. If the counter is full (inuse_pages =3D=3D pages) with the off-list bit unset, try to remove it from the plist. If the counter is not full (inuse_pages !=3D pages) with the off-list bit set, try to add it to the plist. Removing and adding is serialized with lock as well as the bit setting. Ordinary counter updates will be lockless. Signed-off-by: Kairui Song Suggested-by: Chris Li --- include/linux/swap.h | 2 +- mm/swapfile.c | 182 +++++++++++++++++++++++++++++++------------ 2 files changed, 132 insertions(+), 52 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index c0d49dad7a4b..16dcf8bd1a4e 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -306,7 +306,7 @@ struct swap_info_struct { /* list of cluster that are fragmented or contented */ unsigned int frag_cluster_nr[SWAP_NR_ORDERS]; unsigned int pages; /* total of usable pages of swap */ - unsigned int inuse_pages; /* number of those currently in use */ + atomic_long_t inuse_pages; /* number of those currently in use */ struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio= n */ struct rb_root swap_extent_root;/* root of the swap extent rbtree */ struct block_device *bdev; /* swap device or bdev of swap file */ diff --git a/mm/swapfile.c b/mm/swapfile.c index e620b41c3120..4e629536a07c 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -128,6 +128,25 @@ static inline unsigned char swap_count(unsigned char e= nt) return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */ } =20 +/* + * Use the second highest bit of inuse_pages as the indicator + * of if one swap device is on the allocation plist. + * + * inuse_pages is the only thing decides of a device should be on + * list or not (except swapoff as a special case). By embedding the + * on-list bit into it, updaters don't need any lock to check the + * device list status. + * + * This bit will be set to 1 if the device is not on the plist and not + * usable, will be cleared if the device is on the plist. + */ +#define SWAP_USAGE_OFFLIST_BIT (1UL << (BITS_PER_TYPE(atomic_t) - 2)) +#define SWAP_USAGE_COUNTER_MASK (~SWAP_USAGE_OFFLIST_BIT) +static long swap_usage_in_pages(struct swap_info_struct *si) +{ + return atomic_long_read(&si->inuse_pages) & SWAP_USAGE_COUNTER_MASK; +} + /* Reclaim the swap entry anyway if possible */ #define TTRS_ANYWAY 0x1 /* @@ -709,7 +728,7 @@ static void swap_reclaim_full_clusters(struct swap_info= _struct *si, bool force) int nr_reclaim; =20 if (force) - to_scan =3D si->inuse_pages / SWAPFILE_CLUSTER; + to_scan =3D swap_usage_in_pages(si) / SWAPFILE_CLUSTER; =20 while (!list_empty(&si->full_clusters)) { ci =3D list_first_entry(&si->full_clusters, struct swap_cluster_info, li= st); @@ -860,42 +879,121 @@ static unsigned long cluster_alloc_swap_entry(struct= swap_info_struct *si, int o return found; } =20 -static void __del_from_avail_list(struct swap_info_struct *si) +/* + * SWAP_USAGE_OFFLIST_BIT can only be cleared by this helper and synced wi= th + * counter updaters with atomic. + */ +static void del_from_avail_list(struct swap_info_struct *si, bool swapoff) { int nid; =20 - assert_spin_locked(&si->lock); + spin_lock(&swap_avail_lock); + + if (swapoff) { + /* Clear SWP_WRITEOK so add_to_avail_list won't add it back */ + si->flags &=3D ~SWP_WRITEOK; + + /* Force take it off. */ + atomic_long_or(SWAP_USAGE_OFFLIST_BIT, &si->inuse_pages); + } else { + /* + * If not swapoff, take it off-list only when it's full and + * SWAP_USAGE_OFFLIST_BIT is not set (inuse_pages =3D=3D pages). + * The cmpxchg below will fail and skip the removal if there + * are slots freed or device is off-listed by someone else. + */ + if (atomic_long_cmpxchg(&si->inuse_pages, si->pages, + si->pages | SWAP_USAGE_OFFLIST_BIT) !=3D si->pages) + goto skip; + } + for_each_node(nid) plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]); + +skip: + spin_unlock(&swap_avail_lock); } =20 -static void del_from_avail_list(struct swap_info_struct *si) +/* + * SWAP_USAGE_OFFLIST_BIT can only be set by this helper and synced with + * counter updaters with atomic. + */ +static void add_to_avail_list(struct swap_info_struct *si, bool swapon) { + int nid; + long val; + bool swapoff; + spin_lock(&swap_avail_lock); - __del_from_avail_list(si); + + /* Special handling for swapon / swapoff */ + if (swapon) { + si->flags |=3D SWP_WRITEOK; + swapoff =3D false; + } else { + swapoff =3D !(READ_ONCE(si->flags) & SWP_WRITEOK); + } + + if (swapoff) + goto skip; + + if (!(atomic_long_read(&si->inuse_pages) & SWAP_USAGE_OFFLIST_BIT)) + goto skip; + + val =3D atomic_long_fetch_and_relaxed(~SWAP_USAGE_OFFLIST_BIT, &si->inuse= _pages); + + /* + * When device is full and device is on the plist, only one updater will + * see (inuse_pages =3D=3D si->pages) and will call del_from_avail_list. = If + * that updater happen to be here, just skip adding. + */ + if (val =3D=3D si->pages) { + /* Just like the cmpxchg in del_from_avail_list */ + if (atomic_long_cmpxchg(&si->inuse_pages, si->pages, + si->pages | SWAP_USAGE_OFFLIST_BIT) =3D=3D si->pages) + goto skip; + } + + for_each_node(nid) + plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]); + +skip: spin_unlock(&swap_avail_lock); } =20 -static void swap_range_alloc(struct swap_info_struct *si, - unsigned int nr_entries) +/* + * swap_usage_add / swap_usage_sub are serialized by ci->lock in each clus= ter + * so the total contribution to the global counter should always be positi= ve. + */ +static bool swap_usage_add(struct swap_info_struct *si, unsigned int nr_en= tries) { - WRITE_ONCE(si->inuse_pages, si->inuse_pages + nr_entries); - if (si->inuse_pages =3D=3D si->pages) { - del_from_avail_list(si); + long val =3D atomic_long_add_return_relaxed(nr_entries, &si->inuse_pages); =20 - if (vm_swap_full()) - schedule_work(&si->reclaim_work); + /* If device is full, SWAP_USAGE_OFFLIST_BIT not set, try off list it */ + if (val =3D=3D si->pages) { + del_from_avail_list(si, false); + return true; } + + return false; } =20 -static void add_to_avail_list(struct swap_info_struct *si) +static void swap_usage_sub(struct swap_info_struct *si, unsigned int nr_en= tries) { - int nid; + long val =3D atomic_long_sub_return_relaxed(nr_entries, &si->inuse_pages); =20 - spin_lock(&swap_avail_lock); - for_each_node(nid) - plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]); - spin_unlock(&swap_avail_lock); + /* If device is off list, try add it back */ + if (val & SWAP_USAGE_OFFLIST_BIT) + add_to_avail_list(si, false); +} + +static void swap_range_alloc(struct swap_info_struct *si, + unsigned int nr_entries) +{ + if (swap_usage_add(si, nr_entries)) { + if (vm_swap_full()) + schedule_work(&si->reclaim_work); + } } =20 static void swap_range_free(struct swap_info_struct *si, unsigned long off= set, @@ -913,8 +1011,6 @@ static void swap_range_free(struct swap_info_struct *s= i, unsigned long offset, for (i =3D 0; i < nr_entries; i++) clear_bit(offset + i, si->zeromap); =20 - if (si->inuse_pages =3D=3D si->pages) - add_to_avail_list(si); if (si->flags & SWP_BLKDEV) swap_slot_free_notify =3D si->bdev->bd_disk->fops->swap_slot_free_notify; @@ -928,13 +1024,13 @@ static void swap_range_free(struct swap_info_struct = *si, unsigned long offset, } clear_shadow_from_swap_cache(si->type, begin, end); =20 + atomic_long_add(nr_entries, &nr_swap_pages); /* * Make sure that try_to_unuse() observes si->inuse_pages reaching 0 * only after the above cleanups are done. */ smp_wmb(); - atomic_long_add(nr_entries, &nr_swap_pages); - WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); + swap_usage_sub(si, nr_entries); } =20 static int cluster_alloc_swap(struct swap_info_struct *si, @@ -1020,19 +1116,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entri= es[], int entry_order) plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); spin_unlock(&swap_avail_lock); spin_lock(&si->lock); - if ((si->inuse_pages =3D=3D si->pages) || !(si->flags & SWP_WRITEOK)) { - spin_lock(&swap_avail_lock); - if (plist_node_empty(&si->avail_lists[node])) { - spin_unlock(&si->lock); - goto nextsi; - } - WARN(!(si->flags & SWP_WRITEOK), - "swap_info %d in list but !SWP_WRITEOK\n", - si->type); - __del_from_avail_list(si); - spin_unlock(&si->lock); - goto nextsi; - } n_ret =3D scan_swap_map_slots(si, SWAP_HAS_CACHE, n_goal, swp_entries, order); spin_unlock(&si->lock); @@ -1041,7 +1124,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entrie= s[], int entry_order) cond_resched(); =20 spin_lock(&swap_avail_lock); -nextsi: /* * if we got here, it's likely that si was almost full before, * and since scan_swap_map_slots() can drop the si->lock, @@ -1773,7 +1855,7 @@ unsigned int count_swap_pages(int type, int free) if (sis->flags & SWP_WRITEOK) { n =3D sis->pages; if (free) - n -=3D sis->inuse_pages; + n -=3D swap_usage_in_pages(sis); } spin_unlock(&sis->lock); } @@ -2108,7 +2190,7 @@ static int try_to_unuse(unsigned int type) swp_entry_t entry; unsigned int i; =20 - if (!READ_ONCE(si->inuse_pages)) + if (!swap_usage_in_pages(si)) goto success; =20 retry: @@ -2121,7 +2203,7 @@ static int try_to_unuse(unsigned int type) =20 spin_lock(&mmlist_lock); p =3D &init_mm.mmlist; - while (READ_ONCE(si->inuse_pages) && + while (swap_usage_in_pages(si) && !signal_pending(current) && (p =3D p->next) !=3D &init_mm.mmlist) { =20 @@ -2149,7 +2231,7 @@ static int try_to_unuse(unsigned int type) mmput(prev_mm); =20 i =3D 0; - while (READ_ONCE(si->inuse_pages) && + while (swap_usage_in_pages(si) && !signal_pending(current) && (i =3D find_next_to_unuse(si, i)) !=3D 0) { =20 @@ -2184,7 +2266,7 @@ static int try_to_unuse(unsigned int type) * folio_alloc_swap(), temporarily hiding that swap. It's easy * and robust (though cpu-intensive) just to keep retrying. */ - if (READ_ONCE(si->inuse_pages)) { + if (swap_usage_in_pages(si)) { if (!signal_pending(current)) goto retry; return -EINTR; @@ -2193,7 +2275,7 @@ static int try_to_unuse(unsigned int type) success: /* * Make sure that further cleanups after try_to_unuse() returns happen - * after swap_range_free() reduces si->inuse_pages to 0. + * after swap_range_free() reduces inuse_pages to 0. */ smp_mb(); return 0; @@ -2211,7 +2293,7 @@ static void drain_mmlist(void) unsigned int type; =20 for (type =3D 0; type < nr_swapfiles; type++) - if (swap_info[type]->inuse_pages) + if (swap_usage_in_pages(swap_info[type])) return; spin_lock(&mmlist_lock); list_for_each_safe(p, next, &init_mm.mmlist) @@ -2390,7 +2472,6 @@ static void setup_swap_info(struct swap_info_struct *= si, int prio, =20 static void _enable_swap_info(struct swap_info_struct *si) { - si->flags |=3D SWP_WRITEOK; atomic_long_add(si->pages, &nr_swap_pages); total_swap_pages +=3D si->pages; =20 @@ -2407,9 +2488,8 @@ static void _enable_swap_info(struct swap_info_struct= *si) */ plist_add(&si->list, &swap_active_head); =20 - /* add to available list if swap device is not full */ - if (si->inuse_pages < si->pages) - add_to_avail_list(si); + /* Add back to available list */ + add_to_avail_list(si, true); } =20 static void enable_swap_info(struct swap_info_struct *si, int prio, @@ -2507,7 +2587,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) goto out_dput; } spin_lock(&p->lock); - del_from_avail_list(p); + del_from_avail_list(p, true); if (p->prio < 0) { struct swap_info_struct *si =3D p; int nid; @@ -2525,7 +2605,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) plist_del(&p->list, &swap_active_head); atomic_long_sub(p->pages, &nr_swap_pages); total_swap_pages -=3D p->pages; - p->flags &=3D ~SWP_WRITEOK; spin_unlock(&p->lock); spin_unlock(&swap_lock); =20 @@ -2705,7 +2784,7 @@ static int swap_show(struct seq_file *swap, void *v) } =20 bytes =3D K(si->pages); - inuse =3D K(READ_ONCE(si->inuse_pages)); + inuse =3D K(swap_usage_in_pages(si)); =20 file =3D si->swap_file; len =3D seq_file_path(swap, file, " \t\n\\"); @@ -2822,6 +2901,7 @@ static struct swap_info_struct *alloc_swap_info(void) } spin_lock_init(&p->lock); spin_lock_init(&p->cont_lock); + atomic_long_set(&p->inuse_pages, SWAP_USAGE_OFFLIST_BIT); init_completion(&p->comp); =20 return p; @@ -3319,7 +3399,7 @@ void si_swapinfo(struct sysinfo *val) struct swap_info_struct *si =3D swap_info[type]; =20 if ((si->flags & SWP_USED) && !(si->flags & SWP_WRITEOK)) - nr_to_be_unused +=3D READ_ONCE(si->inuse_pages); + nr_to_be_unused +=3D swap_usage_in_pages(si); } val->freeswap =3D atomic_long_read(&nr_swap_pages) + nr_to_be_unused; val->totalswap =3D total_swap_pages + nr_to_be_unused; --=20 2.47.0 From nobody Tue Nov 26 01:55:10 2024 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 49FA01CC889 for ; Tue, 22 Oct 2024 19:30:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625411; cv=none; b=agdtGbHceP1dLhv1GwrFlnWG1suZmJwzZ6fklUZQBqzkOUVSIx6qjQCMBgDeB18YJS0+LWw6Js+3W+VwSztDVUw4clvb3D5PJKUUx/M0UsiUW2gTTTiElRuDS3j/tGNv4RH7wj0V39Q2i2hZzJL5jZsnftwPdxfzORzQawy+YIo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625411; c=relaxed/simple; bh=7IHuhTajhKK2ytXTaAghCPfcVQI5lnm1fWCY/s8m/i0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=OwB0hG0vtQIMB2g9LkfrrhIZxm/rIMYcd+Kct0Mn2nXsgE6/GVo5ZzBoqE1prWAHtYmBLbz3rLfggtvyK3aauRmdzAkxoXYD/MeIMd35i/9SF+KtUAFbniqngNi+W9YScmM+n4rgD3XQQmP4p+1eOOoFaJymIAJ2FggcpIkyc0E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Jqj3jxGQ; arc=none smtp.client-ip=209.85.214.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Jqj3jxGQ" Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-20cceb8d8b4so1124445ad.1 for ; Tue, 22 Oct 2024 12:30:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729625409; x=1730230209; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=Y5IVfma5ek0o5p05lfogeIX/Ksc+K9HaTmPWOdXoK94=; b=Jqj3jxGQeH4KnQqYo1PAMIU7Bpjm8HYZsPkyTz72zxes8P/z+xdA1LefXXrzsYtNVM p2h9Th3csN0ri/CAXW3W3K2EBSYMwK5JIeb0u2pRxg1UBwgCy8JOzmLvvB3RJylaV+hk z0Ui2F1xfwP6X6bXp4lUR34MyNHc1D2qfNJ7gQcGrXeaSwoBwgUd+hiua1Ss+QA3U65k wek0PFAwYuiWDoWCLClfDS83K6yKvw823PZ1P4Yfa3aCmyXTEVOHqx4BlM5SUZw+pnt2 DyFHPILn7knkU6h/I7cdYVqDvt5PMvGtIvuwu+x54pQhnBvcrXCgStJvhVzfvxbtxmWX AAnQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729625409; x=1730230209; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Y5IVfma5ek0o5p05lfogeIX/Ksc+K9HaTmPWOdXoK94=; b=SRD7VDIS6FMR/Jv49e9MpTJrb/jtWPk3SVwCZoNUYgXuyyTQVOj7XlSC9zy4nfqGIC XV6AYiD9Ct7HH/ZM0W2VK6l7epS71mbCnaX8luRk152Lb9/kXDKlmBIZ5vY2EQo8Kcd9 fQs/Yr7JsFCvyGuw3iFQoVEEafS2PVSHyJHgzisphImurA1h1G4NIYzKeEAhbPcbOb/i 75J4bVSnMB1tI+Cq9sFH8qSKmEZ/NvZoVbLfpN1OBgS3dk3P0Q5H7Ojpz+2IpZeYn6Iv Wv0jXuokvpPfm0zcXhmw2kXX7xRMMXm18fKYnTFc3KTj4ukMWoMiK8lawN52CMLhvmq2 eDkw== X-Forwarded-Encrypted: i=1; AJvYcCVrUPtYK8W/rquJIgdD6zW7ck9Lvyl9g85E12IdDRv8zhzx4a7GKJTvP2It3ywHAhESQLl7PAoBAnCR3Nc=@vger.kernel.org X-Gm-Message-State: AOJu0YzMoe0STP7VjitnrvdlnqXgFU+4xOkAA3DRhZ/PnrLsxeDGTlP7 ICFMVUdXfqu0PNaKjD0QwCNQORVskNinRJO639fElVRzgI5U6KvP X-Google-Smtp-Source: AGHT+IEaW5p22BuL++IGvPbx1wSKUh2fWL5sYrhcqHGPkvgStJJ2rzNvX7pO43mcwrLobjoc3axngg== X-Received: by 2002:a17:902:e845:b0:20b:a41f:6e4d with SMTP id d9443c01a7336-20fab2e2baemr2651685ad.15.1729625409537; Tue, 22 Oct 2024 12:30:09 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.36]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20e7f0d9f05sm45895305ad.186.2024.10.22.12.30.06 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 22 Oct 2024 12:30:09 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 07/13] mm, swap: hold a reference of si during scan and clean up flags Date: Wed, 23 Oct 2024 03:24:45 +0800 Message-ID: <20241022192451.38138-8-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20241022192451.38138-1-ryncsn@gmail.com> References: <20241022192451.38138-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song The flag SWP_SCANNING was used as an indicator of whether a device is being scanned, and prevents swap off. But it's already no longer used. The only thing protects the scanning now is the si lock. However allocation path may drop the si lock, in theory this could leaf to UAF. So clean this up, just hold a reference for whole allocation path. So per CPU counter killing will wait for existing scan and other usage. The flag SWP_SCANNING can also be dropped. Signed-off-by: Kairui Song Suggested-by: Chris Li --- include/linux/swap.h | 1 - mm/swapfile.c | 62 +++++++++++++++++++++++--------------------- 2 files changed, 33 insertions(+), 30 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 16dcf8bd1a4e..1651174959c8 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -219,7 +219,6 @@ enum { SWP_STABLE_WRITES =3D (1 << 11), /* no overwrite PG_writeback pages */ SWP_SYNCHRONOUS_IO =3D (1 << 12), /* synchronous IO is efficient */ /* add others here before... */ - SWP_SCANNING =3D (1 << 14), /* refcount in scan_swap_map */ }; =20 #define SWAP_CLUSTER_MAX 32UL diff --git a/mm/swapfile.c b/mm/swapfile.c index 4e629536a07c..d6b6e71ccc19 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1088,6 +1088,21 @@ static int scan_swap_map_slots(struct swap_info_stru= ct *si, return cluster_alloc_swap(si, usage, nr, slots, order); } =20 +static bool get_swap_device_info(struct swap_info_struct *si) +{ + if (!percpu_ref_tryget_live(&si->users)) + return false; + /* + * Guarantee the si->users are checked before accessing other + * fields of swap_info_struct. + * + * Paired with the spin_unlock() after setup_swap_info() in + * enable_swap_info(). + */ + smp_rmb(); + return true; +} + int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) { int order =3D swap_entry_order(entry_order); @@ -1115,13 +1130,16 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entr= ies[], int entry_order) /* requeue si to after same-priority siblings */ plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); spin_unlock(&swap_avail_lock); - spin_lock(&si->lock); - n_ret =3D scan_swap_map_slots(si, SWAP_HAS_CACHE, - n_goal, swp_entries, order); - spin_unlock(&si->lock); - if (n_ret || size > 1) - goto check_out; - cond_resched(); + if (get_swap_device_info(si)) { + spin_lock(&si->lock); + n_ret =3D scan_swap_map_slots(si, SWAP_HAS_CACHE, + n_goal, swp_entries, order); + spin_unlock(&si->lock); + put_swap_device(si); + if (n_ret || size > 1) + goto check_out; + cond_resched(); + } =20 spin_lock(&swap_avail_lock); /* @@ -1272,16 +1290,8 @@ struct swap_info_struct *get_swap_device(swp_entry_t= entry) si =3D swp_swap_info(entry); if (!si) goto bad_nofile; - if (!percpu_ref_tryget_live(&si->users)) + if (!get_swap_device_info(si)) goto out; - /* - * Guarantee the si->users are checked before accessing other - * fields of swap_info_struct. - * - * Paired with the spin_unlock() after setup_swap_info() in - * enable_swap_info(). - */ - smp_rmb(); offset =3D swp_offset(entry); if (offset >=3D si->max) goto put_out; @@ -1761,10 +1771,13 @@ swp_entry_t get_swap_page_of_type(int type) goto fail; =20 /* This is called for allocating swap entry, not cache */ - spin_lock(&si->lock); - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0)) - atomic_long_dec(&nr_swap_pages); - spin_unlock(&si->lock); + if (get_swap_device_info(si)) { + spin_lock(&si->lock); + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0= )) + atomic_long_dec(&nr_swap_pages); + spin_unlock(&si->lock); + put_swap_device(si); + } fail: return entry; } @@ -2650,15 +2663,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specia= lfile) spin_lock(&p->lock); drain_mmlist(); =20 - /* wait for anyone still in scan_swap_map_slots */ - while (p->flags >=3D SWP_SCANNING) { - spin_unlock(&p->lock); - spin_unlock(&swap_lock); - schedule_timeout_uninterruptible(1); - spin_lock(&swap_lock); - spin_lock(&p->lock); - } - swap_file =3D p->swap_file; p->swap_file =3D NULL; p->max =3D 0; --=20 2.47.0 From nobody Tue Nov 26 01:55:10 2024 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4CEC71CCB5B for ; Tue, 22 Oct 2024 19:30:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625416; cv=none; b=uXyCQYdMmfR0zdkANYRO8bfhLGtpCD+jmh/TcjM4N+WFEg9WD2kOX7X4Q5yHjZ6AlI04o0642DWT4jMWqhmDs0TOY3Xzutcf5SdYYw0ruChwlDQ5+/FnGu1ZDe5+CIBAtRDwUmyy4dD3sfLre0v5rdinOI5VVY2T9Vcw4mAvpQw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625416; c=relaxed/simple; bh=aF9HuuUJqZ/FR4xwdwbEKu6qkcO16UmzN5/Fli3AOC0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=rSinlmP4dKKmugerbnvRzlQKhW0NP9kwhfRQA1b4ba+ta8LGUajFhCd+Cicp+I5VD1758844+dpZuK7dWDAG1rSHmFJ+Qp4c3cPjsFrY/4cH3pD/b2rl1cpxugJmZQVS2OQDjHoMnpJcKOHR09AhGL3wOEwC/8xIGsmVGCWOIeg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=gV3GA1hq; arc=none smtp.client-ip=209.85.214.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="gV3GA1hq" Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-20ca1b6a80aso57359765ad.2 for ; Tue, 22 Oct 2024 12:30:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729625413; x=1730230213; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=C0u3UHnFQamHvHumiMWfS42vAHn/bzcqM1FVrGPT7rA=; b=gV3GA1hqLsvzCc2+RUt0ucqgW5A1ehDgnNyb4c46ScUJFc5JX5L8ppEpAU/2HM1VEc 43FwO2R53APXlypc9aZ/zwpMgK35QS1j726kSiICdTY95khTjU6XD6uNrcS7+bPu1/9b cZWT15etv3nQvzj61kPJ1WSOy3eo/6QSZR/P3EaauTZo4/tDYOWVzisi2VcvEwU4JwlO gHYBfZQEHa6iUAX2OCLf1HgdV1QhBKy5CMWaG0VZXjsUpHi7Zca3x4M34i6byO8dz7QA nATfYJkErVQiKXiNIcVb3Aalwd7XwUIu1+M4l4BXVSc2krinp78iyBbQkzMvY7oay9z3 eg0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729625413; x=1730230213; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=C0u3UHnFQamHvHumiMWfS42vAHn/bzcqM1FVrGPT7rA=; b=uhq45i+edk0wiDWpPv6V0jCSQkqgflVLRyJGWnnxbXkZah6RW/a1Ev8kOEaSFLBcRO qoRjnfyQQOEZH9b3q++VEPmbhsnryXpNmjzrdUmcR0W/yNgVn5ankodQm7yNTdx9OUFc zTaYlUi8hlpuQl/5cjquhp9e+U4fjkRgj3EhAPMeMZ0ZldV8eC37hHCRgHPDYRyZBvq/ 7YfWrAty7t36yLpUVydSJ/lSrlJ9p1nW3mCh4S1zNr96Sy8GejdgPyipQ8fdc9hzMaEa yxl/UpwEFRkY+djvoTWpiHaGDiICfVYoCPSgGx5r2RZf6HeN0UTgDc1N4XgzCSj6LQdo FOaQ== X-Forwarded-Encrypted: i=1; AJvYcCXjUOZXg/w+mThUa0Om0SR8ZfC0z4hvQyJyztyK+b1EEiVMBBLZvhaPNIhG+w5mHaE6QcXryvJtGtmBTrk=@vger.kernel.org X-Gm-Message-State: AOJu0YxaaHbcxkfpNXzEWqd/0dUxi2c/udTO1ZU87VVJg9QcmHGrzGzr /9h8koa8tMuEl64c7sOC6Vs2B0bszhNE7205sAnhNov811/mUeQ1 X-Google-Smtp-Source: AGHT+IGcRlQHedKQaqyDIKAfCIehFDARjRb4U2itQQ2f0zTHaoGQCyPSUA4jWG4iyNeB5YMGTc2MgA== X-Received: by 2002:a17:902:db12:b0:20c:e875:12b5 with SMTP id d9443c01a7336-20fab3276b7mr2519545ad.59.1729625413398; Tue, 22 Oct 2024 12:30:13 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.36]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20e7f0d9f05sm45895305ad.186.2024.10.22.12.30.09 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 22 Oct 2024 12:30:12 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes Date: Wed, 23 Oct 2024 03:24:46 +0800 Message-ID: <20241022192451.38138-9-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20241022192451.38138-1-ryncsn@gmail.com> References: <20241022192451.38138-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Currently we are only using flags to indicate which list the cluster is on, using one bit for each list type might be a waste as the list type grows we will consume too much bits. And current the mixed usage of "&" and "=3D=3D" is a bit confusing. Make it clean by using an enum to define all possible cluster status, only an off-list cluster will have the NONE (0) flag. And use a wrapper to annotate and sanitize all flag setting and list movement. Suggested-by: Chris Li Signed-off-by: Kairui Song --- include/linux/swap.h | 17 +++++++--- mm/swapfile.c | 76 +++++++++++++++++++++++--------------------- 2 files changed, 53 insertions(+), 40 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 1651174959c8..75fc2da1767d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -256,10 +256,19 @@ struct swap_cluster_info { u8 order; struct list_head list; }; -#define CLUSTER_FLAG_FREE 1 /* This cluster is free */ -#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ -#define CLUSTER_FLAG_FRAG 4 /* This cluster is on nonfull list */ -#define CLUSTER_FLAG_FULL 8 /* This cluster is on full list */ + +/* + * All on-list cluster must have a non-zero flag. + */ +enum swap_cluster_flags { + CLUSTER_FLAG_NONE =3D 0, /* For temporary off-list cluster */ + CLUSTER_FLAG_FREE, + CLUSTER_FLAG_NONFULL, + CLUSTER_FLAG_FRAG, + CLUSTER_FLAG_FULL, + CLUSTER_FLAG_DISCARD, + CLUSTER_FLAG_MAX, +}; =20 /* * The first page in the swap file is the swap header, which is always mar= ked diff --git a/mm/swapfile.c b/mm/swapfile.c index d6b6e71ccc19..96d8012b003c 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -402,7 +402,7 @@ static void discard_swap_cluster(struct swap_info_struc= t *si, =20 static inline bool cluster_is_free(struct swap_cluster_info *info) { - return info->flags & CLUSTER_FLAG_FREE; + return info->flags =3D=3D CLUSTER_FLAG_FREE; } =20 static inline unsigned int cluster_index(struct swap_info_struct *si, @@ -433,6 +433,27 @@ static inline void unlock_cluster(struct swap_cluster_= info *ci) spin_unlock(&ci->lock); } =20 +static void cluster_move(struct swap_info_struct *si, + struct swap_cluster_info *ci, struct list_head *list, + enum swap_cluster_flags new_flags) +{ + VM_WARN_ON(ci->flags =3D=3D new_flags); + BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX); + + if (ci->flags =3D=3D CLUSTER_FLAG_NONE) { + list_add_tail(&ci->list, list); + } else { + if (ci->flags =3D=3D CLUSTER_FLAG_FRAG) { + VM_WARN_ON(!si->frag_cluster_nr[ci->order]); + si->frag_cluster_nr[ci->order]--; + } + list_move_tail(&ci->list, list); + } + ci->flags =3D new_flags; + if (new_flags =3D=3D CLUSTER_FLAG_FRAG) + si->frag_cluster_nr[ci->order]++; +} + /* Add a cluster to discard list and schedule it to do discard */ static void swap_cluster_schedule_discard(struct swap_info_struct *si, struct swap_cluster_info *ci) @@ -446,10 +467,8 @@ static void swap_cluster_schedule_discard(struct swap_= info_struct *si, */ memset(si->swap_map + idx * SWAPFILE_CLUSTER, SWAP_MAP_BAD, SWAPFILE_CLUSTER); - - VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); - list_move_tail(&ci->list, &si->discard_clusters); - ci->flags =3D 0; + VM_BUG_ON(ci->flags =3D=3D CLUSTER_FLAG_FREE); + cluster_move(si, ci, &si->discard_clusters, CLUSTER_FLAG_DISCARD); schedule_work(&si->discard_work); } =20 @@ -457,12 +476,7 @@ static void __free_cluster(struct swap_info_struct *si= , struct swap_cluster_info { lockdep_assert_held(&si->lock); lockdep_assert_held(&ci->lock); - - if (ci->flags) - list_move_tail(&ci->list, &si->free_clusters); - else - list_add_tail(&ci->list, &si->free_clusters); - ci->flags =3D CLUSTER_FLAG_FREE; + cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE); ci->order =3D 0; } =20 @@ -478,6 +492,8 @@ static void swap_do_scheduled_discard(struct swap_info_= struct *si) while (!list_empty(&si->discard_clusters)) { ci =3D list_first_entry(&si->discard_clusters, struct swap_cluster_info,= list); list_del(&ci->list); + /* Must clear flag when taking a cluster off-list */ + ci->flags =3D CLUSTER_FLAG_NONE; idx =3D cluster_index(si, ci); spin_unlock(&si->lock); =20 @@ -518,9 +534,6 @@ static void free_cluster(struct swap_info_struct *si, s= truct swap_cluster_info * lockdep_assert_held(&si->lock); lockdep_assert_held(&ci->lock); =20 - if (ci->flags & CLUSTER_FLAG_FRAG) - si->frag_cluster_nr[ci->order]--; - /* * If the swap is discardable, prepare discard the cluster * instead of free it immediately. The cluster will be freed @@ -572,13 +585,9 @@ static void dec_cluster_info_page(struct swap_info_str= uct *si, return; } =20 - if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { - VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); - if (ci->flags & CLUSTER_FLAG_FRAG) - si->frag_cluster_nr[ci->order]--; - list_move_tail(&ci->list, &si->nonfull_clusters[ci->order]); - ci->flags =3D CLUSTER_FLAG_NONFULL; - } + if (ci->flags !=3D CLUSTER_FLAG_NONFULL) + cluster_move(si, ci, &si->nonfull_clusters[ci->order], + CLUSTER_FLAG_NONFULL); } =20 static bool cluster_reclaim_range(struct swap_info_struct *si, @@ -657,11 +666,14 @@ static void cluster_alloc_range(struct swap_info_stru= ct *si, struct swap_cluster { unsigned int nr_pages =3D 1 << order; =20 + VM_BUG_ON(ci->flags !=3D CLUSTER_FLAG_FREE && + ci->flags !=3D CLUSTER_FLAG_NONFULL && + ci->flags !=3D CLUSTER_FLAG_FRAG); + if (cluster_is_free(ci)) { - if (nr_pages < SWAPFILE_CLUSTER) { - list_move_tail(&ci->list, &si->nonfull_clusters[order]); - ci->flags =3D CLUSTER_FLAG_NONFULL; - } + if (nr_pages < SWAPFILE_CLUSTER) + cluster_move(si, ci, &si->nonfull_clusters[order], + CLUSTER_FLAG_NONFULL); ci->order =3D order; } =20 @@ -669,14 +681,8 @@ static void cluster_alloc_range(struct swap_info_struc= t *si, struct swap_cluster swap_range_alloc(si, nr_pages); ci->count +=3D nr_pages; =20 - if (ci->count =3D=3D SWAPFILE_CLUSTER) { - VM_BUG_ON(!(ci->flags & - (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG))); - if (ci->flags & CLUSTER_FLAG_FRAG) - si->frag_cluster_nr[ci->order]--; - list_move_tail(&ci->list, &si->full_clusters); - ci->flags =3D CLUSTER_FLAG_FULL; - } + if (ci->count =3D=3D SWAPFILE_CLUSTER) + cluster_move(si, ci, &si->full_clusters, CLUSTER_FLAG_FULL); } =20 static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, u= nsigned long offset, @@ -806,9 +812,7 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o while (!list_empty(&si->nonfull_clusters[order])) { ci =3D list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list); - list_move_tail(&ci->list, &si->frag_clusters[order]); - ci->flags =3D CLUSTER_FLAG_FRAG; - si->frag_cluster_nr[order]++; + cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG); offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); frags++; --=20 2.47.0 From nobody Tue Nov 26 01:55:10 2024 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 35F8B1CCECC for ; Tue, 22 Oct 2024 19:30:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625420; cv=none; b=cZmXatVe86t62wI3bD9LQxbI4uudzRU+FykWErRbFTWowoZxjfEpoWdpeBMcW9kQ1qy1e0QgBDmuzsz7055PV57iAYdDNAQtw4X5tFop03U4/A7Bw1S7mW33wEo/VYjBeyhPwLxgK22Vy0F7JNsW+X6pRd5dSL6FWg+0dreXyv4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625420; c=relaxed/simple; bh=mj/CjGxozi3200AqM9jvS9QZ3kbkE9jHXsmJ5+6VXuc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ee0MFxQp3ZemJiI7POaRzkyS8mMz7HlH2Q/4HlHldheEh88QDm4a2+wIfq6u9Ygbj5OXImwQRRpgjeDyyjLy/J+w+D0HPv1zhH2aM1GTI3apV+6hVAGEh9E8AKSYGBUyra0t2RWzy3D0DfGdSeJJ7Ef+6YkNzZulDX7QfODtAhk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=E1F/WHms; arc=none smtp.client-ip=209.85.214.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="E1F/WHms" Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-20e6981ca77so44522815ad.2 for ; Tue, 22 Oct 2024 12:30:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729625417; x=1730230217; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=0T5whPPZ1mtfPZDL/Wr67Sgt2fIYamz1vqMoEXoMco4=; b=E1F/WHmsY4Kb2LOAeL2Kwc+4ZqLq+3S+odmGCNP+wASts/636JiqA0IGfjkguVUitg oavemrly2wrVzo4gd6i84HEwKgAuIJeJqOiDGEKfZtAUqjBKZOBNc7ia2GvGrvXv3h9O zzP/KoC9BSvwDXg5r6tNAaPy40MhubrZsQgFFBZvKbNuWjBpvzAD7+PM31nEYku4HBG7 ycgfrQfDLuqlyH7yn2yw0B71nfor0eBwNDtBHGST03mgFttxqqQChoqDyeArWShk7icW JV2oOau+XT10J6pGFkZu0JJmiyYr4FpoXVq7z29ogj27/cHQEGKLLfK81hvHpP0jAOdn y1Ag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729625417; x=1730230217; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=0T5whPPZ1mtfPZDL/Wr67Sgt2fIYamz1vqMoEXoMco4=; b=txMUxJs41lV6pWvA7zWD8wmJcnX+j/m9uNfjBZGK/FtaiFN13/9cJzasK8LGcT9IxU e25YCg3DX6qwtBw6Vmph/qpJxw16KmpxgcrO0OdUjVvWNtANvGauPMUsZn2zR6KJBB8+ R869D1QfMugHrb1kBNxbwrKp5NiJ9vLk8crSTQNevInX7QPOQwoffa2Sy9P9ghs63l2R oCB3LcETgJ5JwWQGUKrYEG4Xa4mI6tGNM6G1Lr+McGsujnJXsVgy6Tastf8SMD6ymK+z trWEe1mwLON+RDonBXbeyzUngvdlVbfL7FQCrb0gFMqsGGkpKC6ki3No3tEqE8sBoamc es7Q== X-Forwarded-Encrypted: i=1; AJvYcCVT7pAHltRhZGiqNRl64mfoi+pum2oOuO4ekk8+IynNEGWT+chnd/vbuLkD0TFSmkS6Sgv0oij3ZGJgTFQ=@vger.kernel.org X-Gm-Message-State: AOJu0YzuxsFM6Gef504OnChjaThbFpyfICsHL9ICFYrbio0zSQynIIjR /57UR4XTSJ4l8ZoA7N1PBDKbuJjb7Kf9qJW4FMQVfGq5twjTexK/ X-Google-Smtp-Source: AGHT+IEkf+8rWb2625GwIb5iPnB6qF5HfrvR5aPCaq44dinbjDAQnyDVMYYmoXwRSMux2JXk+YmbGw== X-Received: by 2002:a17:902:ea02:b0:20c:7d4c:64db with SMTP id d9443c01a7336-20fab2dba5cmr2570545ad.49.1729625417229; Tue, 22 Oct 2024 12:30:17 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.36]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20e7f0d9f05sm45895305ad.186.2024.10.22.12.30.13 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 22 Oct 2024 12:30:16 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 09/13] mm, swap: reduce contention on device lock Date: Wed, 23 Oct 2024 03:24:47 +0800 Message-ID: <20241022192451.38138-10-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20241022192451.38138-1-ryncsn@gmail.com> References: <20241022192451.38138-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Currently swap locking is mainly composed of two locks, cluster lock (ci->lock) and device lock (si->lock). Cluster lock is much more fine-grained, so it will be best to use ci->lock instead of si->lock as much as possible. Following the new cluster allocator design, many operation doesn't need to touch si->lock at all. In practise, we only need to take si->lock when moving clusters between lists. To archive it, this commit reworked the locking pattern of all si->lock and ci->lock users, eliminated all usage of ci->lock inside si->lock, introduce new design to avoid touching si->lock as much as possible. For minimal contention for allocation and easier understanding, two ideas are introduced with the corresponding helpers: `isolation` and `relocation`: - Clusters will be `isolated` from list upon being scanned for allocation, so scanning of on-list cluster no longer need to hold the si->lock except the very moment, and hence removed the ci->lock usage inside si->lock. In the new allocator design, one cluster always get moved after scanning (free -> nonfull, nonfull -> frag, frag -> frag tail) so this introduces no extra overhead. This also greatly reduced the contention of both si->lock and ci->lock as other CPUs won't walk onto the same cluster by iterating the list. The off-list time window of a cluster is also minimal, one CPU can at most hold one cluster while scanning the 512 entries on it, which we used to busy wait with a spin lock. This is done with `cluster_isolate_lock` on scanning of a new cluster. Note: Scanning of per CPU cluster is a special case, it doesn't isolation the cluster. That's because it doesn't need to hold the si->lock at all, it simply acquire the ci->lock of previously used cluster and use it. - Cluster will be `relocated` after allocation or freeing according to it's count and status. Allocations no longer holds si->lock now, and may drop ci->lock for reclaim, so the cluster could be moved to anywhere. Besides, `isolation` clears all flags when it takes the cluster off list (The flag must be in-sync with list status, so cluster users don't need to touch si->lock for checking its list status. This is important for reducing contention on si->lock). So the cluster have to be `relocated` according to its usage after being allocation to the right list. This is done with `relocate_cluster` after allocation, or `[partial_]free_cluster` after freeing. Now except swapon / swapoff and discard, `isolation` and `relocation` are the only two places that need to take si->lock. And as each CPU will keep using its per-CPU cluster as much as possible and a cluster have 512 entries to be consumed, si->lock is rarely touched. The lock contention of si->lock is now barely observable. Test with build linux kernel with defconfig showed huge performance improvement: tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C: Before: Sys time: 73578.30, Real time: 864.05 After: (-50.7% sys time, -44.8% real time) Sys time: 36227.49, Real time: 476.66 time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C: (avg of 4 test run) Before: Sys time: 74044.85, Real time: 846.51 hugepages-64kB/stats/swpout: 1735216 hugepages-64kB/stats/swpout_fallback: 430333 After: (-40.4% sys time, -37.1% real time) Sys time: 44160.56, Real time: 532.07 hugepages-64kB/stats/swpout: 1786288 hugepages-64kB/stats/swpout_fallback: 243384 time make -j32 / 512M memcg, 4K pages, 5G ZRAM, on AMD 7K62: Before: Sys time: 8098.21, Real time: 401.3 After: (-22.6% sys time, -12.8% real time ) Sys time: 6265.02, Real time: 349.83 The allocation success rate also slightly improved as we sanitized the usage of clusters with new defined helpers and locks, so temporarily dropped si->lock or ci->lock won't cause cluster order shuffle. Suggested-by: Chris Li Signed-off-by: Kairui Song --- include/linux/swap.h | 5 +- mm/swapfile.c | 418 ++++++++++++++++++++++++------------------- 2 files changed, 239 insertions(+), 184 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 75fc2da1767d..a3b5d74b095a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -265,6 +265,8 @@ enum swap_cluster_flags { CLUSTER_FLAG_FREE, CLUSTER_FLAG_NONFULL, CLUSTER_FLAG_FRAG, + /* Clusters with flags above are allocatable */ + CLUSTER_FLAG_USABLE =3D CLUSTER_FLAG_FRAG, CLUSTER_FLAG_FULL, CLUSTER_FLAG_DISCARD, CLUSTER_FLAG_MAX, @@ -290,6 +292,7 @@ enum swap_cluster_flags { * throughput. */ struct percpu_cluster { + local_lock_t lock; /* Protect the percpu_cluster above */ unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ }; =20 @@ -312,7 +315,7 @@ struct swap_info_struct { /* list of cluster that contains at least one free slot */ struct list_head frag_clusters[SWAP_NR_ORDERS]; /* list of cluster that are fragmented or contented */ - unsigned int frag_cluster_nr[SWAP_NR_ORDERS]; + atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS]; unsigned int pages; /* total of usable pages of swap */ atomic_long_t inuse_pages; /* number of those currently in use */ struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio= n */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 96d8012b003c..a19ee8d5ffd0 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -260,12 +260,10 @@ static int __try_to_reclaim_swap(struct swap_info_str= uct *si, folio_ref_sub(folio, nr_pages); folio_set_dirty(folio); =20 - spin_lock(&si->lock); /* Only sinple page folio can be backed by zswap */ if (nr_pages =3D=3D 1) zswap_invalidate(entry); swap_entry_range_free(si, entry, nr_pages); - spin_unlock(&si->lock); ret =3D nr_pages; out_unlock: folio_unlock(folio); @@ -402,7 +400,21 @@ static void discard_swap_cluster(struct swap_info_stru= ct *si, =20 static inline bool cluster_is_free(struct swap_cluster_info *info) { - return info->flags =3D=3D CLUSTER_FLAG_FREE; + return info->count =3D=3D 0; +} + +static inline bool cluster_is_discard(struct swap_cluster_info *info) +{ + return info->flags =3D=3D CLUSTER_FLAG_DISCARD; +} + +static inline bool cluster_is_usable(struct swap_cluster_info *ci, int ord= er) +{ + if (unlikely(ci->flags > CLUSTER_FLAG_USABLE)) + return false; + if (!order) + return true; + return cluster_is_free(ci) || order =3D=3D ci->order; } =20 static inline unsigned int cluster_index(struct swap_info_struct *si, @@ -439,19 +451,20 @@ static void cluster_move(struct swap_info_struct *si, { VM_WARN_ON(ci->flags =3D=3D new_flags); BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX); + lockdep_assert_held(&ci->lock); =20 - if (ci->flags =3D=3D CLUSTER_FLAG_NONE) { + spin_lock(&si->lock); + if (ci->flags =3D=3D CLUSTER_FLAG_NONE) list_add_tail(&ci->list, list); - } else { - if (ci->flags =3D=3D CLUSTER_FLAG_FRAG) { - VM_WARN_ON(!si->frag_cluster_nr[ci->order]); - si->frag_cluster_nr[ci->order]--; - } + else list_move_tail(&ci->list, list); - } + spin_unlock(&si->lock); + + if (ci->flags =3D=3D CLUSTER_FLAG_FRAG) + atomic_long_dec(&si->frag_cluster_nr[ci->order]); + else if (new_flags =3D=3D CLUSTER_FLAG_FRAG) + atomic_long_inc(&si->frag_cluster_nr[ci->order]); ci->flags =3D new_flags; - if (new_flags =3D=3D CLUSTER_FLAG_FRAG) - si->frag_cluster_nr[ci->order]++; } =20 /* Add a cluster to discard list and schedule it to do discard */ @@ -474,39 +487,82 @@ static void swap_cluster_schedule_discard(struct swap= _info_struct *si, =20 static void __free_cluster(struct swap_info_struct *si, struct swap_cluste= r_info *ci) { - lockdep_assert_held(&si->lock); lockdep_assert_held(&ci->lock); cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE); ci->order =3D 0; } =20 +/* + * Isolate and lock the first cluster that is not contented on a list, + * clean its flag before taken off-list. Cluster flag must be in sync + * with list status, so cluster updaters can always know the cluster + * list status without touching si lock. + * + * Note it's possible that all clusters on a list are contented so + * this returns NULL for an non-empty list. + */ +static struct swap_cluster_info *cluster_isolate_lock( + struct swap_info_struct *si, struct list_head *list) +{ + struct swap_cluster_info *ci, *ret =3D NULL; + + spin_lock(&si->lock); + list_for_each_entry(ci, list, list) { + if (!spin_trylock(&ci->lock)) + continue; + + /* We may only isolate and clear flags of following lists */ + VM_BUG_ON(!ci->flags); + VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE && + ci->flags !=3D CLUSTER_FLAG_FULL); + + list_del(&ci->list); + ci->flags =3D CLUSTER_FLAG_NONE; + ret =3D ci; + break; + } + spin_unlock(&si->lock); + + return ret; +} + /* * Doing discard actually. After a cluster discard is finished, the cluster - * will be added to free cluster list. caller should hold si->lock. -*/ -static void swap_do_scheduled_discard(struct swap_info_struct *si) + * will be added to free cluster list. Discard cluster is a bit special as + * they don't participate in allocation or reclaim, so clusters marked as + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list. + */ +static bool swap_do_scheduled_discard(struct swap_info_struct *si) { struct swap_cluster_info *ci; + bool ret =3D false; unsigned int idx; =20 + spin_lock(&si->lock); while (!list_empty(&si->discard_clusters)) { ci =3D list_first_entry(&si->discard_clusters, struct swap_cluster_info,= list); + /* + * Delete the cluster from list but don't clear the flag until + * discard is done, so isolation and relocation will skip it. + */ list_del(&ci->list); - /* Must clear flag when taking a cluster off-list */ - ci->flags =3D CLUSTER_FLAG_NONE; idx =3D cluster_index(si, ci); spin_unlock(&si->lock); - discard_swap_cluster(si, idx * SWAPFILE_CLUSTER, SWAPFILE_CLUSTER); =20 - spin_lock(&si->lock); spin_lock(&ci->lock); - __free_cluster(si, ci); + /* Discard is done, return to list and clear the flag */ + ci->flags =3D CLUSTER_FLAG_NONE; memset(si->swap_map + idx * SWAPFILE_CLUSTER, 0, SWAPFILE_CLUSTER); + __free_cluster(si, ci); spin_unlock(&ci->lock); + ret =3D true; + spin_lock(&si->lock); } + spin_unlock(&si->lock); + return ret; } =20 static void swap_discard_work(struct work_struct *work) @@ -515,9 +571,7 @@ static void swap_discard_work(struct work_struct *work) =20 si =3D container_of(work, struct swap_info_struct, discard_work); =20 - spin_lock(&si->lock); swap_do_scheduled_discard(si); - spin_unlock(&si->lock); } =20 static void swap_users_ref_free(struct percpu_ref *ref) @@ -528,10 +582,14 @@ static void swap_users_ref_free(struct percpu_ref *re= f) complete(&si->comp); } =20 +/* + * Must be called after freeing if ci->count =3D=3D 0, puts the cluster to= free + * or discard list. + */ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_= info *ci) { VM_BUG_ON(ci->count !=3D 0); - lockdep_assert_held(&si->lock); + VM_BUG_ON(ci->flags =3D=3D CLUSTER_FLAG_FREE); lockdep_assert_held(&ci->lock); =20 /* @@ -548,6 +606,48 @@ static void free_cluster(struct swap_info_struct *si, = struct swap_cluster_info * __free_cluster(si, ci); } =20 +/* + * Must be called after freeing if ci->count !=3D 0, puts the cluster to f= ree + * or nonfull list. + */ +static void partial_free_cluster(struct swap_info_struct *si, + struct swap_cluster_info *ci) +{ + VM_BUG_ON(!ci->count || ci->count =3D=3D SWAPFILE_CLUSTER); + lockdep_assert_held(&ci->lock); + + if (ci->flags !=3D CLUSTER_FLAG_NONFULL) + cluster_move(si, ci, &si->nonfull_clusters[ci->order], + CLUSTER_FLAG_NONFULL); +} + +/* + * Must be called after allocation, put the cluster to full or frag list. + * Note: allocation don't need si lock, and may drop the ci lock for recla= im, + * so the cluster could end up any where before re-acquiring ci lock. + */ +static void relocate_cluster(struct swap_info_struct *si, + struct swap_cluster_info *ci) +{ + lockdep_assert_held(&ci->lock); + + /* Discard cluster must remain off-list or on discard list */ + if (cluster_is_discard(ci)) + return; + + if (!ci->count) { + free_cluster(si, ci); + } else if (ci->count !=3D SWAPFILE_CLUSTER) { + if (ci->flags !=3D CLUSTER_FLAG_FRAG) + cluster_move(si, ci, &si->frag_clusters[ci->order], + CLUSTER_FLAG_FRAG); + } else { + if (ci->flags !=3D CLUSTER_FLAG_FULL) + cluster_move(si, ci, &si->full_clusters, + CLUSTER_FLAG_FULL); + } +} + /* * The cluster corresponding to page_nr will be used. The cluster will not= be * added to free cluster list and its usage counter will be increased by 1. @@ -566,30 +666,6 @@ static void inc_cluster_info_page(struct swap_info_str= uct *si, VM_BUG_ON(ci->flags); } =20 -/* - * The cluster ci decreases @nr_pages usage. If the usage counter becomes = 0, - * which means no page in the cluster is in use, we can optionally discard - * the cluster and add it to free cluster list. - */ -static void dec_cluster_info_page(struct swap_info_struct *si, - struct swap_cluster_info *ci, int nr_pages) -{ - VM_BUG_ON(ci->count < nr_pages); - VM_BUG_ON(cluster_is_free(ci)); - lockdep_assert_held(&si->lock); - lockdep_assert_held(&ci->lock); - ci->count -=3D nr_pages; - - if (!ci->count) { - free_cluster(si, ci); - return; - } - - if (ci->flags !=3D CLUSTER_FLAG_NONFULL) - cluster_move(si, ci, &si->nonfull_clusters[ci->order], - CLUSTER_FLAG_NONFULL); -} - static bool cluster_reclaim_range(struct swap_info_struct *si, struct swap_cluster_info *ci, unsigned long start, unsigned long end) @@ -599,8 +675,6 @@ static bool cluster_reclaim_range(struct swap_info_stru= ct *si, int nr_reclaim; =20 spin_unlock(&ci->lock); - spin_unlock(&si->lock); - do { switch (READ_ONCE(map[offset])) { case 0: @@ -618,9 +692,7 @@ static bool cluster_reclaim_range(struct swap_info_stru= ct *si, } } while (offset < end); out: - spin_lock(&si->lock); spin_lock(&ci->lock); - /* * Recheck the range no matter reclaim succeeded or not, the slot * could have been be freed while we are not holding the lock. @@ -634,11 +706,11 @@ static bool cluster_reclaim_range(struct swap_info_st= ruct *si, =20 static bool cluster_scan_range(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long start, unsigned int nr_pages) + unsigned long start, unsigned int nr_pages, + bool *need_reclaim) { unsigned long offset, end =3D start + nr_pages; unsigned char *map =3D si->swap_map; - bool need_reclaim =3D false; =20 for (offset =3D start; offset < end; offset++) { switch (READ_ONCE(map[offset])) { @@ -647,16 +719,13 @@ static bool cluster_scan_range(struct swap_info_struc= t *si, case SWAP_HAS_CACHE: if (!vm_swap_full()) return false; - need_reclaim =3D true; + *need_reclaim =3D true; continue; default: return false; } } =20 - if (need_reclaim) - return cluster_reclaim_range(si, ci, start, end); - return true; } =20 @@ -666,23 +735,12 @@ static void cluster_alloc_range(struct swap_info_stru= ct *si, struct swap_cluster { unsigned int nr_pages =3D 1 << order; =20 - VM_BUG_ON(ci->flags !=3D CLUSTER_FLAG_FREE && - ci->flags !=3D CLUSTER_FLAG_NONFULL && - ci->flags !=3D CLUSTER_FLAG_FRAG); - - if (cluster_is_free(ci)) { - if (nr_pages < SWAPFILE_CLUSTER) - cluster_move(si, ci, &si->nonfull_clusters[order], - CLUSTER_FLAG_NONFULL); + if (cluster_is_free(ci)) ci->order =3D order; - } =20 memset(si->swap_map + start, usage, nr_pages); swap_range_alloc(si, nr_pages); ci->count +=3D nr_pages; - - if (ci->count =3D=3D SWAPFILE_CLUSTER) - cluster_move(si, ci, &si->full_clusters, CLUSTER_FLAG_FULL); } =20 static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, u= nsigned long offset, @@ -692,34 +750,52 @@ static unsigned int alloc_swap_scan_cluster(struct sw= ap_info_struct *si, unsigne unsigned long start =3D offset & ~(SWAPFILE_CLUSTER - 1); unsigned long end =3D min(start + SWAPFILE_CLUSTER, si->max); unsigned int nr_pages =3D 1 << order; + bool need_reclaim, ret; struct swap_cluster_info *ci; =20 - if (end < nr_pages) - return SWAP_NEXT_INVALID; - end -=3D nr_pages; + ci =3D &si->cluster_info[offset / SWAPFILE_CLUSTER]; + lockdep_assert_held(&ci->lock); =20 - ci =3D lock_cluster(si, offset); - if (ci->count + nr_pages > SWAPFILE_CLUSTER) { + if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) { offset =3D SWAP_NEXT_INVALID; - goto done; + goto out; } =20 - while (offset <=3D end) { - if (cluster_scan_range(si, ci, offset, nr_pages)) { - cluster_alloc_range(si, ci, offset, usage, order); - *foundp =3D offset; - if (ci->count =3D=3D SWAPFILE_CLUSTER) { + for (end -=3D nr_pages; offset <=3D end; offset +=3D nr_pages) { + need_reclaim =3D false; + if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim)) + continue; + if (need_reclaim) { + ret =3D cluster_reclaim_range(si, ci, start, end); + /* + * Reclaim drops ci->lock and cluster could be used + * by another order. Not checking flag as off-list + * cluster has no flag set, and change of list + * won't cause fragmentation. + */ + if (!cluster_is_usable(ci, order)) { offset =3D SWAP_NEXT_INVALID; - goto done; + goto out; } - offset +=3D nr_pages; - break; + if (cluster_is_free(ci)) + offset =3D start; + /* Reclaim failed but cluster is usable, try next */ + if (!ret) + continue; + } + cluster_alloc_range(si, ci, offset, usage, order); + *foundp =3D offset; + if (ci->count =3D=3D SWAPFILE_CLUSTER) { + offset =3D SWAP_NEXT_INVALID; + goto out; } offset +=3D nr_pages; + break; } if (offset > end) offset =3D SWAP_NEXT_INVALID; -done: +out: + relocate_cluster(si, ci); unlock_cluster(ci); return offset; } @@ -736,18 +812,17 @@ static void swap_reclaim_full_clusters(struct swap_in= fo_struct *si, bool force) if (force) to_scan =3D swap_usage_in_pages(si) / SWAPFILE_CLUSTER; =20 - while (!list_empty(&si->full_clusters)) { - ci =3D list_first_entry(&si->full_clusters, struct swap_cluster_info, li= st); - list_move_tail(&ci->list, &si->full_clusters); + while ((ci =3D cluster_isolate_lock(si, &si->full_clusters))) { offset =3D cluster_offset(si, ci); end =3D min(si->max, offset + SWAPFILE_CLUSTER); to_scan--; =20 - spin_unlock(&si->lock); while (offset < end) { if (READ_ONCE(map[offset]) =3D=3D SWAP_HAS_CACHE) { + spin_unlock(&ci->lock); nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT); + spin_lock(&ci->lock); if (nr_reclaim) { offset +=3D abs(nr_reclaim); continue; @@ -755,8 +830,8 @@ static void swap_reclaim_full_clusters(struct swap_info= _struct *si, bool force) } offset++; } - spin_lock(&si->lock); =20 + unlock_cluster(ci); if (to_scan <=3D 0) break; } @@ -768,9 +843,7 @@ static void swap_reclaim_work(struct work_struct *work) =20 si =3D container_of(work, struct swap_info_struct, reclaim_work); =20 - spin_lock(&si->lock); swap_reclaim_full_clusters(si, true); - spin_unlock(&si->lock); } =20 /* @@ -781,23 +854,36 @@ static void swap_reclaim_work(struct work_struct *wor= k) static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,= int order, unsigned char usage) { - struct percpu_cluster *cluster; struct swap_cluster_info *ci; unsigned int offset, found =3D 0; =20 -new_cluster: - lockdep_assert_held(&si->lock); - cluster =3D this_cpu_ptr(si->percpu_cluster); - offset =3D cluster->next[order]; + /* Fast path using per CPU cluster */ + local_lock(&si->percpu_cluster->lock); + offset =3D __this_cpu_read(si->percpu_cluster->next[order]); if (offset) { - offset =3D alloc_swap_scan_cluster(si, offset, &found, order, usage); + ci =3D lock_cluster(si, offset); + /* Cluster could have been used by another order */ + if (cluster_is_usable(ci, order)) { + if (cluster_is_free(ci)) + offset =3D cluster_offset(si, ci); + offset =3D alloc_swap_scan_cluster(si, offset, &found, + order, usage); + } else { + unlock_cluster(ci); + } if (found) goto done; } =20 - if (!list_empty(&si->free_clusters)) { - ci =3D list_first_entry(&si->free_clusters, struct swap_cluster_info, li= st); - offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, o= rder, usage); +new_cluster: + ci =3D cluster_isolate_lock(si, &si->free_clusters); + if (ci) { + offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), + &found, order, usage); + /* + * Allocation from free cluster must never fail and + * cluster lock must remain untouched. + */ VM_BUG_ON(!found); goto done; } @@ -807,49 +893,45 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o swap_reclaim_full_clusters(si, false); =20 if (order < PMD_ORDER) { - unsigned int frags =3D 0; + unsigned int frags =3D 0, frags_existing; =20 - while (!list_empty(&si->nonfull_clusters[order])) { - ci =3D list_first_entry(&si->nonfull_clusters[order], - struct swap_cluster_info, list); - cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG); + while ((ci =3D cluster_isolate_lock(si, &si->nonfull_clusters[order]))) { offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); - frags++; + /* + * With `fragmenting` set to true, it will surely take + * the cluster off nonfull list + */ if (found) goto done; + frags++; } =20 - /* - * Nonfull clusters are moved to frag tail if we reached - * here, count them too, don't over scan the frag list. - */ - while (frags < si->frag_cluster_nr[order]) { - ci =3D list_first_entry(&si->frag_clusters[order], - struct swap_cluster_info, list); + frags_existing =3D atomic_long_read(&si->frag_cluster_nr[order]); + while (frags < frags_existing && + (ci =3D cluster_isolate_lock(si, &si->frag_clusters[order]))) { + atomic_long_dec(&si->frag_cluster_nr[order]); /* - * Rotate the frag list to iterate, they were all failing - * high order allocation or moved here due to per-CPU usage, - * this help keeping usable cluster ahead. + * Rotate the frag list to iterate, they were all + * failing high order allocation or moved here due to + * per-CPU usage, but either way they could contain + * usable (eg. lazy-freed swap cache) slots. */ - list_move_tail(&ci->list, &si->frag_clusters[order]); offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); - frags++; if (found) goto done; + frags++; } } =20 - if (!list_empty(&si->discard_clusters)) { - /* - * we don't have free cluster but have some clusters in - * discarding, do discard now and reclaim them, then - * reread cluster_next_cpu since we dropped si->lock - */ - swap_do_scheduled_discard(si); + /* + * We don't have free cluster but have some clusters in + * discarding, do discard now and reclaim them, then + * reread cluster_next_cpu since we dropped si->lock + */ + if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si)) goto new_cluster; - } =20 if (order) goto done; @@ -860,26 +942,25 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o * Clusters here have at least one usable slots and can't fail order 0 * allocation, but reclaim may drop si->lock and race with another user. */ - while (!list_empty(&si->frag_clusters[o])) { - ci =3D list_first_entry(&si->frag_clusters[o], - struct swap_cluster_info, list); + while ((ci =3D cluster_isolate_lock(si, &si->frag_clusters[o]))) { + atomic_long_dec(&si->frag_cluster_nr[o]); offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, 0, usage); + &found, order, usage); if (found) goto done; } =20 - while (!list_empty(&si->nonfull_clusters[o])) { - ci =3D list_first_entry(&si->nonfull_clusters[o], - struct swap_cluster_info, list); + while ((ci =3D cluster_isolate_lock(si, &si->nonfull_clusters[o]))) { offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, 0, usage); + &found, order, usage); if (found) goto done; } } done: - cluster->next[order] =3D offset; + __this_cpu_write(si->percpu_cluster->next[order], offset); + local_unlock(&si->percpu_cluster->lock); + return found; } =20 @@ -1135,14 +1216,11 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entr= ies[], int entry_order) plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); spin_unlock(&swap_avail_lock); if (get_swap_device_info(si)) { - spin_lock(&si->lock); n_ret =3D scan_swap_map_slots(si, SWAP_HAS_CACHE, n_goal, swp_entries, order); - spin_unlock(&si->lock); put_swap_device(si); if (n_ret || size > 1) goto check_out; - cond_resched(); } =20 spin_lock(&swap_avail_lock); @@ -1355,9 +1433,7 @@ static bool __swap_entries_free(struct swap_info_stru= ct *si, if (!has_cache) { for (i =3D 0; i < nr; i++) zswap_invalidate(swp_entry(si->type, offset + i)); - spin_lock(&si->lock); swap_entry_range_free(si, entry, nr); - spin_unlock(&si->lock); } return has_cache; =20 @@ -1386,16 +1462,27 @@ static void swap_entry_range_free(struct swap_info_= struct *si, swp_entry_t entry unsigned char *map_end =3D map + nr_pages; struct swap_cluster_info *ci; =20 + /* It should never free entries across different clusters */ + VM_BUG_ON((offset / SWAPFILE_CLUSTER) !=3D ((offset + nr_pages - 1) / SWA= PFILE_CLUSTER)); + ci =3D lock_cluster(si, offset); + VM_BUG_ON(cluster_is_free(ci)); + VM_BUG_ON(ci->count < nr_pages); + + ci->count -=3D nr_pages; do { VM_BUG_ON(*map !=3D SWAP_HAS_CACHE); *map =3D 0; } while (++map < map_end); - dec_cluster_info_page(si, ci, nr_pages); - unlock_cluster(ci); =20 mem_cgroup_uncharge_swap(entry, nr_pages); swap_range_free(si, offset, nr_pages); + + if (!ci->count) + free_cluster(si, ci); + else + partial_free_cluster(si, ci); + unlock_cluster(ci); } =20 static void cluster_swap_free_nr(struct swap_info_struct *si, @@ -1467,9 +1554,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t = entry) ci =3D lock_cluster(si, offset); if (size > 1 && swap_is_has_cache(si, offset, size)) { unlock_cluster(ci); - spin_lock(&si->lock); swap_entry_range_free(si, entry, size); - spin_unlock(&si->lock); return; } for (int i =3D 0; i < size; i++, entry.val++) { @@ -1484,46 +1569,19 @@ void put_swap_folio(struct folio *folio, swp_entry_= t entry) unlock_cluster(ci); } =20 -static int swp_entry_cmp(const void *ent1, const void *ent2) -{ - const swp_entry_t *e1 =3D ent1, *e2 =3D ent2; - - return (int)swp_type(*e1) - (int)swp_type(*e2); -} - void swapcache_free_entries(swp_entry_t *entries, int n) { - struct swap_info_struct *si, *prev; int i; + struct swap_info_struct *si =3D NULL; =20 if (n <=3D 0) return; =20 - prev =3D NULL; - si =3D NULL; - - /* - * Sort swap entries by swap device, so each lock is only taken once. - * nr_swapfiles isn't absolutely correct, but the overhead of sort() is - * so low that it isn't necessary to optimize further. - */ - if (nr_swapfiles > 1) - sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL); for (i =3D 0; i < n; ++i) { si =3D _swap_info_get(entries[i]); - - if (si !=3D prev) { - if (prev !=3D NULL) - spin_unlock(&prev->lock); - if (si !=3D NULL) - spin_lock(&si->lock); - } if (si) swap_entry_range_free(si, entries[i], 1); - prev =3D si; } - if (si) - spin_unlock(&si->lock); } =20 int __swap_count(swp_entry_t entry) @@ -1775,13 +1833,8 @@ swp_entry_t get_swap_page_of_type(int type) goto fail; =20 /* This is called for allocating swap entry, not cache */ - if (get_swap_device_info(si)) { - spin_lock(&si->lock); - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0= )) - atomic_long_dec(&nr_swap_pages); - spin_unlock(&si->lock); - put_swap_device(si); - } + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0)) + atomic_long_dec(&nr_swap_pages); fail: return entry; } @@ -3098,6 +3151,7 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, cluster =3D per_cpu_ptr(si->percpu_cluster, cpu); for (i =3D 0; i < SWAP_NR_ORDERS; i++) cluster->next[i] =3D SWAP_NEXT_INVALID; + local_lock_init(&cluster->lock); } =20 /* @@ -3121,7 +3175,7 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, for (i =3D 0; i < SWAP_NR_ORDERS; i++) { INIT_LIST_HEAD(&si->nonfull_clusters[i]); INIT_LIST_HEAD(&si->frag_clusters[i]); - si->frag_cluster_nr[i] =3D 0; + atomic_long_set(&si->frag_cluster_nr[i], 0); } =20 /* @@ -3603,7 +3657,6 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) */ goto outer; } - spin_lock(&si->lock); =20 offset =3D swp_offset(entry); =20 @@ -3668,7 +3721,6 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) spin_unlock(&si->cont_lock); out: unlock_cluster(ci); - spin_unlock(&si->lock); put_swap_device(si); outer: if (page) --=20 2.47.0 From nobody Tue Nov 26 01:55:10 2024 Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B8E551CBE94 for ; Tue, 22 Oct 2024 19:30:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625423; cv=none; b=KejGwevdPyySnT1Nrh4Hj6PU0VtsAgWdWlpWVnjniZ7vhl+s1oqJZ16OXyLvtrvKdHNpXgScep9wHW2/Plkr4YhOn74j3Cr/YjnGNkvvAIlFEwA2sJ40R8K1ctejQaAjDItdfGhEjnNlfP8RKARrhRsdiUXxLmF4QygPNwaBYnc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625423; c=relaxed/simple; bh=inyKL2bXI4Pog2z3kT057qmprXz4EMSTSmEr5aXOJUU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=R1FNKUy/wgucd93Voli9nlAJRsaFL3XcsxHKLulp3H6TGze89e/Yf4Zsp+YM54ytDbSYk7vShgbWQkhqZ5k3cCS6xCiN7WD8UKkQYN9WtaOil+PCYc+jTFIyv+5rWV5txtU2QzOk3zS9YVaeCc23YghUV6dLL+9mT2rNrzuRumU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=EBUrrd5B; arc=none smtp.client-ip=209.85.214.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="EBUrrd5B" Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-20ce5e3b116so48009255ad.1 for ; Tue, 22 Oct 2024 12:30:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729625421; x=1730230221; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=6X+VEn+3kCPXJIqxEfgPvCWyrBXaWqi+AOuTj72rupI=; b=EBUrrd5BlYzrxQ2JEdY3z8NH4M0PunmolWD3FGWt4nDwCAPmuh1A3GWh9PPsy3iXxl cssKdmzE0xf4yraOoz52cm0kbumCaz8tFoGXiG8cq5ZXBng0Q4rlIq7riP9J9F815ODe XxtCtVIXyoAn677nWLIjj7+RryplBD7KXPxtpQH8qBxH0KR7O54hoVcc33+DHY+ifVJg u5YtwXT3cjWsD6Z8MYBB1ofXzvKqAjdZDSnlgb2viQ4EFI1/q4Yq9eOkvzSF20gMzmlG EP4JrbMKaPLiAYwRXOMA/u/dyAPMXRWgyMAfNvgmvWepRdXp+4ymyS1gyqMcpokKTl8d fNGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729625421; x=1730230221; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=6X+VEn+3kCPXJIqxEfgPvCWyrBXaWqi+AOuTj72rupI=; b=D2lt55vIXKk3FlSeOO8oFTry4RCjimYChBMHwkylZpz1txDVgfZ46xZVw7kCMnAqcp Jirvz/TkkTzroxAMmY00Tyw74oHkiv8X6qzfxy37woDaEg0S4c02X6ib8nB4J6G0Ihga 4IcJzbvP3IS3uGvDNcRJYmtSDcrwZtD8iScvgnLOLlYgugAcZfmXM3a99JsORc+SjbAv vSwrhk9M+r8Lg9i8shurse0y+K2iwKCj3Mr2N+1fs66Z9cU+N4rw9DrXF0HUU0NuqFMu MJzGJV+BHIHIqy06xekJbRyluYWEVgvs4AioRVdp4Qcrjg3AqE3Cfj4FVRv5qYK17zQl q4xQ== X-Forwarded-Encrypted: i=1; AJvYcCWbDeGsHofddYvKu3wG8nd7gVFbSMTiXup9RylbOmf6cQsDVjwD1JQG24NOSi0iAXk4WuHxQ/sNgWtcq74=@vger.kernel.org X-Gm-Message-State: AOJu0Yw1ebWYCoXw0Uf+b9/LS5MqIASBdYolOsoKefgvI0hhTh3E3y94 9h6mPEpdH5I0ESls7/GozGyXurgMS0cz1XyutqUPkL8+VruxLZrh X-Google-Smtp-Source: AGHT+IFwn5W3d+VVRyPS+Ks6t1Wg9f/VfrmjaEp3PoLGs96xVemWYbaZI0DuQ0MSgKZGu5ZlHrzEeA== X-Received: by 2002:a17:902:e74f:b0:20c:7898:a8f4 with SMTP id d9443c01a7336-20fab2d9835mr2557925ad.60.1729625420908; Tue, 22 Oct 2024 12:30:20 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.36]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20e7f0d9f05sm45895305ad.186.2024.10.22.12.30.17 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 22 Oct 2024 12:30:20 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 10/13] mm, swap: simplify percpu cluster updating Date: Wed, 23 Oct 2024 03:24:48 +0800 Message-ID: <20241022192451.38138-11-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20241022192451.38138-1-ryncsn@gmail.com> References: <20241022192451.38138-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Instead of using a returning argument, we can simply store the next cluster offset to the fixed percpu location, which reduce the stack usage and simplify the function: Object size: ./scripts/bloat-o-meter mm/swapfile.o mm/swapfile.o.new add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-271 (-271) Function old new delta get_swap_pages 2847 2733 -114 alloc_swap_scan_cluster 894 737 -157 Total: Before=3D30833, After=3D30562, chg -0.88% Stack usage: Before: swapfile.c:1190:5:get_swap_pages 240 static After: swapfile.c:1185:5:get_swap_pages 216 static Signed-off-by: Kairui Song Suggested-by: Chris Li --- include/linux/swap.h | 4 ++-- mm/swapfile.c | 57 ++++++++++++++++++++------------------------ 2 files changed, 28 insertions(+), 33 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index a3b5d74b095a..0e6c6bb385f0 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -276,9 +276,9 @@ enum swap_cluster_flags { * The first page in the swap file is the swap header, which is always mar= ked * bad to prevent it from being allocated as an entry. This also prevents = the * cluster to which it belongs being marked free. Therefore 0 is safe to u= se as - * a sentinel to indicate next is not valid in percpu_cluster. + * a sentinel to indicate an entry is not valid. */ -#define SWAP_NEXT_INVALID 0 +#define SWAP_ENTRY_INVALID 0 =20 #ifdef CONFIG_THP_SWAP #define SWAP_NR_ORDERS (PMD_ORDER + 1) diff --git a/mm/swapfile.c b/mm/swapfile.c index a19ee8d5ffd0..f529e2ce2019 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -743,11 +743,14 @@ static void cluster_alloc_range(struct swap_info_stru= ct *si, struct swap_cluster ci->count +=3D nr_pages; } =20 -static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, u= nsigned long offset, - unsigned int *foundp, unsigned int order, +/* Try use a new cluster for current CPU and allocate from it. */ +static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, + unsigned long offset, + unsigned int order, unsigned char usage) { - unsigned long start =3D offset & ~(SWAPFILE_CLUSTER - 1); + unsigned int next =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTRY_INVALID; + unsigned long start =3D ALIGN_DOWN(offset, SWAPFILE_CLUSTER); unsigned long end =3D min(start + SWAPFILE_CLUSTER, si->max); unsigned int nr_pages =3D 1 << order; bool need_reclaim, ret; @@ -756,10 +759,8 @@ static unsigned int alloc_swap_scan_cluster(struct swa= p_info_struct *si, unsigne ci =3D &si->cluster_info[offset / SWAPFILE_CLUSTER]; lockdep_assert_held(&ci->lock); =20 - if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) { - offset =3D SWAP_NEXT_INVALID; + if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) goto out; - } =20 for (end -=3D nr_pages; offset <=3D end; offset +=3D nr_pages) { need_reclaim =3D false; @@ -773,10 +774,8 @@ static unsigned int alloc_swap_scan_cluster(struct swa= p_info_struct *si, unsigne * cluster has no flag set, and change of list * won't cause fragmentation. */ - if (!cluster_is_usable(ci, order)) { - offset =3D SWAP_NEXT_INVALID; + if (!cluster_is_usable(ci, order)) goto out; - } if (cluster_is_free(ci)) offset =3D start; /* Reclaim failed but cluster is usable, try next */ @@ -784,20 +783,17 @@ static unsigned int alloc_swap_scan_cluster(struct sw= ap_info_struct *si, unsigne continue; } cluster_alloc_range(si, ci, offset, usage, order); - *foundp =3D offset; - if (ci->count =3D=3D SWAPFILE_CLUSTER) { - offset =3D SWAP_NEXT_INVALID; - goto out; - } + found =3D offset; offset +=3D nr_pages; + if (ci->count < SWAPFILE_CLUSTER && offset <=3D end) + next =3D offset; break; } - if (offset > end) - offset =3D SWAP_NEXT_INVALID; out: relocate_cluster(si, ci); unlock_cluster(ci); - return offset; + __this_cpu_write(si->percpu_cluster->next[order], next); + return found; } =20 /* Return true if reclaimed a whole cluster */ @@ -866,8 +862,8 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o if (cluster_is_usable(ci, order)) { if (cluster_is_free(ci)) offset =3D cluster_offset(si, ci); - offset =3D alloc_swap_scan_cluster(si, offset, &found, - order, usage); + found =3D alloc_swap_scan_cluster(si, offset, + order, usage); } else { unlock_cluster(ci); } @@ -878,8 +874,8 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o new_cluster: ci =3D cluster_isolate_lock(si, &si->free_clusters); if (ci) { - offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, order, usage); + found =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), + order, usage); /* * Allocation from free cluster must never fail and * cluster lock must remain untouched. @@ -896,8 +892,8 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o unsigned int frags =3D 0, frags_existing; =20 while ((ci =3D cluster_isolate_lock(si, &si->nonfull_clusters[order]))) { - offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, order, usage); + found =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), + order, usage); /* * With `fragmenting` set to true, it will surely take * the cluster off nonfull list @@ -917,8 +913,8 @@ static unsigned long cluster_alloc_swap_entry(struct sw= ap_info_struct *si, int o * per-CPU usage, but either way they could contain * usable (eg. lazy-freed swap cache) slots. */ - offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, order, usage); + found =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), + order, usage); if (found) goto done; frags++; @@ -944,21 +940,20 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o */ while ((ci =3D cluster_isolate_lock(si, &si->frag_clusters[o]))) { atomic_long_dec(&si->frag_cluster_nr[o]); - offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, order, usage); + found =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), + 0, usage); if (found) goto done; } =20 while ((ci =3D cluster_isolate_lock(si, &si->nonfull_clusters[o]))) { - offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, order, usage); + found =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), + 0, usage); if (found) goto done; } } done: - __this_cpu_write(si->percpu_cluster->next[order], offset); local_unlock(&si->percpu_cluster->lock); =20 return found; @@ -3150,7 +3145,7 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, =20 cluster =3D per_cpu_ptr(si->percpu_cluster, cpu); for (i =3D 0; i < SWAP_NR_ORDERS; i++) - cluster->next[i] =3D SWAP_NEXT_INVALID; + cluster->next[i] =3D SWAP_ENTRY_INVALID; local_lock_init(&cluster->lock); } =20 --=20 2.47.0 From nobody Tue Nov 26 01:55:10 2024 Received: from mail-pj1-f44.google.com (mail-pj1-f44.google.com [209.85.216.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5290F1CB32A for ; Tue, 22 Oct 2024 19:30:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.44 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625426; cv=none; b=g6zRz2PCoWqMjtbpKhjOOrXJQWfgKW2cQW09PkQouj0VgPipyQRHfHO8GdTCgFDeZMNK5TA/xI947TDIqYm8HWgPPBRMd4RwEcOTQNBhaVzH7Y2BQ9Y+hgeHDaz8H1Tigg0jOaefGVp7m8jpyWVZQFQ9Y/6tuhDZztYSLhIumOc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625426; c=relaxed/simple; bh=aYdVGM//JHutpU7XqJVJfc6/kfsl1LRCJbyobHUvTg4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=KP5ksLjog8zHBvwNUHJfk3mjKzzW6domGiJTrzRq3k8GQTMCeaeDFTnxAj30ks9lnO2hgMQ+lXU3oYimth3K0VpiBIf77C6py3O8U1buDl3Z0XiKlLKDlBgR8CfZu98Ic7oDjCsWYl0kdNZENcPlyV+zQ9oyP/TtqNxJc2BmFlM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=LAEnrMgN; arc=none smtp.client-ip=209.85.216.44 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="LAEnrMgN" Received: by mail-pj1-f44.google.com with SMTP id 98e67ed59e1d1-2e2ed59a35eso4925032a91.0 for ; Tue, 22 Oct 2024 12:30:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729625425; x=1730230225; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=58sEe9Vj5/e+n1OUXtAWiattRyV+UTZMgvOaHIMS+fY=; b=LAEnrMgNzL5HZu6ZgbtFNB4mxcQgzg37dZz8qS6Oy4LZt75CoU1a1hcxjsr/XOqhiM GClbJRN1hm0h1iAMtVWQLs6XDEWnSQJCTgGojzhfNm78363ECas3ZCoxD8u23NIWZVxz 62kXbTSLlz8EhBsZFCLqoWhs88srKy4v952XvULxWzkuTekBBGve/2VPmj1qt0DfO8jX fkEUQzWfsDU/Ur3DPHdJiSONlj3ui1c98FVxLtfjVgNQoHz5yZC9AzSAZh7jsz8uLUhj FWLBP63Bsd33QCgKXfqV7FldF/HcK9gRDbtXXhLXVSFn/I9Kv2ghjKtS6XAnVWm9itcY aDYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729625425; x=1730230225; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=58sEe9Vj5/e+n1OUXtAWiattRyV+UTZMgvOaHIMS+fY=; b=Y4C/e6Ribq2DJfH1PuRtSLgGfN51a+z+AloOPOZ+W4DT0IckY8IOYaUZ+HqIV6PU9z qHIgNGXnjfK45ys+pSq0HTnW9OOR8SMZZnocxacYFJKfxJmOlCxZIo/Wke2Y3k1dBZnW pUmaMAdla3GoH0NLh70NGJQQVzniYY2EhodohD9jvpZvQz6G0gjBru0wF3kWtelsriZA o6nQiNUzC1OsXjO6tuAeW5paVMXBsHUU3O7kWG6u8SU/F1F3nHIanSSCLz8mv/5EG8zd ipKxaHzUu0lfiU60GglQ3cpyEKt7akfDVfZr3m2ZA840bpeDZXo+HOEcMi3XR13xsdC5 Wtnw== X-Forwarded-Encrypted: i=1; AJvYcCW2m9kxSiIJs2zBX24umBZc2Dg8j3LZ8jRHpveItUCf8W9hZ7X8q0gBeKVRVM//uMTCzRqjf2X5kFwqDls=@vger.kernel.org X-Gm-Message-State: AOJu0YxfIDyROhh1KBfYGqYdGYLdz5WhmaeKQD/pHzHDXBJ6wRx+2rj0 0HU9fFRrAL+zSX6wa9XM0eo7pgi0BdcNnyk8GSVncrJ7X8WyBOdF X-Google-Smtp-Source: AGHT+IH//2l1S5Hl16buo+tJ0OBYMYhrYsRemVW2E65MZE0w+xwxw5pa87HO3zq6jV4yi+SknPNUlg== X-Received: by 2002:a17:90a:de8e:b0:2e2:bd68:b8d8 with SMTP id 98e67ed59e1d1-2e76b5b59a9mr38950a91.8.1729625424694; Tue, 22 Oct 2024 12:30:24 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.36]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20e7f0d9f05sm45895305ad.186.2024.10.22.12.30.21 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 22 Oct 2024 12:30:24 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 11/13] mm, swap: introduce a helper for retrieving cluster from offset Date: Wed, 23 Oct 2024 03:24:49 +0800 Message-ID: <20241022192451.38138-12-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20241022192451.38138-1-ryncsn@gmail.com> References: <20241022192451.38138-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song It's a common operation to retrieve the cluster info from offset, introduce a helper for this. Suggested-by: Chris Li Signed-off-by: Kairui Song --- mm/swapfile.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index f529e2ce2019..f25d697f6736 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -423,6 +423,12 @@ static inline unsigned int cluster_index(struct swap_i= nfo_struct *si, return ci - si->cluster_info; } =20 +static inline struct swap_cluster_info *offset_to_cluster(struct swap_info= _struct *si, + unsigned long offset) +{ + return &si->cluster_info[offset / SWAPFILE_CLUSTER]; +} + static inline unsigned int cluster_offset(struct swap_info_struct *si, struct swap_cluster_info *ci) { @@ -434,7 +440,7 @@ static inline struct swap_cluster_info *lock_cluster(st= ruct swap_info_struct *si { struct swap_cluster_info *ci; =20 - ci =3D &si->cluster_info[offset / SWAPFILE_CLUSTER]; + ci =3D offset_to_cluster(si, offset); spin_lock(&ci->lock); =20 return ci; @@ -756,7 +762,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap= _info_struct *si, bool need_reclaim, ret; struct swap_cluster_info *ci; =20 - ci =3D &si->cluster_info[offset / SWAPFILE_CLUSTER]; + ci =3D offset_to_cluster(si, offset); lockdep_assert_held(&ci->lock); =20 if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) @@ -1457,10 +1463,10 @@ static void swap_entry_range_free(struct swap_info_= struct *si, swp_entry_t entry unsigned char *map_end =3D map + nr_pages; struct swap_cluster_info *ci; =20 - /* It should never free entries across different clusters */ - VM_BUG_ON((offset / SWAPFILE_CLUSTER) !=3D ((offset + nr_pages - 1) / SWA= PFILE_CLUSTER)); - ci =3D lock_cluster(si, offset); + + /* It should never free entries across different clusters */ + VM_BUG_ON(ci !=3D offset_to_cluster(si, offset + nr_pages - 1)); VM_BUG_ON(cluster_is_free(ci)); VM_BUG_ON(ci->count < nr_pages); =20 --=20 2.47.0 From nobody Tue Nov 26 01:55:10 2024 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1742D1CDA24 for ; Tue, 22 Oct 2024 19:30:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625430; cv=none; b=S0yLmEP1Sf3UUa+hsG3ZNPstiJjAYdegqKj0DBFDyduZDxJ8qIie+1bi0pFifKx/VEvnJVIthSnmgw/NunZ25BIMtaZh3pdiYd21KYvhxgaFeecfxRF8imX1pG1FQDS/yZgeCXj64jduG0Wtp4UWA2exQKVm8erX1INop0xNMMU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625430; c=relaxed/simple; bh=Pdjx8c/CvfxwkuH9EzMUjmFgl9lyHu/7HtHzPqX/CVc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=QAop05no0G5q2NEcrmtSkH4CUZvwR+ltc8BbIzj8TEu9wA9CYUdUHO7MKAjlqJf4zbbBGVRVgE1NPhnhrgYmF08W3mNrFmWefq20Yef7/sPA6xY6KXMQY9WE/2APznk7K3JWgtdo7ToHHD7YybiRfatwjaTZB44SQxF9Wq8dTO8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=fnI53bZa; arc=none smtp.client-ip=209.85.214.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="fnI53bZa" Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-20c77459558so51433965ad.0 for ; Tue, 22 Oct 2024 12:30:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729625428; x=1730230228; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=h2jmxIUhNZxFOtFl07G9SGXSpDqPv8x+3F53EFaJwLQ=; b=fnI53bZaicvPHBpzsArDzawy2rkZQxC4nSib3KR3OCYKnTk2gRUMyQPVYhYTDU82KB 7J8u3I/T+/jOpT4RbKRVwZCM9tENouCHPpY1Mlw43uEjSYlgERkfgnyqSV2M/iIt8Wt5 T/uY/mVybK/UqgVd9xUvskALARTySEabuZwadb2vpAd8QomFS0Fpa/G9qyn0Em3JrnpC jWseahMPgmSCjVOUpVbSlA+2Q6yzcnc931m1W+s5jVo4AHV/zYYmTralCOfoH93mfecl 4aprBXY+NHdcAnwtWlef2F/nltrxPceuE8DK7XcLsNesGX+iY0yUl43Zwf2zDQBug6HG DSKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729625428; x=1730230228; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=h2jmxIUhNZxFOtFl07G9SGXSpDqPv8x+3F53EFaJwLQ=; b=clqKAKh+WKc/IzhokJdxbSJw2HBbGyh9dkrp0QkPi5AfDyQPm6ze2WUX0mJTTFlLjE PlK84MLxhpuK0JprUmAo4Bb3NXvkvkyEdytA1mJXDy1Ssf/eWv1zZHuOAoBsEH7WoLoC +il84S1FchhtdZhwcFowboYxMmss9OBmzNhH59u7JKhM5E486D1CmTbs0GCFdsl10+vp l9NAWAdrwKTdAoYRsuWN9VAZauA1z4mWsE3vItxL4qlOxwcNMyrLzZHpYB5VRVtYpoyb 8iyApplWPHeXfRXzKFMaW8K/aiM1m0gfBzL8a9x6L1/f8uPnZkWl8xL/d8+nWETrY62A DavQ== X-Forwarded-Encrypted: i=1; AJvYcCUQKdMSw835YtE/73rKiL3s2bS/R/+GQv/K8DfycFHbJCn0IJT7TtkCzwjpH3GWOU/Xfwu9AYO8JnOAd/g=@vger.kernel.org X-Gm-Message-State: AOJu0Yzo8m+7qvMxxsMl/JMd/Vs9EqVM2xWuHvzTGWRihdhmv6gkMDsE J7aDgdbceBNaL2M5HyFDP777Knz8PEmFPWhtavBxck7slkQKi1sh X-Google-Smtp-Source: AGHT+IFV1ogV52DzmXBw68h0S9N3oYY1APJh2lIGdJCHh5CcSXToC3t30idapf9oyW75V51NGLt65w== X-Received: by 2002:a17:902:e752:b0:20f:ab2e:14f9 with SMTP id d9443c01a7336-20fab2e150bmr2982245ad.55.1729625428254; Tue, 22 Oct 2024 12:30:28 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.36]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20e7f0d9f05sm45895305ad.186.2024.10.22.12.30.24 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 22 Oct 2024 12:30:27 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 12/13] mm, swap: use a global swap cluster for non-rotation device Date: Wed, 23 Oct 2024 03:24:50 +0800 Message-ID: <20241022192451.38138-13-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20241022192451.38138-1-ryncsn@gmail.com> References: <20241022192451.38138-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Non-rotation (SSD / ZRAM) device can tolerate fragmentations so the goal of SWAP allocator is to avoid contention of clusters. So it used a per-CPU cluster design, and each CPU will be using a different cluster as much as possible. But HDD is very sensitive to fragmentations, contention is trivial compared to this. So just use one global cluster instead. This ensured each order will be wring to a same cluster as much as possible, which helps to make the IO more continuous. This ensures the performance of cluster allocator is as good as the old allocator. Test after this commit compared to before this series: make -j32 with tinyconfig, using 1G memcg limit and HDD swap: Before this series: 114.44user 29.11system 39:42.90elapsed 6%CPU (0avgtext+0avgdata 157284maxre= sident)k 2901232inputs+0outputs (238877major+4227640minor)pagefaults After this commit: 113.90user 23.81system 38:11.77elapsed 6%CPU (0avgtext+0avgdata 157260maxre= sident)k 2548728inputs+0outputs (235471major+4238110minor)pagefaults Suggested-by: Chris Li Signed-off-by: Kairui Song --- include/linux/swap.h | 2 ++ mm/swapfile.c | 48 ++++++++++++++++++++++++++++++++------------ 2 files changed, 37 insertions(+), 13 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 0e6c6bb385f0..9898b1881d4d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -319,6 +319,8 @@ struct swap_info_struct { unsigned int pages; /* total of usable pages of swap */ atomic_long_t inuse_pages; /* number of those currently in use */ struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio= n */ + struct percpu_cluster *global_cluster; /* Use one global cluster for rota= ting device */ + spinlock_t global_cluster_lock; /* Serialize usage of global cluster */ struct rb_root swap_extent_root;/* root of the swap extent rbtree */ struct block_device *bdev; /* swap device or bdev of swap file */ struct file *swap_file; /* seldom referenced */ diff --git a/mm/swapfile.c b/mm/swapfile.c index f25d697f6736..6eb298a222c0 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -798,7 +798,10 @@ static unsigned int alloc_swap_scan_cluster(struct swa= p_info_struct *si, out: relocate_cluster(si, ci); unlock_cluster(ci); - __this_cpu_write(si->percpu_cluster->next[order], next); + if (si->flags & SWP_SOLIDSTATE) + __this_cpu_write(si->percpu_cluster->next[order], next); + else + si->global_cluster->next[order] =3D next; return found; } =20 @@ -860,8 +863,14 @@ static unsigned long cluster_alloc_swap_entry(struct s= wap_info_struct *si, int o unsigned int offset, found =3D 0; =20 /* Fast path using per CPU cluster */ - local_lock(&si->percpu_cluster->lock); - offset =3D __this_cpu_read(si->percpu_cluster->next[order]); + if (si->flags & SWP_SOLIDSTATE) { + local_lock(&si->percpu_cluster->lock); + offset =3D __this_cpu_read(si->percpu_cluster->next[order]); + } else { + spin_lock(&si->global_cluster_lock); + offset =3D si->global_cluster->next[order]; + } + if (offset) { ci =3D lock_cluster(si, offset); /* Cluster could have been used by another order */ @@ -960,8 +969,10 @@ static unsigned long cluster_alloc_swap_entry(struct s= wap_info_struct *si, int o } } done: - local_unlock(&si->percpu_cluster->lock); - + if (si->flags & SWP_SOLIDSTATE) + local_unlock(&si->percpu_cluster->lock); + else + spin_unlock(&si->global_cluster_lock); return found; } =20 @@ -2737,6 +2748,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) mutex_unlock(&swapon_mutex); free_percpu(p->percpu_cluster); p->percpu_cluster =3D NULL; + kfree(p->global_cluster); + p->global_cluster =3D NULL; vfree(swap_map); kvfree(zeromap); kvfree(cluster_info); @@ -3142,17 +3155,24 @@ static struct swap_cluster_info *setup_clusters(str= uct swap_info_struct *si, for (i =3D 0; i < nr_clusters; i++) spin_lock_init(&cluster_info[i].lock); =20 - si->percpu_cluster =3D alloc_percpu(struct percpu_cluster); - if (!si->percpu_cluster) - goto err_free; + if (si->flags & SWP_SOLIDSTATE) { + si->percpu_cluster =3D alloc_percpu(struct percpu_cluster); + if (!si->percpu_cluster) + goto err_free; =20 - for_each_possible_cpu(cpu) { - struct percpu_cluster *cluster; + for_each_possible_cpu(cpu) { + struct percpu_cluster *cluster; =20 - cluster =3D per_cpu_ptr(si->percpu_cluster, cpu); + cluster =3D per_cpu_ptr(si->percpu_cluster, cpu); + for (i =3D 0; i < SWAP_NR_ORDERS; i++) + cluster->next[i] =3D SWAP_ENTRY_INVALID; + local_lock_init(&cluster->lock); + } + } else { + si->global_cluster =3D kmalloc(sizeof(*si->global_cluster), GFP_KERNEL); for (i =3D 0; i < SWAP_NR_ORDERS; i++) - cluster->next[i] =3D SWAP_ENTRY_INVALID; - local_lock_init(&cluster->lock); + si->global_cluster->next[i] =3D SWAP_ENTRY_INVALID; + spin_lock_init(&si->global_cluster_lock); } =20 /* @@ -3426,6 +3446,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf= ile, int, swap_flags) bad_swap: free_percpu(si->percpu_cluster); si->percpu_cluster =3D NULL; + kfree(si->global_cluster); + si->global_cluster =3D NULL; inode =3D NULL; destroy_swap_extents(si); swap_cgroup_swapoff(si->type); --=20 2.47.0 From nobody Tue Nov 26 01:55:10 2024 Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com [209.85.214.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 141511A2658 for ; Tue, 22 Oct 2024 19:38:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625890; cv=none; b=Mb4zjr4EnjmhicOnru1S+ydJ5obIgWahdHaXqDyjlZtPFUkIJH3nLUEvUgQLdWcqzauVGUaqUn5yXtOeU1G9fPwRYwCCxwv3y3cT7jtT6Dzzme88C6UluhamVAK/8IghfNlOligUTzKTPfPD7g5UdVQ2jc30bICcALvP1HbL2Aw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625890; c=relaxed/simple; bh=aXU7Z0ls8KlQ8Ns0jRYwHjNeqLOCUIXN4azcAvjExM0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=cIs+G6HiJg5AvtGrh3oyhB7sn7RmMpeTFpiaixVH+fhnsVMbEU+mrKEr/HOOPpS8Guw7XZyhYgL7rAdgrXU3V7GrkvkgHm3c7Z7CfDgf6R1roBvYJUxBdLmSnmFkuHtCxsyQpzfO1dkMUKPwocu2eu7gk8wgRL0zo8VeCCS16S8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=KSu7EDHQ; arc=none smtp.client-ip=209.85.214.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="KSu7EDHQ" Received: by mail-pl1-f174.google.com with SMTP id d9443c01a7336-20cbcd71012so54465315ad.3 for ; Tue, 22 Oct 2024 12:38:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729625888; x=1730230688; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=R0FBRPvlweV3V63kFRaG7PTBewO1B3m5O3zQjLSkCfw=; b=KSu7EDHQ7+3bQiHWRMtH04rDcgWBvcKLL5I+xBFzAqW9/Jo9QK5LyzAKjGRLUN7U4W 5fbCaQ+iZTPjWi3oZXtzgiPhvgEZzf2L9omTmkIZDXXTntdjWRVDHElbBXZG4XQyX5mj jidCq+N0BQ5RJZmHVdZD3n9cuDZX9XiK2tGMtePbIo0e6sPmOdEkzBLVjPCP7u5PsBri LYQZO4a4PxC3fREA0Cd5hW90DzmUbBgvwqp8b0LXUJxtLF/Wj0jUP8/lBdDKlm8YMwdk F0oi7H0WIYaFeaBpmawr7vy28fJLUZWt5KYNWhzG6154SpFIQUhk3mzvzUG4hZchOKQE n8aA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729625888; x=1730230688; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=R0FBRPvlweV3V63kFRaG7PTBewO1B3m5O3zQjLSkCfw=; b=k6/BAaEeTB/s17D0nbgnGU2xiAW8MV1B5Jv9eHZP7QXpkEt21A5aYioTNNIMpqPus0 kZssfytXu2uYPH2sKrnv92HP3x8Ur7N3EeI0EDnHtCRaZ1MSKlUMgSG4pj8zkk8bseuh vwMCUwAVaGj4XC7GHo+8abgTgTtTRkHJYmNCnClyg8fRVgPcmkRDvVyQNEJYhweuBqSj yb7yl0NfGEgaABcz9ull/mV6VuRjzpdxNgIDnTVjj2KSIwQ2+U9OW7INjbjjOuPwfbRk gAqMWSOCGEqIhuLmmC5wJRFBs4GohMrbsNTXBKQZzh0H/VvXfWKEN3bNczuIly7eNEPJ BsRQ== X-Forwarded-Encrypted: i=1; AJvYcCWaFbe08VNPXRZyTbPMOVDCc2cugEtW3i5usrHpl250pqYK9XZvd5Evf/AKaegnBb/yJNeFa8YHbkqAy+4=@vger.kernel.org X-Gm-Message-State: AOJu0YyJkqkEYvb5l3oZZFR69mHUX+B57dCwJTgcHGbbQ+pNnpfhZOPB midt46MBnU0td1OUUq6kNvDKdAwZAknmfdhyXvbSZXp2fyDetPy3 X-Google-Smtp-Source: AGHT+IGUJ8AQzwqt8/MUkHuIb89YL1SfVEexjZ9y8Ui4ZXfkWlfD0aA1npA0ZkViny0ezZ6LYCpwQA== X-Received: by 2002:a17:902:e884:b0:20c:a498:1e4d with SMTP id d9443c01a7336-20fab2e0ed9mr2600735ad.60.1729625888219; Tue, 22 Oct 2024 12:38:08 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.36]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20e7ef0ad34sm46201875ad.68.2024.10.22.12.38.03 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 22 Oct 2024 12:38:06 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 13/13] mm, swap_slots: remove slot cache for freeing path Date: Wed, 23 Oct 2024 03:37:42 +0800 Message-ID: <20241022193742.43903-1-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20241022192451.38138-1-ryncsn@gmail.com> References: <20241022192451.38138-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song The slot cache for freeing path is mostly for reducing the overhead of si->lock. As we have basically eliminated the si->lock usage for freeing path, it can be just removed. This helps simplify the code, and avoids swap entries from being hold in cache upon freeing. The delayed freeing of entries have been causing trouble for further optimizations for zswap [1] and in theory will also cause more fragmentation, and extra overhead. Test with build linux kernel showed both performance and fragmentation is better without the cache: tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, avg of 4 test run:: Before: Sys time: 36047.78, Real time: 472.43 After: (-7.6% sys time, -7.3% real time) Sys time: 33314.76, Real time: 437.67 time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, avg of 4 test run: Before: Sys time: 46859.04, Real time: 562.63 hugepages-64kB/stats/swpout: 1783392 hugepages-64kB/stats/swpout_fallback: 240875 After: (-23.3% sys time, -21.3% real time) Sys time: 35958.87, Real time: 442.69 hugepages-64kB/stats/swpout: 1866267 hugepages-64kB/stats/swpout_fallback: 158330 Sequential SWAP should be also slightly faster, tests didn't show a measurable difference though, at least no regression: Swapin 4G zero page on ZRAM (time in us): Before (avg. 1923756) 1912391 1927023 1927957 1916527 1918263 1914284 1934753 1940813 1921791 After (avg. 1922290): 1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913 Link: https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdme= sW_59W1BWw@mail.gmail.com/[1c] Suggested-by: Chris Li Signed-off-by: Kairui Song --- include/linux/swap_slots.h | 3 -- mm/swap_slots.c | 78 +++++---------------------------- mm/swapfile.c | 89 +++++++++++++++----------------------- 3 files changed, 44 insertions(+), 126 deletions(-) diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h index 15adfb8c813a..840aec3523b2 100644 --- a/include/linux/swap_slots.h +++ b/include/linux/swap_slots.h @@ -16,15 +16,12 @@ struct swap_slots_cache { swp_entry_t *slots; int nr; int cur; - spinlock_t free_lock; /* protects slots_ret, n_ret */ - swp_entry_t *slots_ret; int n_ret; }; =20 void disable_swap_slots_cache_lock(void); void reenable_swap_slots_cache_unlock(void); void enable_swap_slots_cache(void); -void free_swap_slot(swp_entry_t entry); =20 extern bool swap_slot_cache_enabled; =20 diff --git a/mm/swap_slots.c b/mm/swap_slots.c index 13ab3b771409..9c7c171df7ba 100644 --- a/mm/swap_slots.c +++ b/mm/swap_slots.c @@ -43,17 +43,15 @@ static DEFINE_MUTEX(swap_slots_cache_mutex); /* Serialize swap slots cache enable/disable operations */ static DEFINE_MUTEX(swap_slots_cache_enable_mutex); =20 -static void __drain_swap_slots_cache(unsigned int type); +static void __drain_swap_slots_cache(void); =20 #define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_ena= bled) -#define SLOTS_CACHE 0x1 -#define SLOTS_CACHE_RET 0x2 =20 static void deactivate_swap_slots_cache(void) { mutex_lock(&swap_slots_cache_mutex); swap_slot_cache_active =3D false; - __drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET); + __drain_swap_slots_cache(); mutex_unlock(&swap_slots_cache_mutex); } =20 @@ -72,7 +70,7 @@ void disable_swap_slots_cache_lock(void) if (swap_slot_cache_initialized) { /* serialize with cpu hotplug operations */ cpus_read_lock(); - __drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET); + __drain_swap_slots_cache(); cpus_read_unlock(); } } @@ -113,7 +111,7 @@ static bool check_cache_active(void) static int alloc_swap_slot_cache(unsigned int cpu) { struct swap_slots_cache *cache; - swp_entry_t *slots, *slots_ret; + swp_entry_t *slots; =20 /* * Do allocation outside swap_slots_cache_mutex @@ -125,28 +123,19 @@ static int alloc_swap_slot_cache(unsigned int cpu) if (!slots) return -ENOMEM; =20 - slots_ret =3D kvcalloc(SWAP_SLOTS_CACHE_SIZE, sizeof(swp_entry_t), - GFP_KERNEL); - if (!slots_ret) { - kvfree(slots); - return -ENOMEM; - } - mutex_lock(&swap_slots_cache_mutex); cache =3D &per_cpu(swp_slots, cpu); - if (cache->slots || cache->slots_ret) { + if (cache->slots) { /* cache already allocated */ mutex_unlock(&swap_slots_cache_mutex); =20 kvfree(slots); - kvfree(slots_ret); =20 return 0; } =20 if (!cache->lock_initialized) { mutex_init(&cache->alloc_lock); - spin_lock_init(&cache->free_lock); cache->lock_initialized =3D true; } cache->nr =3D 0; @@ -160,19 +149,16 @@ static int alloc_swap_slot_cache(unsigned int cpu) */ mb(); cache->slots =3D slots; - cache->slots_ret =3D slots_ret; mutex_unlock(&swap_slots_cache_mutex); return 0; } =20 -static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type, - bool free_slots) +static void drain_slots_cache_cpu(unsigned int cpu, bool free_slots) { struct swap_slots_cache *cache; - swp_entry_t *slots =3D NULL; =20 cache =3D &per_cpu(swp_slots, cpu); - if ((type & SLOTS_CACHE) && cache->slots) { + if (cache->slots) { mutex_lock(&cache->alloc_lock); swapcache_free_entries(cache->slots + cache->cur, cache->nr); cache->cur =3D 0; @@ -183,20 +169,9 @@ static void drain_slots_cache_cpu(unsigned int cpu, un= signed int type, } mutex_unlock(&cache->alloc_lock); } - if ((type & SLOTS_CACHE_RET) && cache->slots_ret) { - spin_lock_irq(&cache->free_lock); - swapcache_free_entries(cache->slots_ret, cache->n_ret); - cache->n_ret =3D 0; - if (free_slots && cache->slots_ret) { - slots =3D cache->slots_ret; - cache->slots_ret =3D NULL; - } - spin_unlock_irq(&cache->free_lock); - kvfree(slots); - } } =20 -static void __drain_swap_slots_cache(unsigned int type) +static void __drain_swap_slots_cache(void) { unsigned int cpu; =20 @@ -224,13 +199,13 @@ static void __drain_swap_slots_cache(unsigned int typ= e) * There are no slots on such cpu that need to be drained. */ for_each_online_cpu(cpu) - drain_slots_cache_cpu(cpu, type, false); + drain_slots_cache_cpu(cpu, false); } =20 static int free_slot_cache(unsigned int cpu) { mutex_lock(&swap_slots_cache_mutex); - drain_slots_cache_cpu(cpu, SLOTS_CACHE | SLOTS_CACHE_RET, true); + drain_slots_cache_cpu(cpu, true); mutex_unlock(&swap_slots_cache_mutex); return 0; } @@ -269,39 +244,6 @@ static int refill_swap_slots_cache(struct swap_slots_c= ache *cache) return cache->nr; } =20 -void free_swap_slot(swp_entry_t entry) -{ - struct swap_slots_cache *cache; - - /* Large folio swap slot is not covered. */ - zswap_invalidate(entry); - - cache =3D raw_cpu_ptr(&swp_slots); - if (likely(use_swap_slot_cache && cache->slots_ret)) { - spin_lock_irq(&cache->free_lock); - /* Swap slots cache may be deactivated before acquiring lock */ - if (!use_swap_slot_cache || !cache->slots_ret) { - spin_unlock_irq(&cache->free_lock); - goto direct_free; - } - if (cache->n_ret >=3D SWAP_SLOTS_CACHE_SIZE) { - /* - * Return slots to global pool. - * The current swap_map value is SWAP_HAS_CACHE. - * Set it to 0 to indicate it is available for - * allocation in global pool - */ - swapcache_free_entries(cache->slots_ret, cache->n_ret); - cache->n_ret =3D 0; - } - cache->slots_ret[cache->n_ret++] =3D entry; - spin_unlock_irq(&cache->free_lock); - } else { -direct_free: - swapcache_free_entries(&entry, 1); - } -} - swp_entry_t folio_alloc_swap(struct folio *folio) { swp_entry_t entry; diff --git a/mm/swapfile.c b/mm/swapfile.c index 6eb298a222c0..c77b6ec3c83b 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -53,14 +53,15 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t, unsigned char); static void free_swap_count_continuations(struct swap_info_struct *); -static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t= entry, - unsigned int nr_pages); +static void swap_entry_range_free(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry, unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); static bool folio_swapcache_freeable(struct folio *folio); static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, unsigned long offset); -static void unlock_cluster(struct swap_cluster_info *ci); +static inline void unlock_cluster(struct swap_cluster_info *ci); =20 static DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; @@ -260,10 +261,9 @@ static int __try_to_reclaim_swap(struct swap_info_stru= ct *si, folio_ref_sub(folio, nr_pages); folio_set_dirty(folio); =20 - /* Only sinple page folio can be backed by zswap */ - if (nr_pages =3D=3D 1) - zswap_invalidate(entry); - swap_entry_range_free(si, entry, nr_pages); + ci =3D lock_cluster(si, offset); + swap_entry_range_free(si, ci, entry, nr_pages); + unlock_cluster(ci); ret =3D nr_pages; out_unlock: folio_unlock(folio); @@ -1105,8 +1105,10 @@ static void swap_range_free(struct swap_info_struct = *si, unsigned long offset, * Use atomic clear_bit operations only on zeromap instead of non-atomic * bitmap_clear to prevent adjacent bits corruption due to simultaneous w= rites. */ - for (i =3D 0; i < nr_entries; i++) + for (i =3D 0; i < nr_entries; i++) { clear_bit(offset + i, si->zeromap); + zswap_invalidate(swp_entry(si->type, offset + i)); + } =20 if (si->flags & SWP_BLKDEV) swap_slot_free_notify =3D @@ -1410,9 +1412,9 @@ static unsigned char __swap_entry_free(struct swap_in= fo_struct *si, =20 ci =3D lock_cluster(si, offset); usage =3D __swap_entry_free_locked(si, offset, 1); - unlock_cluster(ci); if (!usage) - free_swap_slot(entry); + swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1); + unlock_cluster(ci); =20 return usage; } @@ -1440,13 +1442,10 @@ static bool __swap_entries_free(struct swap_info_st= ruct *si, } for (i =3D 0; i < nr; i++) WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE); + if (!has_cache) + swap_entry_range_free(si, ci, entry, nr); unlock_cluster(ci); =20 - if (!has_cache) { - for (i =3D 0; i < nr; i++) - zswap_invalidate(swp_entry(si->type, offset + i)); - swap_entry_range_free(si, entry, nr); - } return has_cache; =20 fallback: @@ -1466,15 +1465,13 @@ static bool __swap_entries_free(struct swap_info_st= ruct *si, * Drop the last HAS_CACHE flag of swap entries, caller have to * ensure all entries belong to the same cgroup. */ -static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t= entry, - unsigned int nr_pages) +static void swap_entry_range_free(struct swap_info_struct *si, + struct swap_cluster_info *ci, + swp_entry_t entry, unsigned int nr_pages) { unsigned long offset =3D swp_offset(entry); unsigned char *map =3D si->swap_map + offset; unsigned char *map_end =3D map + nr_pages; - struct swap_cluster_info *ci; - - ci =3D lock_cluster(si, offset); =20 /* It should never free entries across different clusters */ VM_BUG_ON(ci !=3D offset_to_cluster(si, offset + nr_pages - 1)); @@ -1494,7 +1491,6 @@ static void swap_entry_range_free(struct swap_info_st= ruct *si, swp_entry_t entry free_cluster(si, ci); else partial_free_cluster(si, ci); - unlock_cluster(ci); } =20 static void cluster_swap_free_nr(struct swap_info_struct *si, @@ -1502,28 +1498,13 @@ static void cluster_swap_free_nr(struct swap_info_s= truct *si, unsigned char usage) { struct swap_cluster_info *ci; - DECLARE_BITMAP(to_free, BITS_PER_LONG) =3D { 0 }; - int i, nr; + unsigned long end =3D offset + nr_pages; =20 ci =3D lock_cluster(si, offset); - while (nr_pages) { - nr =3D min(BITS_PER_LONG, nr_pages); - for (i =3D 0; i < nr; i++) { - if (!__swap_entry_free_locked(si, offset + i, usage)) - bitmap_set(to_free, i, 1); - } - if (!bitmap_empty(to_free, BITS_PER_LONG)) { - unlock_cluster(ci); - for_each_set_bit(i, to_free, BITS_PER_LONG) - free_swap_slot(swp_entry(si->type, offset + i)); - if (nr =3D=3D nr_pages) - return; - bitmap_clear(to_free, 0, BITS_PER_LONG); - ci =3D lock_cluster(si, offset); - } - offset +=3D nr; - nr_pages -=3D nr; - } + do { + if (!__swap_entry_free_locked(si, offset, usage)) + swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1); + } while (++offset < end); unlock_cluster(ci); } =20 @@ -1564,18 +1545,12 @@ void put_swap_folio(struct folio *folio, swp_entry_= t entry) return; =20 ci =3D lock_cluster(si, offset); - if (size > 1 && swap_is_has_cache(si, offset, size)) { - unlock_cluster(ci); - swap_entry_range_free(si, entry, size); - return; - } - for (int i =3D 0; i < size; i++, entry.val++) { - if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) { - unlock_cluster(ci); - free_swap_slot(entry); - if (i =3D=3D size - 1) - return; - lock_cluster(si, offset); + if (swap_is_has_cache(si, offset, size)) + swap_entry_range_free(si, ci, entry, size); + else { + for (int i =3D 0; i < size; i++, entry.val++) { + if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) + swap_entry_range_free(si, ci, entry, 1); } } unlock_cluster(ci); @@ -1584,6 +1559,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t = entry) void swapcache_free_entries(swp_entry_t *entries, int n) { int i; + struct swap_cluster_info *ci; struct swap_info_struct *si =3D NULL; =20 if (n <=3D 0) @@ -1591,8 +1567,11 @@ void swapcache_free_entries(swp_entry_t *entries, in= t n) =20 for (i =3D 0; i < n; ++i) { si =3D _swap_info_get(entries[i]); - if (si) - swap_entry_range_free(si, entries[i], 1); + if (si) { + ci =3D lock_cluster(si, swp_offset(entries[i])); + swap_entry_range_free(si, ci, entries[i], 1); + unlock_cluster(ci); + } } } =20 --=20 2.47.0