From nobody Tue Nov 26 04:01:13 2024 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 35F8B1CCECC for ; Tue, 22 Oct 2024 19:30:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625420; cv=none; b=cZmXatVe86t62wI3bD9LQxbI4uudzRU+FykWErRbFTWowoZxjfEpoWdpeBMcW9kQ1qy1e0QgBDmuzsz7055PV57iAYdDNAQtw4X5tFop03U4/A7Bw1S7mW33wEo/VYjBeyhPwLxgK22Vy0F7JNsW+X6pRd5dSL6FWg+0dreXyv4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729625420; c=relaxed/simple; bh=mj/CjGxozi3200AqM9jvS9QZ3kbkE9jHXsmJ5+6VXuc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ee0MFxQp3ZemJiI7POaRzkyS8mMz7HlH2Q/4HlHldheEh88QDm4a2+wIfq6u9Ygbj5OXImwQRRpgjeDyyjLy/J+w+D0HPv1zhH2aM1GTI3apV+6hVAGEh9E8AKSYGBUyra0t2RWzy3D0DfGdSeJJ7Ef+6YkNzZulDX7QfODtAhk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=E1F/WHms; arc=none smtp.client-ip=209.85.214.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="E1F/WHms" Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-20e6981ca77so44522815ad.2 for ; Tue, 22 Oct 2024 12:30:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729625417; x=1730230217; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=0T5whPPZ1mtfPZDL/Wr67Sgt2fIYamz1vqMoEXoMco4=; b=E1F/WHmsY4Kb2LOAeL2Kwc+4ZqLq+3S+odmGCNP+wASts/636JiqA0IGfjkguVUitg oavemrly2wrVzo4gd6i84HEwKgAuIJeJqOiDGEKfZtAUqjBKZOBNc7ia2GvGrvXv3h9O zzP/KoC9BSvwDXg5r6tNAaPy40MhubrZsQgFFBZvKbNuWjBpvzAD7+PM31nEYku4HBG7 ycgfrQfDLuqlyH7yn2yw0B71nfor0eBwNDtBHGST03mgFttxqqQChoqDyeArWShk7icW JV2oOau+XT10J6pGFkZu0JJmiyYr4FpoXVq7z29ogj27/cHQEGKLLfK81hvHpP0jAOdn y1Ag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729625417; x=1730230217; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=0T5whPPZ1mtfPZDL/Wr67Sgt2fIYamz1vqMoEXoMco4=; b=txMUxJs41lV6pWvA7zWD8wmJcnX+j/m9uNfjBZGK/FtaiFN13/9cJzasK8LGcT9IxU e25YCg3DX6qwtBw6Vmph/qpJxw16KmpxgcrO0OdUjVvWNtANvGauPMUsZn2zR6KJBB8+ R869D1QfMugHrb1kBNxbwrKp5NiJ9vLk8crSTQNevInX7QPOQwoffa2Sy9P9ghs63l2R oCB3LcETgJ5JwWQGUKrYEG4Xa4mI6tGNM6G1Lr+McGsujnJXsVgy6Tastf8SMD6ymK+z trWEe1mwLON+RDonBXbeyzUngvdlVbfL7FQCrb0gFMqsGGkpKC6ki3No3tEqE8sBoamc es7Q== X-Forwarded-Encrypted: i=1; AJvYcCVT7pAHltRhZGiqNRl64mfoi+pum2oOuO4ekk8+IynNEGWT+chnd/vbuLkD0TFSmkS6Sgv0oij3ZGJgTFQ=@vger.kernel.org X-Gm-Message-State: AOJu0YzuxsFM6Gef504OnChjaThbFpyfICsHL9ICFYrbio0zSQynIIjR /57UR4XTSJ4l8ZoA7N1PBDKbuJjb7Kf9qJW4FMQVfGq5twjTexK/ X-Google-Smtp-Source: AGHT+IEkf+8rWb2625GwIb5iPnB6qF5HfrvR5aPCaq44dinbjDAQnyDVMYYmoXwRSMux2JXk+YmbGw== X-Received: by 2002:a17:902:ea02:b0:20c:7d4c:64db with SMTP id d9443c01a7336-20fab2dba5cmr2570545ad.49.1729625417229; Tue, 22 Oct 2024 12:30:17 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.36]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20e7f0d9f05sm45895305ad.186.2024.10.22.12.30.13 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 22 Oct 2024 12:30:16 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 09/13] mm, swap: reduce contention on device lock Date: Wed, 23 Oct 2024 03:24:47 +0800 Message-ID: <20241022192451.38138-10-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20241022192451.38138-1-ryncsn@gmail.com> References: <20241022192451.38138-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Currently swap locking is mainly composed of two locks, cluster lock (ci->lock) and device lock (si->lock). Cluster lock is much more fine-grained, so it will be best to use ci->lock instead of si->lock as much as possible. Following the new cluster allocator design, many operation doesn't need to touch si->lock at all. In practise, we only need to take si->lock when moving clusters between lists. To archive it, this commit reworked the locking pattern of all si->lock and ci->lock users, eliminated all usage of ci->lock inside si->lock, introduce new design to avoid touching si->lock as much as possible. For minimal contention for allocation and easier understanding, two ideas are introduced with the corresponding helpers: `isolation` and `relocation`: - Clusters will be `isolated` from list upon being scanned for allocation, so scanning of on-list cluster no longer need to hold the si->lock except the very moment, and hence removed the ci->lock usage inside si->lock. In the new allocator design, one cluster always get moved after scanning (free -> nonfull, nonfull -> frag, frag -> frag tail) so this introduces no extra overhead. This also greatly reduced the contention of both si->lock and ci->lock as other CPUs won't walk onto the same cluster by iterating the list. The off-list time window of a cluster is also minimal, one CPU can at most hold one cluster while scanning the 512 entries on it, which we used to busy wait with a spin lock. This is done with `cluster_isolate_lock` on scanning of a new cluster. Note: Scanning of per CPU cluster is a special case, it doesn't isolation the cluster. That's because it doesn't need to hold the si->lock at all, it simply acquire the ci->lock of previously used cluster and use it. - Cluster will be `relocated` after allocation or freeing according to it's count and status. Allocations no longer holds si->lock now, and may drop ci->lock for reclaim, so the cluster could be moved to anywhere. Besides, `isolation` clears all flags when it takes the cluster off list (The flag must be in-sync with list status, so cluster users don't need to touch si->lock for checking its list status. This is important for reducing contention on si->lock). So the cluster have to be `relocated` according to its usage after being allocation to the right list. This is done with `relocate_cluster` after allocation, or `[partial_]free_cluster` after freeing. Now except swapon / swapoff and discard, `isolation` and `relocation` are the only two places that need to take si->lock. And as each CPU will keep using its per-CPU cluster as much as possible and a cluster have 512 entries to be consumed, si->lock is rarely touched. The lock contention of si->lock is now barely observable. Test with build linux kernel with defconfig showed huge performance improvement: tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C: Before: Sys time: 73578.30, Real time: 864.05 After: (-50.7% sys time, -44.8% real time) Sys time: 36227.49, Real time: 476.66 time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C: (avg of 4 test run) Before: Sys time: 74044.85, Real time: 846.51 hugepages-64kB/stats/swpout: 1735216 hugepages-64kB/stats/swpout_fallback: 430333 After: (-40.4% sys time, -37.1% real time) Sys time: 44160.56, Real time: 532.07 hugepages-64kB/stats/swpout: 1786288 hugepages-64kB/stats/swpout_fallback: 243384 time make -j32 / 512M memcg, 4K pages, 5G ZRAM, on AMD 7K62: Before: Sys time: 8098.21, Real time: 401.3 After: (-22.6% sys time, -12.8% real time ) Sys time: 6265.02, Real time: 349.83 The allocation success rate also slightly improved as we sanitized the usage of clusters with new defined helpers and locks, so temporarily dropped si->lock or ci->lock won't cause cluster order shuffle. Suggested-by: Chris Li Signed-off-by: Kairui Song --- include/linux/swap.h | 5 +- mm/swapfile.c | 418 ++++++++++++++++++++++++------------------- 2 files changed, 239 insertions(+), 184 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 75fc2da1767d..a3b5d74b095a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -265,6 +265,8 @@ enum swap_cluster_flags { CLUSTER_FLAG_FREE, CLUSTER_FLAG_NONFULL, CLUSTER_FLAG_FRAG, + /* Clusters with flags above are allocatable */ + CLUSTER_FLAG_USABLE =3D CLUSTER_FLAG_FRAG, CLUSTER_FLAG_FULL, CLUSTER_FLAG_DISCARD, CLUSTER_FLAG_MAX, @@ -290,6 +292,7 @@ enum swap_cluster_flags { * throughput. */ struct percpu_cluster { + local_lock_t lock; /* Protect the percpu_cluster above */ unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ }; =20 @@ -312,7 +315,7 @@ struct swap_info_struct { /* list of cluster that contains at least one free slot */ struct list_head frag_clusters[SWAP_NR_ORDERS]; /* list of cluster that are fragmented or contented */ - unsigned int frag_cluster_nr[SWAP_NR_ORDERS]; + atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS]; unsigned int pages; /* total of usable pages of swap */ atomic_long_t inuse_pages; /* number of those currently in use */ struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio= n */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 96d8012b003c..a19ee8d5ffd0 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -260,12 +260,10 @@ static int __try_to_reclaim_swap(struct swap_info_str= uct *si, folio_ref_sub(folio, nr_pages); folio_set_dirty(folio); =20 - spin_lock(&si->lock); /* Only sinple page folio can be backed by zswap */ if (nr_pages =3D=3D 1) zswap_invalidate(entry); swap_entry_range_free(si, entry, nr_pages); - spin_unlock(&si->lock); ret =3D nr_pages; out_unlock: folio_unlock(folio); @@ -402,7 +400,21 @@ static void discard_swap_cluster(struct swap_info_stru= ct *si, =20 static inline bool cluster_is_free(struct swap_cluster_info *info) { - return info->flags =3D=3D CLUSTER_FLAG_FREE; + return info->count =3D=3D 0; +} + +static inline bool cluster_is_discard(struct swap_cluster_info *info) +{ + return info->flags =3D=3D CLUSTER_FLAG_DISCARD; +} + +static inline bool cluster_is_usable(struct swap_cluster_info *ci, int ord= er) +{ + if (unlikely(ci->flags > CLUSTER_FLAG_USABLE)) + return false; + if (!order) + return true; + return cluster_is_free(ci) || order =3D=3D ci->order; } =20 static inline unsigned int cluster_index(struct swap_info_struct *si, @@ -439,19 +451,20 @@ static void cluster_move(struct swap_info_struct *si, { VM_WARN_ON(ci->flags =3D=3D new_flags); BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX); + lockdep_assert_held(&ci->lock); =20 - if (ci->flags =3D=3D CLUSTER_FLAG_NONE) { + spin_lock(&si->lock); + if (ci->flags =3D=3D CLUSTER_FLAG_NONE) list_add_tail(&ci->list, list); - } else { - if (ci->flags =3D=3D CLUSTER_FLAG_FRAG) { - VM_WARN_ON(!si->frag_cluster_nr[ci->order]); - si->frag_cluster_nr[ci->order]--; - } + else list_move_tail(&ci->list, list); - } + spin_unlock(&si->lock); + + if (ci->flags =3D=3D CLUSTER_FLAG_FRAG) + atomic_long_dec(&si->frag_cluster_nr[ci->order]); + else if (new_flags =3D=3D CLUSTER_FLAG_FRAG) + atomic_long_inc(&si->frag_cluster_nr[ci->order]); ci->flags =3D new_flags; - if (new_flags =3D=3D CLUSTER_FLAG_FRAG) - si->frag_cluster_nr[ci->order]++; } =20 /* Add a cluster to discard list and schedule it to do discard */ @@ -474,39 +487,82 @@ static void swap_cluster_schedule_discard(struct swap= _info_struct *si, =20 static void __free_cluster(struct swap_info_struct *si, struct swap_cluste= r_info *ci) { - lockdep_assert_held(&si->lock); lockdep_assert_held(&ci->lock); cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE); ci->order =3D 0; } =20 +/* + * Isolate and lock the first cluster that is not contented on a list, + * clean its flag before taken off-list. Cluster flag must be in sync + * with list status, so cluster updaters can always know the cluster + * list status without touching si lock. + * + * Note it's possible that all clusters on a list are contented so + * this returns NULL for an non-empty list. + */ +static struct swap_cluster_info *cluster_isolate_lock( + struct swap_info_struct *si, struct list_head *list) +{ + struct swap_cluster_info *ci, *ret =3D NULL; + + spin_lock(&si->lock); + list_for_each_entry(ci, list, list) { + if (!spin_trylock(&ci->lock)) + continue; + + /* We may only isolate and clear flags of following lists */ + VM_BUG_ON(!ci->flags); + VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE && + ci->flags !=3D CLUSTER_FLAG_FULL); + + list_del(&ci->list); + ci->flags =3D CLUSTER_FLAG_NONE; + ret =3D ci; + break; + } + spin_unlock(&si->lock); + + return ret; +} + /* * Doing discard actually. After a cluster discard is finished, the cluster - * will be added to free cluster list. caller should hold si->lock. -*/ -static void swap_do_scheduled_discard(struct swap_info_struct *si) + * will be added to free cluster list. Discard cluster is a bit special as + * they don't participate in allocation or reclaim, so clusters marked as + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list. + */ +static bool swap_do_scheduled_discard(struct swap_info_struct *si) { struct swap_cluster_info *ci; + bool ret =3D false; unsigned int idx; =20 + spin_lock(&si->lock); while (!list_empty(&si->discard_clusters)) { ci =3D list_first_entry(&si->discard_clusters, struct swap_cluster_info,= list); + /* + * Delete the cluster from list but don't clear the flag until + * discard is done, so isolation and relocation will skip it. + */ list_del(&ci->list); - /* Must clear flag when taking a cluster off-list */ - ci->flags =3D CLUSTER_FLAG_NONE; idx =3D cluster_index(si, ci); spin_unlock(&si->lock); - discard_swap_cluster(si, idx * SWAPFILE_CLUSTER, SWAPFILE_CLUSTER); =20 - spin_lock(&si->lock); spin_lock(&ci->lock); - __free_cluster(si, ci); + /* Discard is done, return to list and clear the flag */ + ci->flags =3D CLUSTER_FLAG_NONE; memset(si->swap_map + idx * SWAPFILE_CLUSTER, 0, SWAPFILE_CLUSTER); + __free_cluster(si, ci); spin_unlock(&ci->lock); + ret =3D true; + spin_lock(&si->lock); } + spin_unlock(&si->lock); + return ret; } =20 static void swap_discard_work(struct work_struct *work) @@ -515,9 +571,7 @@ static void swap_discard_work(struct work_struct *work) =20 si =3D container_of(work, struct swap_info_struct, discard_work); =20 - spin_lock(&si->lock); swap_do_scheduled_discard(si); - spin_unlock(&si->lock); } =20 static void swap_users_ref_free(struct percpu_ref *ref) @@ -528,10 +582,14 @@ static void swap_users_ref_free(struct percpu_ref *re= f) complete(&si->comp); } =20 +/* + * Must be called after freeing if ci->count =3D=3D 0, puts the cluster to= free + * or discard list. + */ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_= info *ci) { VM_BUG_ON(ci->count !=3D 0); - lockdep_assert_held(&si->lock); + VM_BUG_ON(ci->flags =3D=3D CLUSTER_FLAG_FREE); lockdep_assert_held(&ci->lock); =20 /* @@ -548,6 +606,48 @@ static void free_cluster(struct swap_info_struct *si, = struct swap_cluster_info * __free_cluster(si, ci); } =20 +/* + * Must be called after freeing if ci->count !=3D 0, puts the cluster to f= ree + * or nonfull list. + */ +static void partial_free_cluster(struct swap_info_struct *si, + struct swap_cluster_info *ci) +{ + VM_BUG_ON(!ci->count || ci->count =3D=3D SWAPFILE_CLUSTER); + lockdep_assert_held(&ci->lock); + + if (ci->flags !=3D CLUSTER_FLAG_NONFULL) + cluster_move(si, ci, &si->nonfull_clusters[ci->order], + CLUSTER_FLAG_NONFULL); +} + +/* + * Must be called after allocation, put the cluster to full or frag list. + * Note: allocation don't need si lock, and may drop the ci lock for recla= im, + * so the cluster could end up any where before re-acquiring ci lock. + */ +static void relocate_cluster(struct swap_info_struct *si, + struct swap_cluster_info *ci) +{ + lockdep_assert_held(&ci->lock); + + /* Discard cluster must remain off-list or on discard list */ + if (cluster_is_discard(ci)) + return; + + if (!ci->count) { + free_cluster(si, ci); + } else if (ci->count !=3D SWAPFILE_CLUSTER) { + if (ci->flags !=3D CLUSTER_FLAG_FRAG) + cluster_move(si, ci, &si->frag_clusters[ci->order], + CLUSTER_FLAG_FRAG); + } else { + if (ci->flags !=3D CLUSTER_FLAG_FULL) + cluster_move(si, ci, &si->full_clusters, + CLUSTER_FLAG_FULL); + } +} + /* * The cluster corresponding to page_nr will be used. The cluster will not= be * added to free cluster list and its usage counter will be increased by 1. @@ -566,30 +666,6 @@ static void inc_cluster_info_page(struct swap_info_str= uct *si, VM_BUG_ON(ci->flags); } =20 -/* - * The cluster ci decreases @nr_pages usage. If the usage counter becomes = 0, - * which means no page in the cluster is in use, we can optionally discard - * the cluster and add it to free cluster list. - */ -static void dec_cluster_info_page(struct swap_info_struct *si, - struct swap_cluster_info *ci, int nr_pages) -{ - VM_BUG_ON(ci->count < nr_pages); - VM_BUG_ON(cluster_is_free(ci)); - lockdep_assert_held(&si->lock); - lockdep_assert_held(&ci->lock); - ci->count -=3D nr_pages; - - if (!ci->count) { - free_cluster(si, ci); - return; - } - - if (ci->flags !=3D CLUSTER_FLAG_NONFULL) - cluster_move(si, ci, &si->nonfull_clusters[ci->order], - CLUSTER_FLAG_NONFULL); -} - static bool cluster_reclaim_range(struct swap_info_struct *si, struct swap_cluster_info *ci, unsigned long start, unsigned long end) @@ -599,8 +675,6 @@ static bool cluster_reclaim_range(struct swap_info_stru= ct *si, int nr_reclaim; =20 spin_unlock(&ci->lock); - spin_unlock(&si->lock); - do { switch (READ_ONCE(map[offset])) { case 0: @@ -618,9 +692,7 @@ static bool cluster_reclaim_range(struct swap_info_stru= ct *si, } } while (offset < end); out: - spin_lock(&si->lock); spin_lock(&ci->lock); - /* * Recheck the range no matter reclaim succeeded or not, the slot * could have been be freed while we are not holding the lock. @@ -634,11 +706,11 @@ static bool cluster_reclaim_range(struct swap_info_st= ruct *si, =20 static bool cluster_scan_range(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned long start, unsigned int nr_pages) + unsigned long start, unsigned int nr_pages, + bool *need_reclaim) { unsigned long offset, end =3D start + nr_pages; unsigned char *map =3D si->swap_map; - bool need_reclaim =3D false; =20 for (offset =3D start; offset < end; offset++) { switch (READ_ONCE(map[offset])) { @@ -647,16 +719,13 @@ static bool cluster_scan_range(struct swap_info_struc= t *si, case SWAP_HAS_CACHE: if (!vm_swap_full()) return false; - need_reclaim =3D true; + *need_reclaim =3D true; continue; default: return false; } } =20 - if (need_reclaim) - return cluster_reclaim_range(si, ci, start, end); - return true; } =20 @@ -666,23 +735,12 @@ static void cluster_alloc_range(struct swap_info_stru= ct *si, struct swap_cluster { unsigned int nr_pages =3D 1 << order; =20 - VM_BUG_ON(ci->flags !=3D CLUSTER_FLAG_FREE && - ci->flags !=3D CLUSTER_FLAG_NONFULL && - ci->flags !=3D CLUSTER_FLAG_FRAG); - - if (cluster_is_free(ci)) { - if (nr_pages < SWAPFILE_CLUSTER) - cluster_move(si, ci, &si->nonfull_clusters[order], - CLUSTER_FLAG_NONFULL); + if (cluster_is_free(ci)) ci->order =3D order; - } =20 memset(si->swap_map + start, usage, nr_pages); swap_range_alloc(si, nr_pages); ci->count +=3D nr_pages; - - if (ci->count =3D=3D SWAPFILE_CLUSTER) - cluster_move(si, ci, &si->full_clusters, CLUSTER_FLAG_FULL); } =20 static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, u= nsigned long offset, @@ -692,34 +750,52 @@ static unsigned int alloc_swap_scan_cluster(struct sw= ap_info_struct *si, unsigne unsigned long start =3D offset & ~(SWAPFILE_CLUSTER - 1); unsigned long end =3D min(start + SWAPFILE_CLUSTER, si->max); unsigned int nr_pages =3D 1 << order; + bool need_reclaim, ret; struct swap_cluster_info *ci; =20 - if (end < nr_pages) - return SWAP_NEXT_INVALID; - end -=3D nr_pages; + ci =3D &si->cluster_info[offset / SWAPFILE_CLUSTER]; + lockdep_assert_held(&ci->lock); =20 - ci =3D lock_cluster(si, offset); - if (ci->count + nr_pages > SWAPFILE_CLUSTER) { + if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) { offset =3D SWAP_NEXT_INVALID; - goto done; + goto out; } =20 - while (offset <=3D end) { - if (cluster_scan_range(si, ci, offset, nr_pages)) { - cluster_alloc_range(si, ci, offset, usage, order); - *foundp =3D offset; - if (ci->count =3D=3D SWAPFILE_CLUSTER) { + for (end -=3D nr_pages; offset <=3D end; offset +=3D nr_pages) { + need_reclaim =3D false; + if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim)) + continue; + if (need_reclaim) { + ret =3D cluster_reclaim_range(si, ci, start, end); + /* + * Reclaim drops ci->lock and cluster could be used + * by another order. Not checking flag as off-list + * cluster has no flag set, and change of list + * won't cause fragmentation. + */ + if (!cluster_is_usable(ci, order)) { offset =3D SWAP_NEXT_INVALID; - goto done; + goto out; } - offset +=3D nr_pages; - break; + if (cluster_is_free(ci)) + offset =3D start; + /* Reclaim failed but cluster is usable, try next */ + if (!ret) + continue; + } + cluster_alloc_range(si, ci, offset, usage, order); + *foundp =3D offset; + if (ci->count =3D=3D SWAPFILE_CLUSTER) { + offset =3D SWAP_NEXT_INVALID; + goto out; } offset +=3D nr_pages; + break; } if (offset > end) offset =3D SWAP_NEXT_INVALID; -done: +out: + relocate_cluster(si, ci); unlock_cluster(ci); return offset; } @@ -736,18 +812,17 @@ static void swap_reclaim_full_clusters(struct swap_in= fo_struct *si, bool force) if (force) to_scan =3D swap_usage_in_pages(si) / SWAPFILE_CLUSTER; =20 - while (!list_empty(&si->full_clusters)) { - ci =3D list_first_entry(&si->full_clusters, struct swap_cluster_info, li= st); - list_move_tail(&ci->list, &si->full_clusters); + while ((ci =3D cluster_isolate_lock(si, &si->full_clusters))) { offset =3D cluster_offset(si, ci); end =3D min(si->max, offset + SWAPFILE_CLUSTER); to_scan--; =20 - spin_unlock(&si->lock); while (offset < end) { if (READ_ONCE(map[offset]) =3D=3D SWAP_HAS_CACHE) { + spin_unlock(&ci->lock); nr_reclaim =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT); + spin_lock(&ci->lock); if (nr_reclaim) { offset +=3D abs(nr_reclaim); continue; @@ -755,8 +830,8 @@ static void swap_reclaim_full_clusters(struct swap_info= _struct *si, bool force) } offset++; } - spin_lock(&si->lock); =20 + unlock_cluster(ci); if (to_scan <=3D 0) break; } @@ -768,9 +843,7 @@ static void swap_reclaim_work(struct work_struct *work) =20 si =3D container_of(work, struct swap_info_struct, reclaim_work); =20 - spin_lock(&si->lock); swap_reclaim_full_clusters(si, true); - spin_unlock(&si->lock); } =20 /* @@ -781,23 +854,36 @@ static void swap_reclaim_work(struct work_struct *wor= k) static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,= int order, unsigned char usage) { - struct percpu_cluster *cluster; struct swap_cluster_info *ci; unsigned int offset, found =3D 0; =20 -new_cluster: - lockdep_assert_held(&si->lock); - cluster =3D this_cpu_ptr(si->percpu_cluster); - offset =3D cluster->next[order]; + /* Fast path using per CPU cluster */ + local_lock(&si->percpu_cluster->lock); + offset =3D __this_cpu_read(si->percpu_cluster->next[order]); if (offset) { - offset =3D alloc_swap_scan_cluster(si, offset, &found, order, usage); + ci =3D lock_cluster(si, offset); + /* Cluster could have been used by another order */ + if (cluster_is_usable(ci, order)) { + if (cluster_is_free(ci)) + offset =3D cluster_offset(si, ci); + offset =3D alloc_swap_scan_cluster(si, offset, &found, + order, usage); + } else { + unlock_cluster(ci); + } if (found) goto done; } =20 - if (!list_empty(&si->free_clusters)) { - ci =3D list_first_entry(&si->free_clusters, struct swap_cluster_info, li= st); - offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, o= rder, usage); +new_cluster: + ci =3D cluster_isolate_lock(si, &si->free_clusters); + if (ci) { + offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), + &found, order, usage); + /* + * Allocation from free cluster must never fail and + * cluster lock must remain untouched. + */ VM_BUG_ON(!found); goto done; } @@ -807,49 +893,45 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o swap_reclaim_full_clusters(si, false); =20 if (order < PMD_ORDER) { - unsigned int frags =3D 0; + unsigned int frags =3D 0, frags_existing; =20 - while (!list_empty(&si->nonfull_clusters[order])) { - ci =3D list_first_entry(&si->nonfull_clusters[order], - struct swap_cluster_info, list); - cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG); + while ((ci =3D cluster_isolate_lock(si, &si->nonfull_clusters[order]))) { offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); - frags++; + /* + * With `fragmenting` set to true, it will surely take + * the cluster off nonfull list + */ if (found) goto done; + frags++; } =20 - /* - * Nonfull clusters are moved to frag tail if we reached - * here, count them too, don't over scan the frag list. - */ - while (frags < si->frag_cluster_nr[order]) { - ci =3D list_first_entry(&si->frag_clusters[order], - struct swap_cluster_info, list); + frags_existing =3D atomic_long_read(&si->frag_cluster_nr[order]); + while (frags < frags_existing && + (ci =3D cluster_isolate_lock(si, &si->frag_clusters[order]))) { + atomic_long_dec(&si->frag_cluster_nr[order]); /* - * Rotate the frag list to iterate, they were all failing - * high order allocation or moved here due to per-CPU usage, - * this help keeping usable cluster ahead. + * Rotate the frag list to iterate, they were all + * failing high order allocation or moved here due to + * per-CPU usage, but either way they could contain + * usable (eg. lazy-freed swap cache) slots. */ - list_move_tail(&ci->list, &si->frag_clusters[order]); offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); - frags++; if (found) goto done; + frags++; } } =20 - if (!list_empty(&si->discard_clusters)) { - /* - * we don't have free cluster but have some clusters in - * discarding, do discard now and reclaim them, then - * reread cluster_next_cpu since we dropped si->lock - */ - swap_do_scheduled_discard(si); + /* + * We don't have free cluster but have some clusters in + * discarding, do discard now and reclaim them, then + * reread cluster_next_cpu since we dropped si->lock + */ + if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si)) goto new_cluster; - } =20 if (order) goto done; @@ -860,26 +942,25 @@ static unsigned long cluster_alloc_swap_entry(struct = swap_info_struct *si, int o * Clusters here have at least one usable slots and can't fail order 0 * allocation, but reclaim may drop si->lock and race with another user. */ - while (!list_empty(&si->frag_clusters[o])) { - ci =3D list_first_entry(&si->frag_clusters[o], - struct swap_cluster_info, list); + while ((ci =3D cluster_isolate_lock(si, &si->frag_clusters[o]))) { + atomic_long_dec(&si->frag_cluster_nr[o]); offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, 0, usage); + &found, order, usage); if (found) goto done; } =20 - while (!list_empty(&si->nonfull_clusters[o])) { - ci =3D list_first_entry(&si->nonfull_clusters[o], - struct swap_cluster_info, list); + while ((ci =3D cluster_isolate_lock(si, &si->nonfull_clusters[o]))) { offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), - &found, 0, usage); + &found, order, usage); if (found) goto done; } } done: - cluster->next[order] =3D offset; + __this_cpu_write(si->percpu_cluster->next[order], offset); + local_unlock(&si->percpu_cluster->lock); + return found; } =20 @@ -1135,14 +1216,11 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entr= ies[], int entry_order) plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); spin_unlock(&swap_avail_lock); if (get_swap_device_info(si)) { - spin_lock(&si->lock); n_ret =3D scan_swap_map_slots(si, SWAP_HAS_CACHE, n_goal, swp_entries, order); - spin_unlock(&si->lock); put_swap_device(si); if (n_ret || size > 1) goto check_out; - cond_resched(); } =20 spin_lock(&swap_avail_lock); @@ -1355,9 +1433,7 @@ static bool __swap_entries_free(struct swap_info_stru= ct *si, if (!has_cache) { for (i =3D 0; i < nr; i++) zswap_invalidate(swp_entry(si->type, offset + i)); - spin_lock(&si->lock); swap_entry_range_free(si, entry, nr); - spin_unlock(&si->lock); } return has_cache; =20 @@ -1386,16 +1462,27 @@ static void swap_entry_range_free(struct swap_info_= struct *si, swp_entry_t entry unsigned char *map_end =3D map + nr_pages; struct swap_cluster_info *ci; =20 + /* It should never free entries across different clusters */ + VM_BUG_ON((offset / SWAPFILE_CLUSTER) !=3D ((offset + nr_pages - 1) / SWA= PFILE_CLUSTER)); + ci =3D lock_cluster(si, offset); + VM_BUG_ON(cluster_is_free(ci)); + VM_BUG_ON(ci->count < nr_pages); + + ci->count -=3D nr_pages; do { VM_BUG_ON(*map !=3D SWAP_HAS_CACHE); *map =3D 0; } while (++map < map_end); - dec_cluster_info_page(si, ci, nr_pages); - unlock_cluster(ci); =20 mem_cgroup_uncharge_swap(entry, nr_pages); swap_range_free(si, offset, nr_pages); + + if (!ci->count) + free_cluster(si, ci); + else + partial_free_cluster(si, ci); + unlock_cluster(ci); } =20 static void cluster_swap_free_nr(struct swap_info_struct *si, @@ -1467,9 +1554,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t = entry) ci =3D lock_cluster(si, offset); if (size > 1 && swap_is_has_cache(si, offset, size)) { unlock_cluster(ci); - spin_lock(&si->lock); swap_entry_range_free(si, entry, size); - spin_unlock(&si->lock); return; } for (int i =3D 0; i < size; i++, entry.val++) { @@ -1484,46 +1569,19 @@ void put_swap_folio(struct folio *folio, swp_entry_= t entry) unlock_cluster(ci); } =20 -static int swp_entry_cmp(const void *ent1, const void *ent2) -{ - const swp_entry_t *e1 =3D ent1, *e2 =3D ent2; - - return (int)swp_type(*e1) - (int)swp_type(*e2); -} - void swapcache_free_entries(swp_entry_t *entries, int n) { - struct swap_info_struct *si, *prev; int i; + struct swap_info_struct *si =3D NULL; =20 if (n <=3D 0) return; =20 - prev =3D NULL; - si =3D NULL; - - /* - * Sort swap entries by swap device, so each lock is only taken once. - * nr_swapfiles isn't absolutely correct, but the overhead of sort() is - * so low that it isn't necessary to optimize further. - */ - if (nr_swapfiles > 1) - sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL); for (i =3D 0; i < n; ++i) { si =3D _swap_info_get(entries[i]); - - if (si !=3D prev) { - if (prev !=3D NULL) - spin_unlock(&prev->lock); - if (si !=3D NULL) - spin_lock(&si->lock); - } if (si) swap_entry_range_free(si, entries[i], 1); - prev =3D si; } - if (si) - spin_unlock(&si->lock); } =20 int __swap_count(swp_entry_t entry) @@ -1775,13 +1833,8 @@ swp_entry_t get_swap_page_of_type(int type) goto fail; =20 /* This is called for allocating swap entry, not cache */ - if (get_swap_device_info(si)) { - spin_lock(&si->lock); - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0= )) - atomic_long_dec(&nr_swap_pages); - spin_unlock(&si->lock); - put_swap_device(si); - } + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0)) + atomic_long_dec(&nr_swap_pages); fail: return entry; } @@ -3098,6 +3151,7 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, cluster =3D per_cpu_ptr(si->percpu_cluster, cpu); for (i =3D 0; i < SWAP_NR_ORDERS; i++) cluster->next[i] =3D SWAP_NEXT_INVALID; + local_lock_init(&cluster->lock); } =20 /* @@ -3121,7 +3175,7 @@ static struct swap_cluster_info *setup_clusters(struc= t swap_info_struct *si, for (i =3D 0; i < SWAP_NR_ORDERS; i++) { INIT_LIST_HEAD(&si->nonfull_clusters[i]); INIT_LIST_HEAD(&si->frag_clusters[i]); - si->frag_cluster_nr[i] =3D 0; + atomic_long_set(&si->frag_cluster_nr[i], 0); } =20 /* @@ -3603,7 +3657,6 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) */ goto outer; } - spin_lock(&si->lock); =20 offset =3D swp_offset(entry); =20 @@ -3668,7 +3721,6 @@ int add_swap_count_continuation(swp_entry_t entry, gf= p_t gfp_mask) spin_unlock(&si->cont_lock); out: unlock_cluster(ci); - spin_unlock(&si->lock); put_swap_device(si); outer: if (page) --=20 2.47.0