From nobody Sun Dec 14 06:15:45 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5A2141494DA for ; Thu, 11 Jul 2024 07:29:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720682961; cv=none; b=lR1YkJiqj7FwfrzasdjnDaRAxpLT9ijG/Lrn9KLdrOEH133UVre4C6TRS4IPvBfnqEuH4dmHE5J0IfpUgsYWUp7KZjduM0/Izo15p+SvCZTD9h8C8rTiUUQrlSg8VEd6H/UBTJYF6zCUOD7wMS02Cu1opWKDyZU4ZgvAfhYqd18= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720682961; c=relaxed/simple; bh=XjAQzXfxPKVfEER5AVck+WfkxU99dfNna+CciF0M2zM=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=SpEtIWVjSLV32NC6UuWydgsyXEU3l19/7Lx7nWzK2C3gJ0e4yhVjycsBNCZpFbJdWWixu7Ec7+vhMCL2OREHsLdHXOdBPtirUYygNbDXsdK7jVZnZTI6ny28zMXGPZi4xbgBnwAKPbH+nZ7aiVwLI0d01zILCWVCZK6NHp8VwY4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=j1FNSVM3; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="j1FNSVM3" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 78E77C32782; Thu, 11 Jul 2024 07:29:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1720682960; bh=XjAQzXfxPKVfEER5AVck+WfkxU99dfNna+CciF0M2zM=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=j1FNSVM3xdztg2ij7TxwyY30FGzo6xufR1JhyfPvRV0oJq0n97eSwcvdh+wW+USVD jY/YhhiC+0lLYWx46myG5PCL4tVJf8TldfmkhtCB0vZJmz5ciAarVU7XPPC4bid5tZ eirw/sfH377b6WSm6m5vdadtkAF1VTTUm1nZdh+C7luj/g9o6I4ohm1hlc0rlemGEz ce3EkoocL/P2JtKbCtVlIQ26BPYrbql+HjilZORwDH1XLo8b+o5fPCV64yYbq5neou t73UvybxgOlUORH2OhtdPMGoMlo5miPbO3tuXMEPd2rGrqTipM7negJptHxTRIQpuV G4aZKXyGzDCJw== From: Chris Li Date: Thu, 11 Jul 2024 00:29:05 -0700 Subject: [PATCH v4 1/3] mm: swap: swap cluster switch to double link list Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20240711-swap-allocator-v4-1-0295a4d4c7aa@kernel.org> References: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org> In-Reply-To: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org> To: Andrew Morton Cc: Kairui Song , Hugh Dickins , Ryan Roberts , "Huang, Ying" , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chris Li , Barry Song X-Mailer: b4 0.13.0 Previously, the swap cluster used a cluster index as a pointer to construct a custom single link list type "swap_cluster_list". The next cluster pointer is shared with the cluster->count. It prevents puting the non free cluster into a list. Change the cluster to use the standard double link list instead. This allows tracing the nonfull cluster in the follow up patch. That way, it is faster to get to the nonfull cluster of that order. Remove the cluster getter/setter for accessing the cluster struct member. The list operation is protected by the swap_info_struct->lock. Change cluster code to use "struct swap_cluster_info *" to reference the cluster rather than by using index. That is more consistent with the list manipulation. It avoids the repeat adding index to the cluser_info. The code is easier to understand. Remove the cluster next pointer is NULL flag, the double link list can handle the empty list pretty well. The "swap_cluster_info" struct is two pointer bigger, because 512 swap entries share one swap struct, it has very little impact on the average memory usage per swap entry. For 1TB swapfile, the swap cluster data structure increases from 8MB to 24MB. Other than the list conversion, there is no real function change in this patch. Signed-off-by: Chris Li Reported-by: Barry Song <21cnbao@gmail.com> --- include/linux/swap.h | 26 +++--- mm/swapfile.c | 225 ++++++++++++++---------------------------------= ---- 2 files changed, 70 insertions(+), 181 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index e473fe6cfb7a..e9be95468fc7 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -243,22 +243,21 @@ enum { * free clusters are organized into a list. We fetch an entry from the lis= t to * get a free cluster. * - * The data field stores next cluster if the cluster is free or cluster us= age - * counter otherwise. The flags field determines if a cluster is free. Thi= s is - * protected by swap_info_struct.lock. + * The flags field determines if a cluster is free. This is + * protected by cluster lock. */ struct swap_cluster_info { spinlock_t lock; /* * Protect swap_cluster_info fields - * and swap_info_struct->swap_map - * elements correspond to the swap - * cluster + * other than list, and swap_info_struct->swap_map + * elements correspond to the swap cluster. */ - unsigned int data:24; - unsigned int flags:8; + u16 count; + u8 flags; + struct list_head list; }; #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ + =20 /* * The first page in the swap file is the swap header, which is always mar= ked @@ -283,11 +282,6 @@ struct percpu_cluster { unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ }; =20 -struct swap_cluster_list { - struct swap_cluster_info head; - struct swap_cluster_info tail; -}; - /* * The in-memory structure used to track swap areas. */ @@ -301,7 +295,7 @@ struct swap_info_struct { unsigned char *swap_map; /* vmalloc'ed array of usage counts */ unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */ struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ - struct swap_cluster_list free_clusters; /* free clusters list */ + struct list_head free_clusters; /* free clusters list */ unsigned int lowest_bit; /* index of first free in swap_map */ unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ @@ -332,7 +326,7 @@ struct swap_info_struct { * list. */ struct work_struct discard_work; /* discard worker */ - struct swap_cluster_list discard_clusters; /* discard clusters list */ + struct list_head discard_clusters; /* discard clusters list */ struct plist_node avail_lists[]; /* * entries in swap_avail_heads, one * entry per node. diff --git a/mm/swapfile.c b/mm/swapfile.c index f7224bc1320c..f70d25005d2c 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -290,62 +290,15 @@ static void discard_swap_cluster(struct swap_info_str= uct *si, #endif #define LATENCY_LIMIT 256 =20 -static inline void cluster_set_flag(struct swap_cluster_info *info, - unsigned int flag) -{ - info->flags =3D flag; -} - -static inline unsigned int cluster_count(struct swap_cluster_info *info) -{ - return info->data; -} - -static inline void cluster_set_count(struct swap_cluster_info *info, - unsigned int c) -{ - info->data =3D c; -} - -static inline void cluster_set_count_flag(struct swap_cluster_info *info, - unsigned int c, unsigned int f) -{ - info->flags =3D f; - info->data =3D c; -} - -static inline unsigned int cluster_next(struct swap_cluster_info *info) -{ - return info->data; -} - -static inline void cluster_set_next(struct swap_cluster_info *info, - unsigned int n) -{ - info->data =3D n; -} - -static inline void cluster_set_next_flag(struct swap_cluster_info *info, - unsigned int n, unsigned int f) -{ - info->flags =3D f; - info->data =3D n; -} - static inline bool cluster_is_free(struct swap_cluster_info *info) { return info->flags & CLUSTER_FLAG_FREE; } =20 -static inline bool cluster_is_null(struct swap_cluster_info *info) -{ - return info->flags & CLUSTER_FLAG_NEXT_NULL; -} - -static inline void cluster_set_null(struct swap_cluster_info *info) +static inline unsigned int cluster_index(struct swap_info_struct *si, + struct swap_cluster_info *ci) { - info->flags =3D CLUSTER_FLAG_NEXT_NULL; - info->data =3D 0; + return ci - si->cluster_info; } =20 static inline struct swap_cluster_info *lock_cluster(struct swap_info_stru= ct *si, @@ -394,65 +347,11 @@ static inline void unlock_cluster_or_swap_info(struct= swap_info_struct *si, spin_unlock(&si->lock); } =20 -static inline bool cluster_list_empty(struct swap_cluster_list *list) -{ - return cluster_is_null(&list->head); -} - -static inline unsigned int cluster_list_first(struct swap_cluster_list *li= st) -{ - return cluster_next(&list->head); -} - -static void cluster_list_init(struct swap_cluster_list *list) -{ - cluster_set_null(&list->head); - cluster_set_null(&list->tail); -} - -static void cluster_list_add_tail(struct swap_cluster_list *list, - struct swap_cluster_info *ci, - unsigned int idx) -{ - if (cluster_list_empty(list)) { - cluster_set_next_flag(&list->head, idx, 0); - cluster_set_next_flag(&list->tail, idx, 0); - } else { - struct swap_cluster_info *ci_tail; - unsigned int tail =3D cluster_next(&list->tail); - - /* - * Nested cluster lock, but both cluster locks are - * only acquired when we held swap_info_struct->lock - */ - ci_tail =3D ci + tail; - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING); - cluster_set_next(ci_tail, idx); - spin_unlock(&ci_tail->lock); - cluster_set_next_flag(&list->tail, idx, 0); - } -} - -static unsigned int cluster_list_del_first(struct swap_cluster_list *list, - struct swap_cluster_info *ci) -{ - unsigned int idx; - - idx =3D cluster_next(&list->head); - if (cluster_next(&list->tail) =3D=3D idx) { - cluster_set_null(&list->head); - cluster_set_null(&list->tail); - } else - cluster_set_next_flag(&list->head, - cluster_next(&ci[idx]), 0); - - return idx; -} - /* Add a cluster to discard list and schedule it to do discard */ static void swap_cluster_schedule_discard(struct swap_info_struct *si, - unsigned int idx) + struct swap_cluster_info *ci) { + unsigned int idx =3D cluster_index(si, ci); /* * If scan_swap_map_slots() can't find a free cluster, it will check * si->swap_map directly. To make sure the discarding cluster isn't @@ -462,17 +361,14 @@ static void swap_cluster_schedule_discard(struct swap= _info_struct *si, memset(si->swap_map + idx * SWAPFILE_CLUSTER, SWAP_MAP_BAD, SWAPFILE_CLUSTER); =20 - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx); - + list_add_tail(&ci->list, &si->discard_clusters); schedule_work(&si->discard_work); } =20 -static void __free_cluster(struct swap_info_struct *si, unsigned long idx) +static void __free_cluster(struct swap_info_struct *si, struct swap_cluste= r_info *ci) { - struct swap_cluster_info *ci =3D si->cluster_info; - - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE); - cluster_list_add_tail(&si->free_clusters, ci, idx); + ci->flags =3D CLUSTER_FLAG_FREE; + list_add_tail(&ci->list, &si->free_clusters); } =20 /* @@ -481,24 +377,25 @@ static void __free_cluster(struct swap_info_struct *s= i, unsigned long idx) */ static void swap_do_scheduled_discard(struct swap_info_struct *si) { - struct swap_cluster_info *info, *ci; + struct swap_cluster_info *ci; unsigned int idx; =20 - info =3D si->cluster_info; - - while (!cluster_list_empty(&si->discard_clusters)) { - idx =3D cluster_list_del_first(&si->discard_clusters, info); + while (!list_empty(&si->discard_clusters)) { + ci =3D list_first_entry(&si->discard_clusters, struct swap_cluster_info,= list); + list_del(&ci->list); + idx =3D cluster_index(si, ci); spin_unlock(&si->lock); =20 discard_swap_cluster(si, idx * SWAPFILE_CLUSTER, SWAPFILE_CLUSTER); =20 spin_lock(&si->lock); - ci =3D lock_cluster(si, idx * SWAPFILE_CLUSTER); - __free_cluster(si, idx); + + spin_lock(&ci->lock); + __free_cluster(si, ci); memset(si->swap_map + idx * SWAPFILE_CLUSTER, 0, SWAPFILE_CLUSTER); - unlock_cluster(ci); + spin_unlock(&ci->lock); } } =20 @@ -521,20 +418,20 @@ static void swap_users_ref_free(struct percpu_ref *re= f) complete(&si->comp); } =20 -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx) +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si= , unsigned long idx) { - struct swap_cluster_info *ci =3D si->cluster_info; + struct swap_cluster_info *ci =3D list_first_entry(&si->free_clusters, str= uct swap_cluster_info, list); =20 - VM_BUG_ON(cluster_list_first(&si->free_clusters) !=3D idx); - cluster_list_del_first(&si->free_clusters, ci); - cluster_set_count_flag(ci + idx, 0, 0); + VM_BUG_ON(cluster_index(si, ci) !=3D idx); + list_del(&ci->list); + ci->count =3D 0; + ci->flags =3D 0; + return ci; } =20 -static void free_cluster(struct swap_info_struct *si, unsigned long idx) +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_= info *ci) { - struct swap_cluster_info *ci =3D si->cluster_info + idx; - - VM_BUG_ON(cluster_count(ci) !=3D 0); + VM_BUG_ON(ci->count !=3D 0); /* * If the swap is discardable, prepare discard the cluster * instead of free it immediately. The cluster will be freed @@ -542,11 +439,11 @@ static void free_cluster(struct swap_info_struct *si,= unsigned long idx) */ if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) =3D=3D (SWP_WRITEOK | SWP_PAGE_DISCARD)) { - swap_cluster_schedule_discard(si, idx); + swap_cluster_schedule_discard(si, ci); return; } =20 - __free_cluster(si, idx); + __free_cluster(si, ci); } =20 /* @@ -559,15 +456,15 @@ static void add_cluster_info_page(struct swap_info_st= ruct *p, unsigned long count) { unsigned long idx =3D page_nr / SWAPFILE_CLUSTER; + struct swap_cluster_info *ci =3D cluster_info + idx; =20 if (!cluster_info) return; - if (cluster_is_free(&cluster_info[idx])) + if (cluster_is_free(ci)) alloc_cluster(p, idx); =20 - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER); - cluster_set_count(&cluster_info[idx], - cluster_count(&cluster_info[idx]) + count); + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER); + ci->count +=3D count; } =20 /* @@ -581,24 +478,20 @@ static void inc_cluster_info_page(struct swap_info_st= ruct *p, } =20 /* - * The cluster corresponding to page_nr decreases one usage. If the usage - * counter becomes 0, which means no page in the cluster is in using, we c= an - * optionally discard the cluster and add it to free cluster list. + * The cluster ci decreases one usage. If the usage counter becomes 0, + * which means no page in the cluster is in using, we can optionally disca= rd + * the cluster and add it to free cluster list. */ -static void dec_cluster_info_page(struct swap_info_struct *p, - struct swap_cluster_info *cluster_info, unsigned long page_nr) +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_= cluster_info *ci) { - unsigned long idx =3D page_nr / SWAPFILE_CLUSTER; - - if (!cluster_info) + if (!p->cluster_info) return; =20 - VM_BUG_ON(cluster_count(&cluster_info[idx]) =3D=3D 0); - cluster_set_count(&cluster_info[idx], - cluster_count(&cluster_info[idx]) - 1); + VM_BUG_ON(ci->count =3D=3D 0); + ci->count--; =20 - if (cluster_count(&cluster_info[idx]) =3D=3D 0) - free_cluster(p, idx); + if (!ci->count) + free_cluster(p, ci); } =20 /* @@ -611,10 +504,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_s= truct *si, { struct percpu_cluster *percpu_cluster; bool conflict; - + struct swap_cluster_info *first =3D list_first_entry(&si->free_clusters, = struct swap_cluster_info, list); offset /=3D SWAPFILE_CLUSTER; - conflict =3D !cluster_list_empty(&si->free_clusters) && - offset !=3D cluster_list_first(&si->free_clusters) && + conflict =3D !list_empty(&si->free_clusters) && + offset !=3D first - si->cluster_info && cluster_is_free(&si->cluster_info[offset]); =20 if (!conflict) @@ -655,10 +548,10 @@ static bool scan_swap_map_try_ssd_cluster(struct swap= _info_struct *si, cluster =3D this_cpu_ptr(si->percpu_cluster); tmp =3D cluster->next[order]; if (tmp =3D=3D SWAP_NEXT_INVALID) { - if (!cluster_list_empty(&si->free_clusters)) { - tmp =3D cluster_next(&si->free_clusters.head) * - SWAPFILE_CLUSTER; - } else if (!cluster_list_empty(&si->discard_clusters)) { + if (!list_empty(&si->free_clusters)) { + ci =3D list_first_entry(&si->free_clusters, struct swap_cluster_info, l= ist); + tmp =3D cluster_index(si, ci) * SWAPFILE_CLUSTER; + } else if (!list_empty(&si->discard_clusters)) { /* * we don't have free cluster but have some clusters in * discarding, do discard now and reclaim them, then @@ -1070,8 +963,9 @@ static void swap_free_cluster(struct swap_info_struct = *si, unsigned long idx) =20 ci =3D lock_cluster(si, offset); memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); - cluster_set_count_flag(ci, 0, 0); - free_cluster(si, idx); + ci->count =3D 0; + ci->flags =3D 0; + free_cluster(si, ci); unlock_cluster(ci); swap_range_free(si, offset, SWAPFILE_CLUSTER); } @@ -1344,7 +1238,7 @@ static void swap_entry_free(struct swap_info_struct *= p, swp_entry_t entry) count =3D p->swap_map[offset]; VM_BUG_ON(count !=3D SWAP_HAS_CACHE); p->swap_map[offset] =3D 0; - dec_cluster_info_page(p, p->cluster_info, offset); + dec_cluster_info_page(p, ci); unlock_cluster(ci); =20 mem_cgroup_uncharge_swap(entry, 1); @@ -3022,8 +2916,8 @@ static int setup_swap_map_and_extents(struct swap_inf= o_struct *p, =20 nr_good_pages =3D maxpages - 1; /* omit header page */ =20 - cluster_list_init(&p->free_clusters); - cluster_list_init(&p->discard_clusters); + INIT_LIST_HEAD(&p->free_clusters); + INIT_LIST_HEAD(&p->discard_clusters); =20 for (i =3D 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr =3D swap_header->info.badpages[i]; @@ -3074,14 +2968,15 @@ static int setup_swap_map_and_extents(struct swap_i= nfo_struct *p, for (k =3D 0; k < SWAP_CLUSTER_COLS; k++) { j =3D (k + col) % SWAP_CLUSTER_COLS; for (i =3D 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) { + struct swap_cluster_info *ci; idx =3D i * SWAP_CLUSTER_COLS + j; + ci =3D cluster_info + idx; if (idx >=3D nr_clusters) continue; - if (cluster_count(&cluster_info[idx])) + if (ci->count) continue; - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE); - cluster_list_add_tail(&p->free_clusters, cluster_info, - idx); + ci->flags =3D CLUSTER_FLAG_FREE; + list_add_tail(&ci->list, &p->free_clusters); } } return nr_extents; --=20 2.45.2.803.g4e1b14247a-goog From nobody Sun Dec 14 06:15:45 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7A1E814A4ED for ; Thu, 11 Jul 2024 07:29:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720682961; cv=none; b=YCO63RaEoqJdSpvj9ronUXI9wAkAqYUbQ6sapa2uBLWPAFIeaoOe0OyXY8zAdbPM2pD0SLhQATbM8up7woEt7w6y2L8/fsD9Z3spgOAHQd6BYrJ2xDI1JLyR0I2fPFb5wizMW1Lmq8hMfT5wCr5FrZft64R4wwqTkvBJ6VvZHBc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720682961; c=relaxed/simple; bh=GQQhDBkbVm07F633nVzv9d4hoKaeH6pAoRTRgLjaQmQ=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=atpt6dXSVQ5WCSk1xqTayyJqXkP5i0e900N7jwPKJOSbCbrCJkox8cJjjOgFGYELArtn6hfunJq3Xk9cK6hIKWnZajQY3Pgh2OKv/PxxLdcWEhmykHp/1gSpg56PDFdm94l7gxC5qlrR+ipZQSsvIYdYJS4wzozEhQx7UExIxlA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=aeMOCOvu; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="aeMOCOvu" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 03A02C4AF0C; Thu, 11 Jul 2024 07:29:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1720682961; bh=GQQhDBkbVm07F633nVzv9d4hoKaeH6pAoRTRgLjaQmQ=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=aeMOCOvupqqC2Ej3StRoP1k7OPZxF8jo4beiiML+npBz7HZKH4qN/FxHXemgCWKl4 kP1rBI3p8z1t3UlHkyhh8gD1rACHQNRI19Z0KACtlrkrp7ShnMo/TtkNlUt9rZOtX1 EVkf9DPfQZIKVM6HBSxl+nULAKk1BtmI3EHx+a6kr56VVk6/wq5DMiFLlJTCIdXUfz XiFaRpxW/32iuxeNLx1GQQr0f/OwPBDnODLaCCGuUo0p8rFa0CcWVn3l9g+4V0f3tX QSWFyjLyxTlUZbH+EXqHPR2f7tFDCsrEkKB0mXK1S5jARo6rTfTHrT3GZPzpfY/gVv vecUSKI2Bfmtw== From: Chris Li Date: Thu, 11 Jul 2024 00:29:06 -0700 Subject: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20240711-swap-allocator-v4-2-0295a4d4c7aa@kernel.org> References: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org> In-Reply-To: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org> To: Andrew Morton Cc: Kairui Song , Hugh Dickins , Ryan Roberts , "Huang, Ying" , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chris Li , Barry Song X-Mailer: b4 0.13.0 Track the nonfull cluster as well as the empty cluster on lists. Each order has one nonfull cluster list. The cluster will remember which order it was used during new cluster allocation. When the cluster has free entry, add to the nonfull[order] list. =C2=A0When the free cluster list is empty, also allocate from the nonempty list of that order. This improves the mTHP swap allocation success rate. There are limitations if the distribution of numbers of different orders of mTHP changes a lot. e.g. there are a lot of nonfull cluster assign to order A while later time there are a lot of order B allocation while very little allocation in order A. Currently the cluster used by order A will not reused by order B unless the cluster is 100% empty. Signed-off-by: Chris Li Reported-by: Barry Song <21cnbao@gmail.com> --- include/linux/swap.h | 4 ++++ mm/swapfile.c | 34 +++++++++++++++++++++++++++++++--- 2 files changed, 35 insertions(+), 3 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index e9be95468fc7..db8d6000c116 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -254,9 +254,11 @@ struct swap_cluster_info { */ u16 count; u8 flags; + u8 order; struct list_head list; }; #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ =20 =20 /* @@ -296,6 +298,8 @@ struct swap_info_struct { unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */ struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ struct list_head free_clusters; /* free clusters list */ + struct list_head nonfull_clusters[SWAP_NR_ORDERS]; + /* list of cluster that contains at least one free slot */ unsigned int lowest_bit; /* index of first free in swap_map */ unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ diff --git a/mm/swapfile.c b/mm/swapfile.c index f70d25005d2c..e13a33664cfa 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struct swap= _info_struct *si, memset(si->swap_map + idx * SWAPFILE_CLUSTER, SWAP_MAP_BAD, SWAPFILE_CLUSTER); =20 - list_add_tail(&ci->list, &si->discard_clusters); + if (ci->flags) + list_move_tail(&ci->list, &si->discard_clusters); + else + list_add_tail(&ci->list, &si->discard_clusters); + ci->flags =3D 0; schedule_work(&si->discard_work); } =20 static void __free_cluster(struct swap_info_struct *si, struct swap_cluste= r_info *ci) { + if (ci->flags & CLUSTER_FLAG_NONFULL) + list_move_tail(&ci->list, &si->free_clusters); + else + list_add_tail(&ci->list, &si->free_clusters); ci->flags =3D CLUSTER_FLAG_FREE; - list_add_tail(&ci->list, &si->free_clusters); } =20 /* @@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_info_str= uct *p, struct swap_cluste ci->count--; =20 if (!ci->count) - free_cluster(p, ci); + return free_cluster(p, ci); + + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); + ci->flags |=3D CLUSTER_FLAG_NONFULL; + } } =20 /* @@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_= info_struct *si, if (tmp =3D=3D SWAP_NEXT_INVALID) { if (!list_empty(&si->free_clusters)) { ci =3D list_first_entry(&si->free_clusters, struct swap_cluster_info, l= ist); + list_del(&ci->list); + spin_lock(&ci->lock); + ci->order =3D order; + ci->flags =3D 0; + spin_unlock(&ci->lock); + tmp =3D cluster_index(si, ci) * SWAPFILE_CLUSTER; + } else if (!list_empty(&si->nonfull_clusters[order])) { + ci =3D list_first_entry(&si->nonfull_clusters[order], struct swap_clust= er_info, list); + list_del(&ci->list); + spin_lock(&ci->lock); + ci->flags =3D 0; + spin_unlock(&ci->lock); tmp =3D cluster_index(si, ci) * SWAPFILE_CLUSTER; } else if (!list_empty(&si->discard_clusters)) { /* @@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_struct *= si, unsigned long idx) ci =3D lock_cluster(si, offset); memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); ci->count =3D 0; + ci->order =3D 0; ci->flags =3D 0; free_cluster(si, ci); unlock_cluster(ci); @@ -2919,6 +2944,9 @@ static int setup_swap_map_and_extents(struct swap_inf= o_struct *p, INIT_LIST_HEAD(&p->free_clusters); INIT_LIST_HEAD(&p->discard_clusters); =20 + for (i =3D 0; i < SWAP_NR_ORDERS; i++) + INIT_LIST_HEAD(&p->nonfull_clusters[i]); + for (i =3D 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr =3D swap_header->info.badpages[i]; if (page_nr =3D=3D 0 || page_nr > swap_header->info.last_page) --=20 2.45.2.803.g4e1b14247a-goog From nobody Sun Dec 14 06:15:45 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5015614B978 for ; Thu, 11 Jul 2024 07:29:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720682962; cv=none; b=kn6Bi+4z6HJuZEBY3k7Ver0D9rN+hMuJeGgNRMhyC3I6GMjeOUXATdJu1FyAERI3AATatrCL/CATV80U3JL9TB+qwAq1SuXAZX+zP7IhAuHLaohN2Ra016TBhyWNkZevT62tALE5Ydtx5Po/Z//WGx7XVNRo7LyH5Con2NxGx38= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720682962; c=relaxed/simple; bh=J8FHwdhxltN0qeCWTQ9ab5ob3eCGGqwb0L98++XBWds=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=WorAylRtwXx/TsrNnbnHGNiPBCTWjsH6xKf7q24lbeC+q1+/yY76C9uWYpMSUkxisavtta8lzMasSBt1Y7Qi0XU5gTu7+acEQGXXcNCZn9G272Gu259wJduSOo6VpacisVXA83K2o5dcPthpJOowfO4J87IRazoVPTpT0HyGigE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=RKzIb9QW; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="RKzIb9QW" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7FB43C32782; Thu, 11 Jul 2024 07:29:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1720682961; bh=J8FHwdhxltN0qeCWTQ9ab5ob3eCGGqwb0L98++XBWds=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=RKzIb9QWrld3pBFgu1zmeH8UGMrXIJu9UXtR34ZI/3euR0R5rYfXeFdgNRQ78Rkki 8CjdagylbLSKsVDQuP0iBsEDiUL0g0ew/v4z3jVh2NtbksAFCzv1KIrsP3a9ynLvrq ERc4Xra0gSExAT2XmxWonKjbMFx5FvNPs2OwiXj1qel/P+qSfIHJqDSBBgxzVFM+oZ FlP3DNOtqDl8hH2cGTwYYeRbEtMpTs84Hh9LxvZ9IQGkOLANCCmefE0nLe5dCtiSma WSywXv8veB4zsB6vHHkWSVYeyqU4CnxtV/mXPnjUmWIwPo9ql4GEF63Rl/gQTIwixw 0ciOSieokvGjw== From: Chris Li Date: Thu, 11 Jul 2024 00:29:07 -0700 Subject: [PATCH v4 3/3] RFC: mm: swap: seperate SSD allocation from scan_swap_map_slots() Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20240711-swap-allocator-v4-3-0295a4d4c7aa@kernel.org> References: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org> In-Reply-To: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org> To: Andrew Morton Cc: Kairui Song , Hugh Dickins , Ryan Roberts , "Huang, Ying" , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chris Li , Barry Song X-Mailer: b4 0.13.0 Previously the SSD and HDD share the same swap_map scan loop in scan_swap_map_slots(). This function is complex and hard to flow the execution flow. scan_swap_map_try_ssd_cluster() can already do most of the heavy lifting to locate the candidate swap range in the cluster. However it needs to go back to scan_swap_map_slots() to check conflict and then perform the allocation. When scan_swap_map_try_ssd_cluster() failed, it still depended on the scan_swap_map_slots() to do brute force scanning of the swap_map. When the swapfile is large and almost full, it will take some CPU time to go through the swap_map array. Get rid of the cluster allocation dependency on the swap_map scan loop in scan_swap_map_slots(). Streamline the cluster allocation code path. No more conflict checks. For order 0 swap entry, when run out of free and nonfull list. It will allocate from the higher order nonfull cluster list. Users should see less CPU time spent on searching the free swap slot when swapfile is almost full. Signed-off-by: Chris Li Reported-by: Barry Song <21cnbao@gmail.com> --- mm/swapfile.c | 297 ++++++++++++++++++++++++++++++++----------------------= ---- 1 file changed, 166 insertions(+), 131 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index e13a33664cfa..b967e628ae65 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -53,6 +53,8 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t, unsigned char); static void free_swap_count_continuations(struct swap_info_struct *); +static void swap_range_alloc(struct swap_info_struct *si, unsigned long of= fset, + unsigned int nr_entries); =20 static DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; @@ -301,6 +303,12 @@ static inline unsigned int cluster_index(struct swap_i= nfo_struct *si, return ci - si->cluster_info; } =20 +static inline unsigned int cluster_offset(struct swap_info_struct *si, + struct swap_cluster_info *ci) +{ + return cluster_index(si, ci) * SWAPFILE_CLUSTER; +} + static inline struct swap_cluster_info *lock_cluster(struct swap_info_stru= ct *si, unsigned long offset) { @@ -371,11 +379,15 @@ static void swap_cluster_schedule_discard(struct swap= _info_struct *si, =20 static void __free_cluster(struct swap_info_struct *si, struct swap_cluste= r_info *ci) { + VM_BUG_ON(!spin_is_locked(&si->lock)); + VM_BUG_ON(!spin_is_locked(&ci->lock)); + if (ci->flags & CLUSTER_FLAG_NONFULL) list_move_tail(&ci->list, &si->free_clusters); else list_add_tail(&ci->list, &si->free_clusters); ci->flags =3D CLUSTER_FLAG_FREE; + ci->order =3D 0; } =20 /* @@ -430,8 +442,10 @@ static struct swap_cluster_info *alloc_cluster(struct = swap_info_struct *si, unsi struct swap_cluster_info *ci =3D list_first_entry(&si->free_clusters, str= uct swap_cluster_info, list); =20 VM_BUG_ON(cluster_index(si, ci) !=3D idx); + VM_BUG_ON(!spin_is_locked(&si->lock)); + VM_BUG_ON(!spin_is_locked(&ci->lock)); + VM_BUG_ON(ci->count); list_del(&ci->list); - ci->count =3D 0; ci->flags =3D 0; return ci; } @@ -439,6 +453,8 @@ static struct swap_cluster_info *alloc_cluster(struct s= wap_info_struct *si, unsi static void free_cluster(struct swap_info_struct *si, struct swap_cluster_= info *ci) { VM_BUG_ON(ci->count !=3D 0); + VM_BUG_ON(!spin_is_locked(&si->lock)); + VM_BUG_ON(!spin_is_locked(&ci->lock)); /* * If the swap is discardable, prepare discard the cluster * instead of free it immediately. The cluster will be freed @@ -495,52 +511,96 @@ static void dec_cluster_info_page(struct swap_info_st= ruct *p, struct swap_cluste return; =20 VM_BUG_ON(ci->count =3D=3D 0); + VM_BUG_ON(cluster_is_free(ci)); + VM_BUG_ON(!spin_is_locked(&p->lock)); + VM_BUG_ON(!spin_is_locked(&ci->lock)); ci->count--; =20 if (!ci->count) return free_cluster(p, ci); =20 if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { + VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); - ci->flags |=3D CLUSTER_FLAG_NONFULL; + ci->flags =3D CLUSTER_FLAG_NONFULL; } } =20 -/* - * It's possible scan_swap_map_slots() uses a free cluster in the middle o= f free - * cluster list. Avoiding such abuse to avoid list corruption. - */ -static bool -scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, - unsigned long offset, int order) -{ - struct percpu_cluster *percpu_cluster; - bool conflict; - struct swap_cluster_info *first =3D list_first_entry(&si->free_clusters, = struct swap_cluster_info, list); - offset /=3D SWAPFILE_CLUSTER; - conflict =3D !list_empty(&si->free_clusters) && - offset !=3D first - si->cluster_info && - cluster_is_free(&si->cluster_info[offset]); - - if (!conflict) - return false; +static inline bool cluster_scan_range(struct swap_info_struct *si, unsigne= d int start, + unsigned int nr_pages) +{ + unsigned char *p =3D si->swap_map + start; + unsigned char *end =3D p + nr_pages; + + while (p < end) + if (*p++) + return false; =20 - percpu_cluster =3D this_cpu_ptr(si->percpu_cluster); - percpu_cluster->next[order] =3D SWAP_NEXT_INVALID; return true; } =20 -static inline bool swap_range_empty(char *swap_map, unsigned int start, - unsigned int nr_pages) + +static inline void cluster_alloc_range(struct swap_info_struct *si, struct= swap_cluster_info *ci, + unsigned int start, unsigned char usage, + unsigned int order) { - unsigned int i; + unsigned int nr_pages =3D 1 << order; =20 - for (i =3D 0; i < nr_pages; i++) { - if (swap_map[start + i]) - return false; + if (cluster_is_free(ci)) { + if (nr_pages < SWAPFILE_CLUSTER) { + list_move_tail(&ci->list, &si->nonfull_clusters[order]); + ci->flags =3D CLUSTER_FLAG_NONFULL; + } + ci->order =3D order; } =20 - return true; + memset(si->swap_map + start, usage, nr_pages); + swap_range_alloc(si, start, nr_pages); + ci->count +=3D nr_pages; + + if (ci->count =3D=3D SWAPFILE_CLUSTER) { + VM_BUG_ON(!(ci->flags & (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL))); + list_del(&ci->list); + ci->flags =3D 0; + } +} + +static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, u= nsigned long offset, unsigned int *foundp, + unsigned int order, unsigned char usage) +{ + unsigned long start =3D offset & ~(SWAPFILE_CLUSTER - 1); + unsigned long end =3D min(start + SWAPFILE_CLUSTER, si->max); + unsigned int nr_pages =3D 1 << order; + struct swap_cluster_info *ci; + + if (end < nr_pages) + return SWAP_NEXT_INVALID; + end -=3D nr_pages; + + ci =3D lock_cluster(si, offset); + if (ci->count + nr_pages > SWAPFILE_CLUSTER) { + offset =3D SWAP_NEXT_INVALID; + goto done; + } + + while (offset <=3D end) { + if (cluster_scan_range(si, offset, nr_pages)) { + cluster_alloc_range(si, ci, offset, usage, order); + *foundp =3D offset; + if (ci->count =3D=3D SWAPFILE_CLUSTER) { + offset =3D SWAP_NEXT_INVALID; + goto done; + } + offset +=3D nr_pages; + break; + } + offset +=3D nr_pages; + } + if (offset > end) + offset =3D SWAP_NEXT_INVALID; +done: + unlock_cluster(ci); + return offset; } =20 /* @@ -548,71 +608,63 @@ static inline bool swap_range_empty(char *swap_map, u= nsigned int start, * pool (a cluster). This might involve allocating a new cluster for curre= nt CPU * too. */ -static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, - unsigned long *offset, unsigned long *scan_base, int order) +static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,= int order, unsigned char usage) { - unsigned int nr_pages =3D 1 << order; struct percpu_cluster *cluster; - struct swap_cluster_info *ci; - unsigned int tmp, max; + struct swap_cluster_info *ci, *n; + unsigned int offset, found =3D 0; =20 new_cluster: + VM_BUG_ON(!spin_is_locked(&si->lock)); cluster =3D this_cpu_ptr(si->percpu_cluster); - tmp =3D cluster->next[order]; - if (tmp =3D=3D SWAP_NEXT_INVALID) { - if (!list_empty(&si->free_clusters)) { - ci =3D list_first_entry(&si->free_clusters, struct swap_cluster_info, l= ist); - list_del(&ci->list); - spin_lock(&ci->lock); - ci->order =3D order; - ci->flags =3D 0; - spin_unlock(&ci->lock); - tmp =3D cluster_index(si, ci) * SWAPFILE_CLUSTER; - } else if (!list_empty(&si->nonfull_clusters[order])) { - ci =3D list_first_entry(&si->nonfull_clusters[order], struct swap_clust= er_info, list); - list_del(&ci->list); - spin_lock(&ci->lock); - ci->flags =3D 0; - spin_unlock(&ci->lock); - tmp =3D cluster_index(si, ci) * SWAPFILE_CLUSTER; - } else if (!list_empty(&si->discard_clusters)) { - /* - * we don't have free cluster but have some clusters in - * discarding, do discard now and reclaim them, then - * reread cluster_next_cpu since we dropped si->lock - */ - swap_do_scheduled_discard(si); - *scan_base =3D this_cpu_read(*si->cluster_next_cpu); - *offset =3D *scan_base; - goto new_cluster; - } else - return false; + offset =3D cluster->next[order]; + if (offset) { + offset =3D alloc_swap_scan_cluster(si, offset, &found, order, usage); + if (found) + goto done; } =20 - /* - * Other CPUs can use our cluster if they can't find a free cluster, - * check if there is still free entry in the cluster, maintaining - * natural alignment. - */ - max =3D min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER)); - if (tmp < max) { - ci =3D lock_cluster(si, tmp); - while (tmp < max) { - if (swap_range_empty(si->swap_map, tmp, nr_pages)) - break; - tmp +=3D nr_pages; + list_for_each_entry_safe(ci, n, &si->free_clusters, list) { + offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, o= rder, usage); + if (found) + goto done; + VM_BUG_ON(1); + } + + if (order < PMD_ORDER) { + list_for_each_entry_safe(ci, n, &si->nonfull_clusters[order], list) { + offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, = order, usage); + if (found) + goto done; } - unlock_cluster(ci); } - if (tmp >=3D max) { - cluster->next[order] =3D SWAP_NEXT_INVALID; + + if (!list_empty(&si->discard_clusters)) { + /* + * we don't have free cluster but have some clusters in + * discarding, do discard now and reclaim them, then + * reread cluster_next_cpu since we dropped si->lock + */ + swap_do_scheduled_discard(si); goto new_cluster; } - *offset =3D tmp; - *scan_base =3D tmp; - tmp +=3D nr_pages; - cluster->next[order] =3D tmp < max ? tmp : SWAP_NEXT_INVALID; - return true; + + if (order) + goto done; + + for (int o =3D order + 1; o < SWAP_NR_ORDERS; o++) { + struct swap_cluster_info *ci, *n; + + list_for_each_entry_safe(ci, n, &si->nonfull_clusters[o], list) { + offset =3D alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, = order, usage); + if (found) + goto done; + } + } + +done: + cluster->next[order] =3D offset; + return found; } =20 static void __del_from_avail_list(struct swap_info_struct *p) @@ -747,11 +799,29 @@ static bool swap_offset_available_and_locked(struct s= wap_info_struct *si, return false; } =20 +static int cluster_alloc_swap(struct swap_info_struct *si, + unsigned char usage, int nr, + swp_entry_t slots[], int order) +{ + int n_ret =3D 0; + + VM_BUG_ON(!si->cluster_info); + + while (n_ret < nr) { + unsigned long offset =3D cluster_alloc_swap_entry(si, order, usage); + + if (!offset) + break; + slots[n_ret++] =3D swp_entry(si->type, offset); + } + + return n_ret; +} + static int scan_swap_map_slots(struct swap_info_struct *si, unsigned char usage, int nr, swp_entry_t slots[], int order) { - struct swap_cluster_info *ci; unsigned long offset; unsigned long scan_base; unsigned long last_in_cluster =3D 0; @@ -790,26 +860,16 @@ static int scan_swap_map_slots(struct swap_info_struc= t *si, return 0; } =20 + if (si->cluster_info) + return cluster_alloc_swap(si, usage, nr, slots, order); + si->flags +=3D SWP_SCANNING; - /* - * Use percpu scan base for SSD to reduce lock contention on - * cluster and swap cache. For HDD, sequential access is more - * important. - */ - if (si->flags & SWP_SOLIDSTATE) - scan_base =3D this_cpu_read(*si->cluster_next_cpu); - else - scan_base =3D si->cluster_next; + + /* For HDD, sequential access is more important. */ + scan_base =3D si->cluster_next; offset =3D scan_base; =20 - /* SSD algorithm */ - if (si->cluster_info) { - if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) { - if (order > 0) - goto no_page; - goto scan; - } - } else if (unlikely(!si->cluster_nr--)) { + if (unlikely(!si->cluster_nr--)) { if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) { si->cluster_nr =3D SWAPFILE_CLUSTER - 1; goto checks; @@ -820,8 +880,6 @@ static int scan_swap_map_slots(struct swap_info_struct = *si, /* * If seek is expensive, start searching for new cluster from * start of partition, to minimize the span of allocated swap. - * If seek is cheap, that is the SWP_SOLIDSTATE si->cluster_info - * case, just handled by scan_swap_map_try_ssd_cluster() above. */ scan_base =3D offset =3D si->lowest_bit; last_in_cluster =3D offset + SWAPFILE_CLUSTER - 1; @@ -849,19 +907,6 @@ static int scan_swap_map_slots(struct swap_info_struct= *si, } =20 checks: - if (si->cluster_info) { - while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) { - /* take a break if we already got some slots */ - if (n_ret) - goto done; - if (!scan_swap_map_try_ssd_cluster(si, &offset, - &scan_base, order)) { - if (order > 0) - goto no_page; - goto scan; - } - } - } if (!(si->flags & SWP_WRITEOK)) goto no_page; if (!si->highest_bit) @@ -869,11 +914,9 @@ static int scan_swap_map_slots(struct swap_info_struct= *si, if (offset > si->highest_bit) scan_base =3D offset =3D si->lowest_bit; =20 - ci =3D lock_cluster(si, offset); /* reuse swap entry of cache-only swap if not busy. */ if (vm_swap_full() && si->swap_map[offset] =3D=3D SWAP_HAS_CACHE) { int swap_was_freed; - unlock_cluster(ci); spin_unlock(&si->lock); swap_was_freed =3D __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); spin_lock(&si->lock); @@ -884,15 +927,12 @@ static int scan_swap_map_slots(struct swap_info_struc= t *si, } =20 if (si->swap_map[offset]) { - unlock_cluster(ci); if (!n_ret) goto scan; else goto done; } memset(si->swap_map + offset, usage, nr_pages); - add_cluster_info_page(si, si->cluster_info, offset, nr_pages); - unlock_cluster(ci); =20 swap_range_alloc(si, offset, nr_pages); slots[n_ret++] =3D swp_entry(si->type, offset); @@ -913,13 +953,7 @@ static int scan_swap_map_slots(struct swap_info_struct= *si, latency_ration =3D LATENCY_LIMIT; } =20 - /* try to get more slots in cluster */ - if (si->cluster_info) { - if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) - goto checks; - if (order > 0) - goto done; - } else if (si->cluster_nr && !si->swap_map[++offset]) { + if (si->cluster_nr && !si->swap_map[++offset]) { /* non-ssd case, still more slots in cluster? */ --si->cluster_nr; goto checks; @@ -988,8 +1022,6 @@ static void swap_free_cluster(struct swap_info_struct = *si, unsigned long idx) ci =3D lock_cluster(si, offset); memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); ci->count =3D 0; - ci->order =3D 0; - ci->flags =3D 0; free_cluster(si, ci); unlock_cluster(ci); swap_range_free(si, offset, SWAPFILE_CLUSTER); @@ -3001,8 +3033,11 @@ static int setup_swap_map_and_extents(struct swap_in= fo_struct *p, ci =3D cluster_info + idx; if (idx >=3D nr_clusters) continue; - if (ci->count) + if (ci->count) { + ci->flags =3D CLUSTER_FLAG_NONFULL; + list_add_tail(&ci->list, &p->nonfull_clusters[0]); continue; + } ci->flags =3D CLUSTER_FLAG_FREE; list_add_tail(&ci->list, &p->free_clusters); } --=20 2.45.2.803.g4e1b14247a-goog