From nobody Thu Dec 18 20:36:31 2025 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A80401CD214 for ; Mon, 13 Jan 2025 18:00:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791239; cv=none; b=d6damN9k46En2yVMw66YdvlBPhi4aQYutTOXeh9ZNJlfEGGYN+GAS2j2T4KX6nBXvPFwH+ePGJx+4RDAwu5M1CvcnQmmEmg2nC6xD7OvfvlcxCmLD0wIgNxshvxokh9izSjs/4KVqGtSvV4s1qAJAKp3waQxPW61uE7Xo+YXdk4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736791239; c=relaxed/simple; bh=TWJeHP5g0DrHX9uDpXjIGxPX2jm6qRBY6opzZAOIqVM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=K8Pg8tbYaRzxJ4Xnj1sGhTRB/a4Vf5UxwPFh/cb3yp0QK2VMkI8DaGpg1qQakIqUuG5gMj4a8CyFpNk6C5MVp3yY0WHNFLxNwJA5YkbFYbC/JGmUbapePZTJ528fp0XsbDC3DFhGDjMIq/Yr1/oerI+OLIcOcrcH+nWzfe1+hOw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=gsh2ttjl; arc=none smtp.client-ip=209.85.214.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="gsh2ttjl" Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-216395e151bso56012075ad.0 for ; Mon, 13 Jan 2025 10:00:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736791237; x=1737396037; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=At1biLHm6e57SEHEw6uwq9JjLDEXAlo5gNBhgwUICrU=; b=gsh2ttjlf0FFKGjchpp/aTJoLdcQRAIoO9gzoB2xuDbRg18LG7YNfU8qBCeZkuOvuD SK8RRKhJCAT0RthDd20ArDb/lfRun85vLcbwB/jE/W3ZaKeXE5uXrX1pOBhjmh4jb/LL 8xDGEhYh6gzvWb5egYssK4+rF+ydz+WHq80KK2zHyQC1itOfpNg28CeffYmEG2F2vyy3 wSwBLY4O1gfJBXCKy9BV2bLPQKQ7k3BS/RB51AQ/J61wjsH2NzRZQrOMIQoduGgKyRpX E1kEqIfkFy5CSKKODLfz4KA6zQZ+cHAZF1PjADhq2E/X0rVfPcMll6NDcPhLcoGjxkO6 NN5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736791237; x=1737396037; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=At1biLHm6e57SEHEw6uwq9JjLDEXAlo5gNBhgwUICrU=; b=e6jyZJZDNRGoNV4OW/ucy5iMlKPBao3ASvk30uBHmaC5y0/ybM6u4uHF7z2op4bG56 0vc5TCfd5Oq9Uo9geWVQUE9TkVvh6uEBVFmemsPHvwduSyTrGTx2Gi0ks+qLSMBcwGmF AaEuMIzoSyWKjRrmOu3yx579M4+SkayM8Axmaosc50v2KKJjwBBULaMnTGkmMH5mrrHT DpxzDOTX7F3WMAq9Qk8L4L3kryTJI96slzMm3sTT1+3gTWV0E+nCWPjhsv93m8JWoY80 qS5h0Y1nhoCxUiwUnP1qDnSoHgGmDyBGvBrALBxkUEVag+kKqxl9d/IYdPlTuGj34ax/ TczA== X-Forwarded-Encrypted: i=1; AJvYcCU6NmzRYszJLVlnJ0rsYMEfEZ5OyFWd/4LxioM3Adtxcykm1BiQjKNRH+fQqtdKVSZfgM3bF6kRympIYV0=@vger.kernel.org X-Gm-Message-State: AOJu0YxbEQCfOeciyjEPuzKhPX2weR6/Jx8v88Od2ybWX/V3qHnKZMAe 6YSs65jed24Cx7x2OA3PC7pVqBakYywg8boAh0xZm5r7GoOu3SWe X-Gm-Gg: ASbGncuRlDmDBXQy1cpGUrbn2xJ6oE5A3Ka/UHWNjMlemjRaL+rnhU3FBvi7TcuGib9 5LSENhwvvSbO4TTPALAXrMhHc5e3IO1sguREbupE3U5/lUTP95+Th5ofEc3HZrJ3r9SJKczPOSG 6J8HgirlRGpZZSVpwluILkTLpACRPYScUZakPTYDwS0oy5TQuyRv4ICfkQSg56LDxbeFyMBnojz PYSPyDf+pZ5m3gG7fQgGe0BKfTbVlnYR1Kwt75lqS3Mr1R7edfubrLYh5PrgDi+wkU/2GLVHFo7 iw== X-Google-Smtp-Source: AGHT+IHxGwhG9J2j0xHRNOesp9JYTQhHyHxaHzGPoN76JvXqmpGF+xTYQiMe5d834HF94l/UImFywQ== X-Received: by 2002:a17:902:d4c9:b0:216:6ef9:60d with SMTP id d9443c01a7336-21a8d6e9625mr285446795ad.23.1736791236839; Mon, 13 Jan 2025 10:00:36 -0800 (PST) Received: from KASONG-MC4.tencent.com ([115.171.41.132]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.10.00.33 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 13 Jan 2025 10:00:36 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Baoquan He , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 12/13] mm, swap: use a global swap cluster for non-rotation devices Date: Tue, 14 Jan 2025 01:57:31 +0800 Message-ID: <20250113175732.48099-13-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250113175732.48099-1-ryncsn@gmail.com> References: <20250113175732.48099-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Non-rotational devices (SSD / ZRAM) can tolerate fragmentation, so the goal of the SWAP allocator is to avoid contention for clusters. It uses a per-CPU cluster design, and each CPU will use a different cluster as much as possible. However, HDDs are very sensitive to fragmentation, contention is trivial in comparison. Therefore, we use one global cluster instead. This ensures that each order will be written to the same cluster as much as possible, which helps make the I/O more continuous. This ensures that the performance of the cluster allocator is as good as that of the old allocator. Tests after this commit compared to those before this series: Tested using 'make -j32' with tinyconfig, a 1G memcg limit, and HDD swap: make -j32 with tinyconfig, using 1G memcg limit and HDD swap: Before this series: 114.44user 29.11system 39:42.90elapsed 6%CPU (0avgtext+0avgdata 157284maxre= sident)k 2901232inputs+0outputs (238877major+4227640minor)pagefaults After this commit: 113.90user 23.81system 38:11.77elapsed 6%CPU (0avgtext+0avgdata 157260maxre= sident)k 2548728inputs+0outputs (235471major+4238110minor)pagefaults Suggested-by: Chris Li Signed-off-by: Kairui Song --- include/linux/swap.h | 2 ++ mm/swapfile.c | 51 ++++++++++++++++++++++++++++++++------------ 2 files changed, 39 insertions(+), 14 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 4c1d2e69689f..b13b72645db3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -318,6 +318,8 @@ struct swap_info_struct { unsigned int pages; /* total of usable pages of swap */ atomic_long_t inuse_pages; /* number of those currently in use */ struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap locatio= n */ + struct percpu_cluster *global_cluster; /* Use one global cluster for rota= ting device */ + spinlock_t global_cluster_lock; /* Serialize usage of global cluster */ struct rb_root swap_extent_root;/* root of the swap extent rbtree */ struct block_device *bdev; /* swap device or bdev of swap file */ struct file *swap_file; /* seldom referenced */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 37d540fa0310..793b2fd1a2a8 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -820,7 +820,10 @@ static unsigned int alloc_swap_scan_cluster(struct swa= p_info_struct *si, out: relocate_cluster(si, ci); unlock_cluster(ci); - __this_cpu_write(si->percpu_cluster->next[order], next); + if (si->flags & SWP_SOLIDSTATE) + __this_cpu_write(si->percpu_cluster->next[order], next); + else + si->global_cluster->next[order] =3D next; return found; } =20 @@ -881,9 +884,16 @@ static unsigned long cluster_alloc_swap_entry(struct s= wap_info_struct *si, int o struct swap_cluster_info *ci; unsigned int offset, found =3D 0; =20 - /* Fast path using per CPU cluster */ - local_lock(&si->percpu_cluster->lock); - offset =3D __this_cpu_read(si->percpu_cluster->next[order]); + if (si->flags & SWP_SOLIDSTATE) { + /* Fast path using per CPU cluster */ + local_lock(&si->percpu_cluster->lock); + offset =3D __this_cpu_read(si->percpu_cluster->next[order]); + } else { + /* Serialize HDD SWAP allocation for each device. */ + spin_lock(&si->global_cluster_lock); + offset =3D si->global_cluster->next[order]; + } + if (offset) { ci =3D lock_cluster(si, offset); /* Cluster could have been used by another order */ @@ -975,8 +985,10 @@ static unsigned long cluster_alloc_swap_entry(struct s= wap_info_struct *si, int o } } done: - local_unlock(&si->percpu_cluster->lock); - + if (si->flags & SWP_SOLIDSTATE) + local_unlock(&si->percpu_cluster->lock); + else + spin_unlock(&si->global_cluster_lock); return found; } =20 @@ -2784,6 +2796,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, special= file) mutex_unlock(&swapon_mutex); free_percpu(p->percpu_cluster); p->percpu_cluster =3D NULL; + kfree(p->global_cluster); + p->global_cluster =3D NULL; vfree(swap_map); kvfree(zeromap); kvfree(cluster_info); @@ -3189,17 +3203,24 @@ static struct swap_cluster_info *setup_clusters(str= uct swap_info_struct *si, for (i =3D 0; i < nr_clusters; i++) spin_lock_init(&cluster_info[i].lock); =20 - si->percpu_cluster =3D alloc_percpu(struct percpu_cluster); - if (!si->percpu_cluster) - goto err_free; + if (si->flags & SWP_SOLIDSTATE) { + si->percpu_cluster =3D alloc_percpu(struct percpu_cluster); + if (!si->percpu_cluster) + goto err_free; =20 - for_each_possible_cpu(cpu) { - struct percpu_cluster *cluster; + for_each_possible_cpu(cpu) { + struct percpu_cluster *cluster; =20 - cluster =3D per_cpu_ptr(si->percpu_cluster, cpu); + cluster =3D per_cpu_ptr(si->percpu_cluster, cpu); + for (i =3D 0; i < SWAP_NR_ORDERS; i++) + cluster->next[i] =3D SWAP_ENTRY_INVALID; + local_lock_init(&cluster->lock); + } + } else { + si->global_cluster =3D kmalloc(sizeof(*si->global_cluster), GFP_KERNEL); for (i =3D 0; i < SWAP_NR_ORDERS; i++) - cluster->next[i] =3D SWAP_ENTRY_INVALID; - local_lock_init(&cluster->lock); + si->global_cluster->next[i] =3D SWAP_ENTRY_INVALID; + spin_lock_init(&si->global_cluster_lock); } =20 /* @@ -3473,6 +3494,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialf= ile, int, swap_flags) bad_swap: free_percpu(si->percpu_cluster); si->percpu_cluster =3D NULL; + kfree(si->global_cluster); + si->global_cluster =3D NULL; inode =3D NULL; destroy_swap_extents(si); swap_cgroup_swapoff(si->type); --=20 2.47.1