From nobody Sun Feb 8 03:27:00 2026 Received: from mail-lj1-f171.google.com (mail-lj1-f171.google.com [209.85.208.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0B77514A0B8 for ; Thu, 22 Aug 2024 11:27:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724326040; cv=none; b=rled4yoF7G7FHY6UulW5tpqG5wqibHQvaFMHMedAwEzRckerUU/dnkFZuQANzvPYLEuxHhQDH7I6u2Q1Vt3St0klO+CpIVoZMxax4NjcdLGqjbv4UlogJallKyRVp5Leu3jEu/5v1Bxp1532jCnHsqLlGgIJhBWSCLsEdHLagB8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724326040; c=relaxed/simple; bh=0egdaud3XJnLrHOfKJR02AkJ9G1kwMRfcILmqDKOvbk=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=Q9UmmUYtY1+JJfOMn+A7PTnScmhrr5kEyh7jQUkwMp+s+Ix05GZvwM67eNvT2pU7u3sWmwYtzZQPbTHYA/mjTviMXzI26SU04uqFMkKgR3lHgCanG1COVMMucpCUCsHuEtafdPicHNuQgyeW4/2kY/fiUL4EeFrpudUyAKwbxSA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org; spf=pass smtp.mailfrom=cmpxchg.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20230601.gappssmtp.com header.i=@cmpxchg-org.20230601.gappssmtp.com header.b=I+AGRzXm; arc=none smtp.client-ip=209.85.208.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20230601.gappssmtp.com header.i=@cmpxchg-org.20230601.gappssmtp.com header.b="I+AGRzXm" Received: by mail-lj1-f171.google.com with SMTP id 38308e7fff4ca-2f3f90295a9so6827511fa.0 for ; Thu, 22 Aug 2024 04:27:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1724326034; x=1724930834; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=OdcrbgUwYZCbvTxgX6hl7dn8mDljxmv7MhBnsCIJfOc=; b=I+AGRzXm8LJ07LPiI2Y6KctwlfnF0wefl83XokyCP5SfwVPnPAoJVZPbE6o5vEC/AP Eiip3H4W30vxtlxg+LbywQyCh6+oxgdQawGMM08tew1Ilxqt5g0/zTImTLFW5yaIBBP/ aBbvAa6SjCgQKOeC6YylMOlrp5t9+XfeZl5tO1KHc6HoiWGV77MDgJ9sC/NyMBc846Mj RztnlxbeFd28laXSluC8PN9sUkyR5gi3WrYM75+/W3ue0LsSeN6wlkn/OsELoIzpceI4 le7ZUeQt3b8Jbb3PShxTYwIucpb1Y9o95d5AWgM8cn3tnRzO3Pmco9Q+IJQGTSZWLNoW WL/w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724326034; x=1724930834; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=OdcrbgUwYZCbvTxgX6hl7dn8mDljxmv7MhBnsCIJfOc=; b=GbtiPl9tGKIfcZoyn/r6q+kJnQFhvmevj5Q64yQDtQyR4G7VytHQEIHxThUhJiIh+t 0u28y17lNvCyA+QSCU3I7/JYg4HVX+93W7f7+qjLr2cRbAj4CP0agNKRJG73XMDuAfYP jeBySJMjw96iRCCQZj23j1Wv/R52n0d63HTvTBoMEFzLh1vZetpO2ENdb4TGjDktl90W mFOom+g5Ht8xtOZ/4qTNy7Ig4ZFwTrDavvWkAi3+Irne921z0Yow69TTTse5DA4Vp7vr UimUkoOSXdZAdBcjYqBt8cW8+U56hatqhnVbkG3SNtYO+f3ReRClMI0wenkCmMa07vjy NpUA== X-Forwarded-Encrypted: i=1; AJvYcCXGgJZTpov4ddIzUpBntuhEum5mO1YJgWGYUFOZhLDeiJD4QWDlwzCkMRae6TZHotbKrzPZuv7qfcCaweI=@vger.kernel.org X-Gm-Message-State: AOJu0YygUSBTp52AeJsi2B7xtsrTeJognl18cl5S2QiMgBgB6wmlFQVs C8vF+jf5TgYBApam8wEJMFOH2CbXWKJa79ss5PEQVEeS3npiD6jFpmyKueXFkcg= X-Google-Smtp-Source: AGHT+IHAU8FDGqmIJEUfnfRLVt+aTgE3s5SHVaHgmnI/b3XT2WX5U2lY1l6UDcY90CmnGCCqaxafaQ== X-Received: by 2002:a2e:f19:0:b0:2ef:2d3a:e70a with SMTP id 38308e7fff4ca-2f3f883066amr30830081fa.18.1724326033653; Thu, 22 Aug 2024 04:27:13 -0700 (PDT) Received: from localhost ([2a02:8071:8280:d6e0:e324:b080:c95e:f348]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-5c044ddbccbsm803653a12.11.2024.08.22.04.27.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 22 Aug 2024 04:27:13 -0700 (PDT) From: Johannes Weiner To: Andrew Morton Cc: "Huang, Ying" , Hugh Dickins , linux-btrfs@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH] mm: swapfile: fix SSD detection with swapfile on btrfs Date: Thu, 22 Aug 2024 13:24:58 +0200 Message-ID: <20240822112707.351844-1-hannes@cmpxchg.org> X-Mailer: git-send-email 2.46.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" We've been noticing a trend of significant lock contention in the swap subsystem as core counts have been increasing in our fleet. It turns out that our swapfiles on btrfs on flash were in fact using the old swap code for rotational storage. This turns out to be a detection issue in the swapon sequence: btrfs sets si->bdev during swap activation, which currently happens *after* swapon's SSD detection and cluster setup. Thus, none of the SSD optimizations and cluster lock splitting are enabled for btrfs swap. Rearrange the swapon sequence so that filesystem activation happens *before* determining swap behavior based on the backing device. Afterwards, the nonrotational drive is detected correctly: - Adding 2097148k swap on /mnt/swapfile. Priority:-3 extents:1 across:2097= 148k + Adding 2097148k swap on /mnt/swapfile. Priority:-3 extents:1 across:2097= 148k SS Signed-off-by: Johannes Weiner --- mm/swapfile.c | 165 ++++++++++++++++++++++++++------------------------ 1 file changed, 86 insertions(+), 79 deletions(-) Changes since RFC: o walk badpages[] instead of [0, maxpages] for faster swapon (thanks Ying!) diff --git a/mm/swapfile.c b/mm/swapfile.c index c1638a009113..aff73a3d0ead 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3196,29 +3196,15 @@ static unsigned long read_swap_header(struct swap_i= nfo_struct *si, static int setup_swap_map_and_extents(struct swap_info_struct *si, union swap_header *swap_header, unsigned char *swap_map, - struct swap_cluster_info *cluster_info, unsigned long maxpages, sector_t *span) { - unsigned int j, k; unsigned int nr_good_pages; + unsigned long i; int nr_extents; - unsigned long nr_clusters =3D DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); - unsigned long col =3D si->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_= COLS; - unsigned long i, idx; =20 nr_good_pages =3D maxpages - 1; /* omit header page */ =20 - INIT_LIST_HEAD(&si->free_clusters); - INIT_LIST_HEAD(&si->full_clusters); - INIT_LIST_HEAD(&si->discard_clusters); - - for (i =3D 0; i < SWAP_NR_ORDERS; i++) { - INIT_LIST_HEAD(&si->nonfull_clusters[i]); - INIT_LIST_HEAD(&si->frag_clusters[i]); - si->frag_cluster_nr[i] =3D 0; - } - for (i =3D 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr =3D swap_header->info.badpages[i]; if (page_nr =3D=3D 0 || page_nr > swap_header->info.last_page) @@ -3226,25 +3212,11 @@ static int setup_swap_map_and_extents(struct swap_i= nfo_struct *si, if (page_nr < maxpages) { swap_map[page_nr] =3D SWAP_MAP_BAD; nr_good_pages--; - /* - * Haven't marked the cluster free yet, no list - * operation involved - */ - inc_cluster_info_page(si, cluster_info, page_nr); } } =20 - /* Haven't marked the cluster free yet, no list operation involved */ - for (i =3D maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) - inc_cluster_info_page(si, cluster_info, i); - if (nr_good_pages) { swap_map[0] =3D SWAP_MAP_BAD; - /* - * Not mark the cluster free yet, no list - * operation involved - */ - inc_cluster_info_page(si, cluster_info, 0); si->max =3D maxpages; si->pages =3D nr_good_pages; nr_extents =3D setup_swap_extents(si, span); @@ -3257,8 +3229,70 @@ static int setup_swap_map_and_extents(struct swap_in= fo_struct *si, return -EINVAL; } =20 + return nr_extents; +} + +static struct swap_cluster_info *setup_clusters(struct swap_info_struct *s= i, + union swap_header *swap_header, + unsigned long maxpages) +{ + unsigned long nr_clusters =3D DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); + unsigned long col =3D si->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_= COLS; + struct swap_cluster_info *cluster_info; + unsigned long i, j, k, idx; + int cpu, err =3D -ENOMEM; + + cluster_info =3D kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL); if (!cluster_info) - return nr_extents; + goto err; + + for (i =3D 0; i < nr_clusters; i++) + spin_lock_init(&cluster_info[i].lock); + + si->cluster_next_cpu =3D alloc_percpu(unsigned int); + if (!si->cluster_next_cpu) + goto err_free; + + /* Random start position to help with wear leveling */ + for_each_possible_cpu(cpu) + per_cpu(*si->cluster_next_cpu, cpu) =3D + get_random_u32_inclusive(1, si->highest_bit); + + si->percpu_cluster =3D alloc_percpu(struct percpu_cluster); + if (!si->percpu_cluster) + goto err_free; + + for_each_possible_cpu(cpu) { + struct percpu_cluster *cluster; + + cluster =3D per_cpu_ptr(si->percpu_cluster, cpu); + for (i =3D 0; i < SWAP_NR_ORDERS; i++) + cluster->next[i] =3D SWAP_NEXT_INVALID; + } + + /* + * Mark unusable pages as unavailable. The clusters aren't + * marked free yet, so no list operations are involved yet. + * + * See setup_swap_map_and_extents(): header page, bad pages, + * and the EOF part of the last cluster. + */ + inc_cluster_info_page(si, cluster_info, 0); + for (i =3D 0; i < swap_header->info.nr_badpages; i++) + inc_cluster_info_page(si, cluster_info, + swap_header->info.badpages[i]); + for (i =3D maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) + inc_cluster_info_page(si, cluster_info, i); + + INIT_LIST_HEAD(&si->free_clusters); + INIT_LIST_HEAD(&si->full_clusters); + INIT_LIST_HEAD(&si->discard_clusters); + + for (i =3D 0; i < SWAP_NR_ORDERS; i++) { + INIT_LIST_HEAD(&si->nonfull_clusters[i]); + INIT_LIST_HEAD(&si->frag_clusters[i]); + si->frag_cluster_nr[i] =3D 0; + } =20 /* * Reduce false cache line sharing between cluster_info and @@ -3281,7 +3315,13 @@ static int setup_swap_map_and_extents(struct swap_in= fo_struct *si, list_add_tail(&ci->list, &si->free_clusters); } } - return nr_extents; + + return cluster_info; + +err_free: + kvfree(cluster_info); +err: + return ERR_PTR(err); } =20 SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) @@ -3377,6 +3417,17 @@ SYSCALL_DEFINE2(swapon, const char __user *, special= file, int, swap_flags) goto bad_swap_unlock_inode; } =20 + error =3D swap_cgroup_swapon(si->type, maxpages); + if (error) + goto bad_swap_unlock_inode; + + nr_extents =3D setup_swap_map_and_extents(si, swap_header, swap_map, + maxpages, &span); + if (unlikely(nr_extents < 0)) { + error =3D nr_extents; + goto bad_swap_unlock_inode; + } + if (si->bdev && bdev_stable_writes(si->bdev)) si->flags |=3D SWP_STABLE_WRITES; =20 @@ -3384,63 +3435,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, specia= lfile, int, swap_flags) si->flags |=3D SWP_SYNCHRONOUS_IO; =20 if (si->bdev && bdev_nonrot(si->bdev)) { - int cpu, i; - unsigned long ci, nr_cluster; - si->flags |=3D SWP_SOLIDSTATE; - si->cluster_next_cpu =3D alloc_percpu(unsigned int); - if (!si->cluster_next_cpu) { - error =3D -ENOMEM; - goto bad_swap_unlock_inode; - } - /* - * select a random position to start with to help wear leveling - * SSD - */ - for_each_possible_cpu(cpu) { - per_cpu(*si->cluster_next_cpu, cpu) =3D - get_random_u32_inclusive(1, si->highest_bit); - } - nr_cluster =3D DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); =20 - cluster_info =3D kvcalloc(nr_cluster, sizeof(*cluster_info), - GFP_KERNEL); - if (!cluster_info) { - error =3D -ENOMEM; + cluster_info =3D setup_clusters(si, swap_header, maxpages); + if (IS_ERR(cluster_info)) { + error =3D PTR_ERR(cluster_info); + cluster_info =3D NULL; goto bad_swap_unlock_inode; } - - for (ci =3D 0; ci < nr_cluster; ci++) - spin_lock_init(&((cluster_info + ci)->lock)); - - si->percpu_cluster =3D alloc_percpu(struct percpu_cluster); - if (!si->percpu_cluster) { - error =3D -ENOMEM; - goto bad_swap_unlock_inode; - } - for_each_possible_cpu(cpu) { - struct percpu_cluster *cluster; - - cluster =3D per_cpu_ptr(si->percpu_cluster, cpu); - for (i =3D 0; i < SWAP_NR_ORDERS; i++) - cluster->next[i] =3D SWAP_NEXT_INVALID; - } } else { atomic_inc(&nr_rotate_swap); inced_nr_rotate_swap =3D true; } =20 - error =3D swap_cgroup_swapon(si->type, maxpages); - if (error) - goto bad_swap_unlock_inode; - - nr_extents =3D setup_swap_map_and_extents(si, swap_header, swap_map, - cluster_info, maxpages, &span); - if (unlikely(nr_extents < 0)) { - error =3D nr_extents; - goto bad_swap_unlock_inode; - } - if ((swap_flags & SWAP_FLAG_DISCARD) && si->bdev && bdev_max_discard_sectors(si->bdev)) { /* --=20 2.46.0