From nobody Thu Oct 9 04:43:23 2025 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 888F228D8F3; Thu, 19 Jun 2025 17:55:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750355752; cv=none; b=VV73GewNAUqjRWRJt7LU2DsQVGRdH3u4LA0wZ0vH7Uu7AX5fjQDl4lKYjh/AMU+aOkOEgI1AX7qP9FFeqFxsZSQR3NoPpdZZeMLcZB3Azg+9aIC3smYsx5ZKaP8BQPtL3KPUnuk8A1eBRUJowtmRKfT/rPMhlvXo3nmEXKIBL6s= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750355752; c=relaxed/simple; bh=Etb+1G0dKFWlHJP7I3ksVIxEPevKNJNx0/hd6w0pTQU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=YqQTbBjl9voTNr/Ezfi4lG9+28KHqS0LCYkoNp2nD1AB09Qmw/qo61cq5COUrC9hSd/xx0potdClyl24IjCIbKBezRs4zKsfT7iJn06VPtNkKB9A2+xyOL8NRlJNvKiurFBcYV6oOE/fttN69DqglwVabgAkR95d2vHdFPEH5W0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Grmx4wBI; arc=none smtp.client-ip=209.85.214.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Grmx4wBI" Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-2352400344aso10354695ad.2; Thu, 19 Jun 2025 10:55:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750355750; x=1750960550; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=G6U4jlRW51IhrAkyNdYr7Hbh7uPerEcOuqvZV/VaVHo=; b=Grmx4wBI6pxbMG8xN/zk/bGhzWg6IaliOh+K6astjxu2brbfD6lyLExb5A6sSgWR70 80MmpuTmBrGRW7pZWnAeBrxqq8GQQaN1lGkJIK8GoKe2QnAnBy4/24YuIkC/gquaFWcM 2XRwo8SAwaVsHH0b4CWxOLPg7USzxjhIHiOTUjY03yTZb81+gMLo+m+c4PXYvC1J1mk6 eFJxZ/HDGD3FUO07wN+YQQK1I6vOVoNofkU42zg4Xs/97VcHs/Yc8VbwfN8OBFqRy7SS Fw6ARPU8b9eke9TY5J7JsCtOKe39chjAc1QoxFOYqUE2TVb0lkcm1NlHEOxPFpqb+rqF DYGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750355750; x=1750960550; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=G6U4jlRW51IhrAkyNdYr7Hbh7uPerEcOuqvZV/VaVHo=; b=F7rUhbJWvVpb97XPLKN6EDtb0+AjvIuf0GVJzGD7XFd/X47HcHzJEhohfBFuJLeU5e sv4gjAspLra+vE4Dc6amQE9F77Z93UBPpJIKIVsAWF8jnbClWI+MsLbqdF02zbkfMhwu zONCZaBXLmS8M7HtZq8a/1bpCLwtbv61mq7k7e+tbdyu+WZD/HjyothXP/0eVu6tKwE3 XwQCaPzXHit7nQ8p7ZwzB9gZofQ/L0QoqDDmL2du4ZmtNDLMDnZEEF3xZJXaQmm8SgfF GFaN0u1qyCmCxJSFh1LUf0Hmsb+INMdBeooFyvSXBfv1tz+9CsL43LHmik3KsnU3SpdX iunA== X-Forwarded-Encrypted: i=1; AJvYcCW3CLe/eBCxWpsvQVb/rBIAToQj4dymFDpMkG+ZBdccETx4BE2KKiABi6pvg086AqccZwSsXqgG+H2eyS8=@vger.kernel.org, AJvYcCWnVKUlCfe+BPZOTrfF0tvKxUivc08M/MOIH7iQbTNVSvT5vCOTejwkA6tspwMan0YFv7XrQtkN@vger.kernel.org X-Gm-Message-State: AOJu0YzpXgupkEI+2E5f36E5Wx2HeKjeNsnNbLShLaO/zxVqmqSBGZ76 +loJhRG266OlIN44gLRR/8GxOaQB9//AR2Va7a3zSfT/07noNEyZDPHA X-Gm-Gg: ASbGncsd4TLyH+aH6RSvSI8JqL8umSxihMF8ZC0z6LvSQkj9ykfZFrgjbHnuHQ+i5MK ftMxlWvYopTHNNpinGY9B6DjlFtVRfwp1Fl2iUoCvNOxJYXAqX92WUUkv092GspU4T3Y+5QN90H D/Q6ltqdE2rC/yWvN75gfNZLZTA8HWsIjFZPe+NZP6irlEdS9KCeNIiKt0Vx48JL51YjgPKRpvb bMygWNZy7DhkvDOq71A7l/ZkTv/lPQx9D+4OAchWYld7kZtVh0seKpV/NxNkSnosLzx1FQeWK8J tOXqogW1tRACB4nSidjzNSnsB9URwsk3MUL6z0v4Op7cdrquS1Mrr2EGTDGJUdN3ItFuzhZ2CoR We20OjcKCsDhumtgaIg== X-Google-Smtp-Source: AGHT+IGSMiUkLVaHEY0qpn7YhOq9nmtfFgzCs7eZSl+BsvEKujzwRfUJdWREe52x5B5NCsObb9RRRA== X-Received: by 2002:a17:902:7b8c:b0:234:8ef1:aa7b with SMTP id d9443c01a7336-2366afe7f06mr238436175ad.20.1750355749765; Thu, 19 Jun 2025 10:55:49 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-237d83efa44sm255215ad.77.2025.06.19.10.55.46 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Thu, 19 Jun 2025 10:55:49 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Hugh Dickins , Baolin Wang , Matthew Wilcox , Kemeng Shi , Chris Li , Nhat Pham , Baoquan He , Barry Song , linux-kernel@vger.kernel.org, Kairui Song , stable@vger.kernel.org Subject: [PATCH v2 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung Date: Fri, 20 Jun 2025 01:55:35 +0800 Message-ID: <20250619175538.15799-2-ryncsn@gmail.com> X-Mailer: git-send-email 2.50.0 In-Reply-To: <20250619175538.15799-1-ryncsn@gmail.com> References: <20250619175538.15799-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song The current swap-in code assumes that, when a swap entry in shmem mapping is order 0, its cached folios (if present) must be order 0 too, which turns out not always correct. The problem is shmem_split_large_entry is called before verifying the folio will eventually be swapped in, one possible race is: CPU1 CPU2 shmem_swapin_folio /* swap in of order > 0 swap entry S1 */ folio =3D swap_cache_get_folio /* folio =3D NULL */ order =3D xa_get_order /* order > 0 */ folio =3D shmem_swap_alloc_folio /* mTHP alloc failure, folio =3D NULL */ <... Interrupted ...> shmem_swapin_folio /* S1 is swapped in */ shmem_writeout /* S1 is swapped out, folio cached */ shmem_split_large_entry(..., S1) /* S1 is split, but the folio covering it has order > 0 now */ Now any following swapin of S1 will hang: `xa_get_order` returns 0, and folio lookup will return a folio with order > 0. The `xa_get_order(&mapping->i_pages, index) !=3D folio_order(folio)` will always return false causing swap-in to return -EEXIST. And this looks fragile. So fix this up by allowing seeing a larger folio in swap cache, and check the whole shmem mapping range covered by the swapin have the right swap value upon inserting the folio. And drop the redundant tree walks before the insertion. This will actually improve the performance, as it avoided two redundant Xarray tree walks in the hot path, and the only side effect is that in the failure path, shmem may redundantly reallocate a few folios causing temporary slight memory pressure. And worth noting, it may seems the order and value check before inserting might help reducing the lock contention, which is not true. The swap cache layer ensures raced swapin will either see a swap cache folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is bypassed), so holding the folio lock and checking the folio flag is already good enough for avoiding the lock contention. The chance that a folio passes the swap entry value check but the shmem mapping slot has changed should be very low. Cc: stable@vger.kernel.org Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") Signed-off-by: Kairui Song Reviewed-by: Kemeng Shi --- mm/shmem.c | 30 +++++++++++++++++++++--------- 1 file changed, 21 insertions(+), 9 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index eda35be2a8d9..4e7ef343a29b 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio, pgoff_t index, void *expected, gfp_t gfp) { XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio)); - long nr =3D folio_nr_pages(folio); + unsigned long nr =3D folio_nr_pages(folio); + swp_entry_t iter, swap; + void *entry; =20 VM_BUG_ON_FOLIO(index !=3D round_down(index, nr), folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); @@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *foli= o, =20 gfp &=3D GFP_RECLAIM_MASK; folio_throttle_swaprate(folio, gfp); + swap =3D iter =3D radix_to_swp_entry(expected); =20 do { xas_lock_irq(&xas); - if (expected !=3D xas_find_conflict(&xas)) { - xas_set_err(&xas, -EEXIST); - goto unlock; + xas_for_each_conflict(&xas, entry) { + /* + * The range must either be empty, or filled with + * expected swap entries. Shmem swap entries are never + * partially freed without split of both entry and + * folio, so there shouldn't be any holes. + */ + if (!expected || entry !=3D swp_to_radix_entry(iter)) { + xas_set_err(&xas, -EEXIST); + goto unlock; + } + iter.val +=3D 1 << xas_get_order(&xas); } - if (expected && xas_find_conflict(&xas)) { + if (expected && iter.val - nr !=3D swap.val) { xas_set_err(&xas, -EEXIST); goto unlock; } @@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, error =3D -ENOMEM; goto failed; } - } else if (order !=3D folio_order(folio)) { + } else if (order > folio_order(folio)) { /* * Swap readahead may swap in order 0 folios into swapcache * asynchronously, while the shmem mapping can still stores @@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, = pgoff_t index, =20 swap =3D swp_entry(swp_type(swap), swp_offset(swap) + offset); } + } else if (order < folio_order(folio)) { + swap.val =3D round_down(swp_type(swap), folio_order(folio)); } =20 alloced: /* We have to do this with folio locked to prevent races */ folio_lock(folio); if ((!skip_swapcache && !folio_test_swapcache(folio)) || - folio->swap.val !=3D swap.val || - !shmem_confirm_swap(mapping, index, swap) || - xa_get_order(&mapping->i_pages, index) !=3D folio_order(folio)) { + folio->swap.val !=3D swap.val) { error =3D -EEXIST; goto unlock; } --=20 2.50.0 From nobody Thu Oct 9 04:43:23 2025 Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 47BFB28DF1D for ; Thu, 19 Jun 2025 17:55:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.169 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750355755; cv=none; b=CUej1wlcyPf6UV0j+VLqMZ0uJA86du8BtTFd1Q1U08olocDMZ8BUhjso46q5XISitCqDkb7q1jdYRpEUHNijGDhWl0Q7+MsbOwhcb9Ny/hCQHihAcIkwb8aR82iukPzshwEutw5wLVtNDUKy06eP2Slz11j5olrGbmFiA7tWnCc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750355755; c=relaxed/simple; bh=JzMTjZB49BCjwvIcUPEM+HFWvW8HtJQKVeXy4jJnSio=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=rHHM8njO8MxIRFIeBXWd3Iu1ITQUVr311bZzIh7LG+LVS4nUQOJgYf85xge0FodCMX9wtK/8Fj9tQe6onHMDIB9d+nxLdZ4bXi3PnSo5uisikGrXbJkWSo5Q0QDFR2Ryf4V+eaKmJlAaUkCBhJ3IRDz9VQKn19p/PTWXSugzhy4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=RLZBQ8Jd; arc=none smtp.client-ip=209.85.214.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="RLZBQ8Jd" Received: by mail-pl1-f169.google.com with SMTP id d9443c01a7336-22c33677183so9255845ad.2 for ; Thu, 19 Jun 2025 10:55:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750355753; x=1750960553; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=qhxEiBYif/YKCDyTxEvEnwXoCgIUrrgaAQ2W3uziJ+o=; b=RLZBQ8JdnxFU1+RdyTpkk3UHGvjkojg7YhbwbkJ/xvkhKXma1s/CEgMylNNq5fhQ6d 0yFjnXKEq+Q6EsppRNHD9MIw1d6Z31CUURvQfrN5+zh0aRgUiyoV7vJLkGwNKSopOJed oSLvWD+T7pgH59qRxovZmY9mjtJLwBWz0fDc/4Ly6B/LCJTlqb37mILW1gYCeoUB0TIW VQprnKsF7gQYAdhv2KCks41VH73IljMIQAfWDGWVDrrMWnG5rvDXZFekZ1tZsDFRvDB0 R+4aNymTKQYK2TqxmDLGQt0/bK5A+qDeAO4Cy6fG+kUO53I2F+l7MSokVmFbbOMcjRUG 8+gw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750355753; x=1750960553; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=qhxEiBYif/YKCDyTxEvEnwXoCgIUrrgaAQ2W3uziJ+o=; b=ZBfaY478eJVrm1nfZPydZhrb2OOP0Nf6b0j+qhghsCgf6142UkvvKxm5qkPi/trKnJ QVhgXfZ0P3rT4TN8OrJ90RdyETKV5ZjCdc/1scstXE8EgWHr+oW04o0XkEXS3uuV3rFJ ogwR9N5W/IRIkQVIdp9IHsuOEssK21Y9V988cAjFrHgj7h6gelLFRYwkhayL49alho60 Xf4wSbZ4gn5E1oTIPUpZL4G9JEaponoG+g8Xq8a2+jZuBu2A6UYJjZ9NuJ+BJY5gxMIM G18tb2CcK9N2bUSRh4keGqAqAYrLzDYZ3H+QXbgkp8pXnwLln4JSMxRAWJSyFpAFXYdr bWkw== X-Forwarded-Encrypted: i=1; AJvYcCW0OjjOz9rMuCdSTR5n0uL06KkXwe86Yqp+0k6zY6ZkBR0BDMIZaw2XnSZwrbcFrHTPZMsPHKEKhqFQk3c=@vger.kernel.org X-Gm-Message-State: AOJu0YzRPPrc6kdYMY57XyWDK7W0TP+zlCjlcFcPZR9j3hbr9jgjYO+Y p9qaivzapo1dDNea8kyH1kDH2nFeRj6yhzVCoLaSuKJeWscwMtwplNoq X-Gm-Gg: ASbGnctwMisVy/LFHh3EtdsymLrVXL3vc0CA3LaBFHnRUedPsKkeYgeB/Yalnz4kXgj P1hYa99YK3aevTB6xan9GHCjYsY2VW0OHumltwvabYZLqy9hyIBlLY4wGOcmPNvVum49RZHeiRC rD3GcVlBPeyNVODqf/1miXjoO5yK0d4X4Nl+ePFULwDt1c2SmgqHT21PQ8Kgt1EP5r1w8JDsacX y1tMT+AY8OHgxqw9jUEvF9AoWxK5Xe04jSN7RNXafLrX9t1IHlKjvP3puS09hMS5GxzY1DBaOd9 Lwm8Fu57N4ZxtlXtcYLsMOurnH4tYQBshMsOYHsLaziL6uFeLXbkJbiPA84nY2NXYgBCzA/W7KZ cdS1bWAKq1Hsrq2oFkw== X-Google-Smtp-Source: AGHT+IEJFgdALACITfgwxqQ56cWatjSGdDaAunvgPQAqa+5FC5hRIuAlEC9S5QqI+0+fDBmHXigCPQ== X-Received: by 2002:a17:903:22c8:b0:234:8a4a:adad with SMTP id d9443c01a7336-2366b14d484mr326770805ad.26.1750355753431; Thu, 19 Jun 2025 10:55:53 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-237d83efa44sm255215ad.77.2025.06.19.10.55.50 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Thu, 19 Jun 2025 10:55:52 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Hugh Dickins , Baolin Wang , Matthew Wilcox , Kemeng Shi , Chris Li , Nhat Pham , Baoquan He , Barry Song , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v2 2/4] mm/shmem, swap: avoid redundant Xarray lookup during swapin Date: Fri, 20 Jun 2025 01:55:36 +0800 Message-ID: <20250619175538.15799-3-ryncsn@gmail.com> X-Mailer: git-send-email 2.50.0 In-Reply-To: <20250619175538.15799-1-ryncsn@gmail.com> References: <20250619175538.15799-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Currently shmem calls xa_get_order to get the swap radix entry order, requiring a full tree walk. This can be easily combined with the swap entry value checking (shmem_confirm_swap) to avoid the duplicated lookup, which should improve the performance. Signed-off-by: Kairui Song Reviewed-by: Kemeng Shi Reviewed-by: Baolin Wang Reviewed-by: Dev Jain --- mm/shmem.c | 33 ++++++++++++++++++++++++--------- 1 file changed, 24 insertions(+), 9 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 4e7ef343a29b..ce44d1da08cd 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -505,15 +505,27 @@ static int shmem_replace_entry(struct address_space *= mapping, =20 /* * Sometimes, before we decide whether to proceed or to fail, we must check - * that an entry was not already brought back from swap by a racing thread. + * that an entry was not already brought back or split by a racing thread. * * Checking folio is not enough: by the time a swapcache folio is locked, = it * might be reused, and again be swapcache, using the same swap as before. + * Returns the swap entry's order if it still presents, else returns -1. */ -static bool shmem_confirm_swap(struct address_space *mapping, - pgoff_t index, swp_entry_t swap) +static int shmem_confirm_swap(struct address_space *mapping, pgoff_t index, + swp_entry_t swap) { - return xa_load(&mapping->i_pages, index) =3D=3D swp_to_radix_entry(swap); + XA_STATE(xas, &mapping->i_pages, index); + int ret =3D -1; + void *entry; + + rcu_read_lock(); + do { + entry =3D xas_load(&xas); + if (entry =3D=3D swp_to_radix_entry(swap)) + ret =3D xas_get_order(&xas); + } while (xas_retry(&xas, entry)); + rcu_read_unlock(); + return ret; } =20 /* @@ -2256,16 +2268,20 @@ static int shmem_swapin_folio(struct inode *inode, = pgoff_t index, return -EIO; =20 si =3D get_swap_device(swap); - if (!si) { - if (!shmem_confirm_swap(mapping, index, swap)) + order =3D shmem_confirm_swap(mapping, index, swap); + if (unlikely(!si)) { + if (order < 0) return -EEXIST; else return -EINVAL; } + if (unlikely(order < 0)) { + put_swap_device(si); + return -EEXIST; + } =20 /* Look it up and read it in.. */ folio =3D swap_cache_get_folio(swap, NULL, 0); - order =3D xa_get_order(&mapping->i_pages, index); if (!folio) { int nr_pages =3D 1 << order; bool fallback_order0 =3D false; @@ -2415,7 +2431,7 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, *foliop =3D folio; return 0; failed: - if (!shmem_confirm_swap(mapping, index, swap)) + if (shmem_confirm_swap(mapping, index, swap) < 0) error =3D -EEXIST; if (error =3D=3D -EIO) shmem_set_folio_swapin_error(inode, index, folio, swap, @@ -2428,7 +2444,6 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, folio_put(folio); } put_swap_device(si); - return error; } =20 --=20 2.50.0 From nobody Thu Oct 9 04:43:23 2025 Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 18DDD28D843 for ; Thu, 19 Jun 2025 17:55:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750355759; cv=none; b=VgWVB9w35WqDHsdpM3YyXQzQx6AxhuVP2C2cSYXOYSQowhQ403ycQL1hZ/PbfpRt74T2HgSulyXbpcImnNLRfMtGZo7LjdXGS9O1WtCPFZRaLS9kdAXTBe5kw61QTbTJeR7J3/P4g1e0RZIKl0CWNOda5h2M7XHoyBDVfHG0MfU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750355759; c=relaxed/simple; bh=RLC5KqOsMpW2CqeXilUABrHIj3rrk1RbF0EEjm4sXx0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=gYIkd13u0xc7SL559uNa8bK7GzwvbfXnJNdEerZFF8YutIKOe0bEqsUntyM9B47O86Qf4MiiAkGVUzCzdjeDoDicor1fxMFLXeX9xGhIR8qzVL4tSSmnkhUkNEM2nEkJRcgqxM8s8To3y3OU9ocSFvSioS1ogCcms4xScJgGKm0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=BbNfs0ZD; arc=none smtp.client-ip=209.85.214.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="BbNfs0ZD" Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-2352400344aso10355715ad.2 for ; Thu, 19 Jun 2025 10:55:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750355757; x=1750960557; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=b0J5rIzwckRCud+/m8yBCOqHh4RXMbBP/Pgn0V8wAaM=; b=BbNfs0ZDDzIm8LnFXz81CV7HPgykf09EJ84Ys7SYcJaWWS/NPI9jyIhCsajF9jBZNt NsMB+sNXZX19NZYTeDQA8JPq8c0aNkOrKtAHcCGJaduzrVK09+HCRlzWAEBW1NtVZbTB izMft+HTJAj9YD+vKn+zGwWys04Jjv76D+K+qF2rpAmQzLY7k+w1uBC3ZhsuEayxT0d1 hB8bS5ws3Bzxee6N+RQDnDj9RksPo9uLraJi66hoa4OeQXj22AUh+InCDhFofMS36uaF qC7Elm5dPWcxLRQwm7diaE0ENxz/mbXFdKN9U7s+XhOa5Iq4hHLOpql+kgsNWC4HBlqx XzYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750355757; x=1750960557; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=b0J5rIzwckRCud+/m8yBCOqHh4RXMbBP/Pgn0V8wAaM=; b=c2kQtHphDg0yAyOrzX4LY0sPztxlm8YzQpNb/nPHPL+/+LnaNW+ZtMXPnKDcnDWoZm gj8tpzYe+6F6O23tW+FUcaeBtGzREh+6n+EeYE4/3cFFaaXZp+aPdoZmMUJDHGAhUGUO xuKY7Eh9xhZsUbCOXStlmC6xFgNpuTE0ZNfiPrSUckL2IGby2s/n49LMm637l/tQalkq 9OayNBcGbtt2Qbd7bbd+egt0ZUOasa9PmGB0xiFxP05XorvlJFbGBKLfht9fa76juD09 wec8a7duXRR4TSZsxWygN7wIUnLRJJHtzvHHgAKsOSK2y/f9DA6LnSGEaT6zRlxXB5yN c+Jg== X-Forwarded-Encrypted: i=1; AJvYcCWOJpRkmRWzFlNkl3f1vaTjzIn6tn74HKL1Mfh/utj+J7nuX2cybGVxyMc6lVLm03wWcwU+NODwcvqhxE4=@vger.kernel.org X-Gm-Message-State: AOJu0Yy+vvhYBWjluzLIEW+luVHmbIW/EI0usL3dN2Q5IV71WBX1QjYG vfrAGVBp/ZxveePMwyEsB/yusm8o9/5PBnJQEmsNo4qRLsBDDlHpCBrv X-Gm-Gg: ASbGncsMo1T9iAOG29tFoAS+VOPIJOj5JGZ6YCwXrtjrVqReyCedhjlMjdge1uKrpSK DYLkG2Unc+kzYUo2ORiuKqXoE8USl7PPCJ37KKg7FqlVp0EUg71DVlSSVyH6LMBwcNRjnXXT1IM KtXpXnpuN5XHyz0jM9W1FGrwPAymz5a9tvZATE5T6xPKwuPDa/ScfGuugfGI+2VaqpgLnmBQGQo Dmpyl6X9ii+PfRZYSY8J/Qzxcj+xw+VDJmos8QUHTyFoDzqETPafEeqiZrp+iLFTLzm7tz7TUTA it8B9PaNEpsGel+eMfNCzUOSh2+/twRxXhYNXp/QSVLNPAbg6Qxq9LgamM8+hCyGp4Qw7miv3Sn 49ydnp8J11S9KtMGwVw== X-Google-Smtp-Source: AGHT+IFjL5lSBKqSLv1e4AfpQIYPMDh9hgjRu/MF7siVCMwu5I/gaWVJSunXz3+4fKOuSsbg2ckBuQ== X-Received: by 2002:a17:902:f68e:b0:234:cf24:3be8 with SMTP id d9443c01a7336-2366b12f541mr369757775ad.28.1750355757240; Thu, 19 Jun 2025 10:55:57 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-237d83efa44sm255215ad.77.2025.06.19.10.55.53 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Thu, 19 Jun 2025 10:55:56 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Hugh Dickins , Baolin Wang , Matthew Wilcox , Kemeng Shi , Chris Li , Nhat Pham , Baoquan He , Barry Song , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v2 3/4] mm/shmem, swap: improve mthp swapin process Date: Fri, 20 Jun 2025 01:55:37 +0800 Message-ID: <20250619175538.15799-4-ryncsn@gmail.com> X-Mailer: git-send-email 2.50.0 In-Reply-To: <20250619175538.15799-1-ryncsn@gmail.com> References: <20250619175538.15799-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song Tidy up the mTHP swapin code, reduce duplicated codes and slightly tweak the workflow. For SWP_SYNCHRONOUS_IO devices, we should skip the readahead and swap cache even if the swapin falls back to order 0. Readahead is not helpful for such devices. Also consolidates the mTHP related check to one place so they are now all wrapped by CONFIG_TRANSPARENT_HUGEPAGE, and will be trimmed off by compiler if not needed. Signed-off-by: Kairui Song Reviewed-by: Kemeng Shi --- mm/shmem.c | 175 ++++++++++++++++++++++++----------------------------- 1 file changed, 78 insertions(+), 97 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index ce44d1da08cd..721f5aa68572 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1975,31 +1975,51 @@ static struct folio *shmem_alloc_and_add_folio(stru= ct vm_fault *vmf, return ERR_PTR(error); } =20 -static struct folio *shmem_swap_alloc_folio(struct inode *inode, +static struct folio *shmem_swapin_direct(struct inode *inode, struct vm_area_struct *vma, pgoff_t index, - swp_entry_t entry, int order, gfp_t gfp) + swp_entry_t entry, int *order, gfp_t gfp) { struct shmem_inode_info *info =3D SHMEM_I(inode); + int nr_pages =3D 1 << *order; struct folio *new; + pgoff_t offset; void *shadow; - int nr_pages; =20 /* * We have arrived here because our zones are constrained, so don't * limit chance of success with further cpuset and node constraints. */ gfp &=3D ~GFP_CONSTRAINT_MASK; - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && order > 0) { - gfp_t huge_gfp =3D vma_thp_gfp_mask(vma); + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { + if (WARN_ON_ONCE(*order)) + return ERR_PTR(-EINVAL); + } else if (*order) { + /* + * If uffd is active for the vma, we need per-page fault + * fidelity to maintain the uffd semantics, then fallback + * to swapin order-0 folio, as well as for zswap case. + * Any existing sub folio in the swap cache also blocks + * mTHP swapin. + */ + if ((vma && userfaultfd_armed(vma)) || + !zswap_never_enabled() || + non_swapcache_batch(entry, nr_pages) !=3D nr_pages) { + offset =3D index - round_down(index, nr_pages); + entry =3D swp_entry(swp_type(entry), + swp_offset(entry) + offset); + *order =3D 0; + nr_pages =3D 1; + } else { + gfp_t huge_gfp =3D vma_thp_gfp_mask(vma); =20 - gfp =3D limit_gfp_mask(huge_gfp, gfp); + gfp =3D limit_gfp_mask(huge_gfp, gfp); + } } =20 - new =3D shmem_alloc_folio(gfp, order, info, index); + new =3D shmem_alloc_folio(gfp, *order, info, index); if (!new) return ERR_PTR(-ENOMEM); =20 - nr_pages =3D folio_nr_pages(new); if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL, gfp, entry)) { folio_put(new); @@ -2165,8 +2185,12 @@ static void shmem_set_folio_swapin_error(struct inod= e *inode, pgoff_t index, swap_free_nr(swap, nr_pages); } =20 -static int shmem_split_large_entry(struct inode *inode, pgoff_t index, - swp_entry_t swap, gfp_t gfp) +/* + * Split an existing large swap entry. @index should point to one sub mapp= ing + * slot within the entry @swap, this sub slot will be split into order 0. + */ +static int shmem_split_swap_entry(struct inode *inode, pgoff_t index, + swp_entry_t swap, gfp_t gfp) { struct address_space *mapping =3D inode->i_mapping; XA_STATE_ORDER(xas, &mapping->i_pages, index, 0); @@ -2226,7 +2250,6 @@ static int shmem_split_large_entry(struct inode *inod= e, pgoff_t index, cur_order =3D split_order; split_order =3D xas_try_split_min_order(split_order); } - unlock: xas_unlock_irq(&xas); =20 @@ -2237,7 +2260,7 @@ static int shmem_split_large_entry(struct inode *inod= e, pgoff_t index, if (xas_error(&xas)) return xas_error(&xas); =20 - return entry_order; + return 0; } =20 /* @@ -2254,11 +2277,11 @@ static int shmem_swapin_folio(struct inode *inode, = pgoff_t index, struct address_space *mapping =3D inode->i_mapping; struct mm_struct *fault_mm =3D vma ? vma->vm_mm : NULL; struct shmem_inode_info *info =3D SHMEM_I(inode); + int error, nr_pages, order, swap_order; struct swap_info_struct *si; struct folio *folio =3D NULL; bool skip_swapcache =3D false; swp_entry_t swap; - int error, nr_pages, order, split_order; =20 VM_BUG_ON(!*foliop || !xa_is_value(*foliop)); swap =3D radix_to_swp_entry(*foliop); @@ -2283,110 +2306,66 @@ static int shmem_swapin_folio(struct inode *inode,= pgoff_t index, /* Look it up and read it in.. */ folio =3D swap_cache_get_folio(swap, NULL, 0); if (!folio) { - int nr_pages =3D 1 << order; - bool fallback_order0 =3D false; - /* Or update major stats only when swapin succeeds?? */ if (fault_type) { *fault_type |=3D VM_FAULT_MAJOR; count_vm_event(PGMAJFAULT); count_memcg_event_mm(fault_mm, PGMAJFAULT); } - - /* - * If uffd is active for the vma, we need per-page fault - * fidelity to maintain the uffd semantics, then fallback - * to swapin order-0 folio, as well as for zswap case. - * Any existing sub folio in the swap cache also blocks - * mTHP swapin. - */ - if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) || - !zswap_never_enabled() || - non_swapcache_batch(swap, nr_pages) !=3D nr_pages)) - fallback_order0 =3D true; - - /* Skip swapcache for synchronous device. */ - if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) { - folio =3D shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp); + /* Try direct mTHP swapin bypassing swap cache and readahead */ + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { + swap_order =3D order; + folio =3D shmem_swapin_direct(inode, vma, index, + swap, &swap_order, gfp); if (!IS_ERR(folio)) { skip_swapcache =3D true; goto alloced; } - - /* - * Fallback to swapin order-0 folio unless the swap entry - * already exists. - */ + /* Fallback if order > 0 swapin failed with -ENOMEM */ error =3D PTR_ERR(folio); folio =3D NULL; - if (error =3D=3D -EEXIST) + if (error !=3D -ENOMEM || !swap_order) goto failed; } - /* - * Now swap device can only swap in order 0 folio, then we - * should split the large swap entry stored in the pagecache - * if necessary. + * Try order 0 swapin using swap cache and readahead, it still + * may return order > 0 folio due to raced swap cache. */ - split_order =3D shmem_split_large_entry(inode, index, swap, gfp); - if (split_order < 0) { - error =3D split_order; - goto failed; - } - - /* - * If the large swap entry has already been split, it is - * necessary to recalculate the new swap entry based on - * the old order alignment. - */ - if (split_order > 0) { - pgoff_t offset =3D index - round_down(index, 1 << split_order); - - swap =3D swp_entry(swp_type(swap), swp_offset(swap) + offset); - } - - /* Here we actually start the io */ folio =3D shmem_swapin_cluster(swap, gfp, info, index); if (!folio) { error =3D -ENOMEM; goto failed; } - } else if (order > folio_order(folio)) { - /* - * Swap readahead may swap in order 0 folios into swapcache - * asynchronously, while the shmem mapping can still stores - * large swap entries. In such cases, we should split the - * large swap entry to prevent possible data corruption. - */ - split_order =3D shmem_split_large_entry(inode, index, swap, gfp); - if (split_order < 0) { - folio_put(folio); - folio =3D NULL; - error =3D split_order; - goto failed; - } - - /* - * If the large swap entry has already been split, it is - * necessary to recalculate the new swap entry based on - * the old order alignment. - */ - if (split_order > 0) { - pgoff_t offset =3D index - round_down(index, 1 << split_order); - - swap =3D swp_entry(swp_type(swap), swp_offset(swap) + offset); - } - } else if (order < folio_order(folio)) { - swap.val =3D round_down(swp_type(swap), folio_order(folio)); } - alloced: + /* + * We need to split an existing large entry if swapin brought in a + * smaller folio due to various of reasons. + * + * And worth noting there is a special case: if there is a smaller + * cached folio that covers @swap, but not @index (it only covers + * first few sub entries of the large entry, but @index points to + * later parts), the swap cache lookup will still see this folio, + * And we need to split the large entry here. Later checks will fail, + * as it can't satisfy the swap requirement, and we will retry + * the swapin from beginning. + */ + swap_order =3D folio_order(folio); + if (order > swap_order) { + error =3D shmem_split_swap_entry(inode, index, swap, gfp); + if (error) + goto failed_nolock; + } + + index =3D round_down(index, 1 << swap_order); + swap.val =3D round_down(swap.val, 1 << swap_order); + /* We have to do this with folio locked to prevent races */ folio_lock(folio); if ((!skip_swapcache && !folio_test_swapcache(folio)) || folio->swap.val !=3D swap.val) { error =3D -EEXIST; - goto unlock; + goto failed_unlock; } if (!folio_test_uptodate(folio)) { error =3D -EIO; @@ -2407,8 +2386,7 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, goto failed; } =20 - error =3D shmem_add_to_page_cache(folio, mapping, - round_down(index, nr_pages), + error =3D shmem_add_to_page_cache(folio, mapping, index, swp_to_radix_entry(swap), gfp); if (error) goto failed; @@ -2419,8 +2397,8 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, folio_mark_accessed(folio); =20 if (skip_swapcache) { + swapcache_clear(si, folio->swap, folio_nr_pages(folio)); folio->swap.val =3D 0; - swapcache_clear(si, swap, nr_pages); } else { delete_from_swap_cache(folio); } @@ -2436,13 +2414,16 @@ static int shmem_swapin_folio(struct inode *inode, = pgoff_t index, if (error =3D=3D -EIO) shmem_set_folio_swapin_error(inode, index, folio, swap, skip_swapcache); -unlock: - if (skip_swapcache) - swapcache_clear(si, swap, folio_nr_pages(folio)); - if (folio) { +failed_unlock: + if (folio) folio_unlock(folio); - folio_put(folio); +failed_nolock: + if (skip_swapcache) { + swapcache_clear(si, folio->swap, folio_nr_pages(folio)); + folio->swap.val =3D 0; } + if (folio) + folio_put(folio); put_swap_device(si); return error; } --=20 2.50.0 From nobody Thu Oct 9 04:43:23 2025 Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BE2D028E616 for ; Thu, 19 Jun 2025 17:56:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.175 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750355763; cv=none; b=tesip/2cHHyaP/RKVm8AjgTEMHgnMrgNhGNQgXV6QRCCkXspEc77HZFlSqEY9BPVZ9BD6i8HpaABUKf8BxFcrAb6DBDIifpm6GZH7k5vmv6Ufe9qzrRfSMZJbbGT7rb5aW9YBit2xd4XIH+AXSRn3B7hhIfrILGok6Y7W6XvD10= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750355763; c=relaxed/simple; bh=O2EvyAzBJCgLOfpjkyngcKBtBnSduiUNDmv9MQ65KMo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=bx0nRtvf+2ld6jXihZCI/an7rK5U9E6zodELPdZnfu6menBI99iCScYk88+gEjxvYfoSUzgn74XzIsmVhMRcNWWrbGbH9I/VITtef1zLBZ4JosAWNzXtQQO9n1RxVi0dcCZMt4g1Z4jAHjdMcK+P6rVbF38lvSFSdY0XPCTiIIk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=i1/2IhZa; arc=none smtp.client-ip=209.85.214.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="i1/2IhZa" Received: by mail-pl1-f175.google.com with SMTP id d9443c01a7336-234d3261631so7039005ad.1 for ; Thu, 19 Jun 2025 10:56:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750355761; x=1750960561; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=ZEwjuK0AylTZIVquFebv6qfh92gO6eHGaxgW/dfiLAE=; b=i1/2IhZadsw6xPyT1dTauf/Oyegh+XQMzvV7jZQ4rYwBbaeazzU3gRPZhjM7EPE7a8 yrPWT3ifFrr8vLzFbwxe+KWj7Tn7cxizeRgRH+HlmbrfURPswK9F5idcSg6r4OF20dL2 PW3gJP5LL4MOjB+4ZlZ/SOlY6j9+KuWmpPxnOvV4x6lOGr4zgljOmitvGtWHat+bP6lu 3AWNxN+BaPa3nG4QB4EtNUMl2UsIOecosFvaxwFslCDaxVKGNp0Mor+Y8ti/7nvwNens owNLmceJpN+2FgQ0XTbY/8HFUxf2DlYizzWyH8mj+RGkz71wuGD2UWoMy/vPwPSnWNNd rLoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750355761; x=1750960561; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=ZEwjuK0AylTZIVquFebv6qfh92gO6eHGaxgW/dfiLAE=; b=c5skDc6dicqJw9pD7KWNRMLRuhiZqJjwHw0x+JzkSx3C8TVUK3QCIy/0XBy1zDXRLG /ZZ01XYbIm+90sIakBYxaLbV8P9aqoDHxRFCKMExzgOkefGpQFz0p7BM6BpuHGeqV952 0EyjMAnBbSNE+vBdyaSYEMHiLodjv0DTbf20v3sU76bcMo8GZ5qzPeElcVbESj01XTMe zTNq+Op0yKXgL1zebbzUUkaAVwlP0GuVFKHa4K+YEmhxErxpSdpHEZ5141d0rEKIHjdD ezOEPiReq18iq5ZG8aZ+unxkbar1fMIOUWtSRRqnQXcbEZjNqZFNebDoMFnXiKkxo/qd OXIg== X-Forwarded-Encrypted: i=1; AJvYcCXFSs6NAWV0SXDrMpEbwMxvcjzCFbXki/dAWHJeC82dz8JYiUjEzYQKqWjcVQ1GOn1Lz/kjW+z/dhRla4c=@vger.kernel.org X-Gm-Message-State: AOJu0YyRgbeiZpMSLkX+3MpeF+A5R6wzwTBZl/TbdzEg+QzZWowGWnZX nBOg+lbmxbU8kFEhe/5T4w7rz0CqJADUdxJ9/e44PQcHRS0Afqxf6TzN X-Gm-Gg: ASbGncu2+RfvTso8+voG/YMsunGuoEh0BjCXczJ9t8hz6l5MYUz69F/+Wmrjl71U+Rp HYl7lMxl1zQ5+RHb7bTLoM4bpe6fsYfJGCl4bFnG8Qr8pcFFYCSanNuzd5Kerc2i7MeAcPXXOWv 3EzArp9Cp9uRHbgsZwpQM7OdVt7HCug3UPFKemoc7Gsj1gz5A4fuQGpi5vagJ0pBWyUtJ47GLnC R0Y0wqASqG8MJIzCU6G7EWGzWuYl9jmkpashSjhBBH3iMEGx8OtToBzq9qgyBZUyftmOXvz0MfN M4xa9hfEXY9rwpFY9/6OvvHC+OtZQ/1f+jV4sDlSd0kyNTNKR29x+sg2k4+Uv6y+S84mwwsKdly lEnQqRqy2ShLdCmnfLw== X-Google-Smtp-Source: AGHT+IE3cWG2htDcPElbGlnt86OWiMjDrxw9xS9eH9eh+XwRtY+S8BaNpOKQxNLRO9LApOYSIdscwg== X-Received: by 2002:a17:903:3d8d:b0:22f:c19c:810c with SMTP id d9443c01a7336-2366b19ec24mr235491245ad.51.1750355761011; Thu, 19 Jun 2025 10:56:01 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-237d83efa44sm255215ad.77.2025.06.19.10.55.57 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Thu, 19 Jun 2025 10:56:00 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Hugh Dickins , Baolin Wang , Matthew Wilcox , Kemeng Shi , Chris Li , Nhat Pham , Baoquan He , Barry Song , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v2 4/4] mm/shmem, swap: avoid false positive swap cache lookup Date: Fri, 20 Jun 2025 01:55:38 +0800 Message-ID: <20250619175538.15799-5-ryncsn@gmail.com> X-Mailer: git-send-email 2.50.0 In-Reply-To: <20250619175538.15799-1-ryncsn@gmail.com> References: <20250619175538.15799-1-ryncsn@gmail.com> Reply-To: Kairui Song Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kairui Song If the shmem read request's index points to the middle of a large swap entry, shmem swap in does the swap cache lookup use the large swap entry's starting value (the first sub swap entry of this large entry). This will lead to false positive lookup result if only the first few swap entries are cached, but the requested swap entry pointed by index is uncached. Currently shmem will do a large entry split then retry the swapin from beginning, which is a waste of CPU and fragile. Handle this correctly. Also add some sanity checks to help understand the code and ensure things won't go wrong. Signed-off-by: Kairui Song Reviewed-by: Kemeng Shi --- mm/shmem.c | 61 ++++++++++++++++++++++++++---------------------------- 1 file changed, 29 insertions(+), 32 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 721f5aa68572..128b92486f2e 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1977,12 +1977,12 @@ static struct folio *shmem_alloc_and_add_folio(stru= ct vm_fault *vmf, =20 static struct folio *shmem_swapin_direct(struct inode *inode, struct vm_area_struct *vma, pgoff_t index, - swp_entry_t entry, int *order, gfp_t gfp) + swp_entry_t index_entry, swp_entry_t swap, + int *order, gfp_t gfp) { struct shmem_inode_info *info =3D SHMEM_I(inode); int nr_pages =3D 1 << *order; struct folio *new; - pgoff_t offset; void *shadow; =20 /* @@ -2003,13 +2003,11 @@ static struct folio *shmem_swapin_direct(struct ino= de *inode, */ if ((vma && userfaultfd_armed(vma)) || !zswap_never_enabled() || - non_swapcache_batch(entry, nr_pages) !=3D nr_pages) { - offset =3D index - round_down(index, nr_pages); - entry =3D swp_entry(swp_type(entry), - swp_offset(entry) + offset); + non_swapcache_batch(index_entry, nr_pages) !=3D nr_pages) { *order =3D 0; nr_pages =3D 1; } else { + swap.val =3D index_entry.val; gfp_t huge_gfp =3D vma_thp_gfp_mask(vma); =20 gfp =3D limit_gfp_mask(huge_gfp, gfp); @@ -2021,7 +2019,7 @@ static struct folio *shmem_swapin_direct(struct inode= *inode, return ERR_PTR(-ENOMEM); =20 if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL, - gfp, entry)) { + gfp, swap)) { folio_put(new); return ERR_PTR(-ENOMEM); } @@ -2036,17 +2034,17 @@ static struct folio *shmem_swapin_direct(struct ino= de *inode, * In this case, shmem_add_to_page_cache() will help identify the * concurrent swapin and return -EEXIST. */ - if (swapcache_prepare(entry, nr_pages)) { + if (swapcache_prepare(swap, nr_pages)) { folio_put(new); return ERR_PTR(-EEXIST); } =20 __folio_set_locked(new); __folio_set_swapbacked(new); - new->swap =3D entry; + new->swap =3D swap; =20 - memcg1_swapin(entry, nr_pages); - shadow =3D get_shadow_from_swap_cache(entry); + memcg1_swapin(swap, nr_pages); + shadow =3D get_shadow_from_swap_cache(swap); if (shadow) workingset_refault(new, shadow); folio_add_lru(new); @@ -2278,20 +2276,21 @@ static int shmem_swapin_folio(struct inode *inode, = pgoff_t index, struct mm_struct *fault_mm =3D vma ? vma->vm_mm : NULL; struct shmem_inode_info *info =3D SHMEM_I(inode); int error, nr_pages, order, swap_order; + swp_entry_t swap, index_entry; struct swap_info_struct *si; struct folio *folio =3D NULL; bool skip_swapcache =3D false; - swp_entry_t swap; + pgoff_t offset; =20 VM_BUG_ON(!*foliop || !xa_is_value(*foliop)); - swap =3D radix_to_swp_entry(*foliop); + index_entry =3D radix_to_swp_entry(*foliop); *foliop =3D NULL; =20 - if (is_poisoned_swp_entry(swap)) + if (is_poisoned_swp_entry(index_entry)) return -EIO; =20 - si =3D get_swap_device(swap); - order =3D shmem_confirm_swap(mapping, index, swap); + si =3D get_swap_device(index_entry); + order =3D shmem_confirm_swap(mapping, index, index_entry); if (unlikely(!si)) { if (order < 0) return -EEXIST; @@ -2303,7 +2302,9 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, return -EEXIST; } =20 - /* Look it up and read it in.. */ + /* @index may points to the middle of a large entry, get the real swap va= lue first */ + offset =3D index - round_down(index, 1 << order); + swap.val =3D index_entry.val + offset; folio =3D swap_cache_get_folio(swap, NULL, 0); if (!folio) { /* Or update major stats only when swapin succeeds?? */ @@ -2315,7 +2316,7 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, /* Try direct mTHP swapin bypassing swap cache and readahead */ if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { swap_order =3D order; - folio =3D shmem_swapin_direct(inode, vma, index, + folio =3D shmem_swapin_direct(inode, vma, index, index_entry, swap, &swap_order, gfp); if (!IS_ERR(folio)) { skip_swapcache =3D true; @@ -2338,28 +2339,25 @@ static int shmem_swapin_folio(struct inode *inode, = pgoff_t index, } } alloced: + swap_order =3D folio_order(folio); + nr_pages =3D folio_nr_pages(folio); + + /* The swap-in should cover both @swap and @index */ + swap.val =3D round_down(swap.val, nr_pages); + VM_WARN_ON_ONCE(swap.val > index_entry.val + offset); + VM_WARN_ON_ONCE(swap.val + nr_pages <=3D index_entry.val + offset); + /* * We need to split an existing large entry if swapin brought in a * smaller folio due to various of reasons. - * - * And worth noting there is a special case: if there is a smaller - * cached folio that covers @swap, but not @index (it only covers - * first few sub entries of the large entry, but @index points to - * later parts), the swap cache lookup will still see this folio, - * And we need to split the large entry here. Later checks will fail, - * as it can't satisfy the swap requirement, and we will retry - * the swapin from beginning. */ - swap_order =3D folio_order(folio); + index =3D round_down(index, nr_pages); if (order > swap_order) { - error =3D shmem_split_swap_entry(inode, index, swap, gfp); + error =3D shmem_split_swap_entry(inode, index, index_entry, gfp); if (error) goto failed_nolock; } =20 - index =3D round_down(index, 1 << swap_order); - swap.val =3D round_down(swap.val, 1 << swap_order); - /* We have to do this with folio locked to prevent races */ folio_lock(folio); if ((!skip_swapcache && !folio_test_swapcache(folio)) || @@ -2372,7 +2370,6 @@ static int shmem_swapin_folio(struct inode *inode, pg= off_t index, goto failed; } folio_wait_writeback(folio); - nr_pages =3D folio_nr_pages(folio); =20 /* * Some architectures may have to restore extra metadata to the --=20 2.50.0