From nobody Sun Oct  5 20:03:52 2025
Received: from out30-118.freemail.mail.aliyun.com
 (out30-118.freemail.mail.aliyun.com [115.124.30.118])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id BC22A254AE7
	for <linux-kernel@vger.kernel.org>; Wed, 30 Jul 2025 08:20:38 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=115.124.30.118
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1753863642; cv=none;
 b=KR/XBeyPQizoEDYtpd+YkCHyeMqkfOnbSl+tmsVhnSD0xYHnpJTeRCVlkf+UpEUfnRRYhiw3GWEQugpl9gzEMgTRyDGNr+Eo6vgCY9xIj5hhCZZ1xCtBZbyLxfMzRwgwAaZRfn+54IXu9safwfIjGvI+bku0q492A8d3SKgFoGg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1753863642; c=relaxed/simple;
	bh=CjyiHoI5BLMORj84N/4G5YYAygIepUzfm3l6N8gWLsw=;
	h=From:To:Cc:Subject:Date:Message-ID:MIME-Version;
 b=rj1Uzf1T/+PuCXug4hbISwsiyHSqRvLVnYD7J/xqytIPHlG24IAbGqaOFE1Gaqas5XZW+PL/g0JrQaJPzPYtPX+nG5m6vzsy5Xmer1cfw/JLlzMOxmKJ8tqWYauYDJYdUuRfYQ4l19pykrDzJ7ztDDiFI+lNclAX7E3O4tKQuBM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.alibaba.com;
 spf=pass smtp.mailfrom=linux.alibaba.com;
 dkim=pass (1024-bit key) header.d=linux.alibaba.com
 header.i=@linux.alibaba.com header.b=klyEqR5q;
 arc=none smtp.client-ip=115.124.30.118
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.alibaba.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.alibaba.com
 header.i=@linux.alibaba.com header.b="klyEqR5q"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=linux.alibaba.com; s=default;
	t=1753863628; h=From:To:Subject:Date:Message-ID:MIME-Version;
	bh=ZfTdKUbwT+CGqRYeyNrf0eIANy7ZlcAynRZ8f/EUVM8=;
	b=klyEqR5q+ypIFa1+PqdAjQiCUjLiWEC3S5WCT+FrdJvoGc+VzpCQ1UVE14tnB7cgw0YuPqfUAvL/34IQJUukY4sLQcVuaCYjC8XajwAd5oUiCeZ5BkrdtwZD8a99S83O6XLoGYF+M4RSDm6KGLjkesqqMq5S+sstENTZjavgT/8=
Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com
 fp:SMTPD_---0WkUFS0K_1753863307 cluster:ay36)
          by smtp.aliyun-inc.com;
          Wed, 30 Jul 2025 16:15:08 +0800
From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: akpm@linux-foundation.org,
	hughd@google.com
Cc: willy@infradead.org,
	david@redhat.com,
	lorenzo.stoakes@oracle.com,
	ziy@nvidia.com,
	Liam.Howlett@oracle.com,
	npache@redhat.com,
	ryan.roberts@arm.com,
	dev.jain@arm.com,
	baohua@kernel.org,
	baolin.wang@linux.alibaba.com,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: [RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options
Date: Wed, 30 Jul 2025 16:14:55 +0800
Message-ID: 
 <701271092af74c2d969b195321c2c22e15e3c694.1753863013.git.baolin.wang@linux.alibaba.com>
X-Mailer: git-send-email 2.43.5
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
we have extended tmpfs to allow any sized large folios, rather than just
PMD-sized large folios.

The strategy discussed previously was:

"
Considering that tmpfs already has the 'huge=3D' option to control the
PMD-sized large folios allocation, we can extend the 'huge=3D' option to
allow any sized large folios.  The semantics of the 'huge=3D' mount option
are:

    huge=3Dnever: no any sized large folios
    huge=3Dalways: any sized large folios
    huge=3Dwithin_size: like 'always' but respect the i_size
    huge=3Dadvise: like 'always' if requested with madvise()

Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
allocate the PMD-sized huge folios if huge=3Dalways/within_size/advise is
set.

Moreover, the 'deny' and 'force' testing options controlled by
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
semantics.  The 'deny' can disable any sized large folios for tmpfs, while
the 'force' can enable PMD sized large folios for tmpfs.
"

This means that when tmpfs is mounted with 'huge=3Dalways' or 'huge=3Dwithi=
n_size',
tmpfs will allow getting a highest order hint based on the size of write() =
and
fallocate() paths. It will then try each allowable large order, rather than
continually attempting to allocate PMD-sized large folios as before.

However, this might break some user scenarios for those who want to use
PMD-sized large folios, such as the i915 driver which did not supply a write
size hint when allocating shmem [1].

Moreover, Hugh also complained that this will cause a regression in userspa=
ce
with 'huge=3Dalways' or 'huge=3Dwithin_size'.

So, let's revisit the strategy for tmpfs large page allocation. A simple fix
would be to always try PMD-sized large folios first, and if that fails, fall
back to smaller large folios. However, this approach differs from the strat=
egy
for large folio allocation used by other file systems. Is this acceptable?

[1] https://lore.kernel.org/lkml/0d734549d5ed073c80b11601da3abdd5223e1889.1=
753689802.git.baolin.wang@linux.alibaba.com/
Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
Note: this is just an RFC patch. I would like to hear others' opinions or
see if there is a better way to address Hugh's concern.
---
 Documentation/admin-guide/mm/transhuge.rst |  6 ++-
 mm/shmem.c                                 | 47 +++-------------------
 2 files changed, 10 insertions(+), 43 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/adm=
in-guide/mm/transhuge.rst
index 878796b4d7d3..121cbb3a72f7 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -383,12 +383,16 @@ option: ``huge=3D``. It can have following values:
=20
 always
     Attempt to allocate huge pages every time we need a new page;
+    Always try PMD-sized huge pages first, and fall back to smaller-sized
+    huge pages if the PMD-sized huge page allocation fails;
=20
 never
     Do not allocate huge pages;
=20
 within_size
-    Only allocate huge page if it will be fully within i_size.
+    Only allocate huge page if it will be fully within i_size;
+    Always try PMD-sized huge pages first, and fall back to smaller-sized
+    huge pages if the PMD-sized huge page allocation fails;
     Also respect madvise() hints;
=20
 advise
diff --git a/mm/shmem.c b/mm/shmem.c
index 75cc2cb92950..c1040a115f08 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -566,42 +566,6 @@ static int shmem_confirm_swap(struct address_space *ma=
pping, pgoff_t index,
 static int shmem_huge __read_mostly =3D SHMEM_HUGE_NEVER;
 static int tmpfs_huge __read_mostly =3D SHMEM_HUGE_NEVER;
=20
-/**
- * shmem_mapping_size_orders - Get allowable folio orders for the given fi=
le size.
- * @mapping: Target address_space.
- * @index: The page index.
- * @write_end: end of a write, could extend inode size.
- *
- * This returns huge orders for folios (when supported) based on the file =
size
- * which the mapping currently allows at the given index. The index is rel=
evant
- * due to alignment considerations the mapping might have. The returned or=
der
- * may be less than the size passed.
- *
- * Return: The orders.
- */
-static inline unsigned int
-shmem_mapping_size_orders(struct address_space *mapping, pgoff_t index, lo=
ff_t write_end)
-{
-	unsigned int order;
-	size_t size;
-
-	if (!mapping_large_folio_support(mapping) || !write_end)
-		return 0;
-
-	/* Calculate the write size based on the write_end */
-	size =3D write_end - (index << PAGE_SHIFT);
-	order =3D filemap_get_order(size);
-	if (!order)
-		return 0;
-
-	/* If we're not aligned, allocate a smaller folio */
-	if (index & ((1UL << order) - 1))
-		order =3D __ffs(index);
-
-	order =3D min_t(size_t, order, MAX_PAGECACHE_ORDER);
-	return order > 0 ? BIT(order + 1) - 1 : 0;
-}
-
 static unsigned int shmem_get_orders_within_size(struct inode *inode,
 		unsigned long within_size_orders, pgoff_t index,
 		loff_t write_end)
@@ -648,22 +612,21 @@ static unsigned int shmem_huge_global_enabled(struct =
inode *inode, pgoff_t index
 	 * For tmpfs mmap()'s huge order, we still use PMD-sized order to
 	 * allocate huge pages due to lack of a write size hint.
 	 *
-	 * Otherwise, tmpfs will allow getting a highest order hint based on
-	 * the size of write and fallocate paths, then will try each allowable
-	 * huge orders.
+	 * For tmpfs with 'huge=3Dalways' or 'huge=3Dwithin_size' mount option,
+	 * we will always try PMD-sized order first. If that failed, it will
+	 * fall back to small large folios.
 	 */
 	switch (SHMEM_SB(inode->i_sb)->huge) {
 	case SHMEM_HUGE_ALWAYS:
 		if (vma)
 			return maybe_pmd_order;
=20
-		return shmem_mapping_size_orders(inode->i_mapping, index, write_end);
+		return THP_ORDERS_ALL_FILE_DEFAULT;
 	case SHMEM_HUGE_WITHIN_SIZE:
 		if (vma)
 			within_size_orders =3D maybe_pmd_order;
 		else
-			within_size_orders =3D shmem_mapping_size_orders(inode->i_mapping,
-								       index, write_end);
+			within_size_orders =3D THP_ORDERS_ALL_FILE_DEFAULT;
=20
 		within_size_orders =3D shmem_get_orders_within_size(inode, within_size_o=
rders,
 								  index, write_end);
--=20
2.43.5