From nobody Sun Oct 5 20:03:52 2025 Received: from out30-118.freemail.mail.aliyun.com (out30-118.freemail.mail.aliyun.com [115.124.30.118]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BC22A254AE7 for ; Wed, 30 Jul 2025 08:20:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.118 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753863642; cv=none; b=KR/XBeyPQizoEDYtpd+YkCHyeMqkfOnbSl+tmsVhnSD0xYHnpJTeRCVlkf+UpEUfnRRYhiw3GWEQugpl9gzEMgTRyDGNr+Eo6vgCY9xIj5hhCZZ1xCtBZbyLxfMzRwgwAaZRfn+54IXu9safwfIjGvI+bku0q492A8d3SKgFoGg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753863642; c=relaxed/simple; bh=CjyiHoI5BLMORj84N/4G5YYAygIepUzfm3l6N8gWLsw=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=rj1Uzf1T/+PuCXug4hbISwsiyHSqRvLVnYD7J/xqytIPHlG24IAbGqaOFE1Gaqas5XZW+PL/g0JrQaJPzPYtPX+nG5m6vzsy5Xmer1cfw/JLlzMOxmKJ8tqWYauYDJYdUuRfYQ4l19pykrDzJ7ztDDiFI+lNclAX7E3O4tKQuBM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=klyEqR5q; arc=none smtp.client-ip=115.124.30.118 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="klyEqR5q" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1753863628; h=From:To:Subject:Date:Message-ID:MIME-Version; bh=ZfTdKUbwT+CGqRYeyNrf0eIANy7ZlcAynRZ8f/EUVM8=; b=klyEqR5q+ypIFa1+PqdAjQiCUjLiWEC3S5WCT+FrdJvoGc+VzpCQ1UVE14tnB7cgw0YuPqfUAvL/34IQJUukY4sLQcVuaCYjC8XajwAd5oUiCeZ5BkrdtwZD8a99S83O6XLoGYF+M4RSDm6KGLjkesqqMq5S+sstENTZjavgT/8= Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WkUFS0K_1753863307 cluster:ay36) by smtp.aliyun-inc.com; Wed, 30 Jul 2025 16:15:08 +0800 From: Baolin Wang To: akpm@linux-foundation.org, hughd@google.com Cc: willy@infradead.org, david@redhat.com, lorenzo.stoakes@oracle.com, ziy@nvidia.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options Date: Wed, 30 Jul 2025 16:14:55 +0800 Message-ID: <701271092af74c2d969b195321c2c22e15e3c694.1753863013.git.baolin.wang@linux.alibaba.com> X-Mailer: git-send-email 2.43.5 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"), we have extended tmpfs to allow any sized large folios, rather than just PMD-sized large folios. The strategy discussed previously was: " Considering that tmpfs already has the 'huge=3D' option to control the PMD-sized large folios allocation, we can extend the 'huge=3D' option to allow any sized large folios. The semantics of the 'huge=3D' mount option are: huge=3Dnever: no any sized large folios huge=3Dalways: any sized large folios huge=3Dwithin_size: like 'always' but respect the i_size huge=3Dadvise: like 'always' if requested with madvise() Note: for tmpfs mmap() faults, due to the lack of a write size hint, still allocate the PMD-sized huge folios if huge=3Dalways/within_size/advise is set. Moreover, the 'deny' and 'force' testing options controlled by '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same semantics. The 'deny' can disable any sized large folios for tmpfs, while the 'force' can enable PMD sized large folios for tmpfs. " This means that when tmpfs is mounted with 'huge=3Dalways' or 'huge=3Dwithi= n_size', tmpfs will allow getting a highest order hint based on the size of write() = and fallocate() paths. It will then try each allowable large order, rather than continually attempting to allocate PMD-sized large folios as before. However, this might break some user scenarios for those who want to use PMD-sized large folios, such as the i915 driver which did not supply a write size hint when allocating shmem [1]. Moreover, Hugh also complained that this will cause a regression in userspa= ce with 'huge=3Dalways' or 'huge=3Dwithin_size'. So, let's revisit the strategy for tmpfs large page allocation. A simple fix would be to always try PMD-sized large folios first, and if that fails, fall back to smaller large folios. However, this approach differs from the strat= egy for large folio allocation used by other file systems. Is this acceptable? [1] https://lore.kernel.org/lkml/0d734549d5ed073c80b11601da3abdd5223e1889.1= 753689802.git.baolin.wang@linux.alibaba.com/ Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs") Signed-off-by: Baolin Wang --- Note: this is just an RFC patch. I would like to hear others' opinions or see if there is a better way to address Hugh's concern. --- Documentation/admin-guide/mm/transhuge.rst | 6 ++- mm/shmem.c | 47 +++------------------- 2 files changed, 10 insertions(+), 43 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/adm= in-guide/mm/transhuge.rst index 878796b4d7d3..121cbb3a72f7 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -383,12 +383,16 @@ option: ``huge=3D``. It can have following values: =20 always Attempt to allocate huge pages every time we need a new page; + Always try PMD-sized huge pages first, and fall back to smaller-sized + huge pages if the PMD-sized huge page allocation fails; =20 never Do not allocate huge pages; =20 within_size - Only allocate huge page if it will be fully within i_size. + Only allocate huge page if it will be fully within i_size; + Always try PMD-sized huge pages first, and fall back to smaller-sized + huge pages if the PMD-sized huge page allocation fails; Also respect madvise() hints; =20 advise diff --git a/mm/shmem.c b/mm/shmem.c index 75cc2cb92950..c1040a115f08 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -566,42 +566,6 @@ static int shmem_confirm_swap(struct address_space *ma= pping, pgoff_t index, static int shmem_huge __read_mostly =3D SHMEM_HUGE_NEVER; static int tmpfs_huge __read_mostly =3D SHMEM_HUGE_NEVER; =20 -/** - * shmem_mapping_size_orders - Get allowable folio orders for the given fi= le size. - * @mapping: Target address_space. - * @index: The page index. - * @write_end: end of a write, could extend inode size. - * - * This returns huge orders for folios (when supported) based on the file = size - * which the mapping currently allows at the given index. The index is rel= evant - * due to alignment considerations the mapping might have. The returned or= der - * may be less than the size passed. - * - * Return: The orders. - */ -static inline unsigned int -shmem_mapping_size_orders(struct address_space *mapping, pgoff_t index, lo= ff_t write_end) -{ - unsigned int order; - size_t size; - - if (!mapping_large_folio_support(mapping) || !write_end) - return 0; - - /* Calculate the write size based on the write_end */ - size =3D write_end - (index << PAGE_SHIFT); - order =3D filemap_get_order(size); - if (!order) - return 0; - - /* If we're not aligned, allocate a smaller folio */ - if (index & ((1UL << order) - 1)) - order =3D __ffs(index); - - order =3D min_t(size_t, order, MAX_PAGECACHE_ORDER); - return order > 0 ? BIT(order + 1) - 1 : 0; -} - static unsigned int shmem_get_orders_within_size(struct inode *inode, unsigned long within_size_orders, pgoff_t index, loff_t write_end) @@ -648,22 +612,21 @@ static unsigned int shmem_huge_global_enabled(struct = inode *inode, pgoff_t index * For tmpfs mmap()'s huge order, we still use PMD-sized order to * allocate huge pages due to lack of a write size hint. * - * Otherwise, tmpfs will allow getting a highest order hint based on - * the size of write and fallocate paths, then will try each allowable - * huge orders. + * For tmpfs with 'huge=3Dalways' or 'huge=3Dwithin_size' mount option, + * we will always try PMD-sized order first. If that failed, it will + * fall back to small large folios. */ switch (SHMEM_SB(inode->i_sb)->huge) { case SHMEM_HUGE_ALWAYS: if (vma) return maybe_pmd_order; =20 - return shmem_mapping_size_orders(inode->i_mapping, index, write_end); + return THP_ORDERS_ALL_FILE_DEFAULT; case SHMEM_HUGE_WITHIN_SIZE: if (vma) within_size_orders =3D maybe_pmd_order; else - within_size_orders =3D shmem_mapping_size_orders(inode->i_mapping, - index, write_end); + within_size_orders =3D THP_ORDERS_ALL_FILE_DEFAULT; =20 within_size_orders =3D shmem_get_orders_within_size(inode, within_size_o= rders, index, write_end); --=20 2.43.5