[PATCH v3 3/6] mm: shmem: add large folio support for tmpfs

Baolin Wang posted 6 patches 1 year, 2 months ago
[PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
Posted by Baolin Wang 1 year, 2 months ago
Add large folio support for tmpfs write and fallocate paths matching the
same high order preference mechanism used in the iomap buffered IO path
as used in __filemap_get_folio().

Add shmem_mapping_size_orders() to get a hint for the orders of the folio
based on the file size which takes care of the mapping requirements.

Traditionally, tmpfs only supported PMD-sized large folios. However nowadays
with other file systems supporting any sized large folios, and extending
anonymous to support mTHP, we should not restrict tmpfs to allocating only
PMD-sized large folios, making it more special. Instead, we should allow
tmpfs can allocate any sized large folios.

Considering that tmpfs already has the 'huge=' option to control the PMD-sized
large folios allocation, we can extend the 'huge=' option to allow any sized
large folios. The semantics of the 'huge=' mount option are:

huge=never: no any sized large folios
huge=always: any sized large folios
huge=within_size: like 'always' but respect the i_size
huge=advise: like 'always' if requested with madvise()

Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
allocate the PMD-sized huge folios if huge=always/within_size/advise is set.

Moreover, the 'deny' and 'force' testing options controlled by
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
semantics. The 'deny' can disable any sized large folios for tmpfs, while
the 'force' can enable PMD sized large folios for tmpfs.

Co-developed-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 99 ++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 81 insertions(+), 18 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 7595c3db4c1c..54eaa724c153 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -554,34 +554,100 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 
 static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
 
+/**
+ * shmem_mapping_size_orders - Get allowable folio orders for the given file size.
+ * @mapping: Target address_space.
+ * @index: The page index.
+ * @write_end: end of a write, could extend inode size.
+ *
+ * This returns huge orders for folios (when supported) based on the file size
+ * which the mapping currently allows at the given index. The index is relevant
+ * due to alignment considerations the mapping might have. The returned order
+ * may be less than the size passed.
+ *
+ * Return: The orders.
+ */
+static inline unsigned int
+shmem_mapping_size_orders(struct address_space *mapping, pgoff_t index, loff_t write_end)
+{
+	unsigned int order;
+	size_t size;
+
+	if (!mapping_large_folio_support(mapping) || !write_end)
+		return 0;
+
+	/* Calculate the write size based on the write_end */
+	size = write_end - (index << PAGE_SHIFT);
+	order = filemap_get_order(size);
+	if (!order)
+		return 0;
+
+	/* If we're not aligned, allocate a smaller folio */
+	if (index & ((1UL << order) - 1))
+		order = __ffs(index);
+
+	order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
+	return order > 0 ? BIT(order + 1) - 1 : 0;
+}
+
 static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
 					      loff_t write_end, bool shmem_huge_force,
+					      struct vm_area_struct *vma,
 					      unsigned long vm_flags)
 {
+	unsigned int maybe_pmd_order = HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER ?
+		0 : BIT(HPAGE_PMD_ORDER);
+	unsigned long within_size_orders;
+	unsigned int order;
+	pgoff_t aligned_index;
 	loff_t i_size;
 
-	if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
-		return 0;
 	if (!S_ISREG(inode->i_mode))
 		return 0;
 	if (shmem_huge == SHMEM_HUGE_DENY)
 		return 0;
 	if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
-		return BIT(HPAGE_PMD_ORDER);
+		return maybe_pmd_order;
 
+	/*
+	 * The huge order allocation for anon shmem is controlled through
+	 * the mTHP interface, so we still use PMD-sized huge order to
+	 * check whether global control is enabled.
+	 *
+	 * For tmpfs mmap()'s huge order, we still use PMD-sized order to
+	 * allocate huge pages due to lack of a write size hint.
+	 *
+	 * Otherwise, tmpfs will allow getting a highest order hint based on
+	 * the size of write and fallocate paths, then will try each allowable
+	 * huge orders.
+	 */
 	switch (SHMEM_SB(inode->i_sb)->huge) {
 	case SHMEM_HUGE_ALWAYS:
-		return BIT(HPAGE_PMD_ORDER);
+		if (vma)
+			return maybe_pmd_order;
+
+		return shmem_mapping_size_orders(inode->i_mapping, index, write_end);
 	case SHMEM_HUGE_WITHIN_SIZE:
-		index = round_up(index + 1, HPAGE_PMD_NR);
-		i_size = max(write_end, i_size_read(inode));
-		i_size = round_up(i_size, PAGE_SIZE);
-		if (i_size >> PAGE_SHIFT >= index)
-			return BIT(HPAGE_PMD_ORDER);
+		if (vma)
+			within_size_orders = maybe_pmd_order;
+		else
+			within_size_orders = shmem_mapping_size_orders(inode->i_mapping,
+								       index, write_end);
+
+		order = highest_order(within_size_orders);
+		while (within_size_orders) {
+			aligned_index = round_up(index + 1, 1 << order);
+			i_size = max(write_end, i_size_read(inode));
+			i_size = round_up(i_size, PAGE_SIZE);
+			if (i_size >> PAGE_SHIFT >= aligned_index)
+				return within_size_orders;
+
+			order = next_order(&within_size_orders, order);
+		}
 		fallthrough;
 	case SHMEM_HUGE_ADVISE:
 		if (vm_flags & VM_HUGEPAGE)
-			return BIT(HPAGE_PMD_ORDER);
+			return maybe_pmd_order;
 		fallthrough;
 	default:
 		return 0;
@@ -781,6 +847,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
 
 static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
 					      loff_t write_end, bool shmem_huge_force,
+					      struct vm_area_struct *vma,
 					      unsigned long vm_flags)
 {
 	return 0;
@@ -1176,7 +1243,7 @@ static int shmem_getattr(struct mnt_idmap *idmap,
 			STATX_ATTR_NODUMP);
 	generic_fillattr(idmap, request_mask, inode, stat);
 
-	if (shmem_huge_global_enabled(inode, 0, 0, false, 0))
+	if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0))
 		stat->blksize = HPAGE_PMD_SIZE;
 
 	if (request_mask & STATX_BTIME) {
@@ -1693,14 +1760,10 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
 		return 0;
 
 	global_orders = shmem_huge_global_enabled(inode, index, write_end,
-						  shmem_huge_force, vm_flags);
-	if (!vma || !vma_is_anon_shmem(vma)) {
-		/*
-		 * For tmpfs, we now only support PMD sized THP if huge page
-		 * is enabled, otherwise fallback to order 0.
-		 */
+						  shmem_huge_force, vma, vm_flags);
+	/* Tmpfs huge pages allocation */
+	if (!vma || !vma_is_anon_shmem(vma))
 		return global_orders;
-	}
 
 	/*
 	 * Following the 'deny' semantics of the top level, force the huge
-- 
2.39.3
[REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
Posted by Ville Syrjälä 9 months, 1 week ago
On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
> Add large folio support for tmpfs write and fallocate paths matching the
> same high order preference mechanism used in the iomap buffered IO path
> as used in __filemap_get_folio().
> 
> Add shmem_mapping_size_orders() to get a hint for the orders of the folio
> based on the file size which takes care of the mapping requirements.
> 
> Traditionally, tmpfs only supported PMD-sized large folios. However nowadays
> with other file systems supporting any sized large folios, and extending
> anonymous to support mTHP, we should not restrict tmpfs to allocating only
> PMD-sized large folios, making it more special. Instead, we should allow
> tmpfs can allocate any sized large folios.
> 
> Considering that tmpfs already has the 'huge=' option to control the PMD-sized
> large folios allocation, we can extend the 'huge=' option to allow any sized
> large folios. The semantics of the 'huge=' mount option are:
> 
> huge=never: no any sized large folios
> huge=always: any sized large folios
> huge=within_size: like 'always' but respect the i_size
> huge=advise: like 'always' if requested with madvise()
> 
> Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
> allocate the PMD-sized huge folios if huge=always/within_size/advise is set.
> 
> Moreover, the 'deny' and 'force' testing options controlled by
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
> semantics. The 'deny' can disable any sized large folios for tmpfs, while
> the 'force' can enable PMD sized large folios for tmpfs.
> 
> Co-developed-by: Daniel Gomez <da.gomez@samsung.com>
> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Hi,

This causes a huge regression in Intel iGPU texturing performance.

I haven't had time to look at this in detail, but presumably the
problem is that we're no longer getting huge pages from our
private tmpfs mount (done in i915_gemfs_init()).

Some more details at
https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13845

> ---
>  mm/shmem.c | 99 ++++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 81 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 7595c3db4c1c..54eaa724c153 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -554,34 +554,100 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>  
>  static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>  
> +/**
> + * shmem_mapping_size_orders - Get allowable folio orders for the given file size.
> + * @mapping: Target address_space.
> + * @index: The page index.
> + * @write_end: end of a write, could extend inode size.
> + *
> + * This returns huge orders for folios (when supported) based on the file size
> + * which the mapping currently allows at the given index. The index is relevant
> + * due to alignment considerations the mapping might have. The returned order
> + * may be less than the size passed.
> + *
> + * Return: The orders.
> + */
> +static inline unsigned int
> +shmem_mapping_size_orders(struct address_space *mapping, pgoff_t index, loff_t write_end)
> +{
> +	unsigned int order;
> +	size_t size;
> +
> +	if (!mapping_large_folio_support(mapping) || !write_end)
> +		return 0;
> +
> +	/* Calculate the write size based on the write_end */
> +	size = write_end - (index << PAGE_SHIFT);
> +	order = filemap_get_order(size);
> +	if (!order)
> +		return 0;
> +
> +	/* If we're not aligned, allocate a smaller folio */
> +	if (index & ((1UL << order) - 1))
> +		order = __ffs(index);
> +
> +	order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
> +	return order > 0 ? BIT(order + 1) - 1 : 0;
> +}
> +
>  static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
>  					      loff_t write_end, bool shmem_huge_force,
> +					      struct vm_area_struct *vma,
>  					      unsigned long vm_flags)
>  {
> +	unsigned int maybe_pmd_order = HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER ?
> +		0 : BIT(HPAGE_PMD_ORDER);
> +	unsigned long within_size_orders;
> +	unsigned int order;
> +	pgoff_t aligned_index;
>  	loff_t i_size;
>  
> -	if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
> -		return 0;
>  	if (!S_ISREG(inode->i_mode))
>  		return 0;
>  	if (shmem_huge == SHMEM_HUGE_DENY)
>  		return 0;
>  	if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
> -		return BIT(HPAGE_PMD_ORDER);
> +		return maybe_pmd_order;
>  
> +	/*
> +	 * The huge order allocation for anon shmem is controlled through
> +	 * the mTHP interface, so we still use PMD-sized huge order to
> +	 * check whether global control is enabled.
> +	 *
> +	 * For tmpfs mmap()'s huge order, we still use PMD-sized order to
> +	 * allocate huge pages due to lack of a write size hint.
> +	 *
> +	 * Otherwise, tmpfs will allow getting a highest order hint based on
> +	 * the size of write and fallocate paths, then will try each allowable
> +	 * huge orders.
> +	 */
>  	switch (SHMEM_SB(inode->i_sb)->huge) {
>  	case SHMEM_HUGE_ALWAYS:
> -		return BIT(HPAGE_PMD_ORDER);
> +		if (vma)
> +			return maybe_pmd_order;
> +
> +		return shmem_mapping_size_orders(inode->i_mapping, index, write_end);
>  	case SHMEM_HUGE_WITHIN_SIZE:
> -		index = round_up(index + 1, HPAGE_PMD_NR);
> -		i_size = max(write_end, i_size_read(inode));
> -		i_size = round_up(i_size, PAGE_SIZE);
> -		if (i_size >> PAGE_SHIFT >= index)
> -			return BIT(HPAGE_PMD_ORDER);
> +		if (vma)
> +			within_size_orders = maybe_pmd_order;
> +		else
> +			within_size_orders = shmem_mapping_size_orders(inode->i_mapping,
> +								       index, write_end);
> +
> +		order = highest_order(within_size_orders);
> +		while (within_size_orders) {
> +			aligned_index = round_up(index + 1, 1 << order);
> +			i_size = max(write_end, i_size_read(inode));
> +			i_size = round_up(i_size, PAGE_SIZE);
> +			if (i_size >> PAGE_SHIFT >= aligned_index)
> +				return within_size_orders;
> +
> +			order = next_order(&within_size_orders, order);
> +		}
>  		fallthrough;
>  	case SHMEM_HUGE_ADVISE:
>  		if (vm_flags & VM_HUGEPAGE)
> -			return BIT(HPAGE_PMD_ORDER);
> +			return maybe_pmd_order;
>  		fallthrough;
>  	default:
>  		return 0;
> @@ -781,6 +847,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
>  
>  static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
>  					      loff_t write_end, bool shmem_huge_force,
> +					      struct vm_area_struct *vma,
>  					      unsigned long vm_flags)
>  {
>  	return 0;
> @@ -1176,7 +1243,7 @@ static int shmem_getattr(struct mnt_idmap *idmap,
>  			STATX_ATTR_NODUMP);
>  	generic_fillattr(idmap, request_mask, inode, stat);
>  
> -	if (shmem_huge_global_enabled(inode, 0, 0, false, 0))
> +	if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0))
>  		stat->blksize = HPAGE_PMD_SIZE;
>  
>  	if (request_mask & STATX_BTIME) {
> @@ -1693,14 +1760,10 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
>  		return 0;
>  
>  	global_orders = shmem_huge_global_enabled(inode, index, write_end,
> -						  shmem_huge_force, vm_flags);
> -	if (!vma || !vma_is_anon_shmem(vma)) {
> -		/*
> -		 * For tmpfs, we now only support PMD sized THP if huge page
> -		 * is enabled, otherwise fallback to order 0.
> -		 */
> +						  shmem_huge_force, vma, vm_flags);
> +	/* Tmpfs huge pages allocation */
> +	if (!vma || !vma_is_anon_shmem(vma))
>  		return global_orders;
> -	}
>  
>  	/*
>  	 * Following the 'deny' semantics of the top level, force the huge
> -- 
> 2.39.3
> 

-- 
Ville Syrjälä
Intel
Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
Posted by Baolin Wang 9 months, 1 week ago
Hi,

On 2025/4/30 01:44, Ville Syrjälä wrote:
> On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
>> Add large folio support for tmpfs write and fallocate paths matching the
>> same high order preference mechanism used in the iomap buffered IO path
>> as used in __filemap_get_folio().
>>
>> Add shmem_mapping_size_orders() to get a hint for the orders of the folio
>> based on the file size which takes care of the mapping requirements.
>>
>> Traditionally, tmpfs only supported PMD-sized large folios. However nowadays
>> with other file systems supporting any sized large folios, and extending
>> anonymous to support mTHP, we should not restrict tmpfs to allocating only
>> PMD-sized large folios, making it more special. Instead, we should allow
>> tmpfs can allocate any sized large folios.
>>
>> Considering that tmpfs already has the 'huge=' option to control the PMD-sized
>> large folios allocation, we can extend the 'huge=' option to allow any sized
>> large folios. The semantics of the 'huge=' mount option are:
>>
>> huge=never: no any sized large folios
>> huge=always: any sized large folios
>> huge=within_size: like 'always' but respect the i_size
>> huge=advise: like 'always' if requested with madvise()
>>
>> Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
>> allocate the PMD-sized huge folios if huge=always/within_size/advise is set.
>>
>> Moreover, the 'deny' and 'force' testing options controlled by
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
>> semantics. The 'deny' can disable any sized large folios for tmpfs, while
>> the 'force' can enable PMD sized large folios for tmpfs.
>>
>> Co-developed-by: Daniel Gomez <da.gomez@samsung.com>
>> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> Hi,
> 
> This causes a huge regression in Intel iGPU texturing performance.

Unfortunately, I don't have such platform to test it.

> 
> I haven't had time to look at this in detail, but presumably the
> problem is that we're no longer getting huge pages from our
> private tmpfs mount (done in i915_gemfs_init()).

IIUC, the i915 driver still limits the maximum write size to PAGE_SIZE 
in the shmem_pwrite(), which prevents tmpfs from allocating large 
folios. As mentioned in the comments below, tmpfs like other file 
systems that support large folios, will allow getting a highest order 
hint based on the size of the write and fallocate paths, and then will 
attempt each allowable huge order.

Therefore, I think the shmem_pwrite() function should be changed to 
remove the limitation that the write size cannot exceed PAGE_SIZE.

Something like the following code (untested):
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c 
b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index ae3343c81a64..97eefb73c5d2 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -420,6 +420,7 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
         struct address_space *mapping = obj->base.filp->f_mapping;
         const struct address_space_operations *aops = mapping->a_ops;
         char __user *user_data = u64_to_user_ptr(arg->data_ptr);
+       size_t chunk = mapping_max_folio_size(mapping);
         u64 remain;
         loff_t pos;
         unsigned int pg;
@@ -463,10 +464,10 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
                 void *data, *vaddr;
                 int err;
                 char __maybe_unused c;
+               size_t offset;

-               len = PAGE_SIZE - pg;
-               if (len > remain)
-                       len = remain;
+               offset = pos & (chunk - 1);
+               len = min(chunk - offset, remain);

                 /* Prefault the user page to reduce potential recursion */
                 err = __get_user(c, user_data);
Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
Posted by Ville Syrjälä 9 months, 1 week ago
On Wed, Apr 30, 2025 at 02:32:39PM +0800, Baolin Wang wrote:
> Hi,
> 
> On 2025/4/30 01:44, Ville Syrjälä wrote:
> > On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
> >> Add large folio support for tmpfs write and fallocate paths matching the
> >> same high order preference mechanism used in the iomap buffered IO path
> >> as used in __filemap_get_folio().
> >>
> >> Add shmem_mapping_size_orders() to get a hint for the orders of the folio
> >> based on the file size which takes care of the mapping requirements.
> >>
> >> Traditionally, tmpfs only supported PMD-sized large folios. However nowadays
> >> with other file systems supporting any sized large folios, and extending
> >> anonymous to support mTHP, we should not restrict tmpfs to allocating only
> >> PMD-sized large folios, making it more special. Instead, we should allow
> >> tmpfs can allocate any sized large folios.
> >>
> >> Considering that tmpfs already has the 'huge=' option to control the PMD-sized
> >> large folios allocation, we can extend the 'huge=' option to allow any sized
> >> large folios. The semantics of the 'huge=' mount option are:
> >>
> >> huge=never: no any sized large folios
> >> huge=always: any sized large folios
> >> huge=within_size: like 'always' but respect the i_size
> >> huge=advise: like 'always' if requested with madvise()
> >>
> >> Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
> >> allocate the PMD-sized huge folios if huge=always/within_size/advise is set.
> >>
> >> Moreover, the 'deny' and 'force' testing options controlled by
> >> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
> >> semantics. The 'deny' can disable any sized large folios for tmpfs, while
> >> the 'force' can enable PMD sized large folios for tmpfs.
> >>
> >> Co-developed-by: Daniel Gomez <da.gomez@samsung.com>
> >> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > 
> > Hi,
> > 
> > This causes a huge regression in Intel iGPU texturing performance.
> 
> Unfortunately, I don't have such platform to test it.
> 
> > 
> > I haven't had time to look at this in detail, but presumably the
> > problem is that we're no longer getting huge pages from our
> > private tmpfs mount (done in i915_gemfs_init()).
> 
> IIUC, the i915 driver still limits the maximum write size to PAGE_SIZE 
> in the shmem_pwrite(),

pwrite is just one random way to write to objects, and probably
not something that's even used by current Mesa.

> which prevents tmpfs from allocating large 
> folios. As mentioned in the comments below, tmpfs like other file 
> systems that support large folios, will allow getting a highest order 
> hint based on the size of the write and fallocate paths, and then will 
> attempt each allowable huge order.
> 
> Therefore, I think the shmem_pwrite() function should be changed to 
> remove the limitation that the write size cannot exceed PAGE_SIZE.
> 
> Something like the following code (untested):
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c 
> b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> index ae3343c81a64..97eefb73c5d2 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> @@ -420,6 +420,7 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
>          struct address_space *mapping = obj->base.filp->f_mapping;
>          const struct address_space_operations *aops = mapping->a_ops;
>          char __user *user_data = u64_to_user_ptr(arg->data_ptr);
> +       size_t chunk = mapping_max_folio_size(mapping);
>          u64 remain;
>          loff_t pos;
>          unsigned int pg;
> @@ -463,10 +464,10 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
>                  void *data, *vaddr;
>                  int err;
>                  char __maybe_unused c;
> +               size_t offset;
> 
> -               len = PAGE_SIZE - pg;
> -               if (len > remain)
> -                       len = remain;
> +               offset = pos & (chunk - 1);
> +               len = min(chunk - offset, remain);
> 
>                  /* Prefault the user page to reduce potential recursion */
>                  err = __get_user(c, user_data);

-- 
Ville Syrjälä
Intel
Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
Posted by Daniel Gomez 9 months, 1 week ago
On Wed, Apr 30, 2025 at 02:20:02PM +0100, Ville Syrjälä wrote:
> On Wed, Apr 30, 2025 at 02:32:39PM +0800, Baolin Wang wrote:
> > On 2025/4/30 01:44, Ville Syrjälä wrote:
> > > On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
> > > Hi,
> > > 
> > > This causes a huge regression in Intel iGPU texturing performance.
> > 
> > Unfortunately, I don't have such platform to test it.
> > 
> > > 
> > > I haven't had time to look at this in detail, but presumably the
> > > problem is that we're no longer getting huge pages from our
> > > private tmpfs mount (done in i915_gemfs_init()).
> > 
> > IIUC, the i915 driver still limits the maximum write size to PAGE_SIZE 
> > in the shmem_pwrite(),
> 
> pwrite is just one random way to write to objects, and probably
> not something that's even used by current Mesa.
> 
> > which prevents tmpfs from allocating large 
> > folios. As mentioned in the comments below, tmpfs like other file 
> > systems that support large folios, will allow getting a highest order 
> > hint based on the size of the write and fallocate paths, and then will 
> > attempt each allowable huge order.
> > 
> > Therefore, I think the shmem_pwrite() function should be changed to 
> > remove the limitation that the write size cannot exceed PAGE_SIZE.

To enable mTHP on tmpfs, the necessary knobs must first be enabled in sysfs
as they are not enabled by default IIRC (only THP, PMD level). Ville, I
see i915_gemfs the huge=within_size mount option is passed. Can you confirm
if /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled are also marked as
'always' when the regression is found?

Even if these are enabled, the possible difference may be that before, i915 was
using PMD pages (THP) always and now mTHP will be used, unless the file size is
as big as the PMD page. I think the always mount option would also try to infer
the size to actually give a proper order folio according to that size. Baolin,
is that correct?

And Ville, can you confirm if what i915 needs is to enable PMD-size allocations
always?
Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
Posted by Baolin Wang 9 months, 1 week ago

On 2025/4/30 21:24, Daniel Gomez wrote:
> On Wed, Apr 30, 2025 at 02:20:02PM +0100, Ville Syrjälä wrote:
>> On Wed, Apr 30, 2025 at 02:32:39PM +0800, Baolin Wang wrote:
>>> On 2025/4/30 01:44, Ville Syrjälä wrote:
>>>> On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
>>>> Hi,
>>>>
>>>> This causes a huge regression in Intel iGPU texturing performance.
>>>
>>> Unfortunately, I don't have such platform to test it.
>>>
>>>>
>>>> I haven't had time to look at this in detail, but presumably the
>>>> problem is that we're no longer getting huge pages from our
>>>> private tmpfs mount (done in i915_gemfs_init()).
>>>
>>> IIUC, the i915 driver still limits the maximum write size to PAGE_SIZE
>>> in the shmem_pwrite(),
>>
>> pwrite is just one random way to write to objects, and probably
>> not something that's even used by current Mesa.
>>
>>> which prevents tmpfs from allocating large
>>> folios. As mentioned in the comments below, tmpfs like other file
>>> systems that support large folios, will allow getting a highest order
>>> hint based on the size of the write and fallocate paths, and then will
>>> attempt each allowable huge order.
>>>
>>> Therefore, I think the shmem_pwrite() function should be changed to
>>> remove the limitation that the write size cannot exceed PAGE_SIZE.
> 
> To enable mTHP on tmpfs, the necessary knobs must first be enabled in sysfs
> as they are not enabled by default IIRC (only THP, PMD level). Ville, I
> see i915_gemfs the huge=within_size mount option is passed. Can you confirm
> if /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled are also marked as
> 'always' when the regression is found?

The tmpfs mount will not be controlled by 
'/sys/kernel/mm/transparent_hugepage/hugepages-*Kb/enabled' (except for 
the debugging options 'deny' and 'force').

> Even if these are enabled, the possible difference may be that before, i915 was
> using PMD pages (THP) always and now mTHP will be used, unless the file size is
> as big as the PMD page. I think the always mount option would also try to infer
> the size to actually give a proper order folio according to that size. Baolin,
> is that correct?

Right.

> And Ville, can you confirm if what i915 needs is to enable PMD-size allocations
> always?
Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
Posted by David Hildenbrand 9 months, 1 week ago
On 02.05.25 03:02, Baolin Wang wrote:
> 
> 
> On 2025/4/30 21:24, Daniel Gomez wrote:
>> On Wed, Apr 30, 2025 at 02:20:02PM +0100, Ville Syrjälä wrote:
>>> On Wed, Apr 30, 2025 at 02:32:39PM +0800, Baolin Wang wrote:
>>>> On 2025/4/30 01:44, Ville Syrjälä wrote:
>>>>> On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
>>>>> Hi,
>>>>>
>>>>> This causes a huge regression in Intel iGPU texturing performance.
>>>>
>>>> Unfortunately, I don't have such platform to test it.
>>>>
>>>>>
>>>>> I haven't had time to look at this in detail, but presumably the
>>>>> problem is that we're no longer getting huge pages from our
>>>>> private tmpfs mount (done in i915_gemfs_init()).
>>>>
>>>> IIUC, the i915 driver still limits the maximum write size to PAGE_SIZE
>>>> in the shmem_pwrite(),
>>>
>>> pwrite is just one random way to write to objects, and probably
>>> not something that's even used by current Mesa.
>>>
>>>> which prevents tmpfs from allocating large
>>>> folios. As mentioned in the comments below, tmpfs like other file
>>>> systems that support large folios, will allow getting a highest order
>>>> hint based on the size of the write and fallocate paths, and then will
>>>> attempt each allowable huge order.
>>>>
>>>> Therefore, I think the shmem_pwrite() function should be changed to
>>>> remove the limitation that the write size cannot exceed PAGE_SIZE.
>>
>> To enable mTHP on tmpfs, the necessary knobs must first be enabled in sysfs
>> as they are not enabled by default IIRC (only THP, PMD level). Ville, I
>> see i915_gemfs the huge=within_size mount option is passed. Can you confirm
>> if /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled are also marked as
>> 'always' when the regression is found?
> 
> The tmpfs mount will not be controlled by
> '/sys/kernel/mm/transparent_hugepage/hugepages-*Kb/enabled' (except for
> the debugging options 'deny' and 'force').

Right, IIRC as requested by Willy, it should behave like other FSes 
where there is no control over the folio size to be used.

-- 
Cheers,

David / dhildenb

Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
Posted by Daniel Gomez 9 months, 1 week ago
On Fri, May 02, 2025 at 09:18:41AM +0100, David Hildenbrand wrote:
> On 02.05.25 03:02, Baolin Wang wrote:
> > 
> > 
> > On 2025/4/30 21:24, Daniel Gomez wrote:
> > > On Wed, Apr 30, 2025 at 02:20:02PM +0100, Ville Syrjälä wrote:
> > > > On Wed, Apr 30, 2025 at 02:32:39PM +0800, Baolin Wang wrote:
> > > > > On 2025/4/30 01:44, Ville Syrjälä wrote:
> > > > > > On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > This causes a huge regression in Intel iGPU texturing performance.
> > > > > 
> > > > > Unfortunately, I don't have such platform to test it.
> > > > > 
> > > > > > 
> > > > > > I haven't had time to look at this in detail, but presumably the
> > > > > > problem is that we're no longer getting huge pages from our
> > > > > > private tmpfs mount (done in i915_gemfs_init()).
> > > > > 
> > > > > IIUC, the i915 driver still limits the maximum write size to PAGE_SIZE
> > > > > in the shmem_pwrite(),
> > > > 
> > > > pwrite is just one random way to write to objects, and probably
> > > > not something that's even used by current Mesa.
> > > > 
> > > > > which prevents tmpfs from allocating large
> > > > > folios. As mentioned in the comments below, tmpfs like other file
> > > > > systems that support large folios, will allow getting a highest order
> > > > > hint based on the size of the write and fallocate paths, and then will
> > > > > attempt each allowable huge order.
> > > > > 
> > > > > Therefore, I think the shmem_pwrite() function should be changed to
> > > > > remove the limitation that the write size cannot exceed PAGE_SIZE.
> > > 
> > > To enable mTHP on tmpfs, the necessary knobs must first be enabled in sysfs
> > > as they are not enabled by default IIRC (only THP, PMD level). Ville, I
> > > see i915_gemfs the huge=within_size mount option is passed. Can you confirm
> > > if /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled are also marked as
> > > 'always' when the regression is found?
> > 
> > The tmpfs mount will not be controlled by
> > '/sys/kernel/mm/transparent_hugepage/hugepages-*Kb/enabled' (except for
> > the debugging options 'deny' and 'force').
> 
> Right, IIRC as requested by Willy, it should behave like other FSes where
> there is no control over the folio size to be used.

Thanks for reminding me. I forgot we finally changed it.

Could the performance drop be due to the driver no longer using PMD-level pages?

I also recall a performance drop when using order-8 and order-9 folios in tmpfs
with the initial per-block implementation. Baolin, did you experience anything
similar in the final implementation?

These were my numbers:

| Block Size (bs) | Linux Kernel v6.9 (GiB/s) | tmpfs with Large Folios v6.9 (GiB/s) |
| 4k   | 20.4 | 20.5 |
| 8k   | 34.3 | 34.3 |
| 16k  | 52.9 | 52.2 |
| 32k  | 70.2 | 76.9 |
| 64k  | 73.9 | 92.5 |
| 128k | 76.7 | 101  |
| 256k | 80.5 | 114  |
| 512k | 80.3 | 132  |
| 1M   | 78.5 | 75.2 |
| 2M   | 65.7 | 47.1 |

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 
Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
Posted by David Hildenbrand 9 months, 1 week ago
On 02.05.25 15:10, Daniel Gomez wrote:
> On Fri, May 02, 2025 at 09:18:41AM +0100, David Hildenbrand wrote:
>> On 02.05.25 03:02, Baolin Wang wrote:
>>>
>>>
>>> On 2025/4/30 21:24, Daniel Gomez wrote:
>>>> On Wed, Apr 30, 2025 at 02:20:02PM +0100, Ville Syrjälä wrote:
>>>>> On Wed, Apr 30, 2025 at 02:32:39PM +0800, Baolin Wang wrote:
>>>>>> On 2025/4/30 01:44, Ville Syrjälä wrote:
>>>>>>> On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> This causes a huge regression in Intel iGPU texturing performance.
>>>>>>
>>>>>> Unfortunately, I don't have such platform to test it.
>>>>>>
>>>>>>>
>>>>>>> I haven't had time to look at this in detail, but presumably the
>>>>>>> problem is that we're no longer getting huge pages from our
>>>>>>> private tmpfs mount (done in i915_gemfs_init()).
>>>>>>
>>>>>> IIUC, the i915 driver still limits the maximum write size to PAGE_SIZE
>>>>>> in the shmem_pwrite(),
>>>>>
>>>>> pwrite is just one random way to write to objects, and probably
>>>>> not something that's even used by current Mesa.
>>>>>
>>>>>> which prevents tmpfs from allocating large
>>>>>> folios. As mentioned in the comments below, tmpfs like other file
>>>>>> systems that support large folios, will allow getting a highest order
>>>>>> hint based on the size of the write and fallocate paths, and then will
>>>>>> attempt each allowable huge order.
>>>>>>
>>>>>> Therefore, I think the shmem_pwrite() function should be changed to
>>>>>> remove the limitation that the write size cannot exceed PAGE_SIZE.
>>>>
>>>> To enable mTHP on tmpfs, the necessary knobs must first be enabled in sysfs
>>>> as they are not enabled by default IIRC (only THP, PMD level). Ville, I
>>>> see i915_gemfs the huge=within_size mount option is passed. Can you confirm
>>>> if /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled are also marked as
>>>> 'always' when the regression is found?
>>>
>>> The tmpfs mount will not be controlled by
>>> '/sys/kernel/mm/transparent_hugepage/hugepages-*Kb/enabled' (except for
>>> the debugging options 'deny' and 'force').
>>
>> Right, IIRC as requested by Willy, it should behave like other FSes where
>> there is no control over the folio size to be used.
> 
> Thanks for reminding me. I forgot we finally changed it.
> 
> Could the performance drop be due to the driver no longer using PMD-level pages?

I suspect that the faulting logic will now go to a smaller order first, 
indeed.

... trying to digest shmem_allowable_huge_orders() and 
shmem_huge_global_enabled(), having a hard time trying to isolate the 
tmpfs case: especially, if we run here into the vma vs. !vma case.

Without a VMA, I think we should have "mpfs will allow getting a highest 
order hint based on and fallocate paths, then will try each allowable 
order".

With a VMA (no access hint), "we still use PMD-sized order to locate 
huge pages due to lack of a write size hint."

So if we get a fallocate()/write() that is, say, 1 MiB, we'd now 
allocate an 1 MiB folio instead of a 2 MiB one.

-- 
Cheers,

David / dhildenb

Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
Posted by Baolin Wang 9 months ago

On 2025/5/2 23:31, David Hildenbrand wrote:
> On 02.05.25 15:10, Daniel Gomez wrote:
>> On Fri, May 02, 2025 at 09:18:41AM +0100, David Hildenbrand wrote:
>>> On 02.05.25 03:02, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/4/30 21:24, Daniel Gomez wrote:
>>>>> On Wed, Apr 30, 2025 at 02:20:02PM +0100, Ville Syrjälä wrote:
>>>>>> On Wed, Apr 30, 2025 at 02:32:39PM +0800, Baolin Wang wrote:
>>>>>>> On 2025/4/30 01:44, Ville Syrjälä wrote:
>>>>>>>> On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> This causes a huge regression in Intel iGPU texturing performance.
>>>>>>>
>>>>>>> Unfortunately, I don't have such platform to test it.
>>>>>>>
>>>>>>>>
>>>>>>>> I haven't had time to look at this in detail, but presumably the
>>>>>>>> problem is that we're no longer getting huge pages from our
>>>>>>>> private tmpfs mount (done in i915_gemfs_init()).
>>>>>>>
>>>>>>> IIUC, the i915 driver still limits the maximum write size to 
>>>>>>> PAGE_SIZE
>>>>>>> in the shmem_pwrite(),
>>>>>>
>>>>>> pwrite is just one random way to write to objects, and probably
>>>>>> not something that's even used by current Mesa.
>>>>>>
>>>>>>> which prevents tmpfs from allocating large
>>>>>>> folios. As mentioned in the comments below, tmpfs like other file
>>>>>>> systems that support large folios, will allow getting a highest 
>>>>>>> order
>>>>>>> hint based on the size of the write and fallocate paths, and then 
>>>>>>> will
>>>>>>> attempt each allowable huge order.
>>>>>>>
>>>>>>> Therefore, I think the shmem_pwrite() function should be changed to
>>>>>>> remove the limitation that the write size cannot exceed PAGE_SIZE.
>>>>>
>>>>> To enable mTHP on tmpfs, the necessary knobs must first be enabled 
>>>>> in sysfs
>>>>> as they are not enabled by default IIRC (only THP, PMD level). 
>>>>> Ville, I
>>>>> see i915_gemfs the huge=within_size mount option is passed. Can you 
>>>>> confirm
>>>>> if /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled are also 
>>>>> marked as
>>>>> 'always' when the regression is found?
>>>>
>>>> The tmpfs mount will not be controlled by
>>>> '/sys/kernel/mm/transparent_hugepage/hugepages-*Kb/enabled' (except for
>>>> the debugging options 'deny' and 'force').
>>>
>>> Right, IIRC as requested by Willy, it should behave like other FSes 
>>> where
>>> there is no control over the folio size to be used.
>>
>> Thanks for reminding me. I forgot we finally changed it.
>>
>> Could the performance drop be due to the driver no longer using 
>> PMD-level pages?
> 
> I suspect that the faulting logic will now go to a smaller order first, 
> indeed.
> 
> ... trying to digest shmem_allowable_huge_orders() and 
> shmem_huge_global_enabled(), having a hard time trying to isolate the 
> tmpfs case: especially, if we run here into the vma vs. !vma case.
> 
> Without a VMA, I think we should have "mpfs will allow getting a highest 
> order hint based on and fallocate paths, then will try each allowable 
> order".
> 
> With a VMA (no access hint), "we still use PMD-sized order to locate 
> huge pages due to lack of a write size hint."
> 
> So if we get a fallocate()/write() that is, say, 1 MiB, we'd now 
> allocate an 1 MiB folio instead of a 2 MiB one.

Right.

So I asked Ville how the shmem folios are allocated in the i915 driver, 
and to see if we can make some improvements.
Re: [REGRESSION] Re: [PATCH v3 3/6] mm: shmem: add large folio support for tmpfs
Posted by David Hildenbrand 9 months ago
On 06.05.25 05:33, Baolin Wang wrote:
> 
> 
> On 2025/5/2 23:31, David Hildenbrand wrote:
>> On 02.05.25 15:10, Daniel Gomez wrote:
>>> On Fri, May 02, 2025 at 09:18:41AM +0100, David Hildenbrand wrote:
>>>> On 02.05.25 03:02, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2025/4/30 21:24, Daniel Gomez wrote:
>>>>>> On Wed, Apr 30, 2025 at 02:20:02PM +0100, Ville Syrjälä wrote:
>>>>>>> On Wed, Apr 30, 2025 at 02:32:39PM +0800, Baolin Wang wrote:
>>>>>>>> On 2025/4/30 01:44, Ville Syrjälä wrote:
>>>>>>>>> On Thu, Nov 28, 2024 at 03:40:41PM +0800, Baolin Wang wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> This causes a huge regression in Intel iGPU texturing performance.
>>>>>>>>
>>>>>>>> Unfortunately, I don't have such platform to test it.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I haven't had time to look at this in detail, but presumably the
>>>>>>>>> problem is that we're no longer getting huge pages from our
>>>>>>>>> private tmpfs mount (done in i915_gemfs_init()).
>>>>>>>>
>>>>>>>> IIUC, the i915 driver still limits the maximum write size to
>>>>>>>> PAGE_SIZE
>>>>>>>> in the shmem_pwrite(),
>>>>>>>
>>>>>>> pwrite is just one random way to write to objects, and probably
>>>>>>> not something that's even used by current Mesa.
>>>>>>>
>>>>>>>> which prevents tmpfs from allocating large
>>>>>>>> folios. As mentioned in the comments below, tmpfs like other file
>>>>>>>> systems that support large folios, will allow getting a highest
>>>>>>>> order
>>>>>>>> hint based on the size of the write and fallocate paths, and then
>>>>>>>> will
>>>>>>>> attempt each allowable huge order.
>>>>>>>>
>>>>>>>> Therefore, I think the shmem_pwrite() function should be changed to
>>>>>>>> remove the limitation that the write size cannot exceed PAGE_SIZE.
>>>>>>
>>>>>> To enable mTHP on tmpfs, the necessary knobs must first be enabled
>>>>>> in sysfs
>>>>>> as they are not enabled by default IIRC (only THP, PMD level).
>>>>>> Ville, I
>>>>>> see i915_gemfs the huge=within_size mount option is passed. Can you
>>>>>> confirm
>>>>>> if /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled are also
>>>>>> marked as
>>>>>> 'always' when the regression is found?
>>>>>
>>>>> The tmpfs mount will not be controlled by
>>>>> '/sys/kernel/mm/transparent_hugepage/hugepages-*Kb/enabled' (except for
>>>>> the debugging options 'deny' and 'force').
>>>>
>>>> Right, IIRC as requested by Willy, it should behave like other FSes
>>>> where
>>>> there is no control over the folio size to be used.
>>>
>>> Thanks for reminding me. I forgot we finally changed it.
>>>
>>> Could the performance drop be due to the driver no longer using
>>> PMD-level pages?
>>
>> I suspect that the faulting logic will now go to a smaller order first,
>> indeed.
>>
>> ... trying to digest shmem_allowable_huge_orders() and
>> shmem_huge_global_enabled(), having a hard time trying to isolate the
>> tmpfs case: especially, if we run here into the vma vs. !vma case.
>>
>> Without a VMA, I think we should have "mpfs will allow getting a highest
>> order hint based on and fallocate paths, then will try each allowable
>> order".
>>
>> With a VMA (no access hint), "we still use PMD-sized order to locate
>> huge pages due to lack of a write size hint."
>>
>> So if we get a fallocate()/write() that is, say, 1 MiB, we'd now
>> allocate an 1 MiB folio instead of a 2 MiB one.
> 
> Right.
> 
> So I asked Ville how the shmem folios are allocated in the i915 driver,
> and to see if we can make some improvements.

Maybe preallocation (using fallocate) might be reasonable for their use 
case: if they know they will consume all that memory either way. If it's 
sparse, it's more problematic.

-- 
Cheers,

David / dhildenb