[PATCH] xfs: add tunable threshold parameter for triggering zone GC

Hans Holmberg posted 1 patch 8 months, 4 weeks ago
Documentation/admin-guide/xfs.rst | 21 ++++++++++++++++++++
fs/xfs/xfs_mount.h                |  1 +
fs/xfs/xfs_sysfs.c                | 32 +++++++++++++++++++++++++++++++
fs/xfs/xfs_zone_alloc.c           |  7 +++++++
fs/xfs/xfs_zone_gc.c              | 16 ++++++++++++++--
5 files changed, 75 insertions(+), 2 deletions(-)
[PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by Hans Holmberg 8 months, 4 weeks ago
Presently we start garbage collection late - when we start running
out of free zones to backfill max_open_zones. This is a reasonable
default as it minimizes write amplification. The longer we wait,
the more blocks are invalidated and reclaim cost less in terms
of blocks to relocate.

Starting this late however introduces a risk of GC being outcompeted
by user writes. If GC can't keep up, user writes will be forced to
wait for free zones with high tail latencies as a result.

This is not a problem under normal circumstances, but if fragmentation
is bad and user write pressure is high (multiple full-throttle
writers) we will "bottom out" of free zones.

To mitigate this, introduce a zonegc_low_space tunable that lets the
user specify a percentage of how much of the unused space that GC
should keep available for writing. A high value will reclaim more of
the space occupied by unused blocks, creating a larger buffer against
write bursts.

This comes at a cost as write amplification is increased. To
illustrate this using a sample workload, setting zonegc_low_space to
60% avoids high (500ms) max latencies while increasing write
amplification by 15%.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
---
Changes delta RFC:
- Turned the mount option into a sysfs tunable (as Dave suggested)
- Used mult_frac to avoid overflows (thanks Dave)
- Added xfs admin-guide documentation, including max_open_zones
- Documented the default value and set it explicitly
- Link to RFC: https://lore.kernel.org/all/20250319081818.6406-1-hans.holmberg@wdc.com/T/

I just realized that the max_open_zones *mount option* is not documented
in the admin guide, but that should probably be added as a separate patch.

 Documentation/admin-guide/xfs.rst | 21 ++++++++++++++++++++
 fs/xfs/xfs_mount.h                |  1 +
 fs/xfs/xfs_sysfs.c                | 32 +++++++++++++++++++++++++++++++
 fs/xfs/xfs_zone_alloc.c           |  7 +++++++
 fs/xfs/xfs_zone_gc.c              | 16 ++++++++++++++--
 5 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
index b67772cf36d6..20746e795477 100644
--- a/Documentation/admin-guide/xfs.rst
+++ b/Documentation/admin-guide/xfs.rst
@@ -542,3 +542,24 @@ The interesting knobs for XFS workqueues are as follows:
   nice           Relative priority of scheduling the threads.  These are the
                  same nice levels that can be applied to userspace processes.
 ============     ===========
+
+Zoned Filesystems
+=================
+
+For zoned file systems, the following attributes are exposed in:
+
+ /sys/fs/xfs/<dev>/zoned/
+
+ max_open_zones                 (Min:  1  Default:  Varies  Max:  UINTMAX)
+        This read-only attribute exposes the maximum number of open zones
+        available for data placement. The value is determined at mount time and
+        is limited by the capabilities of the backing zoned device, file system
+        size and the max_open_zones mount option.
+
+ zonegc_low_space               (Min:  0  Default:  0  Max:  100)
+        Define a percentage for how much of the unused space that GC should keep
+        available for writing. A high value will reclaim more of the space
+        occupied by unused blocks, creating a larger buffer against write
+        bursts at the cost of increased write amplification.  Regardless
+        of this value, garbage collection will always aim to free a minimum
+        amount of blocks to keep max_open_zones open for data placement purposes.
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 799b84220ebb..e5192c12e7ac 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -229,6 +229,7 @@ typedef struct xfs_mount {
 	bool			m_finobt_nores; /* no per-AG finobt resv. */
 	bool			m_update_sb;	/* sb needs update in mount */
 	unsigned int		m_max_open_zones;
+	unsigned int		m_zonegc_low_space;
 
 	/*
 	 * Bitsets of per-fs metadata that have been checked and/or are sick.
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index b0857e3c1270..40bfe0a7790e 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -718,8 +718,40 @@ max_open_zones_show(
 }
 XFS_SYSFS_ATTR_RO(max_open_zones);
 
+static ssize_t
+zonegc_low_space_store(
+	struct kobject		*kobj,
+	const char		*buf,
+	size_t			count)
+{
+	int			ret;
+	unsigned int		val;
+
+	ret = kstrtouint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	if (val > 100)
+		return -EINVAL;
+
+	zoned_to_mp(kobj)->m_zonegc_low_space = val;
+
+	return count;
+}
+
+static ssize_t
+zonegc_low_space_show(
+	struct kobject		*kobj,
+	char			*buf)
+{
+	return sysfs_emit(buf, "%u\n",
+			zoned_to_mp(kobj)->m_zonegc_low_space);
+}
+XFS_SYSFS_ATTR_RW(zonegc_low_space);
+
 static struct attribute *xfs_zoned_attrs[] = {
 	ATTR_LIST(max_open_zones),
+	ATTR_LIST(zonegc_low_space),
 	NULL,
 };
 ATTRIBUTE_GROUPS(xfs_zoned);
diff --git a/fs/xfs/xfs_zone_alloc.c b/fs/xfs/xfs_zone_alloc.c
index 734b73470ef9..31c2803bc8c9 100644
--- a/fs/xfs/xfs_zone_alloc.c
+++ b/fs/xfs/xfs_zone_alloc.c
@@ -1195,6 +1195,13 @@ xfs_mount_zones(
 	xfs_set_freecounter(mp, XC_FREE_RTEXTENTS,
 			iz.available + iz.reclaimable);
 
+	/*
+	 * The user may configure GC to free up a percentage of unused blocks.
+	 * By default this is 0. GC will always trigger at the minimum level
+	 * for keeping max_open_zones available for data placement.
+	 */
+	mp->m_zonegc_low_space = 0;
+
 	error = xfs_zone_gc_mount(mp);
 	if (error)
 		goto out_free_zone_info;
diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
index a4abaac0fbc5..dd706635e3c1 100644
--- a/fs/xfs/xfs_zone_gc.c
+++ b/fs/xfs/xfs_zone_gc.c
@@ -162,18 +162,30 @@ struct xfs_zone_gc_data {
 
 /*
  * We aim to keep enough zones free in stock to fully use the open zone limit
- * for data placement purposes.
+ * for data placement purposes. Additionally, the m_zonegc_low_space tunable
+ * can be set to make sure a fraction of the unused blocks are available for
+ * writing.
  */
 bool
 xfs_zoned_need_gc(
 	struct xfs_mount	*mp)
 {
+	s64			available, free;
+
 	if (!xfs_group_marked(mp, XG_TYPE_RTG, XFS_RTG_RECLAIMABLE))
 		return false;
-	if (xfs_estimate_freecounter(mp, XC_FREE_RTAVAILABLE) <
+
+	available = xfs_estimate_freecounter(mp, XC_FREE_RTAVAILABLE);
+
+	if (available <
 	    mp->m_groups[XG_TYPE_RTG].blocks *
 	    (mp->m_max_open_zones - XFS_OPEN_GC_ZONES))
 		return true;
+
+	free = xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS);
+	if (available < mult_frac(free, mp->m_zonegc_low_space, 100))
+		return true;
+
 	return false;
 }
 
-- 
2.34.1
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by Guenter Roeck 8 months ago
On Tue, Mar 25, 2025 at 09:10:49AM +0000, Hans Holmberg wrote:
> Presently we start garbage collection late - when we start running
> out of free zones to backfill max_open_zones. This is a reasonable
> default as it minimizes write amplification. The longer we wait,
> the more blocks are invalidated and reclaim cost less in terms
> of blocks to relocate.
> 
> Starting this late however introduces a risk of GC being outcompeted
> by user writes. If GC can't keep up, user writes will be forced to
> wait for free zones with high tail latencies as a result.
> 
> This is not a problem under normal circumstances, but if fragmentation
> is bad and user write pressure is high (multiple full-throttle
> writers) we will "bottom out" of free zones.
> 
> To mitigate this, introduce a zonegc_low_space tunable that lets the
> user specify a percentage of how much of the unused space that GC
> should keep available for writing. A high value will reclaim more of
> the space occupied by unused blocks, creating a larger buffer against
> write bursts.
> 
> This comes at a cost as write amplification is increased. To
> illustrate this using a sample workload, setting zonegc_low_space to
> 60% avoids high (500ms) max latencies while increasing write
> amplification by 15%.
> 
...
>  bool
>  xfs_zoned_need_gc(
>  	struct xfs_mount	*mp)
>  {
> +	s64			available, free;
> +
...
> +
> +	free = xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS);
> +	if (available < mult_frac(free, mp->m_zonegc_low_space, 100))
> +		return true;
> +

With some 32-bit builds (parisc, openrisc so far):

Error log:
ERROR: modpost: "__divdi3" [fs/xfs/xfs.ko] undefined!
ERROR: modpost: "__umoddi3" [fs/xfs/xfs.ko] undefined!
ERROR: modpost: "__moddi3" [fs/xfs/xfs.ko] undefined!
ERROR: modpost: "__udivdi3" [fs/xfs/xfs.ko] undefined!

Guenter
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by Carlos Maiolino 8 months ago
On Sun, Apr 20, 2025 at 02:47:02AM -0700, Guenter Roeck wrote:
> On Tue, Mar 25, 2025 at 09:10:49AM +0000, Hans Holmberg wrote:
> > Presently we start garbage collection late - when we start running
> > out of free zones to backfill max_open_zones. This is a reasonable
> > default as it minimizes write amplification. The longer we wait,
> > the more blocks are invalidated and reclaim cost less in terms
> > of blocks to relocate.
> >
> > Starting this late however introduces a risk of GC being outcompeted
> > by user writes. If GC can't keep up, user writes will be forced to
> > wait for free zones with high tail latencies as a result.
> >
> > This is not a problem under normal circumstances, but if fragmentation
> > is bad and user write pressure is high (multiple full-throttle
> > writers) we will "bottom out" of free zones.
> >
> > To mitigate this, introduce a zonegc_low_space tunable that lets the
> > user specify a percentage of how much of the unused space that GC
> > should keep available for writing. A high value will reclaim more of
> > the space occupied by unused blocks, creating a larger buffer against
> > write bursts.
> >
> > This comes at a cost as write amplification is increased. To
> > illustrate this using a sample workload, setting zonegc_low_space to
> > 60% avoids high (500ms) max latencies while increasing write
> > amplification by 15%.
> >
> ...
> >  bool
> >  xfs_zoned_need_gc(
> >  	struct xfs_mount	*mp)
> >  {
> > +	s64			available, free;
> > +
> ...
> > +
> > +	free = xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS);
> > +	if (available < mult_frac(free, mp->m_zonegc_low_space, 100))
> > +		return true;
> > +
> 
> With some 32-bit builds (parisc, openrisc so far):
> 
> Error log:
> ERROR: modpost: "__divdi3" [fs/xfs/xfs.ko] undefined!
> ERROR: modpost: "__umoddi3" [fs/xfs/xfs.ko] undefined!
> ERROR: modpost: "__moddi3" [fs/xfs/xfs.ko] undefined!
> ERROR: modpost: "__udivdi3" [fs/xfs/xfs.ko] undefined!
> 

I opened a discussion about this:

https://lore.kernel.org/lkml/20250419115157.567249-1-cem@kernel.org/
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by Guenter Roeck 8 months ago
On 4/20/25 03:53, Carlos Maiolino wrote:
> On Sun, Apr 20, 2025 at 02:47:02AM -0700, Guenter Roeck wrote:
>> On Tue, Mar 25, 2025 at 09:10:49AM +0000, Hans Holmberg wrote:
>>> Presently we start garbage collection late - when we start running
>>> out of free zones to backfill max_open_zones. This is a reasonable
>>> default as it minimizes write amplification. The longer we wait,
>>> the more blocks are invalidated and reclaim cost less in terms
>>> of blocks to relocate.
>>>
>>> Starting this late however introduces a risk of GC being outcompeted
>>> by user writes. If GC can't keep up, user writes will be forced to
>>> wait for free zones with high tail latencies as a result.
>>>
>>> This is not a problem under normal circumstances, but if fragmentation
>>> is bad and user write pressure is high (multiple full-throttle
>>> writers) we will "bottom out" of free zones.
>>>
>>> To mitigate this, introduce a zonegc_low_space tunable that lets the
>>> user specify a percentage of how much of the unused space that GC
>>> should keep available for writing. A high value will reclaim more of
>>> the space occupied by unused blocks, creating a larger buffer against
>>> write bursts.
>>>
>>> This comes at a cost as write amplification is increased. To
>>> illustrate this using a sample workload, setting zonegc_low_space to
>>> 60% avoids high (500ms) max latencies while increasing write
>>> amplification by 15%.
>>>
>> ...
>>>   bool
>>>   xfs_zoned_need_gc(
>>>   	struct xfs_mount	*mp)
>>>   {
>>> +	s64			available, free;
>>> +
>> ...
>>> +
>>> +	free = xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS);
>>> +	if (available < mult_frac(free, mp->m_zonegc_low_space, 100))
>>> +		return true;
>>> +
>>
>> With some 32-bit builds (parisc, openrisc so far):
>>
>> Error log:
>> ERROR: modpost: "__divdi3" [fs/xfs/xfs.ko] undefined!
>> ERROR: modpost: "__umoddi3" [fs/xfs/xfs.ko] undefined!
>> ERROR: modpost: "__moddi3" [fs/xfs/xfs.ko] undefined!
>> ERROR: modpost: "__udivdi3" [fs/xfs/xfs.ko] undefined!
>>
> 
> I opened a discussion about this:
> 
> https://lore.kernel.org/lkml/20250419115157.567249-1-cem@kernel.org/

A possible local solution is below. Note the variable type change from s64 to u64.

Guenter
---

diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
index 8c541ca71872..6dde2a680e75 100644
--- a/fs/xfs/xfs_zone_gc.c
+++ b/fs/xfs/xfs_zone_gc.c
@@ -170,7 +170,7 @@ bool
  xfs_zoned_need_gc(
         struct xfs_mount        *mp)
  {
-       s64                     available, free;
+       u64                     available, free, rem;

         if (!xfs_group_marked(mp, XG_TYPE_RTG, XFS_RTG_RECLAIMABLE))
                 return false;
@@ -183,7 +183,12 @@ xfs_zoned_need_gc(
                 return true;

         free = xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS);
-       if (available < mult_frac(free, mp->m_zonegc_low_space, 100))
+
+       rem = do_div(free, 100);
+       free = free * mp->m_zonegc_low_space +
+               div_u64(rem * mp->m_zonegc_low_space, 100);
+
+       if (available < free)
                 return true;

         return false;
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by Geert Uytterhoeven 7 months, 4 weeks ago
On Sun, 20 Apr 2025 at 19:43, Guenter Roeck <linux@roeck-us.net> wrote:
> On 4/20/25 03:53, Carlos Maiolino wrote:
> > On Sun, Apr 20, 2025 at 02:47:02AM -0700, Guenter Roeck wrote:
> >> On Tue, Mar 25, 2025 at 09:10:49AM +0000, Hans Holmberg wrote:
> >>> Presently we start garbage collection late - when we start running
> >>> out of free zones to backfill max_open_zones. This is a reasonable
> >>> default as it minimizes write amplification. The longer we wait,
> >>> the more blocks are invalidated and reclaim cost less in terms
> >>> of blocks to relocate.
> >>>
> >>> Starting this late however introduces a risk of GC being outcompeted
> >>> by user writes. If GC can't keep up, user writes will be forced to
> >>> wait for free zones with high tail latencies as a result.
> >>>
> >>> This is not a problem under normal circumstances, but if fragmentation
> >>> is bad and user write pressure is high (multiple full-throttle
> >>> writers) we will "bottom out" of free zones.
> >>>
> >>> To mitigate this, introduce a zonegc_low_space tunable that lets the
> >>> user specify a percentage of how much of the unused space that GC
> >>> should keep available for writing. A high value will reclaim more of
> >>> the space occupied by unused blocks, creating a larger buffer against
> >>> write bursts.
> >>>
> >>> This comes at a cost as write amplification is increased. To
> >>> illustrate this using a sample workload, setting zonegc_low_space to
> >>> 60% avoids high (500ms) max latencies while increasing write
> >>> amplification by 15%.
> >>>
> >> ...
> >>>   bool
> >>>   xfs_zoned_need_gc(
> >>>     struct xfs_mount        *mp)
> >>>   {
> >>> +   s64                     available, free;
> >>> +
> >> ...
> >>> +
> >>> +   free = xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS);
> >>> +   if (available < mult_frac(free, mp->m_zonegc_low_space, 100))
> >>> +           return true;
> >>> +
> >>
> >> With some 32-bit builds (parisc, openrisc so far):
> >>
> >> Error log:
> >> ERROR: modpost: "__divdi3" [fs/xfs/xfs.ko] undefined!
> >> ERROR: modpost: "__umoddi3" [fs/xfs/xfs.ko] undefined!
> >> ERROR: modpost: "__moddi3" [fs/xfs/xfs.ko] undefined!
> >> ERROR: modpost: "__udivdi3" [fs/xfs/xfs.ko] undefined!
> >>
> >
> > I opened a discussion about this:
> >
> > https://lore.kernel.org/lkml/20250419115157.567249-1-cem@kernel.org/
>
> A possible local solution is below. Note the variable type change from s64 to u64.
>
> Guenter
> ---
>
> diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
> index 8c541ca71872..6dde2a680e75 100644
> --- a/fs/xfs/xfs_zone_gc.c
> +++ b/fs/xfs/xfs_zone_gc.c
> @@ -170,7 +170,7 @@ bool
>   xfs_zoned_need_gc(
>          struct xfs_mount        *mp)
>   {
> -       s64                     available, free;
> +       u64                     available, free, rem;
>
>          if (!xfs_group_marked(mp, XG_TYPE_RTG, XFS_RTG_RECLAIMABLE))
>                  return false;
> @@ -183,7 +183,12 @@ xfs_zoned_need_gc(
>                  return true;
>
>          free = xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS);
> -       if (available < mult_frac(free, mp->m_zonegc_low_space, 100))
> +
> +       rem = do_div(free, 100);
> +       free = free * mp->m_zonegc_low_space +
> +               div_u64(rem * mp->m_zonegc_low_space, 100);
> +
> +       if (available < free)
>                  return true;
>
>          return false;

There's also mul_u64_u32_div():

    static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
    ...

Unsigned, too.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by hch 8 months ago
On Sun, Apr 20, 2025 at 10:42:56AM -0700, Guenter Roeck wrote:
> A possible local solution is below. Note the variable type change from s64 to u64.

I think that'll need a lower bound of 0 thrown in to be safe as these
counters can occasionally underflow.

Otherwise this is probably the right thing to do for now until mult_frac
gets fixed eventually.  Can you add a comment why this open codes
mult_frac to the code and send a formal patch for it?
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by Guenter Roeck 8 months ago
On 4/21/25 01:31, hch wrote:
> On Sun, Apr 20, 2025 at 10:42:56AM -0700, Guenter Roeck wrote:
>> A possible local solution is below. Note the variable type change from s64 to u64.
> 
> I think that'll need a lower bound of 0 thrown in to be safe as these
> counters can occasionally underflow.
> 
> Otherwise this is probably the right thing to do for now until mult_frac
> gets fixed eventually.  Can you add a comment why this open codes
> mult_frac to the code and send a formal patch for it?
> 

Technically only free needs to be u64 for do_div to work. But that makes
me wonder what the function is supposed to return if free < 0.

Note that the proposed 64-bit safe version of mult_frac() also requires
the dividend to be unsigned. That is a quirk of do_div().

Thanks,
Guenter
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by hch 8 months ago
On Mon, Apr 21, 2025 at 06:41:43AM -0700, Guenter Roeck wrote:
> On 4/21/25 01:31, hch wrote:
>> On Sun, Apr 20, 2025 at 10:42:56AM -0700, Guenter Roeck wrote:
>>> A possible local solution is below. Note the variable type change from s64 to u64.
>>
>> I think that'll need a lower bound of 0 thrown in to be safe as these
>> counters can occasionally underflow.
>>
>> Otherwise this is probably the right thing to do for now until mult_frac
>> gets fixed eventually.  Can you add a comment why this open codes
>> mult_frac to the code and send a formal patch for it?
>>
>
> Technically only free needs to be u64 for do_div to work. But that makes
> me wonder what the function is supposed to return if free < 0.

free should be floored to zero, i.e.

	free = min(0, xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS));
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by Guenter Roeck 8 months ago
On 4/21/25 22:48, hch wrote:
> On Mon, Apr 21, 2025 at 06:41:43AM -0700, Guenter Roeck wrote:
>> On 4/21/25 01:31, hch wrote:
>>> On Sun, Apr 20, 2025 at 10:42:56AM -0700, Guenter Roeck wrote:
>>>> A possible local solution is below. Note the variable type change from s64 to u64.
>>>
>>> I think that'll need a lower bound of 0 thrown in to be safe as these
>>> counters can occasionally underflow.
>>>
>>> Otherwise this is probably the right thing to do for now until mult_frac
>>> gets fixed eventually.  Can you add a comment why this open codes
>>> mult_frac to the code and send a formal patch for it?
>>>
>>
>> Technically only free needs to be u64 for do_div to work. But that makes
>> me wonder what the function is supposed to return if free < 0.
> 
> free should be floored to zero, i.e.
> 
> 	free = min(0, xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS));
> 

Do you mean max, maybe ?

Guenter
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by hch 8 months ago
On Mon, Apr 21, 2025 at 10:57:31PM -0700, Guenter Roeck wrote:
>> free should be floored to zero, i.e.
>>
>> 	free = min(0, xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS));
>>
>
> Do you mean max, maybe ?

Yes, sorry.

Also if you want the work taken off your hands I can prepare a patch
as well, I just don't want to do that without approval from the
original author.
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by Carlos Maiolino 8 months ago
On Tue, Apr 22, 2025 at 08:01:37AM +0200, hch wrote:
> On Mon, Apr 21, 2025 at 10:57:31PM -0700, Guenter Roeck wrote:
> >> free should be floored to zero, i.e.
> >>
> >> 	free = min(0, xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS));
> >>
> >
> > Do you mean max, maybe ?
> 
> Yes, sorry.
> 
> Also if you want the work taken off your hands I can prepare a patch
> as well, I just don't want to do that without approval from the
> original author.

Just noticed it, I shouldn't read emails backwards on time :)
I had the same thoughts as hch though. But giving Guenter's last message...

hch, do you want to write a patch for that or should I?

btw, Guenter, I still plan to work on a generic mult_frac() to properly fix
this. Could you please share a reproducer for your case so I can test it on the
same architectures you're doing so? Hopefully it can be cross compiled?
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by hch 8 months ago
On Tue, Apr 22, 2025 at 09:59:52AM +0200, Carlos Maiolino wrote:
> hch, do you want to write a patch for that or should I?

If you have time today please do it.  Otherwise I'll try to get to it
later today.
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by Guenter Roeck 8 months ago
On 4/21/25 23:01, hch wrote:
> On Mon, Apr 21, 2025 at 10:57:31PM -0700, Guenter Roeck wrote:
>>> free should be floored to zero, i.e.
>>>
>>> 	free = min(0, xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS));
>>>
>>
>> Do you mean max, maybe ?
> 
> Yes, sorry.
> 
> Also if you want the work taken off your hands I can prepare a patch
> as well, I just don't want to do that without approval from the
> original author.

I didn't actually plan to write a patch myself. I thought that either
Hans or Carlos would do that. Sorry if my replies caused a misunderstanding.

Guenter
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by Carlos Maiolino 8 months ago
On Mon, Apr 21, 2025 at 11:22:47PM -0700, Guenter Roeck wrote:
> On 4/21/25 23:01, hch wrote:
> > On Mon, Apr 21, 2025 at 10:57:31PM -0700, Guenter Roeck wrote:
> >>> free should be floored to zero, i.e.
> >>>
> >>> 	free = min(0, xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS));
> >>>
> >>
> >> Do you mean max, maybe ?
> >
> > Yes, sorry.
> >
> > Also if you want the work taken off your hands I can prepare a patch
> > as well, I just don't want to do that without approval from the
> > original author.
> 
> I didn't actually plan to write a patch myself. I thought that either
> Hans or Carlos would do that. Sorry if my replies caused a misunderstanding.

I can do that, don't worry then. I'll craft something around this week.


> 
> Guenter
> 
>
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by Carlos Maiolino 8 months ago
On Sun, Apr 20, 2025 at 10:42:56AM -0700, Guenter Roeck wrote:
> On 4/20/25 03:53, Carlos Maiolino wrote:
> > On Sun, Apr 20, 2025 at 02:47:02AM -0700, Guenter Roeck wrote:
> >> On Tue, Mar 25, 2025 at 09:10:49AM +0000, Hans Holmberg wrote:
> >>> Presently we start garbage collection late - when we start running
> >>> out of free zones to backfill max_open_zones. This is a reasonable
> >>> default as it minimizes write amplification. The longer we wait,
> >>> the more blocks are invalidated and reclaim cost less in terms
> >>> of blocks to relocate.
> >>>
> >>> Starting this late however introduces a risk of GC being outcompeted
> >>> by user writes. If GC can't keep up, user writes will be forced to
> >>> wait for free zones with high tail latencies as a result.
> >>>
> >>> This is not a problem under normal circumstances, but if fragmentation
> >>> is bad and user write pressure is high (multiple full-throttle
> >>> writers) we will "bottom out" of free zones.
> >>>
> >>> To mitigate this, introduce a zonegc_low_space tunable that lets the
> >>> user specify a percentage of how much of the unused space that GC
> >>> should keep available for writing. A high value will reclaim more of
> >>> the space occupied by unused blocks, creating a larger buffer against
> >>> write bursts.
> >>>
> >>> This comes at a cost as write amplification is increased. To
> >>> illustrate this using a sample workload, setting zonegc_low_space to
> >>> 60% avoids high (500ms) max latencies while increasing write
> >>> amplification by 15%.
> >>>
> >> ...
> >>>   bool
> >>>   xfs_zoned_need_gc(
> >>>   	struct xfs_mount	*mp)
> >>>   {
> >>> +	s64			available, free;
> >>> +
> >> ...
> >>> +
> >>> +	free = xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS);
> >>> +	if (available < mult_frac(free, mp->m_zonegc_low_space, 100))
> >>> +		return true;
> >>> +
> >>
> >> With some 32-bit builds (parisc, openrisc so far):
> >>
> >> Error log:
> >> ERROR: modpost: "__divdi3" [fs/xfs/xfs.ko] undefined!
> >> ERROR: modpost: "__umoddi3" [fs/xfs/xfs.ko] undefined!
> >> ERROR: modpost: "__moddi3" [fs/xfs/xfs.ko] undefined!
> >> ERROR: modpost: "__udivdi3" [fs/xfs/xfs.ko] undefined!
> >>
> >
> > I opened a discussion about this:
> >
> > https://lore.kernel.org/lkml/20250419115157.567249-1-cem@kernel.org/
> 
> A possible local solution is below. Note the variable type change from s64 to u64.
> 
> Guenter
> ---
> 
> diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
> index 8c541ca71872..6dde2a680e75 100644
> --- a/fs/xfs/xfs_zone_gc.c
> +++ b/fs/xfs/xfs_zone_gc.c
> @@ -170,7 +170,7 @@ bool
>   xfs_zoned_need_gc(
>          struct xfs_mount        *mp)
>   {
> -       s64                     available, free;
> +       u64                     available, free, rem;
> 
>          if (!xfs_group_marked(mp, XG_TYPE_RTG, XFS_RTG_RECLAIMABLE))
>                  return false;
> @@ -183,7 +183,12 @@ xfs_zoned_need_gc(
>                  return true;
> 
>          free = xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS);
> -       if (available < mult_frac(free, mp->m_zonegc_low_space, 100))
> +
> +       rem = do_div(free, 100);
> +       free = free * mp->m_zonegc_low_space +
> +               div_u64(rem * mp->m_zonegc_low_space, 100);
> +
> +       if (available < free)
>                  return true;

You're essentially open coding mult_frac(), if we can get mult_frac() to work
on 64-bit too (or add a 64-bit version), that seems a better generic solution.


> 
>          return false;
> 
>
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by Guenter Roeck 8 months ago
On 4/20/25 11:07, Carlos Maiolino wrote:
...
>>
>> diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
>> index 8c541ca71872..6dde2a680e75 100644
>> --- a/fs/xfs/xfs_zone_gc.c
>> +++ b/fs/xfs/xfs_zone_gc.c
>> @@ -170,7 +170,7 @@ bool
>>    xfs_zoned_need_gc(
>>           struct xfs_mount        *mp)
>>    {
>> -       s64                     available, free;
>> +       u64                     available, free, rem;
>>
>>           if (!xfs_group_marked(mp, XG_TYPE_RTG, XFS_RTG_RECLAIMABLE))
>>                   return false;
>> @@ -183,7 +183,12 @@ xfs_zoned_need_gc(
>>                   return true;
>>
>>           free = xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS);
>> -       if (available < mult_frac(free, mp->m_zonegc_low_space, 100))
>> +
>> +       rem = do_div(free, 100);
>> +       free = free * mp->m_zonegc_low_space +
>> +               div_u64(rem * mp->m_zonegc_low_space, 100);
>> +
>> +       if (available < free)
>>                   return true;
> 
> You're essentially open coding mult_frac(), if we can get mult_frac() to work
> on 64-bit too (or add a 64-bit version), that seems a better generic solution.
> 

Yes, I know. Problem is that getting more than one maintainer involved tends to make
it exponentially more difficult to get anything accepted. With that in mind, I prefer
open coded solutions like the one I suggested above. A generic solution is then still
possible, but it is disconnected from solving the immediate problem.

Guenter
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by Carlos Maiolino 8 months ago
On Sun, Apr 20, 2025 at 12:13:34PM -0700, Guenter Roeck wrote:
> On 4/20/25 11:07, Carlos Maiolino wrote:
> ...
> >>
> >> diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
> >> index 8c541ca71872..6dde2a680e75 100644
> >> --- a/fs/xfs/xfs_zone_gc.c
> >> +++ b/fs/xfs/xfs_zone_gc.c
> >> @@ -170,7 +170,7 @@ bool
> >>    xfs_zoned_need_gc(
> >>           struct xfs_mount        *mp)
> >>    {
> >> -       s64                     available, free;
> >> +       u64                     available, free, rem;
> >>
> >>           if (!xfs_group_marked(mp, XG_TYPE_RTG, XFS_RTG_RECLAIMABLE))
> >>                   return false;
> >> @@ -183,7 +183,12 @@ xfs_zoned_need_gc(
> >>                   return true;
> >>
> >>           free = xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS);
> >> -       if (available < mult_frac(free, mp->m_zonegc_low_space, 100))
> >> +
> >> +       rem = do_div(free, 100);
> >> +       free = free * mp->m_zonegc_low_space +
> >> +               div_u64(rem * mp->m_zonegc_low_space, 100);
> >> +
> >> +       if (available < free)
> >>                   return true;
> >
> > You're essentially open coding mult_frac(), if we can get mult_frac() to work
> > on 64-bit too (or add a 64-bit version), that seems a better generic solution.
> >
> 
> Yes, I know. Problem is that getting more than one maintainer involved tends to make
> it exponentially more difficult to get anything accepted. With that in mind, I prefer
> open coded solutions like the one I suggested above. A generic solution is then still
> possible, but it is disconnected from solving the immediate problem.
> 

I think this is fair for the moment, unless Hans/Christoph have a better idea?!

> Guenter
>
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by Carlos Maiolino 8 months ago
On Tue, 25 Mar 2025 09:10:49 +0000, Hans Holmberg wrote:
> Presently we start garbage collection late - when we start running
> out of free zones to backfill max_open_zones. This is a reasonable
> default as it minimizes write amplification. The longer we wait,
> the more blocks are invalidated and reclaim cost less in terms
> of blocks to relocate.
> 
> Starting this late however introduces a risk of GC being outcompeted
> by user writes. If GC can't keep up, user writes will be forced to
> wait for free zones with high tail latencies as a result.
> 
> [...]

Applied to for-next, thanks!

[1/1] xfs: add tunable threshold parameter for triggering zone GC
      commit: 845abeb1f06a8a44e21314460eeb14cddfca52cc

Best regards,
-- 
Carlos Maiolino <cem@kernel.org>
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by hch 8 months, 3 weeks ago
On Tue, Mar 25, 2025 at 09:10:49AM +0000, Hans Holmberg wrote:
> +Zoned Filesystems
> +=================
> +
> +For zoned file systems, the following attributes are exposed in:
> +
> + /sys/fs/xfs/<dev>/zoned/
> +
> + max_open_zones                 (Min:  1  Default:  Varies  Max:  UINTMAX)
> +        This read-only attribute exposes the maximum number of open zones
> +        available for data placement. The value is determined at mount time and
> +        is limited by the capabilities of the backing zoned device, file system
> +        size and the max_open_zones mount option.

This should go into 6.15-rc as a separate patch to fix my mistake of not
adding documentation for this file.  (Thanks for fixing that!)

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Re: [PATCH] xfs: add tunable threshold parameter for triggering zone GC
Posted by Hans Holmberg 8 months, 3 weeks ago
On 28/03/2025 12:01, hch wrote:
> On Tue, Mar 25, 2025 at 09:10:49AM +0000, Hans Holmberg wrote:
>> +Zoned Filesystems
>> +=================
>> +
>> +For zoned file systems, the following attributes are exposed in:
>> +
>> + /sys/fs/xfs/<dev>/zoned/
>> +
>> + max_open_zones                 (Min:  1  Default:  Varies  Max:  UINTMAX)
>> +        This read-only attribute exposes the maximum number of open zones
>> +        available for data placement. The value is determined at mount time and
>> +        is limited by the capabilities of the backing zoned device, file system
>> +        size and the max_open_zones mount option.
> 
> This should go into 6.15-rc as a separate patch to fix my mistake of not
> adding documentation for this file.  (Thanks for fixing that!)

Right, I might as well document the zoned mount options (lifetime, nolifetime,
max_open_zones) as part of that patch to make the doc completely up to date.