mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

[PATCH] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Jim Zhao 1 month ago

With the strictlimit flag, wb_thresh acts as a hard limit in
balance_dirty_pages() and wb_position_ratio(). When device write
operations are inactive, wb_thresh can drop to 0, causing writes to
be blocked. The issue occasionally occurs in fuse fs, particularly
with network backends, the write thread is blocked frequently during
a period. To address it, this patch raises the minimum wb_thresh to a
controllable level, similar to the non-strictlimit case.

Signed-off-by: Jim Zhao <jimzhao.ai@gmail.com>
---
 mm/page-writeback.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 72a5d8836425..f21d856c408b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -917,7 +917,9 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
 				      unsigned long thresh)
 {
 	struct wb_domain *dom = dtc_dom(dtc);
+	struct bdi_writeback *wb = dtc->wb;
 	u64 wb_thresh;
+	u64 wb_max_thresh;
 	unsigned long numerator, denominator;
 	unsigned long wb_min_ratio, wb_max_ratio;
 
@@ -931,11 +933,28 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
 	wb_thresh *= numerator;
 	wb_thresh = div64_ul(wb_thresh, denominator);
 
-	wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio);
+	wb_min_max_ratio(wb, &wb_min_ratio, &wb_max_ratio);
 
 	wb_thresh += (thresh * wb_min_ratio) / (100 * BDI_RATIO_SCALE);
-	if (wb_thresh > (thresh * wb_max_ratio) / (100 * BDI_RATIO_SCALE))
-		wb_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
+	wb_max_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
+	if (wb_thresh > wb_max_thresh)
+		wb_thresh = wb_max_thresh;
+
+	/*
+	 * With strictlimit flag, the wb_thresh is treated as
+	 * a hard limit in balance_dirty_pages() and wb_position_ratio().
+	 * It's possible that wb_thresh is close to zero, not because
+	 * the device is slow, but because it has been inactive.
+	 * To prevent occasional writes from being blocked, we raise wb_thresh.
+	 */
+	if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
+		unsigned long limit = hard_dirty_limit(dom, dtc->thresh);
+		u64 wb_scale_thresh = 0;
+
+		if (limit > dtc->dirty)
+			wb_scale_thresh = (limit - dtc->dirty) / 100;
+		wb_thresh = max(wb_thresh, min(wb_scale_thresh, wb_max_thresh / 4));
+	}
 
 	return wb_thresh;
 }
-- 
2.34.1

Re: [PATCH] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Jan Kara 2 weeks, 4 days ago

On Wed 23-10-24 18:00:32, Jim Zhao wrote:
> With the strictlimit flag, wb_thresh acts as a hard limit in
> balance_dirty_pages() and wb_position_ratio(). When device write
> operations are inactive, wb_thresh can drop to 0, causing writes to
> be blocked. The issue occasionally occurs in fuse fs, particularly
> with network backends, the write thread is blocked frequently during
> a period. To address it, this patch raises the minimum wb_thresh to a
> controllable level, similar to the non-strictlimit case.
> 
> Signed-off-by: Jim Zhao <jimzhao.ai@gmail.com>

...

> +	/*
> +	 * With strictlimit flag, the wb_thresh is treated as
> +	 * a hard limit in balance_dirty_pages() and wb_position_ratio().
> +	 * It's possible that wb_thresh is close to zero, not because
> +	 * the device is slow, but because it has been inactive.
> +	 * To prevent occasional writes from being blocked, we raise wb_thresh.
> +	 */
> +	if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> +		unsigned long limit = hard_dirty_limit(dom, dtc->thresh);
> +		u64 wb_scale_thresh = 0;
> +
> +		if (limit > dtc->dirty)
> +			wb_scale_thresh = (limit - dtc->dirty) / 100;
> +		wb_thresh = max(wb_thresh, min(wb_scale_thresh, wb_max_thresh / 4));
> +	}

What you propose makes sense in principle although I'd say this is mostly a
userspace setup issue - with strictlimit enabled, you're kind of expected
to set min_ratio exactly if you want to avoid these startup issues. But I
tend to agree that we can provide a bit of a slack for a bdi without
min_ratio configured to ramp up.

But I'd rather pick the logic like:

	/*
	 * If bdi does not have min_ratio configured and it was inactive,
	 * bump its min_ratio to 0.1% to provide it some room to ramp up.
	 */
	if (!wb_min_ratio && !numerator)
		wb_min_ratio = min(BDI_RATIO_SCALE / 10, wb_max_ratio / 2);

That would seem like a bit more systematic way than the formula you propose
above...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Jim Zhao 2 weeks, 3 days ago

> On Wed 23-10-24 18:00:32, Jim Zhao wrote:
> > With the strictlimit flag, wb_thresh acts as a hard limit in
> > balance_dirty_pages() and wb_position_ratio(). When device write
> > operations are inactive, wb_thresh can drop to 0, causing writes to
> > be blocked. The issue occasionally occurs in fuse fs, particularly
> > with network backends, the write thread is blocked frequently during
> > a period. To address it, this patch raises the minimum wb_thresh to a
> > controllable level, similar to the non-strictlimit case.
> > 
> > Signed-off-by: Jim Zhao <jimzhao.ai@gmail.com>
> 
> ...
> 
> > +	/*
> > +	 * With strictlimit flag, the wb_thresh is treated as
> > +	 * a hard limit in balance_dirty_pages() and wb_position_ratio().
> > +	 * It's possible that wb_thresh is close to zero, not because
> > +	 * the device is slow, but because it has been inactive.
> > +	 * To prevent occasional writes from being blocked, we raise wb_thresh.
> > +	 */
> > +	if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> > +		unsigned long limit = hard_dirty_limit(dom, dtc->thresh);
> > +		u64 wb_scale_thresh = 0;
> > +
> > +		if (limit > dtc->dirty)
> > +			wb_scale_thresh = (limit - dtc->dirty) / 100;
> > +		wb_thresh = max(wb_thresh, min(wb_scale_thresh, wb_max_thresh / 4));
> > +	}
> 
> What you propose makes sense in principle although I'd say this is mostly a
> userspace setup issue - with strictlimit enabled, you're kind of expected
> to set min_ratio exactly if you want to avoid these startup issues. But I
> tend to agree that we can provide a bit of a slack for a bdi without
> min_ratio configured to ramp up.
> 
> But I'd rather pick the logic like:
> 
> 	/*
> 	 * If bdi does not have min_ratio configured and it was inactive,
> 	 * bump its min_ratio to 0.1% to provide it some room to ramp up.
> 	 */
> 	if (!wb_min_ratio && !numerator)
> 		wb_min_ratio = min(BDI_RATIO_SCALE / 10, wb_max_ratio / 2);
> 
> That would seem like a bit more systematic way than the formula you propose
> above...

Thanks for the advice.
Here's the explanation of the formula:
1. when writes are small and intermittent，wb_thresh can approach 0, not just 0, making the numerator value difficult to verify.
2. The ramp-up margin, whether 0.1% or another value, needs consideration.
I based this on the logic of wb_position_ratio in the non-strictlimit scenario:
wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);
It seems provides more room and ensures ramping up within a controllable range.

---
Jim Zhao
Thanks

Re: [PATCH] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Jan Kara 2 weeks, 2 days ago

On Fri 08-11-24 11:19:49, Jim Zhao wrote:
> > On Wed 23-10-24 18:00:32, Jim Zhao wrote:
> > > With the strictlimit flag, wb_thresh acts as a hard limit in
> > > balance_dirty_pages() and wb_position_ratio(). When device write
> > > operations are inactive, wb_thresh can drop to 0, causing writes to
> > > be blocked. The issue occasionally occurs in fuse fs, particularly
> > > with network backends, the write thread is blocked frequently during
> > > a period. To address it, this patch raises the minimum wb_thresh to a
> > > controllable level, similar to the non-strictlimit case.
> > > 
> > > Signed-off-by: Jim Zhao <jimzhao.ai@gmail.com>
> > 
> > ...
> > 
> > > +	/*
> > > +	 * With strictlimit flag, the wb_thresh is treated as
> > > +	 * a hard limit in balance_dirty_pages() and wb_position_ratio().
> > > +	 * It's possible that wb_thresh is close to zero, not because
> > > +	 * the device is slow, but because it has been inactive.
> > > +	 * To prevent occasional writes from being blocked, we raise wb_thresh.
> > > +	 */
> > > +	if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> > > +		unsigned long limit = hard_dirty_limit(dom, dtc->thresh);
> > > +		u64 wb_scale_thresh = 0;
> > > +
> > > +		if (limit > dtc->dirty)
> > > +			wb_scale_thresh = (limit - dtc->dirty) / 100;
> > > +		wb_thresh = max(wb_thresh, min(wb_scale_thresh, wb_max_thresh / 4));
> > > +	}
> > 
> > What you propose makes sense in principle although I'd say this is mostly a
> > userspace setup issue - with strictlimit enabled, you're kind of expected
> > to set min_ratio exactly if you want to avoid these startup issues. But I
> > tend to agree that we can provide a bit of a slack for a bdi without
> > min_ratio configured to ramp up.
> > 
> > But I'd rather pick the logic like:
> > 
> > 	/*
> > 	 * If bdi does not have min_ratio configured and it was inactive,
> > 	 * bump its min_ratio to 0.1% to provide it some room to ramp up.
> > 	 */
> > 	if (!wb_min_ratio && !numerator)
> > 		wb_min_ratio = min(BDI_RATIO_SCALE / 10, wb_max_ratio / 2);
> > 
> > That would seem like a bit more systematic way than the formula you propose
> > above...
> 
> Thanks for the advice.
> Here's the explanation of the formula:
> 1. when writes are small and intermittent，wb_thresh can approach 0, not
> just 0, making the numerator value difficult to verify.

I see, ok.

> 2. The ramp-up margin, whether 0.1% or another value, needs
> consideration.
> I based this on the logic of wb_position_ratio in the non-strictlimit
> scenario: wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8); It seems
> provides more room and ensures ramping up within a controllable range.

I see, thanks for explanation. So I was thinking how to make the code more
consistent instead of adding another special constant and workaround. What
I'd suggest is:

1) There's already code that's supposed to handle ramping up with
strictlimit in wb_update_dirty_ratelimit():

        /*
         * For strictlimit case, calculations above were based on wb counters
         * and limits (starting from pos_ratio = wb_position_ratio() and up to
         * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
         * Hence, to calculate "step" properly, we have to use wb_dirty as
         * "dirty" and wb_setpoint as "setpoint".
         *
         * We rampup dirty_ratelimit forcibly if wb_dirty is low because
         * it's possible that wb_thresh is close to zero due to inactivity
         * of backing device.
         */
        if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
                dirty = dtc->wb_dirty;
                if (dtc->wb_dirty < 8)
                        setpoint = dtc->wb_dirty + 1;
                else
                        setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
        }

Now I agree that increasing wb_thresh directly is more understandable and
transparent so I'd just drop this special case.

2) I'd just handle all the bumping of wb_thresh in a single place instead
of having is spread over multiple places. So __wb_calc_thresh() could have
a code like:

        wb_thresh = (thresh * (100 * BDI_RATIO_SCALE - bdi_min_ratio)) / (100 * BDI_RATIO_SCALE)
        wb_thresh *= numerator;
        wb_thresh = div64_ul(wb_thresh, denominator);

        wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio);

        wb_thresh += (thresh * wb_min_ratio) / (100 * BDI_RATIO_SCALE);
	limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
        /*
         * It's very possible that wb_thresh is close to 0 not because the
         * device is slow, but that it has remained inactive for long time.
         * Honour such devices a reasonable good (hopefully IO efficient)
         * threshold, so that the occasional writes won't be blocked and active
         * writes can rampup the threshold quickly.
         */
	if (limit > dtc->dirty)
		wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);
	if (wb_thresh > (thresh * wb_max_ratio) / (100 * BDI_RATIO_SCALE))
		wb_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);

and we can drop the bumping from wb_position)_ratio(). This way have the
wb_thresh bumping in a single logical place. Since we still limit wb_tresh
with max_ratio, untrusted bdis for which max_ratio should be configured
(otherwise they can grow amount of dirty pages upto global treshold anyway)
are still under control.

If we really wanted, we could introduce a different bumping in case of
strictlimit, but at this point I don't think it is warranted so I'd leave
that as an option if someone comes with a situation where this bumping
proves to be too aggressive.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Jim Zhao 1 week, 6 days ago

> On Fri 08-11-24 11:19:49, Jim Zhao wrote:
> > > On Wed 23-10-24 18:00:32, Jim Zhao wrote:
> > > > With the strictlimit flag, wb_thresh acts as a hard limit in
> > > > balance_dirty_pages() and wb_position_ratio(). When device write
> > > > operations are inactive, wb_thresh can drop to 0, causing writes to
> > > > be blocked. The issue occasionally occurs in fuse fs, particularly
> > > > with network backends, the write thread is blocked frequently during
> > > > a period. To address it, this patch raises the minimum wb_thresh to a
> > > > controllable level, similar to the non-strictlimit case.
> > > >
> > > > Signed-off-by: Jim Zhao <jimzhao.ai@gmail.com>
> > >
> > > ...
> > >
> > > > +       /*
> > > > +        * With strictlimit flag, the wb_thresh is treated as
> > > > +        * a hard limit in balance_dirty_pages() and wb_position_ratio().
> > > > +        * It's possible that wb_thresh is close to zero, not because
> > > > +        * the device is slow, but because it has been inactive.
> > > > +        * To prevent occasional writes from being blocked, we raise wb_thresh.
> > > > +        */
> > > > +       if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> > > > +               unsigned long limit = hard_dirty_limit(dom, dtc->thresh);
> > > > +               u64 wb_scale_thresh = 0;
> > > > +
> > > > +               if (limit > dtc->dirty)
> > > > +                       wb_scale_thresh = (limit - dtc->dirty) / 100;
> > > > +               wb_thresh = max(wb_thresh, min(wb_scale_thresh, wb_max_thresh / 4));
> > > > +       }
> > >
> > > What you propose makes sense in principle although I'd say this is mostly a
> > > userspace setup issue - with strictlimit enabled, you're kind of expected
> > > to set min_ratio exactly if you want to avoid these startup issues. But I
> > > tend to agree that we can provide a bit of a slack for a bdi without
> > > min_ratio configured to ramp up.
> > >
> > > But I'd rather pick the logic like:
> > >
> > >   /*
> > >    * If bdi does not have min_ratio configured and it was inactive,
> > >    * bump its min_ratio to 0.1% to provide it some room to ramp up.
> > >    */
> > >   if (!wb_min_ratio && !numerator)
> > >           wb_min_ratio = min(BDI_RATIO_SCALE / 10, wb_max_ratio / 2);
> > >
> > > That would seem like a bit more systematic way than the formula you propose
> > > above...
> >
> > Thanks for the advice.
> > Here's the explanation of the formula:
> > 1. when writes are small and intermittent，wb_thresh can approach 0, not
> > just 0, making the numerator value difficult to verify.
>
> I see, ok.
>
> > 2. The ramp-up margin, whether 0.1% or another value, needs
> > consideration.
> > I based this on the logic of wb_position_ratio in the non-strictlimit
> > scenario: wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8); It seems
> > provides more room and ensures ramping up within a controllable range.
>
> I see, thanks for explanation. So I was thinking how to make the code more
> consistent instead of adding another special constant and workaround. What
> I'd suggest is:
>
> 1) There's already code that's supposed to handle ramping up with
> strictlimit in wb_update_dirty_ratelimit():
>
>         /*
>          * For strictlimit case, calculations above were based on wb counters
>          * and limits (starting from pos_ratio = wb_position_ratio() and up to
>          * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
>          * Hence, to calculate "step" properly, we have to use wb_dirty as
>          * "dirty" and wb_setpoint as "setpoint".
>          *
>          * We rampup dirty_ratelimit forcibly if wb_dirty is low because
>          * it's possible that wb_thresh is close to zero due to inactivity
>          * of backing device.
>          */
>         if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
>                 dirty = dtc->wb_dirty;
>                 if (dtc->wb_dirty < 8)
>                         setpoint = dtc->wb_dirty + 1;
>                 else
>                         setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
>         }
>
> Now I agree that increasing wb_thresh directly is more understandable and
> transparent so I'd just drop this special case.

yes, I agree.

> 2) I'd just handle all the bumping of wb_thresh in a single place instead
> of having is spread over multiple places. So __wb_calc_thresh() could have
> a code like:
>
>         wb_thresh = (thresh * (100 * BDI_RATIO_SCALE - bdi_min_ratio)) / (100 * BDI_RATIO_SCALE)
>         wb_thresh *= numerator;
>         wb_thresh = div64_ul(wb_thresh, denominator);
>
>         wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio);
>
>         wb_thresh += (thresh * wb_min_ratio) / (100 * BDI_RATIO_SCALE);
>       limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
>         /*
>          * It's very possible that wb_thresh is close to 0 not because the
>          * device is slow, but that it has remained inactive for long time.
>          * Honour such devices a reasonable good (hopefully IO efficient)
>          * threshold, so that the occasional writes won't be blocked and active
>          * writes can rampup the threshold quickly.
>          */
>       if (limit > dtc->dirty)
>               wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);
>       if (wb_thresh > (thresh * wb_max_ratio) / (100 * BDI_RATIO_SCALE))
>               wb_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
>
> and we can drop the bumping from wb_position)_ratio(). This way have the
> wb_thresh bumping in a single logical place. Since we still limit wb_tresh
> with max_ratio, untrusted bdis for which max_ratio should be configured
> (otherwise they can grow amount of dirty pages upto global treshold anyway)
> are still under control.
>
> If we really wanted, we could introduce a different bumping in case of
> strictlimit, but at this point I don't think it is warranted so I'd leave
> that as an option if someone comes with a situation where this bumping
> proves to be too aggressive.

Thank you, this is very helpful. And I have 2 concerns:

1.
In the current non-strictlimit logic, wb_thresh is only bumped within wb_position_ratio() for calculating pos_ratio, and this bump isn’t restricted by max_ratio. 
I’m unsure if moving this adjustment to __wb_calc_thresh() would effect existing behavior. 
Would it be possible to keep the current logic for non-strictlimit case?

2. Regarding the formula:
wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);

Consider a case: 
With 100 fuse devices(with high max_ratio) experiencing high writeback delays, the pages being written back are accounted in NR_WRITEBACK_TEMP, not dtc->dirty. 
As a result, the bumped wb_thresh may remain high. While individual devices are under control, the total could exceed expectations.

Although lowering the max_ratio can avoid this issue, how about reducing the bumped wb_thresh?

The formula in my patch:
wb_scale_thresh = (limit - dtc->dirty) / 100;
The intention is to use the default fuse max_ratio(1%) as the multiplier.


Thanks
Jim Zhao

Re: [PATCH] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Jan Kara 1 week, 5 days ago

On Tue 12-11-24 16:45:39, Jim Zhao wrote:
> > On Fri 08-11-24 11:19:49, Jim Zhao wrote:
> > > > On Wed 23-10-24 18:00:32, Jim Zhao wrote:
> > > > > With the strictlimit flag, wb_thresh acts as a hard limit in
> > > > > balance_dirty_pages() and wb_position_ratio(). When device write
> > > > > operations are inactive, wb_thresh can drop to 0, causing writes to
> > > > > be blocked. The issue occasionally occurs in fuse fs, particularly
> > > > > with network backends, the write thread is blocked frequently during
> > > > > a period. To address it, this patch raises the minimum wb_thresh to a
> > > > > controllable level, similar to the non-strictlimit case.
> > > > >
> > > > > Signed-off-by: Jim Zhao <jimzhao.ai@gmail.com>
> > > >
> > > > ...
> > > >
> > > > > +       /*
> > > > > +        * With strictlimit flag, the wb_thresh is treated as
> > > > > +        * a hard limit in balance_dirty_pages() and wb_position_ratio().
> > > > > +        * It's possible that wb_thresh is close to zero, not because
> > > > > +        * the device is slow, but because it has been inactive.
> > > > > +        * To prevent occasional writes from being blocked, we raise wb_thresh.
> > > > > +        */
> > > > > +       if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> > > > > +               unsigned long limit = hard_dirty_limit(dom, dtc->thresh);
> > > > > +               u64 wb_scale_thresh = 0;
> > > > > +
> > > > > +               if (limit > dtc->dirty)
> > > > > +                       wb_scale_thresh = (limit - dtc->dirty) / 100;
> > > > > +               wb_thresh = max(wb_thresh, min(wb_scale_thresh, wb_max_thresh / 4));
> > > > > +       }
> > > >
> > > > What you propose makes sense in principle although I'd say this is mostly a
> > > > userspace setup issue - with strictlimit enabled, you're kind of expected
> > > > to set min_ratio exactly if you want to avoid these startup issues. But I
> > > > tend to agree that we can provide a bit of a slack for a bdi without
> > > > min_ratio configured to ramp up.
> > > >
> > > > But I'd rather pick the logic like:
> > > >
> > > >   /*
> > > >    * If bdi does not have min_ratio configured and it was inactive,
> > > >    * bump its min_ratio to 0.1% to provide it some room to ramp up.
> > > >    */
> > > >   if (!wb_min_ratio && !numerator)
> > > >           wb_min_ratio = min(BDI_RATIO_SCALE / 10, wb_max_ratio / 2);
> > > >
> > > > That would seem like a bit more systematic way than the formula you propose
> > > > above...
> > >
> > > Thanks for the advice.
> > > Here's the explanation of the formula:
> > > 1. when writes are small and intermittent，wb_thresh can approach 0, not
> > > just 0, making the numerator value difficult to verify.
> >
> > I see, ok.
> >
> > > 2. The ramp-up margin, whether 0.1% or another value, needs
> > > consideration.
> > > I based this on the logic of wb_position_ratio in the non-strictlimit
> > > scenario: wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8); It seems
> > > provides more room and ensures ramping up within a controllable range.
> >
> > I see, thanks for explanation. So I was thinking how to make the code more
> > consistent instead of adding another special constant and workaround. What
> > I'd suggest is:
> >
> > 1) There's already code that's supposed to handle ramping up with
> > strictlimit in wb_update_dirty_ratelimit():
> >
> >         /*
> >          * For strictlimit case, calculations above were based on wb counters
> >          * and limits (starting from pos_ratio = wb_position_ratio() and up to
> >          * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
> >          * Hence, to calculate "step" properly, we have to use wb_dirty as
> >          * "dirty" and wb_setpoint as "setpoint".
> >          *
> >          * We rampup dirty_ratelimit forcibly if wb_dirty is low because
> >          * it's possible that wb_thresh is close to zero due to inactivity
> >          * of backing device.
> >          */
> >         if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> >                 dirty = dtc->wb_dirty;
> >                 if (dtc->wb_dirty < 8)
> >                         setpoint = dtc->wb_dirty + 1;
> >                 else
> >                         setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
> >         }
> >
> > Now I agree that increasing wb_thresh directly is more understandable and
> > transparent so I'd just drop this special case.
> 
> yes, I agree.
> 
> > 2) I'd just handle all the bumping of wb_thresh in a single place instead
> > of having is spread over multiple places. So __wb_calc_thresh() could have
> > a code like:
> >
> >         wb_thresh = (thresh * (100 * BDI_RATIO_SCALE - bdi_min_ratio)) / (100 * BDI_RATIO_SCALE)
> >         wb_thresh *= numerator;
> >         wb_thresh = div64_ul(wb_thresh, denominator);
> >
> >         wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio);
> >
> >         wb_thresh += (thresh * wb_min_ratio) / (100 * BDI_RATIO_SCALE);
> >       limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
> >         /*
> >          * It's very possible that wb_thresh is close to 0 not because the
> >          * device is slow, but that it has remained inactive for long time.
> >          * Honour such devices a reasonable good (hopefully IO efficient)
> >          * threshold, so that the occasional writes won't be blocked and active
> >          * writes can rampup the threshold quickly.
> >          */
> >       if (limit > dtc->dirty)
> >               wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);
> >       if (wb_thresh > (thresh * wb_max_ratio) / (100 * BDI_RATIO_SCALE))
> >               wb_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
> >
> > and we can drop the bumping from wb_position)_ratio(). This way have the
> > wb_thresh bumping in a single logical place. Since we still limit wb_tresh
> > with max_ratio, untrusted bdis for which max_ratio should be configured
> > (otherwise they can grow amount of dirty pages upto global treshold anyway)
> > are still under control.
> >
> > If we really wanted, we could introduce a different bumping in case of
> > strictlimit, but at this point I don't think it is warranted so I'd leave
> > that as an option if someone comes with a situation where this bumping
> > proves to be too aggressive.
> 
> Thank you, this is very helpful. And I have 2 concerns:
> 
> 1.
> In the current non-strictlimit logic, wb_thresh is only bumped within
> wb_position_ratio() for calculating pos_ratio, and this bump isn’t
> restricted by max_ratio.  I’m unsure if moving this adjustment to
> __wb_calc_thresh() would effect existing behavior.  Would it be possible
> to keep the current logic for non-strictlimit case?

You are correct that current bumping is not affected by max_ratio and that
is actually a bug. wb_thresh should never exceed what is corresponding to
the configured max_ratio. Furthermore in practical configurations I don't
think the max_ratio limiting will actually make a big difference because
bumping should happen when wb_thresh is really low. So for consistency I
would apply it also to the non-strictlimit case.

> 2. Regarding the formula:
> wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);
> 
> Consider a case: 
> With 100 fuse devices(with high max_ratio) experiencing high writeback
> delays, the pages being written back are accounted in NR_WRITEBACK_TEMP,
> not dtc->dirty.  As a result, the bumped wb_thresh may remain high. While
> individual devices are under control, the total could exceed
> expectations.

I agree but this is a potential problem with any kind of bumping based on
'limit - dtc->dirty'. It is just a matter of how many fuse devices you have
and how exactly you have max_ratio configured.

> Although lowering the max_ratio can avoid this issue, how about reducing
> the bumped wb_thresh?
> 
> The formula in my patch:
> wb_scale_thresh = (limit - dtc->dirty) / 100;
> The intention is to use the default fuse max_ratio(1%) as the multiplier.

So basically you propose to use the "/ 8" factor for the normal case and "/
100" factor for the strictlimit case. My position is that I would not
complicate the logic unless somebody comes with a real world setup where
the simpler logic is causing real problems. But if you feel strongly about
this, I'm fine with that option.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[PATCH v2] mm/page-writeback: raise wb_thresh to prevent write blocking with strictlimit

Posted by Jim Zhao 6 days, 9 hours ago

With the strictlimit flag, wb_thresh acts as a hard limit in
balance_dirty_pages() and wb_position_ratio().  When device write
operations are inactive, wb_thresh can drop to 0, causing writes to be
blocked.  The issue occasionally occurs in fuse fs, particularly with
network backends, the write thread is blocked frequently during a period.
To address it, this patch raises the minimum wb_thresh to a controllable
level, similar to the non-strictlimit case.

Signed-off-by: Jim Zhao <jimzhao.ai@gmail.com>
---
Changes in v2:
1. Consolidate all wb_thresh bumping logic in __wb_calc_thresh for consistency;
2. Replace the limit variable with thresh for calculating the bump value,
as __wb_calc_thresh is also used to calculate the background threshold;
3. Add domain_dirty_avail in wb_calc_thresh to get dtc->dirty.
---
 mm/page-writeback.c | 48 ++++++++++++++++++++++-----------------------
 1 file changed, 23 insertions(+), 25 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e5a9eb795f99..8b13bcb42de3 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -917,7 +917,9 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
 				      unsigned long thresh)
 {
 	struct wb_domain *dom = dtc_dom(dtc);
+	struct bdi_writeback *wb = dtc->wb;
 	u64 wb_thresh;
+	u64 wb_max_thresh;
 	unsigned long numerator, denominator;
 	unsigned long wb_min_ratio, wb_max_ratio;
 
@@ -931,11 +933,27 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
 	wb_thresh *= numerator;
 	wb_thresh = div64_ul(wb_thresh, denominator);
 
-	wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio);
+	wb_min_max_ratio(wb, &wb_min_ratio, &wb_max_ratio);
 
 	wb_thresh += (thresh * wb_min_ratio) / (100 * BDI_RATIO_SCALE);
-	if (wb_thresh > (thresh * wb_max_ratio) / (100 * BDI_RATIO_SCALE))
-		wb_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
+
+	/*
+	 * It's very possible that wb_thresh is close to 0 not because the
+	 * device is slow, but that it has remained inactive for long time.
+	 * Honour such devices a reasonable good (hopefully IO efficient)
+	 * threshold, so that the occasional writes won't be blocked and active
+	 * writes can rampup the threshold quickly.
+	 */
+	if (thresh > dtc->dirty) {
+		if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT))
+			wb_thresh = max(wb_thresh, (thresh - dtc->dirty) / 100);
+		else
+			wb_thresh = max(wb_thresh, (thresh - dtc->dirty) / 8);
+	}
+
+	wb_max_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
+	if (wb_thresh > wb_max_thresh)
+		wb_thresh = wb_max_thresh;
 
 	return wb_thresh;
 }
@@ -944,6 +962,7 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
 {
 	struct dirty_throttle_control gdtc = { GDTC_INIT(wb) };
 
+	domain_dirty_avail(&gdtc, true);
 	return __wb_calc_thresh(&gdtc, thresh);
 }
 
@@ -1120,12 +1139,6 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
 	if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
 		long long wb_pos_ratio;
 
-		if (dtc->wb_dirty < 8) {
-			dtc->pos_ratio = min_t(long long, pos_ratio * 2,
-					   2 << RATELIMIT_CALC_SHIFT);
-			return;
-		}
-
 		if (dtc->wb_dirty >= wb_thresh)
 			return;
 
@@ -1196,14 +1209,6 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
 	 */
 	if (unlikely(wb_thresh > dtc->thresh))
 		wb_thresh = dtc->thresh;
-	/*
-	 * It's very possible that wb_thresh is close to 0 not because the
-	 * device is slow, but that it has remained inactive for long time.
-	 * Honour such devices a reasonable good (hopefully IO efficient)
-	 * threshold, so that the occasional writes won't be blocked and active
-	 * writes can rampup the threshold quickly.
-	 */
-	wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);
 	/*
 	 * scale global setpoint to wb's:
 	 *	wb_setpoint = setpoint * wb_thresh / thresh
@@ -1459,17 +1464,10 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
 	 * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
 	 * Hence, to calculate "step" properly, we have to use wb_dirty as
 	 * "dirty" and wb_setpoint as "setpoint".
-	 *
-	 * We rampup dirty_ratelimit forcibly if wb_dirty is low because
-	 * it's possible that wb_thresh is close to zero due to inactivity
-	 * of backing device.
 	 */
 	if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
 		dirty = dtc->wb_dirty;
-		if (dtc->wb_dirty < 8)
-			setpoint = dtc->wb_dirty + 1;
-		else
-			setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
+		setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
 	}
 
 	if (dirty < setpoint) {
-- 
2.20.1

Re: [PATCH v2] mm/page-writeback: raise wb_thresh to prevent write blocking with strictlimit

Posted by Jan Kara 4 days, 9 hours ago

On Tue 19-11-24 19:44:42, Jim Zhao wrote:
> With the strictlimit flag, wb_thresh acts as a hard limit in
> balance_dirty_pages() and wb_position_ratio().  When device write
> operations are inactive, wb_thresh can drop to 0, causing writes to be
> blocked.  The issue occasionally occurs in fuse fs, particularly with
> network backends, the write thread is blocked frequently during a period.
> To address it, this patch raises the minimum wb_thresh to a controllable
> level, similar to the non-strictlimit case.
> 
> Signed-off-by: Jim Zhao <jimzhao.ai@gmail.com>
> ---
> Changes in v2:
> 1. Consolidate all wb_thresh bumping logic in __wb_calc_thresh for consistency;
> 2. Replace the limit variable with thresh for calculating the bump value,
> as __wb_calc_thresh is also used to calculate the background threshold;
> 3. Add domain_dirty_avail in wb_calc_thresh to get dtc->dirty.

Since the odd value of BdiDirryThresh got explained (independent cosmetic
bug), feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/page-writeback.c | 48 ++++++++++++++++++++++-----------------------
>  1 file changed, 23 insertions(+), 25 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e5a9eb795f99..8b13bcb42de3 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -917,7 +917,9 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
>  				      unsigned long thresh)
>  {
>  	struct wb_domain *dom = dtc_dom(dtc);
> +	struct bdi_writeback *wb = dtc->wb;
>  	u64 wb_thresh;
> +	u64 wb_max_thresh;
>  	unsigned long numerator, denominator;
>  	unsigned long wb_min_ratio, wb_max_ratio;
>  
> @@ -931,11 +933,27 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
>  	wb_thresh *= numerator;
>  	wb_thresh = div64_ul(wb_thresh, denominator);
>  
> -	wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio);
> +	wb_min_max_ratio(wb, &wb_min_ratio, &wb_max_ratio);
>  
>  	wb_thresh += (thresh * wb_min_ratio) / (100 * BDI_RATIO_SCALE);
> -	if (wb_thresh > (thresh * wb_max_ratio) / (100 * BDI_RATIO_SCALE))
> -		wb_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
> +
> +	/*
> +	 * It's very possible that wb_thresh is close to 0 not because the
> +	 * device is slow, but that it has remained inactive for long time.
> +	 * Honour such devices a reasonable good (hopefully IO efficient)
> +	 * threshold, so that the occasional writes won't be blocked and active
> +	 * writes can rampup the threshold quickly.
> +	 */
> +	if (thresh > dtc->dirty) {
> +		if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT))
> +			wb_thresh = max(wb_thresh, (thresh - dtc->dirty) / 100);
> +		else
> +			wb_thresh = max(wb_thresh, (thresh - dtc->dirty) / 8);
> +	}
> +
> +	wb_max_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
> +	if (wb_thresh > wb_max_thresh)
> +		wb_thresh = wb_max_thresh;
>  
>  	return wb_thresh;
>  }
> @@ -944,6 +962,7 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
>  {
>  	struct dirty_throttle_control gdtc = { GDTC_INIT(wb) };
>  
> +	domain_dirty_avail(&gdtc, true);
>  	return __wb_calc_thresh(&gdtc, thresh);
>  }
>  
> @@ -1120,12 +1139,6 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
>  	if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
>  		long long wb_pos_ratio;
>  
> -		if (dtc->wb_dirty < 8) {
> -			dtc->pos_ratio = min_t(long long, pos_ratio * 2,
> -					   2 << RATELIMIT_CALC_SHIFT);
> -			return;
> -		}
> -
>  		if (dtc->wb_dirty >= wb_thresh)
>  			return;
>  
> @@ -1196,14 +1209,6 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
>  	 */
>  	if (unlikely(wb_thresh > dtc->thresh))
>  		wb_thresh = dtc->thresh;
> -	/*
> -	 * It's very possible that wb_thresh is close to 0 not because the
> -	 * device is slow, but that it has remained inactive for long time.
> -	 * Honour such devices a reasonable good (hopefully IO efficient)
> -	 * threshold, so that the occasional writes won't be blocked and active
> -	 * writes can rampup the threshold quickly.
> -	 */
> -	wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);
>  	/*
>  	 * scale global setpoint to wb's:
>  	 *	wb_setpoint = setpoint * wb_thresh / thresh
> @@ -1459,17 +1464,10 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
>  	 * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
>  	 * Hence, to calculate "step" properly, we have to use wb_dirty as
>  	 * "dirty" and wb_setpoint as "setpoint".
> -	 *
> -	 * We rampup dirty_ratelimit forcibly if wb_dirty is low because
> -	 * it's possible that wb_thresh is close to zero due to inactivity
> -	 * of backing device.
>  	 */
>  	if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
>  		dirty = dtc->wb_dirty;
> -		if (dtc->wb_dirty < 8)
> -			setpoint = dtc->wb_dirty + 1;
> -		else
> -			setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
> +		setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
>  	}
>  
>  	if (dirty < setpoint) {
> -- 
> 2.20.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v2] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Jim Zhao 6 days, 9 hours ago

Thanks, Jan, I just sent patch v2, could you please review it ?

And I found the debug info in the bdi stats. 
The BdiDirtyThresh value may be greater than DirtyThresh, and after applying this patch, the value of BdiDirtyThresh could become even larger.

without patch:
---
root@ubuntu:/sys/kernel/debug/bdi/8:0# cat stats
BdiWriteback:                0 kB
BdiReclaimable:             96 kB
BdiDirtyThresh:        1346824 kB
DirtyThresh:            673412 kB
BackgroundThresh:       336292 kB
BdiDirtied:              19872 kB
BdiWritten:              19776 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1

with patch:
---
root@ubuntu:/sys/kernel/debug/bdi/8:0# cat stats
BdiWriteback:               96 kB
BdiReclaimable:            192 kB
BdiDirtyThresh:        3090736 kB
DirtyThresh:            650716 kB
BackgroundThresh:       324960 kB
BdiDirtied:             472512 kB
BdiWritten:             470592 kB
BdiWriteBandwidth:      106268 kBps
b_dirty:                     2
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1


@kemeng, is this a normal behavior or an issue ?

Thanks,
Jim Zhao


> With the strictlimit flag, wb_thresh acts as a hard limit in
> balance_dirty_pages() and wb_position_ratio().  When device write
> operations are inactive, wb_thresh can drop to 0, causing writes to be
> blocked.  The issue occasionally occurs in fuse fs, particularly with
> network backends, the write thread is blocked frequently during a period.
> To address it, this patch raises the minimum wb_thresh to a controllable
> level, similar to the non-strictlimit case.
>
> Signed-off-by: Jim Zhao <jimzhao.ai@gmail.com>
> ---
> Changes in v2:
> 1. Consolidate all wb_thresh bumping logic in __wb_calc_thresh for consistency;
> 2. Replace the limit variable with thresh for calculating the bump value,
> as __wb_calc_thresh is also used to calculate the background threshold;
> 3. Add domain_dirty_avail in wb_calc_thresh to get dtc->dirty.
> ---
>  mm/page-writeback.c | 48 ++++++++++++++++++++++-----------------------
>  1 file changed, 23 insertions(+), 25 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e5a9eb795f99..8b13bcb42de3 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -917,7 +917,9 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
>                                     unsigned long thresh)
>  {
>       struct wb_domain *dom = dtc_dom(dtc);
> +     struct bdi_writeback *wb = dtc->wb;
>       u64 wb_thresh;
> +     u64 wb_max_thresh;
>       unsigned long numerator, denominator;
>       unsigned long wb_min_ratio, wb_max_ratio;
>
> @@ -931,11 +933,27 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
>       wb_thresh *= numerator;
>       wb_thresh = div64_ul(wb_thresh, denominator);
>
> -     wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio);
> +     wb_min_max_ratio(wb, &wb_min_ratio, &wb_max_ratio);
>
>       wb_thresh += (thresh * wb_min_ratio) / (100 * BDI_RATIO_SCALE);
> -     if (wb_thresh > (thresh * wb_max_ratio) / (100 * BDI_RATIO_SCALE))
> -             wb_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
> +
> +     /*
> +      * It's very possible that wb_thresh is close to 0 not because the
> +      * device is slow, but that it has remained inactive for long time.
> +      * Honour such devices a reasonable good (hopefully IO efficient)
> +      * threshold, so that the occasional writes won't be blocked and active
> +      * writes can rampup the threshold quickly.
> +      */
> +     if (thresh > dtc->dirty) {
> +             if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT))
> +                     wb_thresh = max(wb_thresh, (thresh - dtc->dirty) / 100);
> +             else
> +                     wb_thresh = max(wb_thresh, (thresh - dtc->dirty) / 8);
> +     }
> +
> +     wb_max_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
> +     if (wb_thresh > wb_max_thresh)
> +             wb_thresh = wb_max_thresh;
>
>       return wb_thresh;
>  }
> @@ -944,6 +962,7 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
>  {
>       struct dirty_throttle_control gdtc = { GDTC_INIT(wb) };
>
> +     domain_dirty_avail(&gdtc, true);
>       return __wb_calc_thresh(&gdtc, thresh);
>  }
>
> @@ -1120,12 +1139,6 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
>       if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
>               long long wb_pos_ratio;
>
> -             if (dtc->wb_dirty < 8) {
> -                     dtc->pos_ratio = min_t(long long, pos_ratio * 2,
> -                                        2 << RATELIMIT_CALC_SHIFT);
> -                     return;
> -             }
> -
>               if (dtc->wb_dirty >= wb_thresh)
>                       return;
>
> @@ -1196,14 +1209,6 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
>        */
>       if (unlikely(wb_thresh > dtc->thresh))
>               wb_thresh = dtc->thresh;
> -     /*
> -      * It's very possible that wb_thresh is close to 0 not because the
> -      * device is slow, but that it has remained inactive for long time.
> -      * Honour such devices a reasonable good (hopefully IO efficient)
> -      * threshold, so that the occasional writes won't be blocked and active
> -      * writes can rampup the threshold quickly.
> -      */
> -     wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);
>       /*
>        * scale global setpoint to wb's:
>        *      wb_setpoint = setpoint * wb_thresh / thresh
> @@ -1459,17 +1464,10 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
>        * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
>        * Hence, to calculate "step" properly, we have to use wb_dirty as
>        * "dirty" and wb_setpoint as "setpoint".
> -      *
> -      * We rampup dirty_ratelimit forcibly if wb_dirty is low because
> -      * it's possible that wb_thresh is close to zero due to inactivity
> -      * of backing device.
>        */
>       if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
>               dirty = dtc->wb_dirty;
> -             if (dtc->wb_dirty < 8)
> -                     setpoint = dtc->wb_dirty + 1;
> -             else
> -                     setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
> +             setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
>       }
>
>       if (dirty < setpoint) {
> --
> 2.20.1

Re: [PATCH v2] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Jan Kara 5 days, 9 hours ago

Hello!

On Tue 19-11-24 20:29:22, Jim Zhao wrote:
> Thanks, Jan, I just sent patch v2, could you please review it ?

Yes, the patch looks good to me.

> 
> And I found the debug info in the bdi stats. 
> The BdiDirtyThresh value may be greater than DirtyThresh, and after
> applying this patch, the value of BdiDirtyThresh could become even
> larger.
> 
> without patch:
> ---
> root@ubuntu:/sys/kernel/debug/bdi/8:0# cat stats
> BdiWriteback:                0 kB
> BdiReclaimable:             96 kB
> BdiDirtyThresh:        1346824 kB

But this is odd. The machine appears to have around 3GB of memory, doesn't
it? I suspect this is caused by multiple cgroup-writeback contexts
contributing to BdiDirtyThresh - in fact I think the math in
bdi_collect_stats() is wrong as it is adding wb_thresh() calculated based
on global dirty_thresh for each cgwb whereas it should be adding
wb_thresh() calculated based on per-memcg dirty_thresh... You can have a
look at /sys/kernel/debug/bdi/8:0/wb_stats file which should have correct
limits as far as I'm reading the code.

								Honza

> DirtyThresh:            673412 kB
> BackgroundThresh:       336292 kB
> BdiDirtied:              19872 kB
> BdiWritten:              19776 kB
> BdiWriteBandwidth:           0 kBps
> b_dirty:                     0
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> 
> with patch:
> ---
> root@ubuntu:/sys/kernel/debug/bdi/8:0# cat stats
> BdiWriteback:               96 kB
> BdiReclaimable:            192 kB
> BdiDirtyThresh:        3090736 kB
> DirtyThresh:            650716 kB
> BackgroundThresh:       324960 kB
> BdiDirtied:             472512 kB
> BdiWritten:             470592 kB
> BdiWriteBandwidth:      106268 kBps
> b_dirty:                     2
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> 
> 
> @kemeng, is this a normal behavior or an issue ?
> 
> Thanks,
> Jim Zhao
> 
> 
> > With the strictlimit flag, wb_thresh acts as a hard limit in
> > balance_dirty_pages() and wb_position_ratio().  When device write
> > operations are inactive, wb_thresh can drop to 0, causing writes to be
> > blocked.  The issue occasionally occurs in fuse fs, particularly with
> > network backends, the write thread is blocked frequently during a period.
> > To address it, this patch raises the minimum wb_thresh to a controllable
> > level, similar to the non-strictlimit case.
> >
> > Signed-off-by: Jim Zhao <jimzhao.ai@gmail.com>
> > ---
> > Changes in v2:
> > 1. Consolidate all wb_thresh bumping logic in __wb_calc_thresh for consistency;
> > 2. Replace the limit variable with thresh for calculating the bump value,
> > as __wb_calc_thresh is also used to calculate the background threshold;
> > 3. Add domain_dirty_avail in wb_calc_thresh to get dtc->dirty.
> > ---
> >  mm/page-writeback.c | 48 ++++++++++++++++++++++-----------------------
> >  1 file changed, 23 insertions(+), 25 deletions(-)
> >
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index e5a9eb795f99..8b13bcb42de3 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -917,7 +917,9 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
> >                                     unsigned long thresh)
> >  {
> >       struct wb_domain *dom = dtc_dom(dtc);
> > +     struct bdi_writeback *wb = dtc->wb;
> >       u64 wb_thresh;
> > +     u64 wb_max_thresh;
> >       unsigned long numerator, denominator;
> >       unsigned long wb_min_ratio, wb_max_ratio;
> >
> > @@ -931,11 +933,27 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
> >       wb_thresh *= numerator;
> >       wb_thresh = div64_ul(wb_thresh, denominator);
> >
> > -     wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio);
> > +     wb_min_max_ratio(wb, &wb_min_ratio, &wb_max_ratio);
> >
> >       wb_thresh += (thresh * wb_min_ratio) / (100 * BDI_RATIO_SCALE);
> > -     if (wb_thresh > (thresh * wb_max_ratio) / (100 * BDI_RATIO_SCALE))
> > -             wb_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
> > +
> > +     /*
> > +      * It's very possible that wb_thresh is close to 0 not because the
> > +      * device is slow, but that it has remained inactive for long time.
> > +      * Honour such devices a reasonable good (hopefully IO efficient)
> > +      * threshold, so that the occasional writes won't be blocked and active
> > +      * writes can rampup the threshold quickly.
> > +      */
> > +     if (thresh > dtc->dirty) {
> > +             if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT))
> > +                     wb_thresh = max(wb_thresh, (thresh - dtc->dirty) / 100);
> > +             else
> > +                     wb_thresh = max(wb_thresh, (thresh - dtc->dirty) / 8);
> > +     }
> > +
> > +     wb_max_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
> > +     if (wb_thresh > wb_max_thresh)
> > +             wb_thresh = wb_max_thresh;
> >
> >       return wb_thresh;
> >  }
> > @@ -944,6 +962,7 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
> >  {
> >       struct dirty_throttle_control gdtc = { GDTC_INIT(wb) };
> >
> > +     domain_dirty_avail(&gdtc, true);
> >       return __wb_calc_thresh(&gdtc, thresh);
> >  }
> >
> > @@ -1120,12 +1139,6 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
> >       if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> >               long long wb_pos_ratio;
> >
> > -             if (dtc->wb_dirty < 8) {
> > -                     dtc->pos_ratio = min_t(long long, pos_ratio * 2,
> > -                                        2 << RATELIMIT_CALC_SHIFT);
> > -                     return;
> > -             }
> > -
> >               if (dtc->wb_dirty >= wb_thresh)
> >                       return;
> >
> > @@ -1196,14 +1209,6 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
> >        */
> >       if (unlikely(wb_thresh > dtc->thresh))
> >               wb_thresh = dtc->thresh;
> > -     /*
> > -      * It's very possible that wb_thresh is close to 0 not because the
> > -      * device is slow, but that it has remained inactive for long time.
> > -      * Honour such devices a reasonable good (hopefully IO efficient)
> > -      * threshold, so that the occasional writes won't be blocked and active
> > -      * writes can rampup the threshold quickly.
> > -      */
> > -     wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);
> >       /*
> >        * scale global setpoint to wb's:
> >        *      wb_setpoint = setpoint * wb_thresh / thresh
> > @@ -1459,17 +1464,10 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
> >        * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
> >        * Hence, to calculate "step" properly, we have to use wb_dirty as
> >        * "dirty" and wb_setpoint as "setpoint".
> > -      *
> > -      * We rampup dirty_ratelimit forcibly if wb_dirty is low because
> > -      * it's possible that wb_thresh is close to zero due to inactivity
> > -      * of backing device.
> >        */
> >       if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> >               dirty = dtc->wb_dirty;
> > -             if (dtc->wb_dirty < 8)
> > -                     setpoint = dtc->wb_dirty + 1;
> > -             else
> > -                     setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
> > +             setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
> >       }
> >
> >       if (dirty < setpoint) {
> > --
> > 2.20.1
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v2] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Jim Zhao 4 days, 11 hours ago

> On Tue 19-11-24 20:29:22, Jim Zhao wrote:
> > Thanks, Jan, I just sent patch v2, could you please review it ?
>
> Yes, the patch looks good to me.
>
> >
> > And I found the debug info in the bdi stats.
> > The BdiDirtyThresh value may be greater than DirtyThresh, and after
> > applying this patch, the value of BdiDirtyThresh could become even
> > larger.
> >
> > without patch:
> > ---
> > root@ubuntu:/sys/kernel/debug/bdi/8:0# cat stats
> > BdiWriteback:                0 kB
> > BdiReclaimable:             96 kB
> > BdiDirtyThresh:        1346824 kB
>
> But this is odd. The machine appears to have around 3GB of memory, doesn't
> it? I suspect this is caused by multiple cgroup-writeback contexts
> contributing to BdiDirtyThresh - in fact I think the math in
> bdi_collect_stats() is wrong as it is adding wb_thresh() calculated based
> on global dirty_thresh for each cgwb whereas it should be adding
> wb_thresh() calculated based on per-memcg dirty_thresh... You can have a
> look at /sys/kernel/debug/bdi/8:0/wb_stats file which should have correct
> limits as far as I'm reading the code.

Thanks for review!
Yes, It should be caused by multiple cgroup-writeback with bdi_collect_stats issue.

@Andrew, 
I sent patch v2 according Jan's suggestion. 
Since patch v1 already in tree. So I sent out the diff of v1 -> v2:
https://lore.kernel.org/all/20241121100539.605818-1-jimzhao.ai@gmail.com/
Could you please review it, thanks!

Jim Zhao

Re: [PATCH v2] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Kemeng Shi 5 days, 13 hours ago


on 11/19/2024 8:29 PM, Jim Zhao wrote:
> Thanks, Jan, I just sent patch v2, could you please review it ?
> 
> And I found the debug info in the bdi stats. 
> The BdiDirtyThresh value may be greater than DirtyThresh, and after applying this patch, the value of BdiDirtyThresh could become even larger.
> 
> without patch:
> ---
> root@ubuntu:/sys/kernel/debug/bdi/8:0# cat stats
> BdiWriteback:                0 kB
> BdiReclaimable:             96 kB
> BdiDirtyThresh:        1346824 kB
> DirtyThresh:            673412 kB
> BackgroundThresh:       336292 kB
> BdiDirtied:              19872 kB
> BdiWritten:              19776 kB
> BdiWriteBandwidth:           0 kBps
> b_dirty:                     0
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> 
> with patch:
> ---
> root@ubuntu:/sys/kernel/debug/bdi/8:0# cat stats
> BdiWriteback:               96 kB
> BdiReclaimable:            192 kB
> BdiDirtyThresh:        3090736 kB
> DirtyThresh:            650716 kB
> BackgroundThresh:       324960 kB
> BdiDirtied:             472512 kB
> BdiWritten:             470592 kB
> BdiWriteBandwidth:      106268 kBps
> b_dirty:                     2
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> 
> 
> @kemeng, is this a normal behavior or an issue ?
Hello, this is not a normal behavior, could you aslo send the content in
wb_stats and configuired bdi_min_ratio.
I think the improper use of bdi_min_ratio may cause the issue.

Thanks,
Kemeng
> 
> Thanks,
> Jim Zhao
> 
> 
>> With the strictlimit flag, wb_thresh acts as a hard limit in
>> balance_dirty_pages() and wb_position_ratio().  When device write
>> operations are inactive, wb_thresh can drop to 0, causing writes to be
>> blocked.  The issue occasionally occurs in fuse fs, particularly with
>> network backends, the write thread is blocked frequently during a period.
>> To address it, this patch raises the minimum wb_thresh to a controllable
>> level, similar to the non-strictlimit case.
>>
>> Signed-off-by: Jim Zhao <jimzhao.ai@gmail.com>
>> ---
>> Changes in v2:
>> 1. Consolidate all wb_thresh bumping logic in __wb_calc_thresh for consistency;
>> 2. Replace the limit variable with thresh for calculating the bump value,
>> as __wb_calc_thresh is also used to calculate the background threshold;
>> 3. Add domain_dirty_avail in wb_calc_thresh to get dtc->dirty.
>> ---
>>  mm/page-writeback.c | 48 ++++++++++++++++++++++-----------------------
>>  1 file changed, 23 insertions(+), 25 deletions(-)
>>
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index e5a9eb795f99..8b13bcb42de3 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -917,7 +917,9 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
>>                                     unsigned long thresh)
>>  {
>>       struct wb_domain *dom = dtc_dom(dtc);
>> +     struct bdi_writeback *wb = dtc->wb;
>>       u64 wb_thresh;
>> +     u64 wb_max_thresh;
>>       unsigned long numerator, denominator;
>>       unsigned long wb_min_ratio, wb_max_ratio;
>>
>> @@ -931,11 +933,27 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
>>       wb_thresh *= numerator;
>>       wb_thresh = div64_ul(wb_thresh, denominator);
>>
>> -     wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio);
>> +     wb_min_max_ratio(wb, &wb_min_ratio, &wb_max_ratio);
>>
>>       wb_thresh += (thresh * wb_min_ratio) / (100 * BDI_RATIO_SCALE);
>> -     if (wb_thresh > (thresh * wb_max_ratio) / (100 * BDI_RATIO_SCALE))
>> -             wb_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
>> +
>> +     /*
>> +      * It's very possible that wb_thresh is close to 0 not because the
>> +      * device is slow, but that it has remained inactive for long time.
>> +      * Honour such devices a reasonable good (hopefully IO efficient)
>> +      * threshold, so that the occasional writes won't be blocked and active
>> +      * writes can rampup the threshold quickly.
>> +      */
>> +     if (thresh > dtc->dirty) {
>> +             if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT))
>> +                     wb_thresh = max(wb_thresh, (thresh - dtc->dirty) / 100);
>> +             else
>> +                     wb_thresh = max(wb_thresh, (thresh - dtc->dirty) / 8);
>> +     }
>> +
>> +     wb_max_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
>> +     if (wb_thresh > wb_max_thresh)
>> +             wb_thresh = wb_max_thresh;
>>
>>       return wb_thresh;
>>  }
>> @@ -944,6 +962,7 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
>>  {
>>       struct dirty_throttle_control gdtc = { GDTC_INIT(wb) };
>>
>> +     domain_dirty_avail(&gdtc, true);
>>       return __wb_calc_thresh(&gdtc, thresh);
>>  }
>>
>> @@ -1120,12 +1139,6 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
>>       if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
>>               long long wb_pos_ratio;
>>
>> -             if (dtc->wb_dirty < 8) {
>> -                     dtc->pos_ratio = min_t(long long, pos_ratio * 2,
>> -                                        2 << RATELIMIT_CALC_SHIFT);
>> -                     return;
>> -             }
>> -
>>               if (dtc->wb_dirty >= wb_thresh)
>>                       return;
>>
>> @@ -1196,14 +1209,6 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
>>        */
>>       if (unlikely(wb_thresh > dtc->thresh))
>>               wb_thresh = dtc->thresh;
>> -     /*
>> -      * It's very possible that wb_thresh is close to 0 not because the
>> -      * device is slow, but that it has remained inactive for long time.
>> -      * Honour such devices a reasonable good (hopefully IO efficient)
>> -      * threshold, so that the occasional writes won't be blocked and active
>> -      * writes can rampup the threshold quickly.
>> -      */
>> -     wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);
>>       /*
>>        * scale global setpoint to wb's:
>>        *      wb_setpoint = setpoint * wb_thresh / thresh
>> @@ -1459,17 +1464,10 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
>>        * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
>>        * Hence, to calculate "step" properly, we have to use wb_dirty as
>>        * "dirty" and wb_setpoint as "setpoint".
>> -      *
>> -      * We rampup dirty_ratelimit forcibly if wb_dirty is low because
>> -      * it's possible that wb_thresh is close to zero due to inactivity
>> -      * of backing device.
>>        */
>>       if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
>>               dirty = dtc->wb_dirty;
>> -             if (dtc->wb_dirty < 8)
>> -                     setpoint = dtc->wb_dirty + 1;
>> -             else
>> -                     setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
>> +             setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
>>       }
>>
>>       if (dirty < setpoint) {
>> --
>> 2.20.1
>

Re: [PATCH v2] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Jim Zhao 4 days, 13 hours ago

> on 11/19/2024 8:29 PM, Jim Zhao wrote:
> > Thanks, Jan, I just sent patch v2, could you please review it ?
> >
> > And I found the debug info in the bdi stats.
> > The BdiDirtyThresh value may be greater than DirtyThresh, and after applying this patch, the value of BdiDirtyThresh could become even larger.
> >
> > without patch:
> > ---
> > root@ubuntu:/sys/kernel/debug/bdi/8:0# cat stats
> > BdiWriteback:                0 kB
> > BdiReclaimable:             96 kB
> > BdiDirtyThresh:        1346824 kB
> > DirtyThresh:            673412 kB
> > BackgroundThresh:       336292 kB
> > BdiDirtied:              19872 kB
> > BdiWritten:              19776 kB
> > BdiWriteBandwidth:           0 kBps
> > b_dirty:                     0
> > b_io:                        0
> > b_more_io:                   0
> > b_dirty_time:                0
> > bdi_list:                    1
> > state:                       1
> >
> > with patch:
> > ---
> > root@ubuntu:/sys/kernel/debug/bdi/8:0# cat stats
> > BdiWriteback:               96 kB
> > BdiReclaimable:            192 kB
> > BdiDirtyThresh:        3090736 kB
> > DirtyThresh:            650716 kB
> > BackgroundThresh:       324960 kB
> > BdiDirtied:             472512 kB
> > BdiWritten:             470592 kB
> > BdiWriteBandwidth:      106268 kBps
> > b_dirty:                     2
> > b_io:                        0
> > b_more_io:                   0
> > b_dirty_time:                0
> > bdi_list:                    1
> > state:                       1
> >
> >
> > @kemeng, is this a normal behavior or an issue ?
> Hello, this is not a normal behavior, could you aslo send the content in
> wb_stats and configuired bdi_min_ratio.
> I think the improper use of bdi_min_ratio may cause the issue.

the min_ratio is 0
---
root@ubuntu:/sys/class/bdi/8:0# cat min_bytes
0
root@ubuntu:/sys/class/bdi/8:0# cat min_ratio
0
root@ubuntu:/sys/class/bdi/8:0# cat min_ratio_fine
0

wb_stats:
---

root@ubuntu:/sys/kernel/debug/bdi/8:0# cat stats
BdiWriteback:                0 kB
BdiReclaimable:            480 kB
BdiDirtyThresh:        1664700 kB
DirtyThresh:            554900 kB
BackgroundThresh:       277108 kB
BdiDirtied:              82752 kB
BdiWritten:              82752 kB
BdiWriteBandwidth:      205116 kBps
b_dirty:                     6
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
root@ubuntu:/sys/kernel/debug/bdi/8:0# cat wb_stats
WbCgIno:                    1
WbWriteback:                0 kB
WbReclaimable:             96 kB
WbDirtyThresh:              0 kB
WbDirtied:              33600 kB
WbWritten:              33600 kB
WbWriteBandwidth:         148 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                  416
WbWriteback:                0 kB
WbReclaimable:            288 kB
WbDirtyThresh:         554836 kB
WbDirtied:              47616 kB
WbWritten:              47424 kB
WbWriteBandwidth:         168 kBps
b_dirty:                    1
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      5

WbCgIno:                 1319
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 1835
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                   29
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      101752 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                  158
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      101756 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 2498
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 3358
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 3573
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 3659
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 3186
WbWriteback:                0 kB
WbReclaimable:             96 kB
WbDirtyThresh:         554788 kB
WbDirtied:               1056 kB
WbWritten:               1152 kB
WbWriteBandwidth:         152 kBps
b_dirty:                    1
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      5

WbCgIno:                 3315
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                384 kB
WbWritten:                384 kB
WbWriteBandwidth:       98876 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                   72
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:         554836 kB
WbDirtied:                 96 kB
WbWritten:                192 kB
WbWriteBandwidth:           4 kBps
b_dirty:                    1
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      5

WbCgIno:                 3616
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      100308 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 4132
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 3401
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 4517
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 4846
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 4982
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      100468 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 5369
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                 96 kB
WbWriteBandwidth:       75104 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 5627
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 6235
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 6192
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 6500
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 4617
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 6958
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 5670
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    1
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      5

WbCgIno:                 5870
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 5025
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 7990
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:       91500 kBps
b_dirty:                    1
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      5

WbCgIno:                 8033
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                 2842
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

WbCgIno:                11129
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1

ubuntu24.04 desktop + kernel 6.12.0
default cgroups, not configured manually.

---
Thanks
Jim Zhao

Re: [PATCH] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Andrew Morton 1 month ago

On Wed, 23 Oct 2024 18:00:32 +0800 Jim Zhao <jimzhao.ai@gmail.com> wrote:

> With the strictlimit flag, wb_thresh acts as a hard limit in
> balance_dirty_pages() and wb_position_ratio(). When device write
> operations are inactive, wb_thresh can drop to 0, causing writes to
> be blocked. The issue occasionally occurs in fuse fs, particularly
> with network backends, the write thread is blocked frequently during
> a period. To address it, this patch raises the minimum wb_thresh to a
> controllable level, similar to the non-strictlimit case.

Please tell us more about the userspace-visible effects of this.  It
*sounds* like a serious (but occasional) problem, but that is unclear.

And, very much relatedly, do you feel this fix is needed in earlier
(-stable) kernels?

Re: [PATCH] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Jim Zhao 1 month ago

> On Wed, 23 Oct 2024 18:00:32 +0800 Jim Zhao <jimzhao.ai@gmail.com> wrote:

> > With the strictlimit flag, wb_thresh acts as a hard limit in
> > balance_dirty_pages() and wb_position_ratio(). When device write
> > operations are inactive, wb_thresh can drop to 0, causing writes to
> > be blocked. The issue occasionally occurs in fuse fs, particularly
> > with network backends, the write thread is blocked frequently during
> > a period. To address it, this patch raises the minimum wb_thresh to a
> > controllable level, similar to the non-strictlimit case.

> Please tell us more about the userspace-visible effects of this.  It
> *sounds* like a serious (but occasional) problem, but that is unclear.

> And, very much relatedly, do you feel this fix is needed in earlier
> (-stable) kernels?

The problem exists in two scenarios:
1. FUSE Write Transition from Inactive to Active

sometimes, active writes require several pauses to ramp up to the appropriate wb_thresh.
As shown in the trace below, both bdi_setpoint and task_ratelimit are 0, means wb_thresh is 0. 
The dd process pauses multiple times before reaching a normal state.

dd-1206590 [003] .... 62988.324049: balance_dirty_pages: bdi 0:51: limit=295073 setpoint=259360 dirty=454 bdi_setpoint=0 bdi_dirty=32 dirty_ratelimit=18716 task_ratelimit=0 dirtied=32 dirtied_pause=32 paused=0 pause=4 period=4 think=0 cgroup_ino=1
dd-1206590 [003] .... 62988.332063: balance_dirty_pages: bdi 0:51: limit=295073 setpoint=259453 dirty=454 bdi_setpoint=0 bdi_dirty=33 dirty_ratelimit=18716 task_ratelimit=0 dirtied=1 dirtied_pause=0 paused=0 pause=4 period=4 think=4 cgroup_ino=1
dd-1206590 [003] .... 62988.340064: balance_dirty_pages: bdi 0:51: limit=295073 setpoint=259526 dirty=454 bdi_setpoint=0 bdi_dirty=34 dirty_ratelimit=18716 task_ratelimit=0 dirtied=1 dirtied_pause=0 paused=0 pause=4 period=4 think=4 cgroup_ino=1
dd-1206590 [003] .... 62988.348061: balance_dirty_pages: bdi 0:51: limit=295073 setpoint=259531 dirty=489 bdi_setpoint=0 bdi_dirty=35 dirty_ratelimit=18716 task_ratelimit=0 dirtied=1 dirtied_pause=0 paused=0 pause=4 period=4 think=4 cgroup_ino=1
dd-1206590 [003] .... 62988.356063: balance_dirty_pages: bdi 0:51: limit=295073 setpoint=259531 dirty=490 bdi_setpoint=0 bdi_dirty=36 dirty_ratelimit=18716 task_ratelimit=0 dirtied=1 dirtied_pause=0 paused=0 pause=4 period=4 think=4 cgroup_ino=1
...

2. FUSE with Unstable Network Backends and Occasional Writes
Not easy to reproduce, but when it occurs in this scenario, 
it causes the write thread to experience more pauses and longer durations.

Currently, some code is in place to improve this situation, but seems insufficient:
if (dtc->wb_dirty < 8)
{
	// ...
}

So the patch raise min wb_thresh to keep the occasional writes won't be blocked and
active writes can rampup the threshold quickly.

--

Thanks,
Jim Zhao

Re: [PATCH] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Andrew Morton 1 month ago

On Thu, 24 Oct 2024 14:09:54 +0800 Jim Zhao <jimzhao.ai@gmail.com> wrote:

> > On Wed, 23 Oct 2024 18:00:32 +0800 Jim Zhao <jimzhao.ai@gmail.com> wrote:
> 
> > > With the strictlimit flag, wb_thresh acts as a hard limit in
> > > balance_dirty_pages() and wb_position_ratio(). When device write
> > > operations are inactive, wb_thresh can drop to 0, causing writes to
> > > be blocked. The issue occasionally occurs in fuse fs, particularly
> > > with network backends, the write thread is blocked frequently during
> > > a period. To address it, this patch raises the minimum wb_thresh to a
> > > controllable level, similar to the non-strictlimit case.
> 
> > Please tell us more about the userspace-visible effects of this.  It
> > *sounds* like a serious (but occasional) problem, but that is unclear.
> 
> > And, very much relatedly, do you feel this fix is needed in earlier
> > (-stable) kernels?
> 
> The problem exists in two scenarios:
> 1. FUSE Write Transition from Inactive to Active
> 
> sometimes, active writes require several pauses to ramp up to the appropriate wb_thresh.
> As shown in the trace below, both bdi_setpoint and task_ratelimit are 0, means wb_thresh is 0. 
> The dd process pauses multiple times before reaching a normal state.
> 
> dd-1206590 [003] .... 62988.324049: balance_dirty_pages: bdi 0:51: limit=295073 setpoint=259360 dirty=454 bdi_setpoint=0 bdi_dirty=32 dirty_ratelimit=18716 task_ratelimit=0 dirtied=32 dirtied_pause=32 paused=0 pause=4 period=4 think=0 cgroup_ino=1
> dd-1206590 [003] .... 62988.332063: balance_dirty_pages: bdi 0:51: limit=295073 setpoint=259453 dirty=454 bdi_setpoint=0 bdi_dirty=33 dirty_ratelimit=18716 task_ratelimit=0 dirtied=1 dirtied_pause=0 paused=0 pause=4 period=4 think=4 cgroup_ino=1
> dd-1206590 [003] .... 62988.340064: balance_dirty_pages: bdi 0:51: limit=295073 setpoint=259526 dirty=454 bdi_setpoint=0 bdi_dirty=34 dirty_ratelimit=18716 task_ratelimit=0 dirtied=1 dirtied_pause=0 paused=0 pause=4 period=4 think=4 cgroup_ino=1
> dd-1206590 [003] .... 62988.348061: balance_dirty_pages: bdi 0:51: limit=295073 setpoint=259531 dirty=489 bdi_setpoint=0 bdi_dirty=35 dirty_ratelimit=18716 task_ratelimit=0 dirtied=1 dirtied_pause=0 paused=0 pause=4 period=4 think=4 cgroup_ino=1
> dd-1206590 [003] .... 62988.356063: balance_dirty_pages: bdi 0:51: limit=295073 setpoint=259531 dirty=490 bdi_setpoint=0 bdi_dirty=36 dirty_ratelimit=18716 task_ratelimit=0 dirtied=1 dirtied_pause=0 paused=0 pause=4 period=4 think=4 cgroup_ino=1
> ...
> 
> 2. FUSE with Unstable Network Backends and Occasional Writes
> Not easy to reproduce, but when it occurs in this scenario, 
> it causes the write thread to experience more pauses and longer durations.

Thanks, but it's still unclear how this impacts our users.  How lenghty
are these pauses?

Re: [PATCH] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Jim Zhao 1 month ago

> On Thu, 24 Oct 2024 14:09:54 +0800 Jim Zhao <jimzhao.ai@gmail.com> wrote:

> > > On Wed, 23 Oct 2024 18:00:32 +0800 Jim Zhao <jimzhao.ai@gmail.com> wrote:
> > 
> > > > With the strictlimit flag, wb_thresh acts as a hard limit in
> > > > balance_dirty_pages() and wb_position_ratio(). When device write
> > > > operations are inactive, wb_thresh can drop to 0, causing writes to
> > > > be blocked. The issue occasionally occurs in fuse fs, particularly
> > > > with network backends, the write thread is blocked frequently during
> > > > a period. To address it, this patch raises the minimum wb_thresh to a
> > > > controllable level, similar to the non-strictlimit case.
> > 
> > > Please tell us more about the userspace-visible effects of this.  It
> > > *sounds* like a serious (but occasional) problem, but that is unclear.
> > 
> > > And, very much relatedly, do you feel this fix is needed in earlier
> > > (-stable) kernels?
> > 
> > The problem exists in two scenarios:
> > 1. FUSE Write Transition from Inactive to Active
> > 
> > sometimes, active writes require several pauses to ramp up to the appropriate wb_thresh.
> > As shown in the trace below, both bdi_setpoint and task_ratelimit are 0, means wb_thresh is 0. 
> > The dd process pauses multiple times before reaching a normal state.
> > 
> > dd-1206590 [003] .... 62988.324049: balance_dirty_pages: bdi 0:51: limit=295073 setpoint=259360 dirty=454 bdi_setpoint=0 bdi_dirty=32 dirty_ratelimit=18716 task_ratelimit=0 dirtied=32 dirtied_pause=32 paused=0 pause=4 period=4 think=0 cgroup_ino=1
> > dd-1206590 [003] .... 62988.332063: balance_dirty_pages: bdi 0:51: limit=295073 setpoint=259453 dirty=454 bdi_setpoint=0 bdi_dirty=33 dirty_ratelimit=18716 task_ratelimit=0 dirtied=1 dirtied_pause=0 paused=0 pause=4 period=4 think=4 cgroup_ino=1
> > dd-1206590 [003] .... 62988.340064: balance_dirty_pages: bdi 0:51: limit=295073 setpoint=259526 dirty=454 bdi_setpoint=0 bdi_dirty=34 dirty_ratelimit=18716 task_ratelimit=0 dirtied=1 dirtied_pause=0 paused=0 pause=4 period=4 think=4 cgroup_ino=1
> > dd-1206590 [003] .... 62988.348061: balance_dirty_pages: bdi 0:51: limit=295073 setpoint=259531 dirty=489 bdi_setpoint=0 bdi_dirty=35 dirty_ratelimit=18716 task_ratelimit=0 dirtied=1 dirtied_pause=0 paused=0 pause=4 period=4 think=4 cgroup_ino=1
> > dd-1206590 [003] .... 62988.356063: balance_dirty_pages: bdi 0:51: limit=295073 setpoint=259531 dirty=490 bdi_setpoint=0 bdi_dirty=36 dirty_ratelimit=18716 task_ratelimit=0 dirtied=1 dirtied_pause=0 paused=0 pause=4 period=4 think=4 cgroup_ino=1
> > ...
> > 
> > 2. FUSE with Unstable Network Backends and Occasional Writes
> > Not easy to reproduce, but when it occurs in this scenario, 
> > it causes the write thread to experience more pauses and longer durations.

> Thanks, but it's still unclear how this impacts our users.  How lenghty
> are these pauses?

The length is related to device writeback bandwidth.
Under normal bandwidth, each pause may last around 4ms in several times as shown in the trace above(5 times).
In extreme cases, fuse with unstable network backends,
if pauses occur frequently and bandwidth is low, each pause can exceed 10ms, the total duration of pauses can accumulate to second.


Thnaks,
Jim Zhao

Re: [PATCH] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Andrew Morton 1 month ago

On Thu, 24 Oct 2024 15:29:19 +0800 Jim Zhao <jimzhao.ai@gmail.com> wrote:

> > > 2. FUSE with Unstable Network Backends and Occasional Writes
> > > Not easy to reproduce, but when it occurs in this scenario, 
> > > it causes the write thread to experience more pauses and longer durations.
> 
> > Thanks, but it's still unclear how this impacts our users.  How lenghty
> > are these pauses?
> 
> The length is related to device writeback bandwidth.
> Under normal bandwidth, each pause may last around 4ms in several times as shown in the trace above(5 times).
> In extreme cases, fuse with unstable network backends,
> if pauses occur frequently and bandwidth is low, each pause can exceed 10ms, the total duration of pauses can accumulate to second.

Thanks.  I'll assume that the userspace impact isn't serious to warrant
a backport into -stable kernel.

If you disagree with this, please let me know and send along additional
changelog text which helps others understand why we think our users
will significantly benefit from this change.

Re: [PATCH] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Posted by Jim Zhao 3 weeks, 3 days ago

> On Thu, 24 Oct 2024 15:29:19 +0800 Jim Zhao <jimzhao.ai@gmail.com> wrote:
> 
> > > > 2. FUSE with Unstable Network Backends and Occasional Writes
> > > > Not easy to reproduce, but when it occurs in this scenario, 
> > > > it causes the write thread to experience more pauses and longer durations.
> > 
> > > Thanks, but it's still unclear how this impacts our users.  How lenghty
> > > are these pauses?
> > 
> > The length is related to device writeback bandwidth.
> > Under normal bandwidth, each pause may last around 4ms in several times as shown in the trace above(5 times).
> > In extreme cases, fuse with unstable network backends,
> > if pauses occur frequently and bandwidth is low, each pause can exceed 10ms, the total duration of pauses can accumulate to second.
> 
> Thanks.  I'll assume that the userspace impact isn't serious to warrant
> a backport into -stable kernel.
> 
> If you disagree with this, please let me know and send along additional
> changelog text which helps others understand why we think our users
> will significantly benefit from this change.

It’s acceptable not to backport this to earlier kernels. 
After additional testing,  under normal conditions, the impact on userspace is limited, with blocking times generally in the millisecond range.
However, I recommend including this patch in the next kernel version. 
In cases of low writeback bandwidth and high writeback delay, blocking times can significantly increase. 
This patch helps eliminate unnecessary blocks in those scenarios.
Thanks.