[v1] Expose grace period delay for devlink health reporter

[PATCH net-next 0/5] Expose grace period delay for devlink health reporter

Posted by Tariq Toukan 2 months, 3 weeks ago

Hi,

This series by Shahar implements graceful period delay in devlink health
reporter, and use it in mlx5e driver.

See detailed feature description by Shahar below [1].

Regards,
Tariq

[1]
Currently, the devlink health reporter initiates the grace period
immediately after recovering an error, which blocks further recovery
attempts until the grace period concludes. Since additional errors
are not generally expected during this short interval, any new error
reported during the grace period is not only rejected but also causes
the reporter to enter an error state that requires manual intervention.

This approach poses a problem in scenarios where a single root cause
triggers multiple related errors in quick succession - for example,
a PCI issue affecting multiple hardware queues. Because these errors
are closely related and occur rapidly, it is more effective to handle
them together rather than handling only the first one reported and
blocking any subsequent recovery attempts. Furthermore, setting the
reporter to an error state in this context can be misleading, as these
multiple errors are manifestations of a single underlying issue, making
it unlike the general case where additional errors are not expected
during the grace period.

To resolve this, introduce a configurable grace period delay attribute
to the devlink health reporter. This delay starts when the first error
is recovered and lasts for a user-defined duration. Once this grace
period delay expires, the actual grace period begins. After the grace
period ends, a new reported error will start the same flow again.

Timeline summary:

----|--------|------------------------------/----------------------/--
error is  error is    grace period delay          grace period
reported  recovered  (recoveries allowed)     (recoveries blocked)

With grace period delay, create a time window during which recovery
attempts are permitted, allowing all reported errors to be handled
sequentially before the grace period starts. Once the grace period
begins, it prevents any further error recoveries until it ends.

When grace period delay is set to 0, current behavior is preserved.


Shahar Shitrit (5):
  devlink: Move graceful period parameter to reporter ops
  devlink: Move health reporter recovery abort logic to a separate
    function
  devlink: Introduce grace period delay for health reporter
  devlink: Make health reporter grace period delay configurable
  net/mlx5e: Set default grace period delay for TX and RX reporters

 Documentation/netlink/specs/devlink.yaml      |   7 ++
 .../networking/devlink/devlink-health.rst     |   2 +-
 drivers/net/ethernet/amd/pds_core/main.c      |   2 +-
 .../net/ethernet/broadcom/bnxt/bnxt_devlink.c |   2 +-
 .../net/ethernet/huawei/hinic/hinic_devlink.c |  10 +-
 .../net/ethernet/intel/ice/devlink/health.c   |   3 +-
 .../marvell/octeontx2/af/rvu_devlink.c        |  32 +++--
 .../mellanox/mlx5/core/diag/reporter_vnic.c   |   2 +-
 .../mellanox/mlx5/core/en/reporter_rx.c       |  13 ++-
 .../mellanox/mlx5/core/en/reporter_tx.c       |  13 ++-
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |   2 +-
 .../net/ethernet/mellanox/mlx5/core/health.c  |  41 ++++---
 drivers/net/ethernet/mellanox/mlxsw/core.c    |   2 +-
 drivers/net/ethernet/qlogic/qed/qed_devlink.c |  10 +-
 drivers/net/netdevsim/health.c                |   4 +-
 include/net/devlink.h                         |  15 ++-
 include/uapi/linux/devlink.h                  |   2 +
 net/devlink/health.c                          | 109 +++++++++++++-----
 net/devlink/netlink_gen.c                     |   5 +-
 19 files changed, 191 insertions(+), 85 deletions(-)


base-commit: a96cee9b369ee47b5309311d0d71cb6663b123fc
-- 
2.31.1

Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter

Posted by Jakub Kicinski 2 months, 2 weeks ago

On Thu, 17 Jul 2025 19:07:17 +0300 Tariq Toukan wrote:
> Currently, the devlink health reporter initiates the grace period
> immediately after recovering an error, which blocks further recovery
> attempts until the grace period concludes. Since additional errors
> are not generally expected during this short interval, any new error
> reported during the grace period is not only rejected but also causes
> the reporter to enter an error state that requires manual intervention.
> 
> This approach poses a problem in scenarios where a single root cause
> triggers multiple related errors in quick succession - for example,
> a PCI issue affecting multiple hardware queues. Because these errors
> are closely related and occur rapidly, it is more effective to handle
> them together rather than handling only the first one reported and
> blocking any subsequent recovery attempts. Furthermore, setting the
> reporter to an error state in this context can be misleading, as these
> multiple errors are manifestations of a single underlying issue, making
> it unlike the general case where additional errors are not expected
> during the grace period.
> 
> To resolve this, introduce a configurable grace period delay attribute
> to the devlink health reporter. This delay starts when the first error
> is recovered and lasts for a user-defined duration. Once this grace
> period delay expires, the actual grace period begins. After the grace
> period ends, a new reported error will start the same flow again.
> 
> Timeline summary:
> 
> ----|--------|------------------------------/----------------------/--
> error is  error is    grace period delay          grace period
> reported  recovered  (recoveries allowed)     (recoveries blocked)
> 
> With grace period delay, create a time window during which recovery
> attempts are permitted, allowing all reported errors to be handled
> sequentially before the grace period starts. Once the grace period
> begins, it prevents any further error recoveries until it ends.

We are rate limiting recoveries, the "networking solution" to the
problem you're describing would be to introduce a burst size.
Some kind of poor man's token bucket filter.

Could you say more about what designs were considered and why this
one was chosen?

Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter

Posted by Tariq Toukan 2 months, 2 weeks ago


On 19/07/2025 3:47, Jakub Kicinski wrote:
> On Thu, 17 Jul 2025 19:07:17 +0300 Tariq Toukan wrote:
>> Currently, the devlink health reporter initiates the grace period
>> immediately after recovering an error, which blocks further recovery
>> attempts until the grace period concludes. Since additional errors
>> are not generally expected during this short interval, any new error
>> reported during the grace period is not only rejected but also causes
>> the reporter to enter an error state that requires manual intervention.
>>
>> This approach poses a problem in scenarios where a single root cause
>> triggers multiple related errors in quick succession - for example,
>> a PCI issue affecting multiple hardware queues. Because these errors
>> are closely related and occur rapidly, it is more effective to handle
>> them together rather than handling only the first one reported and
>> blocking any subsequent recovery attempts. Furthermore, setting the
>> reporter to an error state in this context can be misleading, as these
>> multiple errors are manifestations of a single underlying issue, making
>> it unlike the general case where additional errors are not expected
>> during the grace period.
>>
>> To resolve this, introduce a configurable grace period delay attribute
>> to the devlink health reporter. This delay starts when the first error
>> is recovered and lasts for a user-defined duration. Once this grace
>> period delay expires, the actual grace period begins. After the grace
>> period ends, a new reported error will start the same flow again.
>>
>> Timeline summary:
>>
>> ----|--------|------------------------------/----------------------/--
>> error is  error is    grace period delay          grace period
>> reported  recovered  (recoveries allowed)     (recoveries blocked)
>>
>> With grace period delay, create a time window during which recovery
>> attempts are permitted, allowing all reported errors to be handled
>> sequentially before the grace period starts. Once the grace period
>> begins, it prevents any further error recoveries until it ends.
> 
> We are rate limiting recoveries, the "networking solution" to the
> problem you're describing would be to introduce a burst size.
> Some kind of poor man's token bucket filter.
> 
> Could you say more about what designs were considered and why this
> one was chosen?
> 

Please see below.
If no more comments, I'll add the below to the cover letter and re-spin.

Regards,
Tariq

Design alternatives considered:

1. Recover all queues upon any error:
    A brute-force approach that recovers all queues on any error.
    While simple, it is overly aggressive and disrupts unaffected queues
    unnecessarily. Also, because this is handled entirely within the
    driver, it leads to a driver-specific implementation rather than a
    generic one.

2. Per-queue reporter:
    This design would isolate recovery handling per SQ or RQ, effectively
    removing interdependencies between queues. While conceptually clean,
    it introduces significant scalability challenges as the number of
    queues grows, as well as synchronization challenges across multiple
    reporters.

3. Error aggregation with delayed handling:
    Errors arriving during the grace period are saved and processed after
    it ends. While addressing the issue of related errors whose recovery
    is aborted as grace period started, this adds complexity due to
    synchronization needs and contradicts the assumption that no errors
    should occur during a healthy system’s grace period. Also, this
    breaks the important role of grace period in preventing an infinite
    loop of immediate error detection following recovery. In such cases
    we want to stop.

4. Allowing a fixed burst of errors before starting grace period:
    Allows a set number of recoveries before the grace period begins.
    However, it also requires limiting the error reporting window.
    To keep the design simple, the burst threshold becomes redundant.

The grace period delay design was chosen for its simplicity and
precision in addressing the problem at hand. It effectively captures
the temporal correlation of related errors and aligns with the original
intent of the grace period as a stabilization window where further
errors are unexpected, and if they do occur, they indicate an abnormal
system state.

Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter

Posted by Jakub Kicinski 2 months, 1 week ago

On Thu, 24 Jul 2025 13:46:08 +0300 Tariq Toukan wrote:
> Design alternatives considered:
> 
> 1. Recover all queues upon any error:
>     A brute-force approach that recovers all queues on any error.
>     While simple, it is overly aggressive and disrupts unaffected queues
>     unnecessarily. Also, because this is handled entirely within the
>     driver, it leads to a driver-specific implementation rather than a
>     generic one.
> 
> 2. Per-queue reporter:
>     This design would isolate recovery handling per SQ or RQ, effectively
>     removing interdependencies between queues. While conceptually clean,
>     it introduces significant scalability challenges as the number of
>     queues grows, as well as synchronization challenges across multiple
>     reporters.
> 
> 3. Error aggregation with delayed handling:
>     Errors arriving during the grace period are saved and processed after
>     it ends. While addressing the issue of related errors whose recovery
>     is aborted as grace period started, this adds complexity due to
>     synchronization needs and contradicts the assumption that no errors
>     should occur during a healthy system’s grace period. Also, this
>     breaks the important role of grace period in preventing an infinite
>     loop of immediate error detection following recovery. In such cases
>     we want to stop.
> 
> 4. Allowing a fixed burst of errors before starting grace period:
>     Allows a set number of recoveries before the grace period begins.
>     However, it also requires limiting the error reporting window.
>     To keep the design simple, the burst threshold becomes redundant.

We're talking about burst on order of 100s, right? The implementation
is quite simple, store an array the size of burst in which you can
save recovery timestamps (in a circular fashion). On error, count
how many entries are in the past N msecs.

It's a clear generalization of current scheme which can be thought of
as having an array of size 1 (only one most recent recovery time is
saved).

> The grace period delay design was chosen for its simplicity and
> precision in addressing the problem at hand. It effectively captures
> the temporal correlation of related errors and aligns with the original
> intent of the grace period as a stabilization window where further
> errors are unexpected, and if they do occur, they indicate an abnormal
> system state.

Admittedly part of what I find extremely confusing when thinking about
this API is that the period when recovery is **not** allowed is called
"grace period". Now we add something called "grace period delay" in
some places in the code referred to as "reporter_delay"..

It may be more palatable if we named the first period "error burst
period" and, well, the later I suppose it's too late to rename..

Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter

Posted by Tariq Toukan 2 months, 1 week ago

On 25/07/2025 3:10, Jakub Kicinski wrote:
> On Thu, 24 Jul 2025 13:46:08 +0300 Tariq Toukan wrote:
>> Design alternatives considered:
>>
>> 1. Recover all queues upon any error:
>>      A brute-force approach that recovers all queues on any error.
>>      While simple, it is overly aggressive and disrupts unaffected queues
>>      unnecessarily. Also, because this is handled entirely within the
>>      driver, it leads to a driver-specific implementation rather than a
>>      generic one.
>>
>> 2. Per-queue reporter:
>>      This design would isolate recovery handling per SQ or RQ, effectively
>>      removing interdependencies between queues. While conceptually clean,
>>      it introduces significant scalability challenges as the number of
>>      queues grows, as well as synchronization challenges across multiple
>>      reporters.
>>
>> 3. Error aggregation with delayed handling:
>>      Errors arriving during the grace period are saved and processed after
>>      it ends. While addressing the issue of related errors whose recovery
>>      is aborted as grace period started, this adds complexity due to
>>      synchronization needs and contradicts the assumption that no errors
>>      should occur during a healthy system’s grace period. Also, this
>>      breaks the important role of grace period in preventing an infinite
>>      loop of immediate error detection following recovery. In such cases
>>      we want to stop.
>>
>> 4. Allowing a fixed burst of errors before starting grace period:
>>      Allows a set number of recoveries before the grace period begins.
>>      However, it also requires limiting the error reporting window.
>>      To keep the design simple, the burst threshold becomes redundant.
> 
> We're talking about burst on order of 100s, right?

It can be, typically up to O(num_cpus).

> The implementation
> is quite simple, store an array the size of burst in which you can
> save recovery timestamps (in a circular fashion). On error, count
> how many entries are in the past N msecs.
> 

I get your suggestion. I agree that it's also pretty simple to 
implement, and that it tolerates bursts.

However, I think it softens the grace period role too much. It has an 
important disadvantage, as it tolerates non-bursts as well. It lacks the 
"burstness" distinguishability.

IMO current grace_period has multiple goals, among them:

a. let the auto-recovery mechanism handle errors as long as they are 
followed by some long-enough "healthy" intervals.

b. break infinite loop of auto-recoveries, when the "healthy" interval 
is not long enough. Raise a flag to mark the need for admin intervention.

In your proposal, the above doesn't hold.
It won't prevent the infinite auto-recovery loop for a buggy system that 
has a constant rate of up to X failures in N msecs.

One can argue that this can be addressed by increasing the grace_period. 
i.e. a current system with grace_period=N is intuitively moved to 
burst_size=X and grace_period=X*N.

But increasing the grace_period by such a large factor has 
over-enforcement and hurts legitimate auto-recoveries.

Again, the main point is, it lacks the ability to properly distinguish 
between 1. a "burst" followed by a healthy interval, and 2. a buggy 
system with a rate of repeated errors.

> It's a clear generalization of current scheme which can be thought of
> as having an array of size 1 (only one most recent recovery time is
> saved).
> 

It is a simple generalization indeed.
But I don't agree it's a better filter.

>> The grace period delay design was chosen for its simplicity and
>> precision in addressing the problem at hand. It effectively captures
>> the temporal correlation of related errors and aligns with the original
>> intent of the grace period as a stabilization window where further
>> errors are unexpected, and if they do occur, they indicate an abnormal
>> system state.
> 
> Admittedly part of what I find extremely confusing when thinking about
> this API is that the period when recovery is **not** allowed is called
> "grace period".

Absolutely.
We discussed this exact same insight internally. The existing name is 
confusing, but we won't propose modifying it at this point.

> Now we add something called "grace period delay" in
> some places in the code referred to as "reporter_delay"..
> 
> It may be more palatable if we named the first period "error burst
> period" and, well, the later I suppose it's too late to rename..
It can be named after what it achieves (allows handling of more errors) 
or what it is (a shift of the grace_period). I'm fine with both, don't 
have strong preference.

I'd call it grace_period in case we didn't have one already :)

Please let me know what name you prefer.

Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter

Posted by Jakub Kicinski 2 months, 1 week ago

On Sun, 27 Jul 2025 14:00:11 +0300 Tariq Toukan wrote:
> I get your suggestion. I agree that it's also pretty simple to 
> implement, and that it tolerates bursts.
> 
> However, I think it softens the grace period role too much. It has an 
> important disadvantage, as it tolerates non-bursts as well. It lacks the 
> "burstness" distinguishability.
> 
> IMO current grace_period has multiple goals, among them:
> 
> a. let the auto-recovery mechanism handle errors as long as they are 
> followed by some long-enough "healthy" intervals.
> 
> b. break infinite loop of auto-recoveries, when the "healthy" interval 
> is not long enough. Raise a flag to mark the need for admin intervention.
> 
> In your proposal, the above doesn't hold.
> It won't prevent the infinite auto-recovery loop for a buggy system that 
> has a constant rate of up to X failures in N msecs.
> 
> One can argue that this can be addressed by increasing the grace_period. 
> i.e. a current system with grace_period=N is intuitively moved to 
> burst_size=X and grace_period=X*N.
> 
> But increasing the grace_period by such a large factor has 
> over-enforcement and hurts legitimate auto-recoveries.
> 
> Again, the main point is, it lacks the ability to properly distinguish 
> between 1. a "burst" followed by a healthy interval, and 2. a buggy 
> system with a rate of repeated errors.

I suspect this is catching some very mlx5-specific recovery loop,
so I defer to your judgment on what's better.

As a user I do not know how to configure this health recovery stuff.
My intuition would be that we just needs to lower the recovery rate
to prevent filling up logs etc. and the action of taking the machine
out is really the responsibility of some fleet health monitoring daemon.
I can't think of any other error reporting facility in the kernel where
we'd shut down the recovery completely if the rate is high..

> > Now we add something called "grace period delay" in
> > some places in the code referred to as "reporter_delay"..
> > 
> > It may be more palatable if we named the first period "error burst
> > period" and, well, the later I suppose it's too late to rename..  
> It can be named after what it achieves (allows handling of more errors) 
> or what it is (a shift of the grace_period). I'm fine with both, don't 
> have strong preference.

Let's rename to "error burst period", reporter_in_error_burst etc.

> I'd call it grace_period in case we didn't have one already :)

Exactly :)