[PATCH platform-next v2 2/2] [PATCH platform-next 2/2] platform/mellanox: mlxreg-hotplug: Add support for handling interrupt storm

Ciju Rajan K posted 2 patches 1 week, 1 day ago
[PATCH platform-next v2 2/2] [PATCH platform-next 2/2] platform/mellanox: mlxreg-hotplug: Add support for handling interrupt storm
Posted by Ciju Rajan K 1 week, 1 day ago
In case of broken hardware, it is possible that broken device will
flood interrupt handler with false events. For example, if fan or
power supply has damaged presence pin, it will cause permanent
generation of plugged in / plugged out events. As a result, interrupt
handler will consume a lot of CPU resources and will keep raising
"UDEV" events to the user space.

This patch provides a mechanism to detect device causing interrupt
flooding and mask interrupt for this specific device, to isolate
from interrupt handling flow. Use the following criteria: if the
specific interrupt was generated 'N' times during 'T' seconds,
such device is to be considered as broken and will be closed for
getting interrupts. User will be notified through the log error
and will be instructed to replace broken device.

Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Ciju Rajan K <crajank@nvidia.com>
---
 drivers/platform/mellanox/mlxreg-hotplug.c | 32 ++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/platform/mellanox/mlxreg-hotplug.c b/drivers/platform/mellanox/mlxreg-hotplug.c
index d246772aafd6..ae0115ea1fd1 100644
--- a/drivers/platform/mellanox/mlxreg-hotplug.c
+++ b/drivers/platform/mellanox/mlxreg-hotplug.c
@@ -11,6 +11,7 @@
 #include <linux/hwmon-sysfs.h>
 #include <linux/i2c.h>
 #include <linux/interrupt.h>
+#include <linux/jiffies.h>
 #include <linux/module.h>
 #include <linux/platform_data/mlxreg.h>
 #include <linux/platform_device.h>
@@ -30,6 +31,11 @@
 #define MLXREG_HOTPLUG_ATTRS_MAX	128
 #define MLXREG_HOTPLUG_NOT_ASSERT	3
 
+/* Interrupt storm definitions */
+#define MLXREG_HOTPLUG_WM_COUNTER	100
+/* Time window in milliseconds */
+#define MLXREG_HOTPLUG_WM_WINDOW_MS	3000
+
 /**
  * struct mlxreg_hotplug_priv_data - platform private data:
  * @irq: platform device interrupt number;
@@ -366,11 +372,33 @@ mlxreg_hotplug_work_helper(struct mlxreg_hotplug_priv_data *priv,
 	for_each_set_bit(bit, &asserted, 8) {
 		int pos;
 
+		/* Skip already marked storming bit. */
+		if (item->storming_bits & BIT(bit))
+			continue;
+
 		pos = mlxreg_hotplug_item_label_index_get(item->mask, bit);
 		if (pos < 0)
 			goto out;
 
 		data = item->data + pos;
+
+		/* Interrupt storm handling logic. */
+		if (data->wmark_cntr == 0)
+			data->wmark_window = jiffies +
+				msecs_to_jiffies(MLXREG_HOTPLUG_WM_WINDOW_MS);
+
+		if (data->wmark_cntr >= MLXREG_HOTPLUG_WM_COUNTER - 1) {
+			if (time_after(data->wmark_window, jiffies)) {
+				dev_err(priv->dev,
+					"Storming bit %d (label: %s) - interrupt masked permanently. Replace broken HW.",
+					bit, data->label);
+				/* Mark bit as storming. */
+				item->storming_bits |= BIT(bit);
+				continue;
+			}
+			data->wmark_cntr = 0;
+		}
+		data->wmark_cntr++;
 		if (regval & BIT(bit)) {
 			if (item->inversed)
 				mlxreg_hotplug_device_destroy(priv, data, item->kind);
@@ -390,9 +418,9 @@ mlxreg_hotplug_work_helper(struct mlxreg_hotplug_priv_data *priv,
 	if (ret)
 		goto out;
 
-	/* Unmask event. */
+	/* Unmask event, exclude storming bits. */
 	ret = regmap_write(priv->regmap, item->reg + MLXREG_HOTPLUG_MASK_OFF,
-			   item->mask);
+			   item->mask & ~item->storming_bits);
 
  out:
 	if (ret)
-- 
2.47.2
Re: [PATCH platform-next v2 2/2] [PATCH platform-next 2/2] platform/mellanox: mlxreg-hotplug: Add support for handling interrupt storm
Posted by Ilpo Järvinen 1 week, 1 day ago
On Tue, 23 Sep 2025, Ciju Rajan K wrote:

> In case of broken hardware, it is possible that broken device will
> flood interrupt handler with false events. For example, if fan or
> power supply has damaged presence pin, it will cause permanent
> generation of plugged in / plugged out events. As a result, interrupt
> handler will consume a lot of CPU resources and will keep raising
> "UDEV" events to the user space.
> 
> This patch provides a mechanism to detect device causing interrupt
> flooding and mask interrupt for this specific device, to isolate
> from interrupt handling flow. Use the following criteria: if the
> specific interrupt was generated 'N' times during 'T' seconds,
> such device is to be considered as broken and will be closed for
> getting interrupts. User will be notified through the log error
> and will be instructed to replace broken device.
> 
> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
> Signed-off-by: Ciju Rajan K <crajank@nvidia.com>
> ---
>  drivers/platform/mellanox/mlxreg-hotplug.c | 32 ++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/platform/mellanox/mlxreg-hotplug.c b/drivers/platform/mellanox/mlxreg-hotplug.c
> index d246772aafd6..ae0115ea1fd1 100644
> --- a/drivers/platform/mellanox/mlxreg-hotplug.c
> +++ b/drivers/platform/mellanox/mlxreg-hotplug.c
> @@ -11,6 +11,7 @@
>  #include <linux/hwmon-sysfs.h>
>  #include <linux/i2c.h>
>  #include <linux/interrupt.h>
> +#include <linux/jiffies.h>
>  #include <linux/module.h>
>  #include <linux/platform_data/mlxreg.h>
>  #include <linux/platform_device.h>
> @@ -30,6 +31,11 @@
>  #define MLXREG_HOTPLUG_ATTRS_MAX	128
>  #define MLXREG_HOTPLUG_NOT_ASSERT	3
>  
> +/* Interrupt storm definitions */
> +#define MLXREG_HOTPLUG_WM_COUNTER	100
> +/* Time window in milliseconds */
> +#define MLXREG_HOTPLUG_WM_WINDOW_MS	3000
> +
>  /**
>   * struct mlxreg_hotplug_priv_data - platform private data:
>   * @irq: platform device interrupt number;
> @@ -366,11 +372,33 @@ mlxreg_hotplug_work_helper(struct mlxreg_hotplug_priv_data *priv,
>  	for_each_set_bit(bit, &asserted, 8) {
>  		int pos;
>  
> +		/* Skip already marked storming bit. */
> +		if (item->storming_bits & BIT(bit))
> +			continue;
> +
>  		pos = mlxreg_hotplug_item_label_index_get(item->mask, bit);
>  		if (pos < 0)
>  			goto out;
>  
>  		data = item->data + pos;
> +
> +		/* Interrupt storm handling logic. */
> +		if (data->wmark_cntr == 0)
> +			data->wmark_window = jiffies +
> +				msecs_to_jiffies(MLXREG_HOTPLUG_WM_WINDOW_MS);

Please use braces for multi-line if blocks.

> +
> +		if (data->wmark_cntr >= MLXREG_HOTPLUG_WM_COUNTER - 1) {
> +			if (time_after(data->wmark_window, jiffies)) {
> +				dev_err(priv->dev,
> +					"Storming bit %d (label: %s) - interrupt masked permanently. Replace broken HW.",
> +					bit, data->label);
> +				/* Mark bit as storming. */
> +				item->storming_bits |= BIT(bit);
> +				continue;
> +			}
> +			data->wmark_cntr = 0;
> +		}
> +		data->wmark_cntr++;

I think this should be in else block to allow recalculation of the time 
window when the counter wraps.

>  		if (regval & BIT(bit)) {
>  			if (item->inversed)
>  				mlxreg_hotplug_device_destroy(priv, data, item->kind);
> @@ -390,9 +418,9 @@ mlxreg_hotplug_work_helper(struct mlxreg_hotplug_priv_data *priv,
>  	if (ret)
>  		goto out;
>  
> -	/* Unmask event. */
> +	/* Unmask event, exclude storming bits. */
>  	ret = regmap_write(priv->regmap, item->reg + MLXREG_HOTPLUG_MASK_OFF,
> -			   item->mask);
> +			   item->mask & ~item->storming_bits);
>  
>   out:
>  	if (ret)
> 

-- 
 i.