[PATCH] genirq: Fix software resend lockup and nested resend

Johan Hovold posted 1 patch 2 years, 3 months ago
There is a newer version of this series
kernel/irq/resend.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
[PATCH] genirq: Fix software resend lockup and nested resend
Posted by Johan Hovold 2 years, 3 months ago
The switch to using hlist for managing software resend of interrupts
broke resend in at least two ways:

First, unconditionally adding interrupt descriptors to the resend list
can corrupt the list when the descriptor in question has already been
added. This causes the resend tasklet to loop indefinitely with
interrupts disabled as was recently reported with the Lenovo ThinkPad
X13s after threaded NAPI was disabled in the ath11k WiFi driver. [1]

This bug is easily fixed by restoring the old semantics of
irq_sw_resend() so that it can be called also for descriptors that have
already been marked for resend.

Second, the offending commit also broke software resend of nested
interrupts by simply discarding the code that made sure that such
interrupts are retriggered using the parent interrupt.

Add back the corresponding code that adds the parent descriptor to the
resend list. Note that this bit is untested, but I decided to include it
to avoid having to revert the offending commit and the maple tree
conversion that depends on it.

[1] https://lore.kernel.org/lkml/20230809073432.4193-1-johan+linaro@kernel.org/

Fixes: bc06a9e08742 ("genirq: Use hlist for managing resend handlers")
Cc: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
---

Hi Thomas and Marc,

This patch fixes a severe regression in the resend code in 6.5-rc1 that
breaks machines like the Lenovo X13s and which ideally should be
addressed before 6.5 is released tomorrow.

I hesitated about including the fix for nested interrupts as I've not
had time to test this bit, but I ultimately decided to include it to
avoid having to suggest a revert of the maple tree conversion. Let me
know if you prefer to go this route and I'll post a (prepared) revert
series instead.

Johan


 kernel/irq/resend.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c
index edec335c0a7a..5f2c66860ac6 100644
--- a/kernel/irq/resend.c
+++ b/kernel/irq/resend.c
@@ -68,11 +68,16 @@ static int irq_sw_resend(struct irq_desc *desc)
 		 */
 		if (!desc->parent_irq)
 			return -EINVAL;
+
+		desc = irq_to_desc(desc->parent_irq);
+		if (!desc)
+			return -EINVAL;
 	}
 
 	/* Add to resend_list and activate the softirq: */
 	raw_spin_lock(&irq_resend_lock);
-	hlist_add_head(&desc->resend_node, &irq_resend_list);
+	if (hlist_unhashed(&desc->resend_node))
+		hlist_add_head(&desc->resend_node, &irq_resend_list);
 	raw_spin_unlock(&irq_resend_lock);
 	tasklet_schedule(&resend_tasklet);
 	return 0;
-- 
2.41.0
Re: [PATCH] genirq: Fix software resend lockup and nested resend
Posted by Marc Zyngier 2 years, 3 months ago
Hi Johan,

On Sat, 26 Aug 2023 16:40:04 +0100,
Johan Hovold <johan+linaro@kernel.org> wrote:
> 
> The switch to using hlist for managing software resend of interrupts
> broke resend in at least two ways:
> 
> First, unconditionally adding interrupt descriptors to the resend list
> can corrupt the list when the descriptor in question has already been
> added. This causes the resend tasklet to loop indefinitely with
> interrupts disabled as was recently reported with the Lenovo ThinkPad
> X13s after threaded NAPI was disabled in the ath11k WiFi driver. [1]

Gah, of course. Making the descriptor pending again isn't an
idempotent operation anymore (while setting an already set bit in the
bitmap was). You'll need the right timing, but it looks like you've
managed to achieve that reliably.

>
> This bug is easily fixed by restoring the old semantics of
> irq_sw_resend() so that it can be called also for descriptors that have
> already been marked for resend.
> 
> Second, the offending commit also broke software resend of nested
> interrupts by simply discarding the code that made sure that such
> interrupts are retriggered using the parent interrupt.
> 
> Add back the corresponding code that adds the parent descriptor to the
> resend list. Note that this bit is untested, but I decided to include it
> to avoid having to revert the offending commit and the maple tree
> conversion that depends on it.

Indeed, and that one is not timing related, but 100% broken.

> 
> [1] https://lore.kernel.org/lkml/20230809073432.4193-1-johan+linaro@kernel.org/
> 
> Fixes: bc06a9e08742 ("genirq: Use hlist for managing resend handlers")
> Cc: Shanker Donthineni <sdonthineni@nvidia.com>
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---
> 
> Hi Thomas and Marc,
> 
> This patch fixes a severe regression in the resend code in 6.5-rc1 that
> breaks machines like the Lenovo X13s and which ideally should be
> addressed before 6.5 is released tomorrow.

This relies on Thomas seeing this email before tomorrow. Unless you
directly reach out to Linus to try and intercept the release.

>
> I hesitated about including the fix for nested interrupts as I've not
> had time to test this bit, but I ultimately decided to include it to
> avoid having to suggest a revert of the maple tree conversion. Let me
> know if you prefer to go this route and I'll post a (prepared) revert
> series instead.

I think it is too late for that revert to happen, and we might as well
plough along and take the fix.

> 
> Johan
> 
> 
>  kernel/irq/resend.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c
> index edec335c0a7a..5f2c66860ac6 100644
> --- a/kernel/irq/resend.c
> +++ b/kernel/irq/resend.c
> @@ -68,11 +68,16 @@ static int irq_sw_resend(struct irq_desc *desc)
>  		 */
>  		if (!desc->parent_irq)
>  			return -EINVAL;
> +
> +		desc = irq_to_desc(desc->parent_irq);
> +		if (!desc)
> +			return -EINVAL;
>  	}
>  
>  	/* Add to resend_list and activate the softirq: */
>  	raw_spin_lock(&irq_resend_lock);
> -	hlist_add_head(&desc->resend_node, &irq_resend_list);
> +	if (hlist_unhashed(&desc->resend_node))
> +		hlist_add_head(&desc->resend_node, &irq_resend_list);
>  	raw_spin_unlock(&irq_resend_lock);
>  	tasklet_schedule(&resend_tasklet);
>  	return 0;

FWIW:

Reviewed-by: Marc Zyngier <maz@kernel.org>

	M.

-- 
Without deviation from the norm, progress is not possible.
[tip: irq/urgent] genirq: Fix software resend lockup and nested resend
Posted by tip-bot2 for Johan Hovold 2 years, 3 months ago
The following commit has been merged into the irq/urgent branch of tip:

Commit-ID:     9f5deb551655a4cff04b21ecffdcdab75112da3a
Gitweb:        https://git.kernel.org/tip/9f5deb551655a4cff04b21ecffdcdab75112da3a
Author:        Johan Hovold <johan+linaro@kernel.org>
AuthorDate:    Sat, 26 Aug 2023 17:40:04 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Sat, 26 Aug 2023 19:14:31 +02:00

genirq: Fix software resend lockup and nested resend

The switch to using hlist for managing software resend of interrupts
broke resend in at least two ways:

First, unconditionally adding interrupt descriptors to the resend list can
corrupt the list when the descriptor in question has already been
added. This causes the resend tasklet to loop indefinitely with interrupts
disabled as was recently reported with the Lenovo ThinkPad X13s after
threaded NAPI was disabled in the ath11k WiFi driver.

This bug is easily fixed by restoring the old semantics of irq_sw_resend()
so that it can be called also for descriptors that have already been marked
for resend.

Second, the offending commit also broke software resend of nested
interrupts by simply discarding the code that made sure that such
interrupts are retriggered using the parent interrupt.

Add back the corresponding code that adds the parent descriptor to the
resend list.

Fixes: bc06a9e08742 ("genirq: Use hlist for managing resend handlers")
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/lkml/20230809073432.4193-1-johan+linaro@kernel.org/
Link: https://lore.kernel.org/r/20230826154004.1417-1-johan+linaro@kernel.org

---
 kernel/irq/resend.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c
index edec335..5f2c668 100644
--- a/kernel/irq/resend.c
+++ b/kernel/irq/resend.c
@@ -68,11 +68,16 @@ static int irq_sw_resend(struct irq_desc *desc)
 		 */
 		if (!desc->parent_irq)
 			return -EINVAL;
+
+		desc = irq_to_desc(desc->parent_irq);
+		if (!desc)
+			return -EINVAL;
 	}
 
 	/* Add to resend_list and activate the softirq: */
 	raw_spin_lock(&irq_resend_lock);
-	hlist_add_head(&desc->resend_node, &irq_resend_list);
+	if (hlist_unhashed(&desc->resend_node))
+		hlist_add_head(&desc->resend_node, &irq_resend_list);
 	raw_spin_unlock(&irq_resend_lock);
 	tasklet_schedule(&resend_tasklet);
 	return 0;