Fix force_irqthread + fast triggered edge-type IRQs

[RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

Posted by Leonardo Bras 1 year, 11 months ago

In threaded IRQs, some irq handlers are able to handle many requests at a
single run, but this is only accounted as a single IRQ_HANDLED when
increasing threads_handled.

In order to fix this, introduce IRQ_HANDLED_MANY, so the returned value of
those IRQ handlers are able to signal that many IRQs got handled at that
run.

Is scenarios where there is no need to keep track of IRQ handled, convert
it back to IRQ_HANDLED.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 include/linux/irqreturn.h | 23 ++++++++++++++++++++++-
 kernel/irq/chip.c         | 10 ++++++++--
 kernel/irq/handle.c       |  3 +++
 kernel/irq/manage.c       |  4 ++++
 4 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/include/linux/irqreturn.h b/include/linux/irqreturn.h
index d426c7ad92bfd..204030696676c 100644
--- a/include/linux/irqreturn.h
+++ b/include/linux/irqreturn.h
@@ -7,14 +7,35 @@
  * @IRQ_NONE:		interrupt was not from this device or was not handled
  * @IRQ_HANDLED:	interrupt was handled by this device
  * @IRQ_WAKE_THREAD:	handler requests to wake the handler thread
+ * @IRQ_HANDLED_MANY:	interrupt was handled by this device multiple times
+ *			should be the only bit set in the first 3 bits, and
+ *			carry a count > 1 in the next bits.
  */
 enum irqreturn {
 	IRQ_NONE		= (0 << 0),
 	IRQ_HANDLED		= (1 << 0),
 	IRQ_WAKE_THREAD		= (1 << 1),
+	IRQ_HANDLED_MANY	= (1 << 2),
+	IRQ_RETMASK		= IRQ_HANDLED |	IRQ_WAKE_THREAD | IRQ_HANDLED_MANY,
 };
 
-typedef enum irqreturn irqreturn_t;
+#define IRQ_HANDLED_MANY_SHIFT (3)
+
+typedef int irqreturn_t;
 #define IRQ_RETVAL(x)	((x) ? IRQ_HANDLED : IRQ_NONE)
+#define	IRQ_RETVAL_MANY(x)							\
+({										\
+	__typeof__(x) __x = (x);						\
+	irqreturn_t __ret;							\
+	if (__x == 0)								\
+		__ret = IRQ_NONE;						\
+	else if (__x == 1)							\
+		__ret = IRQ_HANDLED;						\
+	else									\
+		__ret = IRQ_HANDLED_MANY | (__x << IRQ_HANDLED_MANY_SHIFT);	\
+	__ret;									\
+})
+
+#define IRQ_HANDLED_MANY_GET(x)	((x) >> IRQ_HANDLED_MANY_SHIFT)
 
 #endif
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index dc94e0bf2c940..233cf22a5b771 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -482,8 +482,14 @@ void handle_nested_irq(unsigned int irq)
 	raw_spin_unlock_irq(&desc->lock);
 
 	action_ret = IRQ_NONE;
-	for_each_action_of_desc(desc, action)
-		action_ret |= action->thread_fn(action->irq, action->dev_id);
+	for_each_action_of_desc(desc, action) {
+		irqreturn_t ret = action->thread_fn(action->irq, action->dev_id);
+
+		if ((ret & IRQ_RETMASK) == IRQ_HANDLED_MANY)
+			ret = IRQ_HANDLED;
+
+		action_ret |= ret;
+	}
 
 	if (!irq_settings_no_debug(desc))
 		note_interrupt(desc, action_ret);
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index 9489f93b3db34..bfc6e3e43045a 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -162,6 +162,9 @@ irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc)
 			      irq, action->handler))
 			local_irq_disable();
 
+		if ((res & IRQ_RETMASK) == IRQ_HANDLED_MANY)
+			res = IRQ_HANDLED;
+
 		switch (res) {
 		case IRQ_WAKE_THREAD:
 			/*
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 5bc609c7b728c..e684c7107ff90 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1192,6 +1192,8 @@ irq_forced_thread_fn(struct irq_desc *desc, struct irqaction *action)
 	ret = action->thread_fn(action->irq, action->dev_id);
 	if (ret == IRQ_HANDLED)
 		irq_add_handled(desc, 1);
+	else if ((ret & IRQ_RETMASK) == IRQ_HANDLED_MANY)
+		irq_add_handled(desc, IRQ_HANDLED_MANY_GET(ret));
 
 	irq_finalize_oneshot(desc, action);
 	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
@@ -1213,6 +1215,8 @@ static irqreturn_t irq_thread_fn(struct irq_desc *desc,
 	ret = action->thread_fn(action->irq, action->dev_id);
 	if (ret == IRQ_HANDLED)
 		irq_add_handled(desc, 1);
+	else if ((ret & IRQ_RETMASK) == IRQ_HANDLED_MANY)
+		irq_add_handled(desc, IRQ_HANDLED_MANY_GET(ret));
 
 	irq_finalize_oneshot(desc, action);
 	return ret;
-- 
2.43.2

Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

Posted by Thomas Gleixner 1 year, 11 months ago

On Fri, Feb 16 2024 at 04:59, Leonardo Bras wrote:

> In threaded IRQs, some irq handlers are able to handle many requests at a
> single run, but this is only accounted as a single IRQ_HANDLED when
> increasing threads_handled.
>
> In order to fix this, introduce IRQ_HANDLED_MANY, so the returned value of
> those IRQ handlers are able to signal that many IRQs got handled at that
> run.
>
> Is scenarios where there is no need to keep track of IRQ handled, convert
> it back to IRQ_HANDLED.

That's not really workable as you'd have to update tons of drivers just
to deal with that corner case. That's error prone and just extra
complexity all over the place.

This really needs to be solved in the core code.

Thanks,

        tglx

Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

Posted by Thomas Gleixner 1 year, 11 months ago

On Mon, Feb 19 2024 at 10:59, Thomas Gleixner wrote:
> On Fri, Feb 16 2024 at 04:59, Leonardo Bras wrote:
>
>> In threaded IRQs, some irq handlers are able to handle many requests at a
>> single run, but this is only accounted as a single IRQ_HANDLED when
>> increasing threads_handled.
>>
>> In order to fix this, introduce IRQ_HANDLED_MANY, so the returned value of
>> those IRQ handlers are able to signal that many IRQs got handled at that
>> run.
>>
>> Is scenarios where there is no need to keep track of IRQ handled, convert
>> it back to IRQ_HANDLED.
>
> That's not really workable as you'd have to update tons of drivers just
> to deal with that corner case. That's error prone and just extra
> complexity all over the place.
>
> This really needs to be solved in the core code.

Something like the uncompiled below should do the trick.

Thanks,

        tglx
---
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -38,7 +38,8 @@ struct pt_regs;
  * @affinity_notify:	context for notification of affinity changes
  * @pending_mask:	pending rebalanced interrupts
  * @threads_oneshot:	bitfield to handle shared oneshot threads
- * @threads_active:	number of irqaction threads currently running
+ * @threads_active:	number of irqaction threads currently activated
+ * @threads_running:	number of irqaction threads currently running
  * @wait_for_threads:	wait queue for sync_irq to wait for threaded handlers
  * @nr_actions:		number of installed actions on this descriptor
  * @no_suspend_depth:	number of irqactions on a irq descriptor with
@@ -80,6 +81,7 @@ struct irq_desc {
 #endif
 	unsigned long		threads_oneshot;
 	atomic_t		threads_active;
+	atomic_t		threads_running;
 	wait_queue_head_t       wait_for_threads;
 #ifdef CONFIG_PM_SLEEP
 	unsigned int		nr_actions;
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1194,9 +1194,11 @@ irq_forced_thread_fn(struct irq_desc *de
 	local_bh_disable();
 	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
 		local_irq_disable();
+	atomic_inc(&desc->threads_running);
 	ret = action->thread_fn(action->irq, action->dev_id);
 	if (ret == IRQ_HANDLED)
 		atomic_inc(&desc->threads_handled);
+	atomic_dec(&desc->threads_running);
 
 	irq_finalize_oneshot(desc, action);
 	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -350,6 +350,12 @@ void note_interrupt(struct irq_desc *des
 				desc->threads_handled_last = handled;
 			} else {
 				/*
+				 * Avoid false positives when there is
+				 * actually a thread running.
+				 */
+				if (atomic_read(&desc->threads_running))
+					return;
+				/*
 				 * None of the threaded handlers felt
 				 * responsible for the last interrupt
 				 *

Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

Posted by Leonardo Bras 1 year, 11 months ago

On Mon, Feb 19, 2024 at 12:03:07PM +0100, Thomas Gleixner wrote:
> On Mon, Feb 19 2024 at 10:59, Thomas Gleixner wrote:

Hi Thomas, thanks for reviewing!

> > On Fri, Feb 16 2024 at 04:59, Leonardo Bras wrote:
> >
> >> In threaded IRQs, some irq handlers are able to handle many requests at a
> >> single run, but this is only accounted as a single IRQ_HANDLED when
> >> increasing threads_handled.
> >>
> >> In order to fix this, introduce IRQ_HANDLED_MANY, so the returned value of
> >> those IRQ handlers are able to signal that many IRQs got handled at that
> >> run.
> >>
> >> Is scenarios where there is no need to keep track of IRQ handled, convert
> >> it back to IRQ_HANDLED.
> >
> > That's not really workable as you'd have to update tons of drivers just
> > to deal with that corner case. That's error prone and just extra
> > complexity all over the place.

I agree, that's a downside of this implementation. 

> >
> > This really needs to be solved in the core code.
> 
> Something like the uncompiled below should do the trick.
> 
> Thanks,
> 
>         tglx
> ---
> --- a/include/linux/irqdesc.h
> +++ b/include/linux/irqdesc.h
> @@ -38,7 +38,8 @@ struct pt_regs;
>   * @affinity_notify:	context for notification of affinity changes
>   * @pending_mask:	pending rebalanced interrupts
>   * @threads_oneshot:	bitfield to handle shared oneshot threads
> - * @threads_active:	number of irqaction threads currently running
> + * @threads_active:	number of irqaction threads currently activated
> + * @threads_running:	number of irqaction threads currently running
>   * @wait_for_threads:	wait queue for sync_irq to wait for threaded handlers
>   * @nr_actions:		number of installed actions on this descriptor
>   * @no_suspend_depth:	number of irqactions on a irq descriptor with
> @@ -80,6 +81,7 @@ struct irq_desc {
>  #endif
>  	unsigned long		threads_oneshot;
>  	atomic_t		threads_active;
> +	atomic_t		threads_running;
>  	wait_queue_head_t       wait_for_threads;
>  #ifdef CONFIG_PM_SLEEP
>  	unsigned int		nr_actions;
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -1194,9 +1194,11 @@ irq_forced_thread_fn(struct irq_desc *de
>  	local_bh_disable();
>  	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
>  		local_irq_disable();
> +	atomic_inc(&desc->threads_running);
>  	ret = action->thread_fn(action->irq, action->dev_id);
>  	if (ret == IRQ_HANDLED)
>  		atomic_inc(&desc->threads_handled);
> +	atomic_dec(&desc->threads_running);
>  
>  	irq_finalize_oneshot(desc, action);
>  	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> --- a/kernel/irq/spurious.c
> +++ b/kernel/irq/spurious.c
> @@ -350,6 +350,12 @@ void note_interrupt(struct irq_desc *des
>  				desc->threads_handled_last = handled;
>  			} else {
>  				/*
> +				 * Avoid false positives when there is
> +				 * actually a thread running.
> +				 */
> +				if (atomic_read(&desc->threads_running))
> +					return;
> +				/*
>  				 * None of the threaded handlers felt
>  				 * responsible for the last interrupt
>  				 *
> 

I agree the above may be able to solve the issue, but it would make 2 extra 
atomic ops necessary in the thread handling the IRQ, as well as one extra 
atomic operation in note_interrupt(), which could increase latency on this 
IRQ deferring the handler to a thread.

I mean, yes, the cpu running note_interrupt() would probably already have 
exclusiveness for this cacheline, but it further increases cacheline 
bouncing and also adds the mem barriers that incur on atomic operations, 
even if we use an extra bit from threads_handled instead of allocate a new 
field for threads_running. 

On top of that, let's think on a scenario where the threaded handler will 
solve a lot of requests, but not necessarily spend a lot of time doing so.
This allows the thread to run for little time while solving a lot of 
requests.

In this scenario, note_interrupt() could return without incrementing 
irqs_unhandled for those IRQ that happen while the brief thread is running, 
but every other IRQ would cause note_interrupt() to increase 
irqs_unhandled, which would cause the bug to still reproduce.

I understand my suggested change increases irq_return complexity, but it
should not increase much of the overhead in both IRQ deferring and IRQ 
thread handling. Also, since it accounts for every IRQ actually handled, it 
does not matter how long the handler thread runs, it still avoids the bug.

As you previously noted, the main issue in my suggestion is about changing 
drivers' code. But while this change is optional, I wonder how many 
drivers handle IRQs that:
- use edge type interrupts, and
- can trigger really fast, many many times, and
- can run in force-thread mode, and
- have handlers that are able to deal with multiple IRQs per call.

If there aren't many that met this requirement, I could try to tackle them 
as they become a problem.

About solving it directly in generic code, I agree it would be the ideal 
scenario. That's why I suggested that in RFCv1: if the thread handles a 
single IRQ, we reset irqs_unhandled to zero. That's a good pointer on the 
thread being stuck, and TBH I don't think this is too far away from the 
current 100/100k if we consider some of those handlers can handle multiple 
IRQs at once.

What are your thoughts on that?

Thanks!
Leo

Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

Posted by Thomas Gleixner 1 year, 11 months ago

On Wed, Feb 21 2024 at 02:39, Leonardo Bras wrote:
> On Mon, Feb 19, 2024 at 12:03:07PM +0100, Thomas Gleixner wrote:
>> >> Is scenarios where there is no need to keep track of IRQ handled, convert
>> >> it back to IRQ_HANDLED.
>> >
>> > That's not really workable as you'd have to update tons of drivers just
>> > to deal with that corner case. That's error prone and just extra
>> > complexity all over the place.
>
> I agree, that's a downside of this implementation. 

A serious one which is not really workable. See below.

> I agree the above may be able to solve the issue, but it would make 2 extra 
> atomic ops necessary in the thread handling the IRQ, as well as one extra 
> atomic operation in note_interrupt(), which could increase latency on this 
> IRQ deferring the handler to a thread.
>
> I mean, yes, the cpu running note_interrupt() would probably already have 
> exclusiveness for this cacheline, but it further increases cacheline 
> bouncing and also adds the mem barriers that incur on atomic operations, 
> even if we use an extra bit from threads_handled instead of allocate a new 
> field for threads_running.

I think that's a strawman. Atomic operations can of course be more
expensive than non-atomic ones, but they only start to make a difference
when the cache line is contended. That's not the case here for the
normal operations.

Interrupts and their threads are strictly targeted to a single CPU and
the cache line is already hot and had to be made exclusive because of
other write operations to it.

There is usually no concurrency at all, except for administrative
operations like enable/disable or affinity changes. Those administrative
operations are not high frequency and the resulting cache line bouncing
is unavoidable even without that change. But does it matter in the
larger picture? I don't think so.

> On top of that, let's think on a scenario where the threaded handler will 
> solve a lot of requests, but not necessarily spend a lot of time doing so.
> This allows the thread to run for little time while solving a lot of 
> requests.
>
> In this scenario, note_interrupt() could return without incrementing 
> irqs_unhandled for those IRQ that happen while the brief thread is running, 
> but every other IRQ would cause note_interrupt() to increase 
> irqs_unhandled, which would cause the bug to still reproduce.

In theory yes. Does it happen in practice?

But that exposes a flaw in the actual detection code. The code is
unconditionally accumulating if there is an unhandled interrupt within
100ms after the last unhandled one. IOW, if there is a periodic
unhandled one every 50ms, the interrupt will be shut down after 100000 *
50ms = 5000s ~= 83.3m ~= 1.4h. And it neither cares about the number of
actually handled interrupts.

The spurious detector is really about runaway interrupts which hog a CPU
completely, but the above is not what we want to protect against.

> I understand my suggested change increases irq_return complexity, but it
> should not increase much of the overhead in both IRQ deferring and IRQ 
> thread handling. Also, since it accounts for every IRQ actually handled, it 
> does not matter how long the handler thread runs, it still avoids the
> bug.

I'm not worried about the extra complexity in note_interrupt(). That's
fine and I don't have a strong opinion whether using your patch 2/4 or
the simpler one I suggested.

> As you previously noted, the main issue in my suggestion is about changing 
> drivers' code. But while this change is optional, I wonder how many 
> drivers handle IRQs that:
> - use edge type interrupts, and

It's not up to the driver to decide that. That's a platform property and
the driver does not even know and they should not care because all what
matters for the driver is that it gets the interrupts and does not lose
anything.

> - can trigger really fast, many many times, and

That's a hardware or emulation property and I don't know how many
devices would expose this behaviour.

> - can run in force-thread mode, and

Almost all drivers.

> - have handlers that are able to deal with multiple IRQs per call.

Pretty much every driver which has to deal with queues, FIFOs
etc. That's a lot of them.

> About solving it directly in generic code, I agree it would be the ideal 
> scenario. That's why I suggested that in RFCv1: if the thread handles a 
> single IRQ, we reset irqs_unhandled to zero. That's a good pointer on the 
> thread being stuck, and TBH I don't think this is too far away from the 
> current 100/100k if we consider some of those handlers can handle multiple 
> IRQs at once.

It has the downside that a scenario where there is a spurious interrupt
flood with an occasional one inbetween which is handled by the thread
will not be detected anymore. The accumulation and time period tracking
are there for a reason.

But as I pointed out above the detection logic is flawed due to the
unconditional accumulation. Can you give the uncompiled below a test
ride with your scenario?

Thanks,

        tglx
---
 include/linux/irqdesc.h |    2 ++
 kernel/irq/spurious.c   |   33 +++++++++++++++++++++++++++++----
 2 files changed, 31 insertions(+), 4 deletions(-)

--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -29,6 +29,7 @@ struct pt_regs;
  * @wake_depth:		enable depth, for multiple irq_set_irq_wake() callers
  * @tot_count:		stats field for non-percpu irqs
  * @irq_count:		stats field to detect stalled irqs
+ * @first_unhandled:	Timestamp of the first unhandled interrupt
  * @last_unhandled:	aging timer for unhandled count
  * @irqs_unhandled:	stats field for spurious unhandled interrupts
  * @threads_handled:	stats field for deferred spurious detection of threaded handlers
@@ -64,6 +65,7 @@ struct irq_desc {
 	unsigned int		wake_depth;	/* nested wake enables */
 	unsigned int		tot_count;
 	unsigned int		irq_count;	/* For detecting broken IRQs */
+	unsigned long		first_unhandled;
 	unsigned long		last_unhandled;	/* Aging timer for unhandled count */
 	unsigned int		irqs_unhandled;
 	atomic_t		threads_handled;
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -390,9 +390,14 @@ void note_interrupt(struct irq_desc *des
 		 * working systems
 		 */
 		if (time_after(jiffies, desc->last_unhandled + HZ/10))
-			desc->irqs_unhandled = 1;
-		else
-			desc->irqs_unhandled++;
+			desc->irqs_unhandled = 0;
+
+		if (!desc->irqs_unhandled) {
+			desc->irq_count = 0;
+			desc->first_unhandled = jiffies;
+		}
+
+		desc->irqs_unhandled++;
 		desc->last_unhandled = jiffies;
 	}
 
@@ -411,9 +416,27 @@ void note_interrupt(struct irq_desc *des
 	if (likely(desc->irq_count < 100000))
 		return;
 
-	desc->irq_count = 0;
 	if (unlikely(desc->irqs_unhandled > 99900)) {
 		/*
+		 * Validate that this is actually an interrupt storm, which
+		 * happens to:
+		 *
+		 *  - increment the unhandled count to ~100k within 10 seconds
+		 *    which means there is an spurious interrupt every 50us
+		 *  - have an unhandled to handled ratio > 2
+		 *
+		 * Otherwise reset the detector and start over. This also
+		 * covers the case where a threaded handler consumes
+		 * several hard interrupts in one go which would otherwise
+		 * accumulate to 99900 over time and trigger a false positive.
+		 */
+		unsigned long td = desc->last_unhandled - desc->first_unhandled;
+		unsigned int handled = desc->irq_count - desc->irqs_unhandled;
+
+		if (td > 5 * HZ || desc->irqs_unhandled / handled < 3)
+			goto out;
+
+		/*
 		 * The interrupt is stuck
 		 */
 		__report_bad_irq(desc, action_ret);
@@ -428,7 +451,9 @@ void note_interrupt(struct irq_desc *des
 		mod_timer(&poll_spurious_irq_timer,
 			  jiffies + POLL_SPURIOUS_IRQ_INTERVAL);
 	}
+out:
 	desc->irqs_unhandled = 0;
+	desc->irq_count = 0;
 }
 
 bool noirqdebug __read_mostly;

Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

Posted by Leonardo Bras 1 year, 11 months ago

On Wed, Feb 21, 2024 at 04:41:20PM +0100, Thomas Gleixner wrote:
> On Wed, Feb 21 2024 at 02:39, Leonardo Bras wrote:
> > On Mon, Feb 19, 2024 at 12:03:07PM +0100, Thomas Gleixner wrote:
> >> >> Is scenarios where there is no need to keep track of IRQ handled, convert
> >> >> it back to IRQ_HANDLED.
> >> >
> >> > That's not really workable as you'd have to update tons of drivers just
> >> > to deal with that corner case. That's error prone and just extra
> >> > complexity all over the place.
> >
> > I agree, that's a downside of this implementation. 
> 
> A serious one which is not really workable. See below.
> 
> > I agree the above may be able to solve the issue, but it would make 2 extra 
> > atomic ops necessary in the thread handling the IRQ, as well as one extra 
> > atomic operation in note_interrupt(), which could increase latency on this 
> > IRQ deferring the handler to a thread.
> >
> > I mean, yes, the cpu running note_interrupt() would probably already have 
> > exclusiveness for this cacheline, but it further increases cacheline 
> > bouncing and also adds the mem barriers that incur on atomic operations, 
> > even if we use an extra bit from threads_handled instead of allocate a new 
> > field for threads_running.
> 
> I think that's a strawman. Atomic operations can of course be more
> expensive than non-atomic ones, but they only start to make a difference
> when the cache line is contended. That's not the case here for the
> normal operations.
> 
> Interrupts and their threads are strictly targeted to a single CPU and
> the cache line is already hot and had to be made exclusive because of
> other write operations to it.
> 
> There is usually no concurrency at all, except for administrative
> operations like enable/disable or affinity changes. Those administrative
> operations are not high frequency and the resulting cache line bouncing
> is unavoidable even without that change. But does it matter in the
> larger picture? I don't think so.

That's a fair point, but there are some use cases that use CPU Isolation on 
top of PREEMPT_RT in order to reduce interference on a CPU running an RT 
workload.

For those cases, IIRC the handler will run on a different (housekeeping) 
CPU when those IRQs originate on an Isolated CPU, meaning the above 
described cacheline bouncing is expected.


> 
> > On top of that, let's think on a scenario where the threaded handler will 
> > solve a lot of requests, but not necessarily spend a lot of time doing so.
> > This allows the thread to run for little time while solving a lot of 
> > requests.
> >
> > In this scenario, note_interrupt() could return without incrementing 
> > irqs_unhandled for those IRQ that happen while the brief thread is running, 
> > but every other IRQ would cause note_interrupt() to increase 
> > irqs_unhandled, which would cause the bug to still reproduce.
> 
> In theory yes. Does it happen in practice?
> 
> But that exposes a flaw in the actual detection code. The code is
> unconditionally accumulating if there is an unhandled interrupt within
> 100ms after the last unhandled one. IOW, if there is a periodic
> unhandled one every 50ms, the interrupt will be shut down after 100000 *
> 50ms = 5000s ~= 83.3m ~= 1.4h. And it neither cares about the number of
> actually handled interrupts.
> 
> The spurious detector is really about runaway interrupts which hog a CPU
> completely, but the above is not what we want to protect against.

Now it makes a lot more sense to me.
Thanks!

> 
> > I understand my suggested change increases irq_return complexity, but it
> > should not increase much of the overhead in both IRQ deferring and IRQ 
> > thread handling. Also, since it accounts for every IRQ actually handled, it 
> > does not matter how long the handler thread runs, it still avoids the
> > bug.
> 
> I'm not worried about the extra complexity in note_interrupt(). That's
> fine and I don't have a strong opinion whether using your patch 2/4 or
> the simpler one I suggested.
> 
> > As you previously noted, the main issue in my suggestion is about changing 
> > drivers' code. But while this change is optional, I wonder how many 
> > drivers handle IRQs that:
> > - use edge type interrupts, and
> 
> It's not up to the driver to decide that. That's a platform property and
> the driver does not even know and they should not care because all what
> matters for the driver is that it gets the interrupts and does not lose
> anything.
> 
> > - can trigger really fast, many many times, and
> 
> That's a hardware or emulation property and I don't know how many
> devices would expose this behaviour.
> 
> > - can run in force-thread mode, and
> 
> Almost all drivers.
> 
> > - have handlers that are able to deal with multiple IRQs per call.
> 
> Pretty much every driver which has to deal with queues, FIFOs
> etc. That's a lot of them.
> 
> > About solving it directly in generic code, I agree it would be the ideal 
> > scenario. That's why I suggested that in RFCv1: if the thread handles a 
> > single IRQ, we reset irqs_unhandled to zero. That's a good pointer on the 
> > thread being stuck, and TBH I don't think this is too far away from the 
> > current 100/100k if we consider some of those handlers can handle multiple 
> > IRQs at once.
> 
> It has the downside that a scenario where there is a spurious interrupt
> flood with an occasional one inbetween which is handled by the thread
> will not be detected anymore. The accumulation and time period tracking
> are there for a reason.

Ok, fair point.
But if we disable it, we end up not having even those few successes.

But now that I understand the point is preventing the CPU from hogging, it 
makes sense: we prefer disabling the interrupt than compomising the system.
It also makes a warning that may help to track down the driver/device 
issue.

> 
> But as I pointed out above the detection logic is flawed due to the
> unconditional accumulation. Can you give the uncompiled below a test
> ride with your scenario?
> 
> Thanks,
> 
>         tglx
> ---
>  include/linux/irqdesc.h |    2 ++
>  kernel/irq/spurious.c   |   33 +++++++++++++++++++++++++++++----
>  2 files changed, 31 insertions(+), 4 deletions(-)
> 
> --- a/include/linux/irqdesc.h
> +++ b/include/linux/irqdesc.h
> @@ -29,6 +29,7 @@ struct pt_regs;
>   * @wake_depth:		enable depth, for multiple irq_set_irq_wake() callers
>   * @tot_count:		stats field for non-percpu irqs
>   * @irq_count:		stats field to detect stalled irqs
> + * @first_unhandled:	Timestamp of the first unhandled interrupt
>   * @last_unhandled:	aging timer for unhandled count
>   * @irqs_unhandled:	stats field for spurious unhandled interrupts
>   * @threads_handled:	stats field for deferred spurious detection of threaded handlers
> @@ -64,6 +65,7 @@ struct irq_desc {
>  	unsigned int		wake_depth;	/* nested wake enables */
>  	unsigned int		tot_count;
>  	unsigned int		irq_count;	/* For detecting broken IRQs */
> +	unsigned long		first_unhandled;
>  	unsigned long		last_unhandled;	/* Aging timer for unhandled count */
>  	unsigned int		irqs_unhandled;
>  	atomic_t		threads_handled;
> --- a/kernel/irq/spurious.c
> +++ b/kernel/irq/spurious.c
> @@ -390,9 +390,14 @@ void note_interrupt(struct irq_desc *des
>  		 * working systems
>  		 */
>  		if (time_after(jiffies, desc->last_unhandled + HZ/10))
> -			desc->irqs_unhandled = 1;
> -		else
> -			desc->irqs_unhandled++;
> +			desc->irqs_unhandled = 0;
> +
> +		if (!desc->irqs_unhandled) {
> +			desc->irq_count = 0;
> +			desc->first_unhandled = jiffies;
> +		}
> +
> +		desc->irqs_unhandled++;
>  		desc->last_unhandled = jiffies;
>  	}
>  
> @@ -411,9 +416,27 @@ void note_interrupt(struct irq_desc *des
>  	if (likely(desc->irq_count < 100000))
>  		return;
>  
> -	desc->irq_count = 0;
>  	if (unlikely(desc->irqs_unhandled > 99900)) {
>  		/*
> +		 * Validate that this is actually an interrupt storm, which
> +		 * happens to:
> +		 *
> +		 *  - increment the unhandled count to ~100k within 10 seconds
> +		 *    which means there is an spurious interrupt every 50us
> +		 *  - have an unhandled to handled ratio > 2
> +		 *
> +		 * Otherwise reset the detector and start over. This also
> +		 * covers the case where a threaded handler consumes
> +		 * several hard interrupts in one go which would otherwise
> +		 * accumulate to 99900 over time and trigger a false positive.
> +		 */
> +		unsigned long td = desc->last_unhandled - desc->first_unhandled;
> +		unsigned int handled = desc->irq_count - desc->irqs_unhandled;
> +
> +		if (td > 5 * HZ || desc->irqs_unhandled / handled < 3)
> +			goto out;
> +
> +		/*
>  		 * The interrupt is stuck
>  		 */
>  		__report_bad_irq(desc, action_ret);
> @@ -428,7 +451,9 @@ void note_interrupt(struct irq_desc *des
>  		mod_timer(&poll_spurious_irq_timer,
>  			  jiffies + POLL_SPURIOUS_IRQ_INTERVAL);
>  	}
> +out:
>  	desc->irqs_unhandled = 0;
> +	desc->irq_count = 0;
>  }
>  
>  bool noirqdebug __read_mostly;
> 

Sure, I will give it a try.

Now having a better understanding of what is expected here, I will also try 
to experiment a little here.

Thank you!
Leo

Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

Posted by Leonardo Bras 1 year, 2 months ago

On Fri, Feb 23, 2024 at 01:37:39AM -0300, Leonardo Bras wrote:
> On Wed, Feb 21, 2024 at 04:41:20PM +0100, Thomas Gleixner wrote:
> > On Wed, Feb 21 2024 at 02:39, Leonardo Bras wrote:
> > > On Mon, Feb 19, 2024 at 12:03:07PM +0100, Thomas Gleixner wrote:
> > >> >> Is scenarios where there is no need to keep track of IRQ handled, convert
> > >> >> it back to IRQ_HANDLED.
> > >> >
> > >> > That's not really workable as you'd have to update tons of drivers just
> > >> > to deal with that corner case. That's error prone and just extra
> > >> > complexity all over the place.
> > >
> > > I agree, that's a downside of this implementation. 
> > 
> > A serious one which is not really workable. See below.
> > 
> > > I agree the above may be able to solve the issue, but it would make 2 extra 
> > > atomic ops necessary in the thread handling the IRQ, as well as one extra 
> > > atomic operation in note_interrupt(), which could increase latency on this 
> > > IRQ deferring the handler to a thread.
> > >
> > > I mean, yes, the cpu running note_interrupt() would probably already have 
> > > exclusiveness for this cacheline, but it further increases cacheline 
> > > bouncing and also adds the mem barriers that incur on atomic operations, 
> > > even if we use an extra bit from threads_handled instead of allocate a new 
> > > field for threads_running.
> > 
> > I think that's a strawman. Atomic operations can of course be more
> > expensive than non-atomic ones, but they only start to make a difference
> > when the cache line is contended. That's not the case here for the
> > normal operations.
> > 
> > Interrupts and their threads are strictly targeted to a single CPU and
> > the cache line is already hot and had to be made exclusive because of
> > other write operations to it.
> > 
> > There is usually no concurrency at all, except for administrative
> > operations like enable/disable or affinity changes. Those administrative
> > operations are not high frequency and the resulting cache line bouncing
> > is unavoidable even without that change. But does it matter in the
> > larger picture? I don't think so.
> 
> That's a fair point, but there are some use cases that use CPU Isolation on 
> top of PREEMPT_RT in order to reduce interference on a CPU running an RT 
> workload.
> 
> For those cases, IIRC the handler will run on a different (housekeeping) 
> CPU when those IRQs originate on an Isolated CPU, meaning the above 
> described cacheline bouncing is expected.
> 
> 
> > 
> > > On top of that, let's think on a scenario where the threaded handler will 
> > > solve a lot of requests, but not necessarily spend a lot of time doing so.
> > > This allows the thread to run for little time while solving a lot of 
> > > requests.
> > >
> > > In this scenario, note_interrupt() could return without incrementing 
> > > irqs_unhandled for those IRQ that happen while the brief thread is running, 
> > > but every other IRQ would cause note_interrupt() to increase 
> > > irqs_unhandled, which would cause the bug to still reproduce.
> > 
> > In theory yes. Does it happen in practice?
> > 
> > But that exposes a flaw in the actual detection code. The code is
> > unconditionally accumulating if there is an unhandled interrupt within
> > 100ms after the last unhandled one. IOW, if there is a periodic
> > unhandled one every 50ms, the interrupt will be shut down after 100000 *
> > 50ms = 5000s ~= 83.3m ~= 1.4h. And it neither cares about the number of
> > actually handled interrupts.
> > 
> > The spurious detector is really about runaway interrupts which hog a CPU
> > completely, but the above is not what we want to protect against.
> 
> Now it makes a lot more sense to me.
> Thanks!

Hi Thomas,

I would like to go back to this discussion :)
From what I could understand, and read back the thread:

- The spurious detector is used to avoid cpu hog when a lots of IRQs are 
  hitting a cpu, but few ( < 100 / 100k) are being handled. It works by
  disabling that interruption.

- The bug I am dealing with (on serial8250), happens to fit exactly at
  above case: lots of requests, but few are handled.
  The reason: threaded handler, many requests, and they are dealt with in 
  batch: multiple requests are handled at once, but a single IRQ_HANDLED 
  returned.

- My proposed solution: Find a way of accounting the requests handled.

  - Implementation: add an option for drivers voluntarily report how 
    many requests they handled. Current drivers need no change.

  - Limitation: If this issue is found on another driver, we need to 
    implement accounting there as well. This may only happen on drivers
    which handle over 1k requests at once.


What was left for me TODO:
Think on a generic solution for this issue, to avoid dealing with that 
in a per-driver basis. 

That's what I was able to think about:

- Only the driver code knows how many requests it handled, so without  
  touching them we can't know how many requests were properly handled.

- I could try thinking a different solution, which involves changing only
  the spurious detector.

  - For that I would need to find a particular characteristic we would want 
    to avoid spurious detection against, and make sure it won't miss an
    actual case we want to be protected about.

Generic solutions(?) proposed:
- Zero irqs_unhandled if threaded & handles a single request in 100k
  - Problem: A regular issue with the interruption would not be detected 
    in the driver. 

- Skip detection if threaded & the handling thread is running
  - Problem 1: the thread may run shortly and batch handle a lot of stuff, 
  not being detected by the spurious detector. 
  - Problem 2: the thread may get stuck, not handle the IRQs and also not
  being detected by the spurious handler. (IIUC)


In the end, I could not find a proper way of telling apart
a - "this is a real spurious IRQ behavior, which needs to be disabled", and 
b - "this is just a handler that batch-handles it's requests",
without touching the drivers' code.

Do you have any suggestion on how to do that?

Thanks!
Leo

Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

Posted by Andy Shevchenko 1 year, 2 months ago

On Thu, Nov 14, 2024 at 12:40:17AM -0300, Leonardo Bras wrote:
> On Fri, Feb 23, 2024 at 01:37:39AM -0300, Leonardo Bras wrote:
> > On Wed, Feb 21, 2024 at 04:41:20PM +0100, Thomas Gleixner wrote:
> > > On Wed, Feb 21 2024 at 02:39, Leonardo Bras wrote:
> > > > On Mon, Feb 19, 2024 at 12:03:07PM +0100, Thomas Gleixner wrote:
> > > >> >> Is scenarios where there is no need to keep track of IRQ handled, convert
> > > >> >> it back to IRQ_HANDLED.
> > > >> >
> > > >> > That's not really workable as you'd have to update tons of drivers just
> > > >> > to deal with that corner case. That's error prone and just extra
> > > >> > complexity all over the place.
> > > >
> > > > I agree, that's a downside of this implementation. 
> > > 
> > > A serious one which is not really workable. See below.
> > > 
> > > > I agree the above may be able to solve the issue, but it would make 2 extra 
> > > > atomic ops necessary in the thread handling the IRQ, as well as one extra 
> > > > atomic operation in note_interrupt(), which could increase latency on this 
> > > > IRQ deferring the handler to a thread.
> > > >
> > > > I mean, yes, the cpu running note_interrupt() would probably already have 
> > > > exclusiveness for this cacheline, but it further increases cacheline 
> > > > bouncing and also adds the mem barriers that incur on atomic operations, 
> > > > even if we use an extra bit from threads_handled instead of allocate a new 
> > > > field for threads_running.
> > > 
> > > I think that's a strawman. Atomic operations can of course be more
> > > expensive than non-atomic ones, but they only start to make a difference
> > > when the cache line is contended. That's not the case here for the
> > > normal operations.
> > > 
> > > Interrupts and their threads are strictly targeted to a single CPU and
> > > the cache line is already hot and had to be made exclusive because of
> > > other write operations to it.
> > > 
> > > There is usually no concurrency at all, except for administrative
> > > operations like enable/disable or affinity changes. Those administrative
> > > operations are not high frequency and the resulting cache line bouncing
> > > is unavoidable even without that change. But does it matter in the
> > > larger picture? I don't think so.
> > 
> > That's a fair point, but there are some use cases that use CPU Isolation on 
> > top of PREEMPT_RT in order to reduce interference on a CPU running an RT 
> > workload.
> > 
> > For those cases, IIRC the handler will run on a different (housekeeping) 
> > CPU when those IRQs originate on an Isolated CPU, meaning the above 
> > described cacheline bouncing is expected.
> > 
> > 
> > > 
> > > > On top of that, let's think on a scenario where the threaded handler will 
> > > > solve a lot of requests, but not necessarily spend a lot of time doing so.
> > > > This allows the thread to run for little time while solving a lot of 
> > > > requests.
> > > >
> > > > In this scenario, note_interrupt() could return without incrementing 
> > > > irqs_unhandled for those IRQ that happen while the brief thread is running, 
> > > > but every other IRQ would cause note_interrupt() to increase 
> > > > irqs_unhandled, which would cause the bug to still reproduce.
> > > 
> > > In theory yes. Does it happen in practice?
> > > 
> > > But that exposes a flaw in the actual detection code. The code is
> > > unconditionally accumulating if there is an unhandled interrupt within
> > > 100ms after the last unhandled one. IOW, if there is a periodic
> > > unhandled one every 50ms, the interrupt will be shut down after 100000 *
> > > 50ms = 5000s ~= 83.3m ~= 1.4h. And it neither cares about the number of
> > > actually handled interrupts.
> > > 
> > > The spurious detector is really about runaway interrupts which hog a CPU
> > > completely, but the above is not what we want to protect against.
> > 
> > Now it makes a lot more sense to me.
> > Thanks!
> 
> Hi Thomas,
> 
> I would like to go back to this discussion :)
> From what I could understand, and read back the thread:
> 
> - The spurious detector is used to avoid cpu hog when a lots of IRQs are 
>   hitting a cpu, but few ( < 100 / 100k) are being handled. It works by
>   disabling that interruption.
> 
> - The bug I am dealing with (on serial8250), happens to fit exactly at
>   above case: lots of requests, but few are handled.
>   The reason: threaded handler, many requests, and they are dealt with in 
>   batch: multiple requests are handled at once, but a single IRQ_HANDLED 
>   returned.
> 
> - My proposed solution: Find a way of accounting the requests handled.
> 
>   - Implementation: add an option for drivers voluntarily report how 
>     many requests they handled. Current drivers need no change.

>   - Limitation: If this issue is found on another driver, we need to 
>     implement accounting there as well. This may only happen on drivers
>     which handle over 1k requests at once.

> What was left for me TODO:
> Think on a generic solution for this issue, to avoid dealing with that 
> in a per-driver basis. 
> 
> That's what I was able to think about:

> - Only the driver code knows how many requests it handled, so without  
>   touching them we can't know how many requests were properly handled.

Hmm... But do I understand correctly the following:

- the IRQ core knows the amount of generated IRQs for the device (so it's kinda
obvious that IRQ number maps to the driver);

- the IRQ core disables IRQ while handling an IRQ number in question;

- the driver is supposed to handle all IRQs that were reported at the beginning
o.f its handler;

- taking the above the amount of handled IRQs is what came till the disabling
IRQ. IRQs that came after should be replayed when IRQ gets enabled.

?

> - I could try thinking a different solution, which involves changing only
>   the spurious detector.
> 
>   - For that I would need to find a particular characteristic we would want 
>     to avoid spurious detection against, and make sure it won't miss an
>     actual case we want to be protected about.
> 
> Generic solutions(?) proposed:
> - Zero irqs_unhandled if threaded & handles a single request in 100k
>   - Problem: A regular issue with the interruption would not be detected 
>     in the driver. 
> 
> - Skip detection if threaded & the handling thread is running
>   - Problem 1: the thread may run shortly and batch handle a lot of stuff, 
>   not being detected by the spurious detector. 
>   - Problem 2: the thread may get stuck, not handle the IRQs and also not
>   being detected by the spurious handler. (IIUC)
> 
> 
> In the end, I could not find a proper way of telling apart
> a - "this is a real spurious IRQ behavior, which needs to be disabled", and 
> b - "this is just a handler that batch-handles it's requests",
> without touching the drivers' code.
> 
> Do you have any suggestion on how to do that?

-- 
With Best Regards,
Andy Shevchenko

Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

Posted by Leonardo Bras 1 year, 2 months ago

On Thu, Nov 14, 2024 at 09:50:36AM +0200, Andy Shevchenko wrote:
> On Thu, Nov 14, 2024 at 12:40:17AM -0300, Leonardo Bras wrote:
> > On Fri, Feb 23, 2024 at 01:37:39AM -0300, Leonardo Bras wrote:
> > > On Wed, Feb 21, 2024 at 04:41:20PM +0100, Thomas Gleixner wrote:
> > > > On Wed, Feb 21 2024 at 02:39, Leonardo Bras wrote:
> > > > > On Mon, Feb 19, 2024 at 12:03:07PM +0100, Thomas Gleixner wrote:
> > > > >> >> Is scenarios where there is no need to keep track of IRQ handled, convert
> > > > >> >> it back to IRQ_HANDLED.
> > > > >> >
> > > > >> > That's not really workable as you'd have to update tons of drivers just
> > > > >> > to deal with that corner case. That's error prone and just extra
> > > > >> > complexity all over the place.
> > > > >
> > > > > I agree, that's a downside of this implementation. 
> > > > 
> > > > A serious one which is not really workable. See below.
> > > > 
> > > > > I agree the above may be able to solve the issue, but it would make 2 extra 
> > > > > atomic ops necessary in the thread handling the IRQ, as well as one extra 
> > > > > atomic operation in note_interrupt(), which could increase latency on this 
> > > > > IRQ deferring the handler to a thread.
> > > > >
> > > > > I mean, yes, the cpu running note_interrupt() would probably already have 
> > > > > exclusiveness for this cacheline, but it further increases cacheline 
> > > > > bouncing and also adds the mem barriers that incur on atomic operations, 
> > > > > even if we use an extra bit from threads_handled instead of allocate a new 
> > > > > field for threads_running.
> > > > 
> > > > I think that's a strawman. Atomic operations can of course be more
> > > > expensive than non-atomic ones, but they only start to make a difference
> > > > when the cache line is contended. That's not the case here for the
> > > > normal operations.
> > > > 
> > > > Interrupts and their threads are strictly targeted to a single CPU and
> > > > the cache line is already hot and had to be made exclusive because of
> > > > other write operations to it.
> > > > 
> > > > There is usually no concurrency at all, except for administrative
> > > > operations like enable/disable or affinity changes. Those administrative
> > > > operations are not high frequency and the resulting cache line bouncing
> > > > is unavoidable even without that change. But does it matter in the
> > > > larger picture? I don't think so.
> > > 
> > > That's a fair point, but there are some use cases that use CPU Isolation on 
> > > top of PREEMPT_RT in order to reduce interference on a CPU running an RT 
> > > workload.
> > > 
> > > For those cases, IIRC the handler will run on a different (housekeeping) 
> > > CPU when those IRQs originate on an Isolated CPU, meaning the above 
> > > described cacheline bouncing is expected.
> > > 
> > > 
> > > > 
> > > > > On top of that, let's think on a scenario where the threaded handler will 
> > > > > solve a lot of requests, but not necessarily spend a lot of time doing so.
> > > > > This allows the thread to run for little time while solving a lot of 
> > > > > requests.
> > > > >
> > > > > In this scenario, note_interrupt() could return without incrementing 
> > > > > irqs_unhandled for those IRQ that happen while the brief thread is running, 
> > > > > but every other IRQ would cause note_interrupt() to increase 
> > > > > irqs_unhandled, which would cause the bug to still reproduce.
> > > > 
> > > > In theory yes. Does it happen in practice?
> > > > 
> > > > But that exposes a flaw in the actual detection code. The code is
> > > > unconditionally accumulating if there is an unhandled interrupt within
> > > > 100ms after the last unhandled one. IOW, if there is a periodic
> > > > unhandled one every 50ms, the interrupt will be shut down after 100000 *
> > > > 50ms = 5000s ~= 83.3m ~= 1.4h. And it neither cares about the number of
> > > > actually handled interrupts.
> > > > 
> > > > The spurious detector is really about runaway interrupts which hog a CPU
> > > > completely, but the above is not what we want to protect against.
> > > 
> > > Now it makes a lot more sense to me.
> > > Thanks!
> > 
> > Hi Thomas,
> > 
> > I would like to go back to this discussion :)
> > From what I could understand, and read back the thread:
> > 
> > - The spurious detector is used to avoid cpu hog when a lots of IRQs are 
> >   hitting a cpu, but few ( < 100 / 100k) are being handled. It works by
> >   disabling that interruption.
> > 
> > - The bug I am dealing with (on serial8250), happens to fit exactly at
> >   above case: lots of requests, but few are handled.
> >   The reason: threaded handler, many requests, and they are dealt with in 
> >   batch: multiple requests are handled at once, but a single IRQ_HANDLED 
> >   returned.
> > 
> > - My proposed solution: Find a way of accounting the requests handled.
> > 
> >   - Implementation: add an option for drivers voluntarily report how 
> >     many requests they handled. Current drivers need no change.
> 
> >   - Limitation: If this issue is found on another driver, we need to 
> >     implement accounting there as well. This may only happen on drivers
> >     which handle over 1k requests at once.
> 
> > What was left for me TODO:
> > Think on a generic solution for this issue, to avoid dealing with that 
> > in a per-driver basis. 
> > 
> > That's what I was able to think about:
> 
> > - Only the driver code knows how many requests it handled, so without  
> >   touching them we can't know how many requests were properly handled.
> 

Hello Andy, thanks for reviewing!


> Hmm... But do I understand correctly the following:
> 
> - the IRQ core knows the amount of generated IRQs for the device (so it's kinda
> obvious that IRQ number maps to the driver);

Yes, I could understand that as well.

> 
> - the IRQ core disables IRQ while handling an IRQ number in question;

Not necessarily:
When on irqs are force-threaded, only a quick handler is called, returning 
IRQ_WAKE_THREAD, which is supposed to wake up the handler thread.

  * @IRQ_WAKE_THREAD:   handler requests to wake the handler thread

In this case (which is what I am dealing with), the actual handler will run 
in thread context (which I suppose don't disable IRQ for sched-out 
purposes).

> 
> - the driver is supposed to handle all IRQs that were reported at the beginning
> o.f its handler;

That I am not aware about. I suppose it depends on driver implementation.

But if this one is correct, and must be true for every handler, then my 
first approach should be the right fix. See [1] below.


Below I am assuming handled IRQs = 'handlers which returned IRQ_HANDLED':

> 
> - taking the above the amount of handled IRQs is what came till the disabling
> IRQ.

Sure

> IRQs that came after should be replayed when IRQ gets enabled.
> 
> ?

Not sure about this one as well.
You mean the ones that got queued for thread-handling, but somehow got 
paused since the interrupt got disabled?

If not, I guess once you disable an IRQ no interruption on that line happens,
so I don't think any interruption gets saved for later (at least not in 
kernel level).

But I may be wrong here, it's a guess.

> 
> > - I could try thinking a different solution, which involves changing only
> >   the spurious detector.
> > 
> >   - For that I would need to find a particular characteristic we would want 
> >     to avoid spurious detection against, and make sure it won't miss an
> >     actual case we want to be protected about.
> > 
> > Generic solutions(?) proposed:

[1] here:

> > - Zero irqs_unhandled if threaded & handles a single request in 100k
> >   - Problem: A regular issue with the interruption would not be detected 
> >     in the driver. 
> > 
> > - Skip detection if threaded & the handling thread is running
> >   - Problem 1: the thread may run shortly and batch handle a lot of stuff, 
> >   not being detected by the spurious detector. 
> >   - Problem 2: the thread may get stuck, not handle the IRQs and also not
> >   being detected by the spurious handler. (IIUC)
> > 
> > 
> > In the end, I could not find a proper way of telling apart
> > a - "this is a real spurious IRQ behavior, which needs to be disabled", and 
> > b - "this is just a handler that batch-handles it's requests",
> > without touching the drivers' code.
> > 
> > Do you have any suggestion on how to do that?
> 
> -- 
> With Best Regards,
> Andy Shevchenko
> 
> 

Thanks!
Leo

Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

Posted by Thomas Gleixner 1 year, 2 months ago

On Mon, Nov 18 2024 at 22:15, Leonardo Bras wrote:
> On Thu, Nov 14, 2024 at 09:50:36AM +0200, Andy Shevchenko wrote:
>> - the IRQ core disables IRQ while handling an IRQ number in question;
>
> Not necessarily:
> When on irqs are force-threaded, only a quick handler is called, returning
> IRQ_WAKE_THREAD, which is supposed to wake up the handler thread.
>
> * @IRQ_WAKE_THREAD: handler requests to wake the handler thread
>
> In this case (which is what I am dealing with), the actual handler will run
> in thread context (which I suppose don't disable IRQ for sched-out
> purposes).

Let's talk about the disable irq concepts here. There are 3 distinct variants:

1) Disable interrupts on the local CPU: local_irq_disable/save().

This only prevents the CPU from handling a pending interrupt,
but does not prevent new interrupts from being marked pending
in the interrupt controller.

2) Disable interrupts in the interrupt controller

disable_irq*() variants handle that. They come with two
flavors: LAZY and UNLAZY

LAZY does not mask the interrupt line right away. It only is masked
when an interrupt arrives while the line is marked "disabled". This
then sets the PENDING bit, which in turn causes a replay of the
interrupt once enable_irq() is bringing the disabled count down to
zero. LAZY has two purposes:

A) Prevent losing edge type interrupts

B) Optimization to avoid the maybe costly hardware access

UNLAZY masks the interrupt right away.

3) Disable interrupts at the device level

This prevents the device from raising interrupts.

Now there is another twist with force threaded interrupts.

If the interrupt is level triggered the interrupt line is masked in the
hard interrupt handler and unmasked when the thread returns.

For edge type interrupts that's handled differently in order to not lose
an edge. The line is not masked, so that if the device raises an
interrupt before the threaded handler returns, the threaded handler is
marked runnable again. That again comes in two flavors:

1) Non-RT kernel

That cannot result in a scenario where the force threaded handler
loops and interrupts keep arriving at the CPU because the CPU has
interrupts disabled across the force threaded handler and at least
on x86 and arm64 that's the CPU which is target of the hardware
interrupt.

So at max there can come a few interrupts between the first
interrupt and the threaded handler being scheduled in, but I doubt
it can be more than three.

2) RT kernel

The forced threaded handler runs with CPU interrupts enabled, so
when the threaded handler deals with the device, new interrupts can
come in at the hardware level and are accounted for, which is what
you are trying to address, right?

So the total amount of TX bytes, i.e. PASS_LIMIT * tx_loadsz,
becomes large enough to trigger the irq storm detector, right?

That obviously begs two questions:

1) Why do you want to deal with this interrupt flood at the storm
detector and not at the root cause of the whole thing?

It does not make any sense to take a gazillion of interrupts for
nothing and then work around the resulting detector fallout.

The obvious adhoc cure is to add this to serial8250_interrupt():

if (IS_ENABLED(CONFIG_PREEMPT_RT))
disable_irq_nosync(irq);

magic_loop();

if (IS_ENABLED(CONFIG_PREEMPT_RT))
enable_irq(irq);

Ideally we'd disable interrupt generation in the IER register
around the loop, but that's another can of worms as this can't be
done easily without holding port lock across the IER disabled
section. Also see below.

2) Is the 512 iterations loop (PASS_LIMIT) still appropriate?

That loop in serial8250_interrupt() looks pretty horrible even on a
non-RT kernel with an emulated UART.

The issue is mostly irrelevant on real hardware as the FIFO writes
are faster than the UART can empty the FIFO. So unless there are
two FIFO UARTs sharing the same interrupt line _and_ writing
large amount of data at the same time the loop will terminate after
one write cycle. There won't be an interrupt per TX byte either.

On emulated hardware this is very different because the write
causes a VMEXIT with a full round trip to user space, which
consumes the character and immediately raises the TX empty
interrupt again because there is no "slow" hardware involved.

With the default qemu FIFO depth of 16 bytes, this results in up to
512 * 16 = 8192 bytes written out with interrupts disabled for one
invocation of the interrupt handler (threaded or not)...

I just traced interrupt entry/exit in a VM and for a file with 256k
bytes written to /dev/ttyS0 this results in:

Interrupts: 35
Bytes/Interrupt: ~7490
Tmax: 104725 us
Tavg: 96778 us

So one interrupt handler invocation which writes up to 8k takes
roughly 100ms with interrupts disabled...

Now looking at the host side. For every write to the TX FIFO there
are _four_ invocations of kvm_set_irq() originating from kvm_vm_ioctl_irq_line():

CPU 36/KVM-2063 [097] ..... 1466862.728737: kvm_pio: pio_write at 0x3f8 size 1 count 1 val 0xd
CPU 36/KVM-2063 [097] ..... 1466862.728742: kvm_set_irq: gsi 4 level 0 source 0
CPU 36/KVM-2063 [097] ..... 1466862.728749: kvm_set_irq: gsi 4 level 0 source 0
CPU 36/KVM-2063 [097] ..... 1466862.728754: kvm_set_irq: gsi 4 level 1 source 0
CPU 36/KVM-2063 [097] ..... 1466862.728762: kvm_set_irq: gsi 4 level 1 source 0
CPU 36/KVM-2063 [097] ..... 1466862.728783: kvm_pio: pio_write at 0x3f8 size 1 count 1 val 0xa
CPU 36/KVM-2063 [097] ..... 1466862.728787: kvm_set_irq: gsi 4 level 0 source 0
CPU 36/KVM-2063 [097] ..... 1466862.728792: kvm_set_irq: gsi 4 level 0 source 0
CPU 36/KVM-2063 [097] ..... 1466862.728797: kvm_set_irq: gsi 4 level 1 source 0
CPU 36/KVM-2063 [097] ..... 1466862.728809: kvm_set_irq: gsi 4 level 1 source 0

No idea why this needs four ioctls and not only two, but this
clearly demonstrates that each byte written creates an edge
interrupt. No surprise that the storm detector triggers on RT. But
this also makes it clear why it's a patently bad idea to work
around this in the detector...

Now being "smart" and disabling THRI in the IER before the write
loop in serial8250_tx_chars() changes the picture to:

CPU 18/KVM-2045 [092] ..... 1466230.357478: kvm_pio: pio_write at 0x3f9 size 1 count 1 val 0x5

IER.THRI is cleared now so it's expected that nothing fiddles with
the interrupt line anymore, right? But ....

CPU 18/KVM-2045 [092] ..... 1466230.357479: kvm_set_irq: gsi 4 level 0 source 0
CPU 18/KVM-2045 [092] ..... 1466230.357481: kvm_set_irq: gsi 4 level 0 source 0
CPU 18/KVM-2045 [092] ..... 1466230.357484: kvm_pio: pio_write at 0x3f8 size 1 count 1 val 0x73
CPU 18/KVM-2045 [092] ..... 1466230.357485: kvm_set_irq: gsi 4 level 0 source 0
CPU 18/KVM-2045 [092] ..... 1466230.357487: kvm_set_irq: gsi 4 level 0 source 0
CPU 18/KVM-2045 [092] ..... 1466230.357489: kvm_set_irq: gsi 4 level 0 source 0
CPU 18/KVM-2045 [092] ..... 1466230.357491: kvm_set_irq: gsi 4 level 0 source 0
....

So every write to the TX FIFO sets the level to 0 four times in a
row to not create an edge. That's truly impressive!

Thanks,

tglx

Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

Posted by Andy Shevchenko 1 year, 2 months ago

On Mon, Nov 18, 2024 at 10:15:40PM -0300, Leonardo Bras wrote:
> On Thu, Nov 14, 2024 at 09:50:36AM +0200, Andy Shevchenko wrote:
> > On Thu, Nov 14, 2024 at 12:40:17AM -0300, Leonardo Bras wrote:
> > > On Fri, Feb 23, 2024 at 01:37:39AM -0300, Leonardo Bras wrote:
> > > > On Wed, Feb 21, 2024 at 04:41:20PM +0100, Thomas Gleixner wrote:
> > > > > On Wed, Feb 21 2024 at 02:39, Leonardo Bras wrote:
> > > > > > On Mon, Feb 19, 2024 at 12:03:07PM +0100, Thomas Gleixner wrote:
> > > > > >> >> Is scenarios where there is no need to keep track of IRQ handled, convert
> > > > > >> >> it back to IRQ_HANDLED.
> > > > > >> >
> > > > > >> > That's not really workable as you'd have to update tons of drivers just
> > > > > >> > to deal with that corner case. That's error prone and just extra
> > > > > >> > complexity all over the place.
> > > > > >
> > > > > > I agree, that's a downside of this implementation. 
> > > > > 
> > > > > A serious one which is not really workable. See below.
> > > > > 
> > > > > > I agree the above may be able to solve the issue, but it would make 2 extra 
> > > > > > atomic ops necessary in the thread handling the IRQ, as well as one extra 
> > > > > > atomic operation in note_interrupt(), which could increase latency on this 
> > > > > > IRQ deferring the handler to a thread.
> > > > > >
> > > > > > I mean, yes, the cpu running note_interrupt() would probably already have 
> > > > > > exclusiveness for this cacheline, but it further increases cacheline 
> > > > > > bouncing and also adds the mem barriers that incur on atomic operations, 
> > > > > > even if we use an extra bit from threads_handled instead of allocate a new 
> > > > > > field for threads_running.
> > > > > 
> > > > > I think that's a strawman. Atomic operations can of course be more
> > > > > expensive than non-atomic ones, but they only start to make a difference
> > > > > when the cache line is contended. That's not the case here for the
> > > > > normal operations.
> > > > > 
> > > > > Interrupts and their threads are strictly targeted to a single CPU and
> > > > > the cache line is already hot and had to be made exclusive because of
> > > > > other write operations to it.
> > > > > 
> > > > > There is usually no concurrency at all, except for administrative
> > > > > operations like enable/disable or affinity changes. Those administrative
> > > > > operations are not high frequency and the resulting cache line bouncing
> > > > > is unavoidable even without that change. But does it matter in the
> > > > > larger picture? I don't think so.
> > > > 
> > > > That's a fair point, but there are some use cases that use CPU Isolation on 
> > > > top of PREEMPT_RT in order to reduce interference on a CPU running an RT 
> > > > workload.
> > > > 
> > > > For those cases, IIRC the handler will run on a different (housekeeping) 
> > > > CPU when those IRQs originate on an Isolated CPU, meaning the above 
> > > > described cacheline bouncing is expected.
> > > > 
> > > > 
> > > > > 
> > > > > > On top of that, let's think on a scenario where the threaded handler will 
> > > > > > solve a lot of requests, but not necessarily spend a lot of time doing so.
> > > > > > This allows the thread to run for little time while solving a lot of 
> > > > > > requests.
> > > > > >
> > > > > > In this scenario, note_interrupt() could return without incrementing 
> > > > > > irqs_unhandled for those IRQ that happen while the brief thread is running, 
> > > > > > but every other IRQ would cause note_interrupt() to increase 
> > > > > > irqs_unhandled, which would cause the bug to still reproduce.
> > > > > 
> > > > > In theory yes. Does it happen in practice?
> > > > > 
> > > > > But that exposes a flaw in the actual detection code. The code is
> > > > > unconditionally accumulating if there is an unhandled interrupt within
> > > > > 100ms after the last unhandled one. IOW, if there is a periodic
> > > > > unhandled one every 50ms, the interrupt will be shut down after 100000 *
> > > > > 50ms = 5000s ~= 83.3m ~= 1.4h. And it neither cares about the number of
> > > > > actually handled interrupts.
> > > > > 
> > > > > The spurious detector is really about runaway interrupts which hog a CPU
> > > > > completely, but the above is not what we want to protect against.
> > > > 
> > > > Now it makes a lot more sense to me.
> > > > Thanks!
> > > 
> > > I would like to go back to this discussion :)
> > > From what I could understand, and read back the thread:
> > > 
> > > - The spurious detector is used to avoid cpu hog when a lots of IRQs are 
> > >   hitting a cpu, but few ( < 100 / 100k) are being handled. It works by
> > >   disabling that interruption.
> > > 
> > > - The bug I am dealing with (on serial8250), happens to fit exactly at
> > >   above case: lots of requests, but few are handled.
> > >   The reason: threaded handler, many requests, and they are dealt with in 
> > >   batch: multiple requests are handled at once, but a single IRQ_HANDLED 
> > >   returned.
> > > 
> > > - My proposed solution: Find a way of accounting the requests handled.
> > > 
> > >   - Implementation: add an option for drivers voluntarily report how 
> > >     many requests they handled. Current drivers need no change.
> > 
> > >   - Limitation: If this issue is found on another driver, we need to 
> > >     implement accounting there as well. This may only happen on drivers
> > >     which handle over 1k requests at once.
> > 
> > > What was left for me TODO:
> > > Think on a generic solution for this issue, to avoid dealing with that 
> > > in a per-driver basis. 
> > > 
> > > That's what I was able to think about:
> > 
> > > - Only the driver code knows how many requests it handled, so without  
> > >   touching them we can't know how many requests were properly handled.
> 
> Hello Andy, thanks for reviewing!
> 
> > Hmm... But do I understand correctly the following:
> > 
> > - the IRQ core knows the amount of generated IRQs for the device (so it's kinda
> > obvious that IRQ number maps to the driver);
> 
> Yes, I could understand that as well.
> 
> > - the IRQ core disables IRQ while handling an IRQ number in question;
> 
> Not necessarily:
> When on irqs are force-threaded, only a quick handler is called, returning 
> IRQ_WAKE_THREAD, which is supposed to wake up the handler thread.
> 
>   * @IRQ_WAKE_THREAD:   handler requests to wake the handler thread
> 
> In this case (which is what I am dealing with), the actual handler will run 
> in thread context (which I suppose don't disable IRQ for sched-out 
> purposes).

Each IRQ can be disabled on both sides: device side, host side. I think we are
talking about host side here as IRQ code doesn't know about each and every devices.
Am I right? Then host side of device IRQ is disabled by core and it doesn't prevent
scheduling from happening (of course if it's UP system with nohz fully on, it might
be that this IRQ is the only one to schedule the system, but I don't see any other
way out here).

At least this is how I interpret the 81e2073c175b ("genirq: Disable interrupts
for force threaded handlers"). Maybe that I'm missing something obvious here...

> > - the driver is supposed to handle all IRQs that were reported at the beginning
> > o.f its handler;
> 
> That I am not aware about. I suppose it depends on driver implementation.

This is just an expectation and some drivers proven to loose performance if
they don't follow it:

0b60557230ad ("usb: ehci: Prevent missed ehci interrupts with edge-triggered MSI")

> But if this one is correct, and must be true for every handler, then my 
> first approach should be the right fix. See [1] below.

I believe it depends if IRQ is level or edge triggered. For level triggered
we may stuck for a while in the handler if there is a stream of IRQs, hence
driver may deliberately give up to give other tasks a chance to progress.

> Below I am assuming handled IRQs = 'handlers which returned IRQ_HANDLED':

Right.

> > - taking the above the amount of handled IRQs is what came till the disabling
> > IRQ.
> 
> Sure
> 
> > IRQs that came after should be replayed when IRQ gets enabled.
> > 
> > ?
> 
> Not sure about this one as well.
> You mean the ones that got queued for thread-handling, but somehow got 
> paused since the interrupt got disabled?

If it's edge triggered, we may loose it if it's not get replayed.

> If not, I guess once you disable an IRQ no interruption on that line happens,
> so I don't think any interruption gets saved for later (at least not in 
> kernel level).

There is a mechanism for IRQ replay in the kernel, see IRQS_REPLAY and code
around it.

> But I may be wrong here, it's a guess.
> 
> > > - I could try thinking a different solution, which involves changing only
> > >   the spurious detector.
> > > 
> > >   - For that I would need to find a particular characteristic we would want 
> > >     to avoid spurious detection against, and make sure it won't miss an
> > >     actual case we want to be protected about.
> > > 
> > > Generic solutions(?) proposed:
> 
> [1] here:
> 
> > > - Zero irqs_unhandled if threaded & handles a single request in 100k
> > >   - Problem: A regular issue with the interruption would not be detected 
> > >     in the driver. 
> > > 
> > > - Skip detection if threaded & the handling thread is running
> > >   - Problem 1: the thread may run shortly and batch handle a lot of stuff, 
> > >   not being detected by the spurious detector. 
> > >   - Problem 2: the thread may get stuck, not handle the IRQs and also not
> > >   being detected by the spurious handler. (IIUC)
> > > 
> > > In the end, I could not find a proper way of telling apart
> > > a - "this is a real spurious IRQ behavior, which needs to be disabled", and 
> > > b - "this is just a handler that batch-handles it's requests",
> > > without touching the drivers' code.
> > > 
> > > Do you have any suggestion on how to do that?

-- 
With Best Regards,
Andy Shevchenko

Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

Posted by Thomas Gleixner 1 year, 11 months ago

On Fri, Feb 23 2024 at 01:37, Leonardo Bras wrote:
> On Wed, Feb 21, 2024 at 04:41:20PM +0100, Thomas Gleixner wrote:
>> There is usually no concurrency at all, except for administrative
>> operations like enable/disable or affinity changes. Those administrative
>> operations are not high frequency and the resulting cache line bouncing
>> is unavoidable even without that change. But does it matter in the
>> larger picture? I don't think so.
>
> That's a fair point, but there are some use cases that use CPU Isolation on 
> top of PREEMPT_RT in order to reduce interference on a CPU running an RT 
> workload.
>
> For those cases, IIRC the handler will run on a different (housekeeping) 
> CPU when those IRQs originate on an Isolated CPU, meaning the above 
> described cacheline bouncing is expected.

No. The affinity of the interrupt and the thread are strictly coupled
and always on the same CPU except for one instance during migration, but
that's irrelevant.

Thanks,

        tglx

Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

Posted by Thomas Gleixner 1 year, 11 months ago

On Wed, Feb 21 2024 at 16:41, Thomas Gleixner wrote:
> On Wed, Feb 21 2024 at 02:39, Leonardo Bras wrote:
> But as I pointed out above the detection logic is flawed due to the
> unconditional accumulation. Can you give the uncompiled below a test
> ride with your scenario?

Bah. Ignore this. I misread the code completely. No idea where my brain
was.

This thing triggers only when there are 100K interrupts and 99.9k of
them unhandled. The 100k total resets the unhandled counts.

Though one thing which strikes me odd is that this actually triggers at
all because it needs 99.9k unhandled out of 100k total. That means on
average every thread handler invocation handles 1000 hardware interrupts
in one go. Is that even realistic?

Thanks,

        tglx

Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

Posted by Leonardo Bras 1 year, 11 months ago

On Wed, Feb 21, 2024 at 06:04:21PM +0100, Thomas Gleixner wrote:
> On Wed, Feb 21 2024 at 16:41, Thomas Gleixner wrote:
> > On Wed, Feb 21 2024 at 02:39, Leonardo Bras wrote:
> > But as I pointed out above the detection logic is flawed due to the
> > unconditional accumulation. Can you give the uncompiled below a test
> > ride with your scenario?
> 
> Bah. Ignore this. I misread the code completely. No idea where my brain
> was.
> 
> This thing triggers only when there are 100K interrupts and 99.9k of
> them unhandled. The 100k total resets the unhandled counts.
> 
> Though one thing which strikes me odd is that this actually triggers at
> all because it needs 99.9k unhandled out of 100k total. That means on
> average every thread handler invocation handles 1000 hardware interrupts
> in one go. Is that even realistic?

Yeap, it triggers pretty easily if you bring a vm with a serial console, 
and try to use it to work with something very verbose.

It was detected by someone trying to unpack a kernel source tarball.

Maybe this is an issue that only becomes reproducible for this and maybe a 
couple extra drivers, so the solution will only need to be implemented in 
those drivers when (if) this bug reproduces.

This being said, thank you for helping me improve my understandig of this 
piece of code. I will put some effort in trying to find a solution that 
works by changing generic-code only, but would like to understand if the 
current proposal is valid if I am unable to find any.

Thanks!
Leo

> 
> Thanks,
> 
>         tglx
> 
>

[RFC PATCH v2 1/4] irq: Move spurious_deferred bit from BIT(31) to BIT(0)
[RFC PATCH v2 2/4] irq/spurious: Account for multiple handles in note_interrupt
[RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY
[RFC PATCH v2 4/4] tty/serial8250: Make use of IRQ_HANDLED_MANY interface