[RFC PATCH 0/5] irq: sysfs interface improvements for SMP affinity control

Radu Rendec posted 5 patches 2 years, 8 months ago
include/linux/irq.h     |   9 +-
include/linux/irqdesc.h |   1 +
kernel/irq/debugfs.c    |   1 +
kernel/irq/internals.h  |  10 ++
kernel/irq/irqdesc.c    | 206 +++++++++++++++++++++++++++++++++++++---
kernel/irq/irqdomain.c  |  15 +++
kernel/irq/manage.c     |  20 +++-
kernel/irq/proc.c       |  72 +-------------
8 files changed, 244 insertions(+), 90 deletions(-)
[RFC PATCH 0/5] irq: sysfs interface improvements for SMP affinity control
Posted by Radu Rendec 2 years, 8 months ago
This patch set implements new sysfs interfaces that facilitate SMP
affinity control of chained interrupts. It follows the guidelines in
https://lore.kernel.org/all/87fslr7ygl.wl-maz@kernel.org/ with slight
deviations, which are explained below.

The assumption is that irqbalance must be aware of the chained interrupt
topology regardless of how it is exposed to userspace, for the following
reasons:
- Interrupt counters are not updated for the parent interrupt. Counters
  must be read separately for each of the chained interrupts and summed
  up to assess the CPU usage impact of the group as a whole.
- The affinity setting is shared by all multiplexed interrupts (and the
  parent interrupt) and cannot be changed individually.

Since irqbalance must be aware of the topology anyway, it is easier to
move parts of the problem there and reduce the complexity of the kernel
changes.
- Instead of creating a new affinity interface for chained interrupts
  that has different semantics from the existing procfs interface (and
  changes the affinity of the parent interrupt in the case of muxed
  interrupts), it is easier to let irqbalance set the affinity of the
  parent interrupt by itself (since it already knows who the parent is).
- Tracking groups of interrupts in the kernel creates additional
  synchronization challenges that are otherwise unnecessary. The kernel
  already has a (struct irq_desc).parent_irq field that can be (re)used
  for this purpose (see below).

Brief description of the patches in this set:
- Patch 1/5 makes the (struct irq_desc).parent_irq field available
  unconditionally. So far, it has been used for IRQ-resend and depended
  on CONFIG_HARDIRQS_SW_RESEND. But it can be (re)used to track chained
  interrupt parents for the general use case, without any changes to the
  existing IRQ chip drivers.
- Patch 2/5 is trivial and just exposes (struct irq_desc).parent_irq in
  debugfs.
- Patch 3/5 exposes the chained interrupt topology in sysfs in two ways:
  the muxed_irqs directory (as described in the original email thread)
  and the parent_irq symlink. From a userspace perspective, they are
  redundant. However, in the first case the synchronization is likely
  incomplete/broken and not so easy to fix.
- Patch 4/5 moves the SMP affinity write handlers from procfs code to
  generic code, with the intention to reuse them for a new sysfs
  interface.
- Patch 5/5 creates a sysfs interface for the affinity, with identical
  semantics to the existing procfs interface. The sole purpose is to
  allow userspace (irqbalance) to control the affinity of the parent
  interrupt, which is typically *not* visible in procfs.

The only required change to existing chained IRQ chip drivers in order
to support the new affinity control is to call irq_set_parent() in their
.map domain op. If they use the newer hierarchical API, they should call
irq_set_parent() in their .alloc domain op instead. This doesn't affect
the existing procfs based affinity interface in any way.

A few IRQ chip drivers already call irq_set_parent() in their .map
domain op to implement IRQ-resend. No change is required to those
drivers to support the new affinity control.

Last but not least, it turns out that hierarchical domains are entirely
out of the scope of these changes (unless chained interrupts are used
along the path). In the case of hierarchical domains, each interrupt in
the outermost domain has a *single* corresponding Linux virq (that is
mapped to each domain in the hierarchy). That makes it perfectly safe to
implement the .irq_set_affinity chip op as irq_chip_set_affinity_parent
and delegate affinity control to the parent chip/domain. This will *not*
suddenly change the affinity of a different interrupt behind anyone's
back simply because there cannot be another interrupt that shares the
same affinity setting.

Note: I still need to update the Documentation/ directory for the new
      sysfs interface, and I will address that in a future version.
      At this point, I just want to get feedback about the current
      approach.

Radu Rendec (5):
  irq: Always enable parent interrupt tracking
  irq: Show the parent chained interrupt in debugfs
  irq: Expose chained interrupt parents in sysfs
  irq: Move SMP affinity write handler out of proc.c
  irq: Add smp_affinity/list attributes to sysfs

 include/linux/irq.h     |   9 +-
 include/linux/irqdesc.h |   1 +
 kernel/irq/debugfs.c    |   1 +
 kernel/irq/internals.h  |  10 ++
 kernel/irq/irqdesc.c    | 206 +++++++++++++++++++++++++++++++++++++---
 kernel/irq/irqdomain.c  |  15 +++
 kernel/irq/manage.c     |  20 +++-
 kernel/irq/proc.c       |  72 +-------------
 8 files changed, 244 insertions(+), 90 deletions(-)

-- 
2.40.1
Re: [RFC PATCH 0/5] irq: sysfs interface improvements for SMP affinity control
Posted by Thomas Gleixner 2 years, 8 months ago
On Tue, May 30 2023 at 17:45, Radu Rendec wrote:
> Note: I still need to update the Documentation/ directory for the new
>       sysfs interface, and I will address that in a future version.
>       At this point, I just want to get feedback about the current
>       approach.

From a conceptual POV I understand why this is required, which makes me
hate this chained mechanism even more.

Aside of having no visibility (counters, affinity, etc) the worst thing
about these chained hidden interrupts is:

  There is no control of runaway interrupts as they circumvent the core,
  which has caused hard to diagnose system hangups in the past. See
   
    ba714a9c1dea ("pinctrl/amd: Use regular interrupt instead of chained")

  for demonstration.

The argument I heard for this chained interrupt muck is that it's so
much more performant than using regular interrupt handlers via
request_irq(). It's obviously less overhead, but whether it matters and
most importantly whether it justifies the downsides is a different
question.

There is also the argument about double accounting. Right now the
chained handler is not accounted and only the childs are.

Though that is inconsistent with other demultiplex handlers which _must_
use regular interrupt handlers (threaded) to demultiplex interrupt chips
which sit behind SPI/I2C...

The sum of child interrupts is also not necessarily the number of parent
interrupts, unless there is always exactly only one child handler to
invoke.

Quite some of those demux handlers are also not RT compatible.

AFAICT, there is no real technical reason at least not for the vast
majority of usage sites why the demultiplex handler cannot run inside a
regular interrupt handler context.

So I personally would prefer to get rid of this chained oddball and just
have consistent mechanisms for dealing with interrupts, which would just
avoid exposing the affinity files in two different places.

Providing information about child/parent relationship is an orthogonal
issue.

If there is some good reason (aside of the chained muck) to have sysfs
based affinity management, then I'm not objecting as long as the
functionality is the same, i.e. effective affinity needs be exposed too.

Thanks,

        tglx
Re: [RFC PATCH 0/5] irq: sysfs interface improvements for SMP affinity control
Posted by Radu Rendec 2 years, 7 months ago
On Wed, 2023-05-31 at 15:09 +0200, Thomas Gleixner wrote:
> On Tue, May 30 2023 at 17:45, Radu Rendec wrote:
> > Note: I still need to update the Documentation/ directory for the new
> >       sysfs interface, and I will address that in a future version.
> >       At this point, I just want to get feedback about the current
> >       approach.
> 
> From a conceptual POV I understand why this is required, which makes me
> hate this chained mechanism even more.
> 
> Aside of having no visibility (counters, affinity, etc) the worst thing
> about these chained hidden interrupts is:
> 
>   There is no control of runaway interrupts as they circumvent the core,
>   which has caused hard to diagnose system hangups in the past. See
>    
>     ba714a9c1dea ("pinctrl/amd: Use regular interrupt instead of chained")
> 
>   for demonstration.
> 
> The argument I heard for this chained interrupt muck is that it's so
> much more performant than using regular interrupt handlers via
> request_irq(). It's obviously less overhead, but whether it matters and
> most importantly whether it justifies the downsides is a different
> question.
> 
> There is also the argument about double accounting. Right now the
> chained handler is not accounted and only the childs are.
> 
> Though that is inconsistent with other demultiplex handlers which _must_
> use regular interrupt handlers (threaded) to demultiplex interrupt chips
> which sit behind SPI/I2C...
> 
> The sum of child interrupts is also not necessarily the number of parent
> interrupts, unless there is always exactly only one child handler to
> invoke.
> 
> Quite some of those demux handlers are also not RT compatible.
> 
> AFAICT, there is no real technical reason at least not for the vast
> majority of usage sites why the demultiplex handler cannot run inside a
> regular interrupt handler context.

I was about to send an RFC patch that converts a multiplexing IRQ chip
driver from chained to regular interrupts, when I realized I had come
full circle. Marc rejected a similar patch in the past [1] and the main
argument was that exposing the parent interrupt in procfs is backwards
incompatible. Quote:

 The problem of changing affinities for chained (or multiplexing)
 interrupts is, to make it short, that it breaks the existing userspace
 ABI that a change in affinity affects only the interrupt userspace
 acts upon, and no other. Which is why we don't expose any affinity
 setting for such an interrupt, as by definition changing its affinity
 affects all the interrupts that are muxed onto it.

So, there are two contradictory arguments here:
 * Chained interrupts are "muck", mainly because they circumvent the
   interrupt core and prevent the interrupt storm detector from kicking
   in when hardware misbehaves (and for other reasons as well).
 * Exposing the parent interrupt affinity control in procfs breaks the
   userspace ABI because all child interrupts inherently share the same
   affinity settings. This implies regular interrupts cannot be used.

Meanwhile, both interrupt storms and the lack of affinity control
support for multiplexing drivers are real problems, and my team and I
came across both. FWIW, the interrupt storm is the one we found more
recently, and it's the reason why I wanted to send that RFC patch I
mentioned before.

Is Marc's argument still relevant? Perhaps with both arguments in the
same email it becomes clearer that there needs to be some alignment at
the maintainer level. I would be more than happy to send patches and
help fix both issues. But I think that's not possible until you come up
with a strategy that you *both* agree to.

[1] https://lore.kernel.org/all/874k0bf7f7.wl-maz@kernel.org/

Thanks,
Radu

> So I personally would prefer to get rid of this chained oddball and just
> have consistent mechanisms for dealing with interrupts, which would just
> avoid exposing the affinity files in two different places.
> 
> Providing information about child/parent relationship is an orthogonal
> issue.
> 
> If there is some good reason (aside of the chained muck) to have sysfs
> based affinity management, then I'm not objecting as long as the
> functionality is the same, i.e. effective affinity needs be exposed too.
> 
> Thanks,
> 
>         tglx
Re: [RFC PATCH 0/5] irq: sysfs interface improvements for SMP affinity control
Posted by Radu Rendec 2 years, 8 months ago
On Wed, 2023-05-31 at 15:09 +0200, Thomas Gleixner wrote:
> From a conceptual POV I understand why this is required, which makes me
> hate this chained mechanism even more.
> 
> [cut]
> 
> AFAICT, there is no real technical reason at least not for the vast
> majority of usage sites why the demultiplex handler cannot run inside a
> regular interrupt handler context.

Thanks for taking the time to explain everything in such detail.
Much appreciated!

> So I personally would prefer to get rid of this chained oddball and just
> have consistent mechanisms for dealing with interrupts, which would just
> avoid exposing the affinity files in two different places.

Does this mean that if I came across a chained driver that lacked
affinity support, then changing it to use regular interrupts via
request_irq() would be a viable solution to enable affinity support
and you would consider accepting such a patch?

Taking a step back, in the case of hierarchical domains where *no*
multiplexing is involved, do you consider setting .irq_set_affinity()
to irq_chip_set_affinity_parent() a good way to enable affinity support
(assuming, of course, the driver lacks such support originally)?

At the end of the day, what I'm trying to do is find a way to enable
affinity support for a few specific drivers where it's lacking. My
initial impression, after reading Marc's message[1], was that the fix
lay at the system level, at least for multiplexing drivers. Hence my
naive attempt at a system-level fix. It is now becoming clear that the
fix will have to be evaluated/addressed at the driver level, on a case
by case basis.

As a side note, one thing I particularly like about your approach is
that it doesn't require any changes to irqbalance.

> Providing information about child/parent relationship is an orthogonal
> issue.

Agreed. Do you see any value in doing that? And, if yes, is reusing
(struct irq_desc).parent_irq and irq_set_parent() a good way of doing
it? FWIW, a multiplexing driver could do that regardless of how it
registers a handler for the parent interrupt (chained/regular).

> If there is some good reason (aside of the chained muck) to have sysfs
> based affinity management, then I'm not objecting as long as the
> functionality is the same, i.e. effective affinity needs be exposed too.

I can't think of any other reason. AFAICT, chained interrupts are the
only interrupts that are active but not visible in procfs. For any
other purpose, the procfs interface is good as it is.

Thanks,
Radu

[1] https://lore.kernel.org/all/87fslr7ygl.wl-maz@kernel.org/