[PATCH 0/6] genirq/test: Platform/architecture fixes

Brian Norris posted 6 patches 1 month, 2 weeks ago
There is a newer version of this series
kernel/irq/Kconfig    |  1 +
kernel/irq/irq_test.c | 64 ++++++++++++++++++++-----------------------
2 files changed, 31 insertions(+), 34 deletions(-)
[PATCH 0/6] genirq/test: Platform/architecture fixes
Posted by Brian Norris 1 month, 2 weeks ago
The new kunit tests at kernel/irq/irq_test.c were primarily tested on
x86_64, with QEMU and with ARCH=um builds. Naturally, there are other
architectures that throw complications in the mix, with various CPU
hotplug and IRQ implementation choices.

Guenter has been dutifully noticing and reporting these errors, in
places like:
https://lore.kernel.org/all/b4cf04ea-d398-473f-bf11-d36643aa50dd@roeck-us.net/

I hope I've addressed all the failures, but it's hard to tell when I
don't have cross-compilers and QEMU setups for all of these
architectures.

I've tested what I could on arm, powerpc, x86_64, and um ARCH.

This series is based on David's patch for these tests:

[PATCH] genirq/test: Fix depth tests on architectures with NOREQUEST by default.
https://lore.kernel.org/all/20250816094528.3560222-2-davidgow@google.com/


Brian Norris (6):
  genirq/test: Select IRQ_DOMAIN
  genirq/test: Factor out fake-virq setup
  genirq/test: Fail early if we can't request an IRQ
  genirq/test: Skip managed-affinity tests with !SPARSE_IRQ
  genirq/test: Drop CONFIG_GENERIC_IRQ_MIGRATION assumptions
  genirq/test: Ensure CPU 1 is online for hotplug test

 kernel/irq/Kconfig    |  1 +
 kernel/irq/irq_test.c | 64 ++++++++++++++++++++-----------------------
 2 files changed, 31 insertions(+), 34 deletions(-)

-- 
2.51.0.rc1.167.g924127e9c0-goog
Re: [PATCH 0/6] genirq/test: Platform/architecture fixes
Posted by Guenter Roeck 1 month, 1 week ago
On Mon, Aug 18, 2025 at 12:27:37PM -0700, Brian Norris wrote:
> The new kunit tests at kernel/irq/irq_test.c were primarily tested on
> x86_64, with QEMU and with ARCH=um builds. Naturally, there are other
> architectures that throw complications in the mix, with various CPU
> hotplug and IRQ implementation choices.
> 
> Guenter has been dutifully noticing and reporting these errors, in
> places like:
> https://lore.kernel.org/all/b4cf04ea-d398-473f-bf11-d36643aa50dd@roeck-us.net/
> 
> I hope I've addressed all the failures, but it's hard to tell when I
> don't have cross-compilers and QEMU setups for all of these
> architectures.
> 
> I've tested what I could on arm, powerpc, x86_64, and um ARCH.
> 
> This series is based on David's patch for these tests:
> 
> [PATCH] genirq/test: Fix depth tests on architectures with NOREQUEST by default.
> https://lore.kernel.org/all/20250816094528.3560222-2-davidgow@google.com/
> 
Looks pretty good.

Build results:
	total: 162 pass: 162 fail: 0
Qemu test results:
	total: 637 pass: 637 fail: 0
Unit test results:
	pass: 640616 fail: 13
Failed unit tests:
	arm64:imx8mp-evk:irq_cpuhotplug_test
	arm64:imx8mp-evk:irq_test_cases
	m68k:q800:irq_test_cases
	m68k:virt:irq_test_cases

Individual failures:

[   32.613761]     # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:210
[   32.613761]     Expected remove_cpu(1) == 0, but
[   32.613761]         remove_cpu(1) == -16 (0xfffffffffffffff0)
[   32.621522]     # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:212
[   32.621522]     Expected add_cpu(1) == 0, but
[   32.621522]         add_cpu(1) == 1 (0x1)
[   32.630930]     # irq_cpuhotplug_test: pass:0 fail:1 skip:0 total:1

    # irq_disable_depth_test: ASSERTION FAILED at kernel/irq/irq_test.c:53
    Expected virq >= 0, but
        virq == -12 (0xfffffffffffffff4)
    # irq_disable_depth_test: pass:0 fail:1 skip:0 total:1
    not ok 1 irq_disable_depth_test
    # irq_free_disabled_test: ASSERTION FAILED at kernel/irq/irq_test.c:53
    Expected virq >= 0, but
        virq == -12 (0xfffffffffffffff4)
    # irq_free_disabled_test: pass:0 fail:1 skip:0 total:1

Guenter
Re: [PATCH 0/6] genirq/test: Platform/architecture fixes
Posted by Brian Norris 1 month, 1 week ago
On Thu, Aug 21, 2025 at 10:02:52AM -0700, Guenter Roeck wrote:
> Build results:
> 	total: 162 pass: 162 fail: 0
> Qemu test results:
> 	total: 637 pass: 637 fail: 0
> Unit test results:
> 	pass: 640616 fail: 13
> Failed unit tests:
> 	arm64:imx8mp-evk:irq_cpuhotplug_test
> 	arm64:imx8mp-evk:irq_test_cases
> 	m68k:q800:irq_test_cases
> 	m68k:virt:irq_test_cases
> 
> Individual failures:
> 
> [   32.613761]     # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:210
> [   32.613761]     Expected remove_cpu(1) == 0, but
> [   32.613761]         remove_cpu(1) == -16 (0xfffffffffffffff0)
> [   32.621522]     # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:212
> [   32.621522]     Expected add_cpu(1) == 0, but
> [   32.621522]         add_cpu(1) == 1 (0x1)
> [   32.630930]     # irq_cpuhotplug_test: pass:0 fail:1 skip:0 total:1

I managed to get an imx8mp-evk setup running (both little and big
endian) and couldn't reproduce. But I'm guessing based on the logs that
we're racing with pci_call_probe(), which disables CPU hotplug
(cpu_hotplug_disable()) for its duration.

I'm not sure how to handle that.

1. I could just SKIP the test on EBUSY. But that'd make for flaky test
   coverage.
2. Expose some method to block cpu_hotplug_disable() users temporarily.
3. Stop trying to do CPU hotplug in a unit test. (It's bordering on
   "integration test"; but it's still useful IMO...)
4. Add an EBUSY retry loop? Or some other similar polling (if we had,
   say, a cpu_hotplug_disabled() API).

>     # irq_disable_depth_test: ASSERTION FAILED at kernel/irq/irq_test.c:53
>     Expected virq >= 0, but
>         virq == -12 (0xfffffffffffffff4)
>     # irq_disable_depth_test: pass:0 fail:1 skip:0 total:1
>     not ok 1 irq_disable_depth_test
>     # irq_free_disabled_test: ASSERTION FAILED at kernel/irq/irq_test.c:53
>     Expected virq >= 0, but
>         virq == -12 (0xfffffffffffffff4)
>     # irq_free_disabled_test: pass:0 fail:1 skip:0 total:1

We've discussed this one, and I have a fix (depends on SPARSE_IRQ).

Brian
Re: [PATCH 0/6] genirq/test: Platform/architecture fixes
Posted by Guenter Roeck 1 month, 1 week ago
On 8/21/25 12:06, Brian Norris wrote:
> On Thu, Aug 21, 2025 at 10:02:52AM -0700, Guenter Roeck wrote:
>> Build results:
>> 	total: 162 pass: 162 fail: 0
>> Qemu test results:
>> 	total: 637 pass: 637 fail: 0
>> Unit test results:
>> 	pass: 640616 fail: 13
>> Failed unit tests:
>> 	arm64:imx8mp-evk:irq_cpuhotplug_test
>> 	arm64:imx8mp-evk:irq_test_cases
>> 	m68k:q800:irq_test_cases
>> 	m68k:virt:irq_test_cases
>>
>> Individual failures:
>>
>> [   32.613761]     # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:210
>> [   32.613761]     Expected remove_cpu(1) == 0, but
>> [   32.613761]         remove_cpu(1) == -16 (0xfffffffffffffff0)
>> [   32.621522]     # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:212
>> [   32.621522]     Expected add_cpu(1) == 0, but
>> [   32.621522]         add_cpu(1) == 1 (0x1)
>> [   32.630930]     # irq_cpuhotplug_test: pass:0 fail:1 skip:0 total:1
> 
> I managed to get an imx8mp-evk setup running (both little and big
> endian) and couldn't reproduce. But I'm guessing based on the logs that
> we're racing with pci_call_probe(), which disables CPU hotplug
> (cpu_hotplug_disable()) for its duration.
> 
> I'm not sure how to handle that.
> 
> 1. I could just SKIP the test on EBUSY. But that'd make for flaky test
>     coverage.
> 2. Expose some method to block cpu_hotplug_disable() users temporarily.
> 3. Stop trying to do CPU hotplug in a unit test. (It's bordering on
>     "integration test"; but it's still useful IMO...)
> 4. Add an EBUSY retry loop? Or some other similar polling (if we had,
>     say, a cpu_hotplug_disabled() API).
> 

Here is an additional data point: It only happens with big endian tests.
This always happens in my setup, and it only happens when booting from
virtio-pci but not when booting from other devices.

I just re-ran the test and it passed this time, so this is apparently
a flake. I'd suggest to ignore it for now. If I see it again and find
a clean way to reproduce it we can have another look. The emulated PCIe
controller for imx8mp-evk isn't exactly stable, so this may just be a side
effect of emulation problems.

Guenter
Re: [PATCH 0/6] genirq/test: Platform/architecture fixes
Posted by Brian Norris 1 month, 1 week ago
On Fri, Aug 22, 2025 at 11:34:04AM -0700, Guenter Roeck wrote:
> On 8/21/25 12:06, Brian Norris wrote:
> > On Thu, Aug 21, 2025 at 10:02:52AM -0700, Guenter Roeck wrote:
> > > Build results:
> > > 	total: 162 pass: 162 fail: 0
> > > Qemu test results:
> > > 	total: 637 pass: 637 fail: 0
> > > Unit test results:
> > > 	pass: 640616 fail: 13
> > > Failed unit tests:
> > > 	arm64:imx8mp-evk:irq_cpuhotplug_test
> > > 	arm64:imx8mp-evk:irq_test_cases
> > > 	m68k:q800:irq_test_cases
> > > 	m68k:virt:irq_test_cases
> > > 
> > > Individual failures:
> > > 
> > > [   32.613761]     # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:210
> > > [   32.613761]     Expected remove_cpu(1) == 0, but
> > > [   32.613761]         remove_cpu(1) == -16 (0xfffffffffffffff0)
> > > [   32.621522]     # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:212
> > > [   32.621522]     Expected add_cpu(1) == 0, but
> > > [   32.621522]         add_cpu(1) == 1 (0x1)
> > > [   32.630930]     # irq_cpuhotplug_test: pass:0 fail:1 skip:0 total:1
> > 
> > I managed to get an imx8mp-evk setup running (both little and big
> > endian) and couldn't reproduce. But I'm guessing based on the logs that
> > we're racing with pci_call_probe(), which disables CPU hotplug
> > (cpu_hotplug_disable()) for its duration.
> > 
> > I'm not sure how to handle that.
> > 
> > 1. I could just SKIP the test on EBUSY. But that'd make for flaky test
> >     coverage.
> > 2. Expose some method to block cpu_hotplug_disable() users temporarily.
> > 3. Stop trying to do CPU hotplug in a unit test. (It's bordering on
> >     "integration test"; but it's still useful IMO...)
> > 4. Add an EBUSY retry loop? Or some other similar polling (if we had,
> >     say, a cpu_hotplug_disabled() API).

Ah, I see that add_cpu() (cpu_subsys_online()) already has an -EBUSY
retry loop, but remove_cpu() doesn't. So #4 seems like a good solution.
It might even make sense to retry in cpu_subsys_offline(), rather than
just in the test.

I'll give this some thought for later though.

> Here is an additional data point: It only happens with big endian tests.
> This always happens in my setup, and it only happens when booting from
> virtio-pci but not when booting from other devices.
> 
> I just re-ran the test and it passed this time, so this is apparently
> a flake. I'd suggest to ignore it for now. If I see it again and find
> a clean way to reproduce it we can have another look. The emulated PCIe
> controller for imx8mp-evk isn't exactly stable, so this may just be a side
> effect of emulation problems.

This furthers my suspicion that it's a race with PCIe probing. On the
failure case, the test is running right after some PCI scan logs.

But I'm fine deferring for now, since it's not very reproducible.

Brian
Re: [PATCH 0/6] genirq/test: Platform/architecture fixes
Posted by David Gow 1 month, 2 weeks ago
On Tue, 19 Aug 2025 at 03:28, Brian Norris <briannorris@chromium.org> wrote:
>
> The new kunit tests at kernel/irq/irq_test.c were primarily tested on
> x86_64, with QEMU and with ARCH=um builds. Naturally, there are other
> architectures that throw complications in the mix, with various CPU
> hotplug and IRQ implementation choices.
>
> Guenter has been dutifully noticing and reporting these errors, in
> places like:
> https://lore.kernel.org/all/b4cf04ea-d398-473f-bf11-d36643aa50dd@roeck-us.net/
>
> I hope I've addressed all the failures, but it's hard to tell when I
> don't have cross-compilers and QEMU setups for all of these
> architectures.
>
> I've tested what I could on arm, powerpc, x86_64, and um ARCH.
>
> This series is based on David's patch for these tests:
>
> [PATCH] genirq/test: Fix depth tests on architectures with NOREQUEST by default.
> https://lore.kernel.org/all/20250816094528.3560222-2-davidgow@google.com/
>
>

Thanks very much. These patches all look good to me, so the series is:

Reviewed-by: David Gow <davidgow@google.com>

I am, however, still getting test failures on m68k (with CONFIG_VIRT=y):
./tools/testing/kunit/kunit.py  run --arch m68k --cross_compile
m68k-linux-gnu- irq*
[14:54:23] =============== irq_test_cases (4 subtests) ================
[14:54:23]     # irq_disable_depth_test: ASSERTION FAILED at
kernel/irq/irq_test.c:53
[14:54:23]     Expected virq >= 0, but
[14:54:23]         virq == -12 (0xfffffffffffffff4)
[14:54:23] [FAILED] irq_disable_depth_test
[14:54:23]     # irq_free_disabled_test: ASSERTION FAILED at
kernel/irq/irq_test.c:53
[14:54:23]     Expected virq >= 0, but
[14:54:23]         virq == -12 (0xfffffffffffffff4)
[14:54:23] [FAILED] irq_free_disabled_test
[14:54:23] [SKIPPED] irq_shutdown_depth_test
[14:54:23] [SKIPPED] irq_cpuhotplug_test
[14:54:23]     # module: irq_test
[14:54:23] # irq_test_cases: pass:0 fail:2 skip:2 total:4
[14:54:23] # Totals: pass:0 fail:2 skip:2 total:4
[14:54:23] ================= [FAILED] irq_test_cases ==================
[14:54:23] ============================================================
[14:54:23] Testing complete. Ran 4 tests: failed: 2, skipped: 2

Looks like __irq_alloc_descs() is returning -ENOMEM (as
irq_find_free_area() is returning 200 w/ nr_irqs == 200, and
CONFIG_SPARSE_IRQ=n).

But all of the other architectures I found worked okay, so this is at
least an improvement.

Thanks,
-- David

> Brian Norris (6):
>   genirq/test: Select IRQ_DOMAIN
>   genirq/test: Factor out fake-virq setup
>   genirq/test: Fail early if we can't request an IRQ
>   genirq/test: Skip managed-affinity tests with !SPARSE_IRQ
>   genirq/test: Drop CONFIG_GENERIC_IRQ_MIGRATION assumptions
>   genirq/test: Ensure CPU 1 is online for hotplug test
>
>  kernel/irq/Kconfig    |  1 +
>  kernel/irq/irq_test.c | 64 ++++++++++++++++++++-----------------------
>  2 files changed, 31 insertions(+), 34 deletions(-)
>
> --
> 2.51.0.rc1.167.g924127e9c0-goog
>
Re: [PATCH 0/6] genirq/test: Platform/architecture fixes
Posted by Brian Norris 1 month, 2 weeks ago
On Wed, Aug 20, 2025 at 03:00:34PM +0800, David Gow wrote:
> Looks like __irq_alloc_descs() is returning -ENOMEM (as
> irq_find_free_area() is returning 200 w/ nr_irqs == 200, and
> CONFIG_SPARSE_IRQ=n).

Thanks for the insight. I bothered compiling my own qemu just so I can
run m68k this time, and I can reproduce.

I wonder if I should make everything (CONFIG_IRQ_KUNIT_TEST) depend on
CONFIG_SPARSE_IRQ, since it seems like arches like m68k can't enable
SPARSE_IRQ, and they can't allocate new (fake) IRQs without it. That'd
be a tweak to patch 4.

Or maybe just 'depends on !M68K', since architectures with higher
NR_IRQS headroom may still work even without SPARSE_IRQ.

> But all of the other architectures I found worked okay, so this is at
> least an improvement.

Thanks for the testing.

Brian
Re: [PATCH 0/6] genirq/test: Platform/architecture fixes
Posted by David Gow 1 month, 2 weeks ago
On Thu, 21 Aug 2025 at 01:22, Brian Norris <briannorris@chromium.org> wrote:
>
> On Wed, Aug 20, 2025 at 03:00:34PM +0800, David Gow wrote:
> > Looks like __irq_alloc_descs() is returning -ENOMEM (as
> > irq_find_free_area() is returning 200 w/ nr_irqs == 200, and
> > CONFIG_SPARSE_IRQ=n).
>
> Thanks for the insight. I bothered compiling my own qemu just so I can
> run m68k this time, and I can reproduce.
>
> I wonder if I should make everything (CONFIG_IRQ_KUNIT_TEST) depend on
> CONFIG_SPARSE_IRQ, since it seems like arches like m68k can't enable
> SPARSE_IRQ, and they can't allocate new (fake) IRQs without it. That'd
> be a tweak to patch 4.
>
> Or maybe just 'depends on !M68K', since architectures with higher
> NR_IRQS headroom may still work even without SPARSE_IRQ.
>

I'm not an m68k expert (so I've CCed Geert), but I think different
m68k configs do have different NR_IRQS, so it's possible there are
working m68k setups, too. (It also seems slightly suspicious to me
that exactly 200 IRQs are allocated here, though, so a lack of extra
headroom may be deliberate and/or triggered by something trying to
allocate all IRQs.)

Personally, I don't have any m68k machines lying around, so disabling
the test so my qemu scripts don't report errors is fine by me. Ideally
the dependency would be as narrow as possible, but that may well be
!M68K.

The other option would be to try to skip the test if there aren't free
IRQs, but maybe that'd hide real issues?

Regardless, I'll defer to the IRQ and m68k experts here: as long as
I'm not seeing errors, I'm happy. :-)

Cheers,
-- David
Re: [PATCH 0/6] genirq/test: Platform/architecture fixes
Posted by Geert Uytterhoeven 1 month, 1 week ago
Hi David,

On Thu, 21 Aug 2025 at 05:45, David Gow <davidgow@google.com> wrote:
> On Thu, 21 Aug 2025 at 01:22, Brian Norris <briannorris@chromium.org> wrote:
> > On Wed, Aug 20, 2025 at 03:00:34PM +0800, David Gow wrote:
> > > Looks like __irq_alloc_descs() is returning -ENOMEM (as
> > > irq_find_free_area() is returning 200 w/ nr_irqs == 200, and
> > > CONFIG_SPARSE_IRQ=n).
> >
> > Thanks for the insight. I bothered compiling my own qemu just so I can
> > run m68k this time, and I can reproduce.
> >
> > I wonder if I should make everything (CONFIG_IRQ_KUNIT_TEST) depend on
> > CONFIG_SPARSE_IRQ, since it seems like arches like m68k can't enable
> > SPARSE_IRQ, and they can't allocate new (fake) IRQs without it. That'd
> > be a tweak to patch 4.
> >
> > Or maybe just 'depends on !M68K', since architectures with higher
> > NR_IRQS headroom may still work even without SPARSE_IRQ.
>
> I'm not an m68k expert (so I've CCed Geert), but I think different
> m68k configs do have different NR_IRQS, so it's possible there are
> working m68k setups, too. (It also seems slightly suspicious to me
> that exactly 200 IRQs are allocated here, though, so a lack of extra
> headroom may be deliberate and/or triggered by something trying to
> allocate all IRQs.)
>
> Personally, I don't have any m68k machines lying around, so disabling
> the test so my qemu scripts don't report errors is fine by me. Ideally
> the dependency would be as narrow as possible, but that may well be
> !M68K.

M68k indeed has different values of NR_IRQS, based on the system(s)
support is enabled for.  These values are based on the IRQ hierarchy
of the system(s), which is rather fixed.  Hence this does not take
into account any additional irqchips that are being registered by
e.g. tests...

"git grep -w NR_IRQS -- arch/*/include/" shows m68k is not the only
architecture having that limitation...

> The other option would be to try to skip the test if there aren't free
> IRQs, but maybe that'd hide real issues?
>
> Regardless, I'll defer to the IRQ and m68k experts here: as long as
> I'm not seeing errors, I'm happy. :-)

kernel/irq/irqdesc.c:

    static bool irq_expand_nr_irqs(unsigned int nr)
    {
            if (nr > MAX_SPARSE_IRQS)
                    return false;
            nr_irqs = nr;
            return true;
    }

kernel/irq/internals.h:

    #ifdef CONFIG_SPARSE_IRQ
    # define MAX_SPARSE_IRQS        INT_MAX
    #else
    # define MAX_SPARSE_IRQS        NR_IRQS
    #endif

So probably the test should depend on SPARSE_IRQ?  Increasing NR_IRQS
everywhere when IRQ_KUNIT_TEST is enabled sounds rather invasive to me.

BTW, given the test calls irq_domain_alloc_descs(), I think it should
also depend on IRQ_DOMAIN.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
Re: [PATCH 0/6] genirq/test: Platform/architecture fixes
Posted by Brian Norris 1 month, 1 week ago
Hi Geert,

On Thu, Aug 21, 2025 at 09:05:03AM +0200, Geert Uytterhoeven wrote:
> So probably the test should depend on SPARSE_IRQ?  Increasing NR_IRQS
> everywhere when IRQ_KUNIT_TEST is enabled sounds rather invasive to me.

Yeah, I was leaning to 'depends on SPARSE_IRQ'

> BTW, given the test calls irq_domain_alloc_descs(), I think it should
> also depend on IRQ_DOMAIN.

Right, that's in patch 1.

I'll resend the series with a 'depends on SPARSE_IRQ'.

Thanks,
Brian
Re: [PATCH 0/6] genirq/test: Platform/architecture fixes
Posted by Guenter Roeck 1 month, 2 weeks ago
On 8/20/25 10:22, Brian Norris wrote:
> On Wed, Aug 20, 2025 at 03:00:34PM +0800, David Gow wrote:
>> Looks like __irq_alloc_descs() is returning -ENOMEM (as
>> irq_find_free_area() is returning 200 w/ nr_irqs == 200, and
>> CONFIG_SPARSE_IRQ=n).
> 
> Thanks for the insight. I bothered compiling my own qemu just so I can
> run m68k this time, and I can reproduce.
> 
> I wonder if I should make everything (CONFIG_IRQ_KUNIT_TEST) depend on
> CONFIG_SPARSE_IRQ, since it seems like arches like m68k can't enable
> SPARSE_IRQ, and they can't allocate new (fake) IRQs without it. That'd
> be a tweak to patch 4.
> 
> Or maybe just 'depends on !M68K', since architectures with higher
> NR_IRQS headroom may still work even without SPARSE_IRQ.
> 
>> But all of the other architectures I found worked okay, so this is at
>> least an improvement.
> 
> Thanks for the testing.
> 
I applied the series to my testing branch. I'll run a full test tonight and
report results tomorrow.

Guenter