After the upper-layer driver removes the device and before the next
probe, events such as firmware updates may increase the number of
interrupts supported by the device. However, the irq_domain still
retains the old hwsize, which causes subsequent interrupt allocation
failures. Update hwsize during MSI-X device domain setup to fix this
issue.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Guanghui Feng <guanghuifeng@linux.alibaba.com>
---
drivers/pci/msi/irqdomain.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/msi/irqdomain.c b/drivers/pci/msi/irqdomain.c
index 6e65f0f44112..485bfba059cc 100644
--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -274,8 +274,11 @@ bool pci_setup_msix_device_domain(struct pci_dev *pdev, unsigned int hwsize)
if (WARN_ON_ONCE(pdev->msi_enabled))
return false;
- if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSIX))
+ if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSIX)) {
+ if (msi_domain_update_hwsize(&pdev->dev, MSI_DEFAULT_DOMAIN, hwsize))
+ pr_warn("too big MSI-X hwsize:%u\n", hwsize);
return true;
+ }
if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSI))
msi_remove_device_irq_domain(&pdev->dev, MSI_DEFAULT_DOMAIN);
--
2.32.0.3.g01195cf9f
Hello,
kernel test robot noticed "RIP:msi_domain_update_hwsize" on:
commit: 8812ea3b58e360d64259b4b79523cec52ff6a014 ("[PATCH 2/2] PCI/MSI: Update MSI-X irq domain hwsize")
url: https://github.com/intel-lab-lkp/linux/commits/Guixin-Liu/genirq-msi-Introduce-update-hwsize-helper/20260324-104239
base: https://git.kernel.org/cgit/linux/kernel/git/pci/pci.git next
patch link: https://lore.kernel.org/all/20260324014754.4973-3-kanie@linux.alibaba.com/
patch subject: [PATCH 2/2] PCI/MSI: Update MSI-X irq domain hwsize
in testcase: perf-sanity-tests
version:
with following parameters:
perf_compiler: gcc
group: group-01
config: x86_64-rhel-9.4-bpf
compiler: gcc-14
test machine: 256 threads 4 sockets INTEL(R) XEON(R) PLATINUM 8592+ (Emerald Rapids) with 256G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202603312126.697741fb-lkp@intel.com
kern :warn : [ 227.562210] [ T1839] ------------[ cut here ]------------
kern :warn : [ 227.568011] [ T1929] idxd 0000:74:02.0: No in-kernel DMA with PASID. -1
kern :warn : [ 227.573941] [ T1839] WARNING: kernel/irq/msi.c:605 at msi_domain_update_hwsize+0xd2/0x130, CPU#0: 2/1839
kern :warn : [ 227.592907] [ T1839] Modules linked in: intel_cstate pmt_discovery drm_client_lib cxl_acpi intel_sdsi pmt_class intel_qat mei_me idxd(+) drm_shmem_helper soundcore libie nvme(+) cxl_port i2c_i801 intel_uncore pcspkr acpi_power_meter libie_adminq isst_if_common nvme_core drm_kms_helper mei idxd_bus intel_vsec crc8 i2c_smbus i2c_ismt wmi ipmi_si acpi_ipmi cxl_core ipmi_devintf einj ipmi_msghandler pinctrl_emmitsburg acpi_pad joydev pfr_update pfr_telemetry binfmt_misc drm nfnetlink ip_tables x_tables sch_fq_codel
kern :info : [ 227.597293] [ T1929] idxd 0000:74:02.0: Intel(R) Accelerator Device (v100)
kern :warn : [ 227.643806] [ T1839] CPU: 0 UID: 0 PID: 1839 Comm: kworker/0:2 Tainted: G S 7.0.0-rc1-00098-g8812ea3b58e3 #1 PREEMPT(full)
kern :info : [ 227.652130] [ T1868] idxd 0000:e7:01.0: enabling device (0144 -> 0146)
kern :warn : [ 227.665534] [ T1839] Tainted: [S]=CPU_OUT_OF_SPEC
kern :warn : [ 227.665537] [ T1839] Hardware name: Intel Corporation D50DNP1SBB/D50DNP1SBB, BIOS SE5C7411.86B.9532.D02.2309201845 09/20/2023
kern :warn : [ 227.665540] [ T1839] Workqueue: sync_wq local_pci_probe_callback
kern :warn : [ 227.673200] [ T1868] idxd 0000:e7:01.0: No in-kernel DMA with PASID. -1
kern :warn : [ 227.678012] [ T1839] RIP: 0010:msi_domain_update_hwsize (kernel/irq/msi.c:605 (discriminator 7) kernel/irq/msi.c:1147 (discriminator 7))
kern :warn : [ 227.678017] [ T1839] Code: cc 48 8d bd 80 03 00 00 e8 8b 81 4c 00 48 8b 85 80 03 00 00 be ff ff ff ff 48 8d 78 70 e8 76 9d 8f 01 85 c0 0f 85 6b ff ff ff <0f> 0b e9 64 ff ff ff 0f 0b 5b b8 ed ff ff ff 5d 41 5c c3 cc cc cc
All code
========
0: cc int3
1: 48 8d bd 80 03 00 00 lea 0x380(%rbp),%rdi
8: e8 8b 81 4c 00 call 0x4c8198
d: 48 8b 85 80 03 00 00 mov 0x380(%rbp),%rax
14: be ff ff ff ff mov $0xffffffff,%esi
19: 48 8d 78 70 lea 0x70(%rax),%rdi
1d: e8 76 9d 8f 01 call 0x18f9d98
22: 85 c0 test %eax,%eax
24: 0f 85 6b ff ff ff jne 0xffffffffffffff95
2a:* 0f 0b ud2 <-- trapping instruction
2c: e9 64 ff ff ff jmp 0xffffffffffffff95
31: 0f 0b ud2
33: 5b pop %rbx
34: b8 ed ff ff ff mov $0xffffffed,%eax
39: 5d pop %rbp
3a: 41 5c pop %r12
3c: c3 ret
3d: cc int3
3e: cc int3
3f: cc int3
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: e9 64 ff ff ff jmp 0xffffffffffffff6b
7: 0f 0b ud2
9: 5b pop %rbx
a: b8 ed ff ff ff mov $0xffffffed,%eax
f: 5d pop %rbp
10: 41 5c pop %r12
12: c3 ret
13: cc int3
14: cc int3
15: cc int3
kern :warn : [ 227.678022] [ T1839] RSP: 0018:ff11000136797920 EFLAGS: 00010246
kern :info : [ 227.692041] [ T407] nvme nvme1: pci function 0000:a8:00.0
kern :warn : [ 227.697311] [ T1839] RAX: 0000000000000000 RBX: 0000000000000081 RCX: 0000000000000001
kern :info : [ 227.720115] [ T1868] idxd 0000:e7:01.0: Intel(R) Accelerator Device (v100)
kern :warn : [ 227.733477] [ T1839] RDX: 0000000000000001 RSI: ff1100013bac6898 RDI: ff11000136789030
kern :warn : [ 227.733482] [ T1839] RBP: ff110030888620d0 R08: 0000000000000001 R09: ffe21c0026cf2f0c
kern :info : [ 227.734061] [ T2446] ipmi_ssif: IPMI SSIF Interface driver
kern :info : [ 227.741722] [ T1845] idxd 0000:f1:02.0: enabling device (0140 -> 0142)
kern :warn : [ 227.742240] [ T407] ------------[ cut here ]------------
kern :warn : [ 227.742244] [ T407] WARNING: kernel/irq/msi.c:605 at msi_domain_update_hwsize+0xd2/0x130, CPU#64: 0/407
kern :warn : [ 227.742254] [ T407] Modules linked in: qat_4xxx(+) intel_cstate ipmi_ssif pmt_discovery drm_client_lib cxl_acpi intel_sdsi pmt_class intel_qat mei_me idxd(+) drm_shmem_helper soundcore libie nvme(+) cxl_port i2c_i801 intel_uncore pcspkr acpi_power_meter libie_adminq isst_if_common nvme_core drm_kms_helper mei idxd_bus intel_vsec crc8 i2c_smbus i2c_ismt wmi ipmi_si acpi_ipmi cxl_core ipmi_devintf einj ipmi_msghandler pinctrl_emmitsburg acpi_pad joydev pfr_update pfr_telemetry binfmt_misc drm nfnetlink ip_tables x_tables sch_fq_codel
kern :warn : [ 227.742324] [ T407] CPU: 64 UID: 0 PID: 407 Comm: kworker/64:0 Tainted: G S 7.0.0-rc1-00098-g8812ea3b58e3 #1 PREEMPT(full)
kern :warn : [ 227.742329] [ T407] Tainted: [S]=CPU_OUT_OF_SPEC
kern :warn : [ 227.742331] [ T407] Hardware name: Intel Corporation D50DNP1SBB/D50DNP1SBB, BIOS SE5C7411.86B.9532.D02.2309201845 09/20/2023
kern :warn : [ 227.742333] [ T407] Workqueue: sync_wq local_pci_probe_callback
kern :warn : [ 227.742339] [ T407] RIP: 0010:msi_domain_update_hwsize (kernel/irq/msi.c:605 (discriminator 7) kernel/irq/msi.c:1147 (discriminator 7))
kern :warn : [ 227.742343] [ T407] Code: cc 48 8d bd 80 03 00 00 e8 8b 81 4c 00 48 8b 85 80 03 00 00 be ff ff ff ff 48 8d 78 70 e8 76 9d 8f 01 85 c0 0f 85 6b ff ff ff <0f> 0b e9 64 ff ff ff 0f 0b 5b b8 ed ff ff ff 5d 41 5c c3 cc cc cc
All code
========
0: cc int3
1: 48 8d bd 80 03 00 00 lea 0x380(%rbp),%rdi
8: e8 8b 81 4c 00 call 0x4c8198
d: 48 8b 85 80 03 00 00 mov 0x380(%rbp),%rax
14: be ff ff ff ff mov $0xffffffff,%esi
19: 48 8d 78 70 lea 0x70(%rax),%rdi
1d: e8 76 9d 8f 01 call 0x18f9d98
22: 85 c0 test %eax,%eax
24: 0f 85 6b ff ff ff jne 0xffffffffffffff95
2a:* 0f 0b ud2 <-- trapping instruction
2c: e9 64 ff ff ff jmp 0xffffffffffffff95
31: 0f 0b ud2
33: 5b pop %rbx
34: b8 ed ff ff ff mov $0xffffffed,%eax
39: 5d pop %rbp
3a: 41 5c pop %r12
3c: c3 ret
3d: cc int3
3e: cc int3
3f: cc int3
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: e9 64 ff ff ff jmp 0xffffffffffffff6b
7: 0f 0b ud2
9: 5b pop %rbx
a: b8 ed ff ff ff mov $0xffffffed,%eax
f: 5d pop %rbp
10: 41 5c pop %r12
12: c3 ret
13: cc int3
14: cc int3
15: cc int3
kern :warn : [ 227.742347] [ T407] RSP: 0018:ff11002084c7f920 EFLAGS: 00010246
kern :warn : [ 227.742351] [ T407] RAX: 0000000000000000 RBX: 0000000000000081 RCX: 0000000000000001
kern :warn : [ 227.742353] [ T407] RDX: 0000000000000001 RSI: ff110020dbdb8898 RDI: ff11002084c73d30
kern :warn : [ 227.742356] [ T407] RBP: ff1100209020a0d0 R08: 0000000000000001 R09: ffe21c041098ff0c
kern :warn : [ 227.742358] [ T407] R10: ff11002084c7f867 R11: 0000000000000000 R12: 0000000000000000
kern :warn : [ 227.742360] [ T407] R13: 0000000000000081 R14: ff1100209020a0d0 R15: 0000000000000000
kern :warn : [ 227.742363] [ T407] FS: 0000000000000000(0000) GS:ff11002eb3b3d000(0000) knlGS:0000000000000000
kern :warn : [ 227.742365] [ T407] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kern :warn : [ 227.742368] [ T407] CR2: 00005577e39810b8 CR3: 0000004077672004 CR4: 0000000000771ef0
kern :warn : [ 227.742371] [ T407] PKRU: 55555554
kern :warn : [ 227.742373] [ T407] Call Trace:
kern :warn : [ 227.742375] [ T407] <TASK>
kern :warn : [ 227.742391] [ T407] pci_setup_msix_device_domain (drivers/pci/msi/irqdomain.c:278 (discriminator 1))
kern :warn : [ 227.742399] [ T407] __pci_enable_msix_range (drivers/pci/msi/msi.c:850 (discriminator 1))
kern :warn : [ 227.742406] [ T407] ? pci_free_irq_vectors (drivers/pci/msi/api.c:382)
kern :warn : [ 227.742412] [ T407] ? lock_release (kernel/locking/lockdep.c:470 (discriminator 4) kernel/locking/lockdep.c:5891 (discriminator 4) kernel/locking/lockdep.c:5875 (discriminator 4))
kern :warn : [ 227.742422] [ T407] pci_alloc_irq_vectors_affinity (drivers/pci/msi/api.c:270)
kern :warn : [ 227.742428] [ T407] ? __pfx_pci_alloc_irq_vectors_affinity (drivers/pci/msi/api.c:255)
kern :warn : [ 227.742432] [ T407] ? pci_free_msi_irqs (drivers/pci/msi/msi.c:929)
kern :warn : [ 227.742443] [ T407] nvme_setup_io_queues (drivers/nvme/host/pci.c:2746 drivers/nvme/host/pci.c:2832) nvme
kern :warn : [ 227.742466] [ T407] ? __pfx_nvme_setup_io_queues (drivers/nvme/host/pci.c:2763) nvme
kern :warn : [ 227.742482] [ T407] ? __pfx_nvme_calc_irq_sets (drivers/nvme/host/pci.c:2678) nvme
kern :warn : [ 227.742495] [ T407] ? preempt_count_sub (kernel/sched/core.c:5782 kernel/sched/core.c:5778 kernel/sched/core.c:5800)
kern :warn : [ 227.742501] [ T407] ? nvme_setup_host_mem (drivers/nvme/host/pci.c:2519) nvme
kern :warn : [ 227.742514] [ T407] ? _raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:179 (discriminator 3) kernel/locking/spinlock.c:194 (discriminator 3))
kern :warn : [ 227.742521] [ T407] nvme_probe.cold (drivers/nvme/host/pci.c:3589) nvme
kern :warn : [ 227.742537] [ T407] ? __pfx_nvme_probe (drivers/nvme/host/pci.c:3530) nvme
kern :warn : [ 227.742552] [ T407] local_pci_probe (drivers/pci/pci-driver.c:324)
kern :warn : [ 227.742557] [ T407] local_pci_probe_callback (drivers/pci/pci-driver.c:352 (discriminator 1))
kern :warn : [ 227.742561] [ T407] process_one_work (arch/x86/include/asm/jump_label.h:37 include/trace/events/workqueue.h:110 kernel/workqueue.c:3280)
kern :warn : [ 227.742571] [ T407] ? __pfx_process_one_work (kernel/workqueue.c:3177)
kern :warn : [ 227.742575] [ T407] ? lock_acquire (include/trace/events/lock.h:24 (discriminator 15) include/trace/events/lock.h:24 (discriminator 15) kernel/locking/lockdep.c:5831 (discriminator 15))
kern :warn : [ 227.742579] [ T407] ? __list_add_valid_or_report (lib/list_debug.c:32 (discriminator 1))
kern :warn : [ 227.742586] [ T407] ? __pfx_local_pci_probe_callback (drivers/pci/pci-driver.c:349)
kern :warn : [ 227.742591] [ T407] worker_thread (kernel/workqueue.c:3352 (discriminator 2) kernel/workqueue.c:3439 (discriminator 2))
kern :warn : [ 227.742597] [ T407] ? __kthread_parkme (kernel/kthread.c:303 (discriminator 1))
kern :warn : [ 227.742602] [ T407] ? __pfx_worker_thread (kernel/workqueue.c:3385)
kern :warn : [ 227.742606] [ T407] kthread (kernel/kthread.c:467)
kern :warn : [ 227.742609] [ T407] ? kthread (kernel/kthread.c:443 (discriminator 1))
kern :warn : [ 227.742612] [ T407] ? __pfx_kthread (kernel/kthread.c:412)
kern :warn : [ 227.742617] [ T407] ret_from_fork (arch/x86/kernel/process.c:164)
kern :warn : [ 227.742623] [ T407] ? __pfx_ret_from_fork (arch/x86/kernel/process.c:153)
kern :warn : [ 227.742629] [ T407] ? __switch_to (include/linux/thread_info.h:142 (discriminator 2) arch/x86/kernel/process.h:17 (discriminator 2) arch/x86/kernel/process_64.c:676 (discriminator 2))
kern :warn : [ 227.742634] [ T407] ? __pfx_kthread (kernel/kthread.c:412)
kern :warn : [ 227.742639] [ T407] ret_from_fork_asm (arch/x86/entry/entry_64.S:255)
kern :warn : [ 227.742651] [ T407] </TASK>
kern :warn : [ 227.742653] [ T407] irq event stamp: 2041
kern :warn : [ 227.742654] [ T407] hardirqs last enabled at (2047): vprintk_emit (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 arch/x86/include/asm/irqflags.h:159 kernel/printk/printk.c:2021 kernel/printk/printk.c:2478)
kern :warn : [ 227.742659] [ T407] hardirqs last disabled at (2052): vprintk_emit (kernel/printk/printk.c:2000 (discriminator 3) kernel/printk/printk.c:2478 (discriminator 3))
kern :warn : [ 227.742662] [ T407] softirqs last enabled at (1836): handle_softirqs (kernel/softirq.c:469 (discriminator 2) kernel/softirq.c:650 (discriminator 2))
kern :warn : [ 227.742668] [ T407] softirqs last disabled at (1831): __irq_exit_rcu (kernel/softirq.c:657 kernel/softirq.c:496 kernel/softirq.c:723)
kern :warn : [ 227.742671] [ T407] ---[ end trace 0000000000000000 ]---
kern :warn : [ 227.746239] [ T1839] R10: ff11000136797867 R11: 0000000000000000 R12: 0000000000000000
kern :warn : [ 227.746243] [ T1839] R13: 0000000000000081 R14: ff110030888620d0 R15: 0000000000000000
kern :warn : [ 227.746246] [ T1839] FS: 0000000000000000(0000) GS:ff11000eb5d3d000(0000) knlGS:0000000000000000
kern :warn : [ 227.748790] [ T1845] idxd 0000:f1:02.0: No in-kernel DMA with PASID. -1
kern :warn : [ 227.757536] [ T1839] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kern :warn : [ 227.757540] [ T1839] CR2: 00007fd8fe021c70 CR3: 0000004077672004 CR4: 0000000000771ef0
kern :warn : [ 227.757543] [ T1839] PKRU: 55555554
kern :warn : [ 227.757545] [ T1839] Call Trace:
kern :info : [ 227.780919] [ T1845] idxd 0000:f1:02.0: Intel(R) Accelerator Device (v100)
kern :warn : [ 227.782837] [ T1839] <TASK>
kern :info : [ 227.796542] [ T407] nvme nvme1: 128/0/0 default/read/poll queues
kern :warn : [ 227.802183] [ T1839] pci_setup_msix_device_domain (drivers/pci/msi/irqdomain.c:278 (discriminator 1))
kern :info : [ 227.835748] [ T1670] nvme nvme1: Ignoring bogus Namespace Identifiers
kern :warn : [ 227.866727] [ T1839] __pci_enable_msix_range (drivers/pci/msi/msi.c:850 (discriminator 1))
kern :info : [ 227.962311] [ T1670] nvme1n1: p1 p2 p3 p4
kern :warn : [ 227.967364] [ T1839] ? pci_free_irq_vectors (drivers/pci/msi/api.c:382)
kern :warn : [ 227.967372] [ T1839] ? lock_release (kernel/locking/lockdep.c:470 (discriminator 4) kernel/locking/lockdep.c:5891 (discriminator 4) kernel/locking/lockdep.c:5875 (discriminator 4))
kern :warn : [ 228.376774] [ T1839] pci_alloc_irq_vectors_affinity (drivers/pci/msi/api.c:270)
kern :warn : [ 228.383465] [ T1839] ? __pfx_pci_alloc_irq_vectors_affinity (drivers/pci/msi/api.c:255)
kern :warn : [ 228.390737] [ T1839] ? pci_free_msi_irqs (drivers/pci/msi/msi.c:929)
kern :warn : [ 228.396166] [ T1839] nvme_setup_io_queues (drivers/nvme/host/pci.c:2746 drivers/nvme/host/pci.c:2832) nvme
kern :warn : [ 228.402590] [ T1839] ? __pfx_nvme_setup_io_queues (drivers/nvme/host/pci.c:2763) nvme
kern :warn : [ 228.409578] [ T1839] ? __pfx_nvme_calc_irq_sets (drivers/nvme/host/pci.c:2678) nvme
kern :warn : [ 228.416373] [ T1839] ? preempt_count_sub (kernel/sched/core.c:5782 kernel/sched/core.c:5778 kernel/sched/core.c:5800)
kern :warn : [ 228.421804] [ T1839] ? nvme_setup_host_mem (drivers/nvme/host/pci.c:2519) nvme
kern :warn : [ 228.428205] [ T1839] ? _raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:179 (discriminator 3) kernel/locking/spinlock.c:194 (discriminator 3))
kern :warn : [ 228.434597] [ T1839] nvme_probe.cold (drivers/nvme/host/pci.c:3589) nvme
kern :warn : [ 228.440519] [ T1839] ? __pfx_nvme_probe (drivers/nvme/host/pci.c:3530) nvme
kern :warn : [ 228.446532] [ T1839] local_pci_probe (drivers/pci/pci-driver.c:324)
kern :warn : [ 228.451561] [ T1839] local_pci_probe_callback (drivers/pci/pci-driver.c:352 (discriminator 1))
kern :warn : [ 228.457464] [ T1839] process_one_work (arch/x86/include/asm/jump_label.h:37 include/trace/events/workqueue.h:110 kernel/workqueue.c:3280)
kern :warn : [ 228.462789] [ T1839] ? __pfx_process_one_work (kernel/workqueue.c:3177)
kern :warn : [ 228.468694] [ T1839] ? __list_add_valid_or_report (lib/list_debug.c:32 (discriminator 1))
kern :warn : [ 228.474999] [ T1839] ? __pfx_local_pci_probe_callback (drivers/pci/pci-driver.c:349)
kern :warn : [ 228.481681] [ T1839] worker_thread (kernel/workqueue.c:3352 (discriminator 2) kernel/workqueue.c:3439 (discriminator 2))
kern :warn : [ 228.486713] [ T1839] ? __kthread_parkme (kernel/kthread.c:303 (discriminator 1))
kern :warn : [ 228.492136] [ T1839] ? __pfx_worker_thread (kernel/workqueue.c:3385)
kern :warn : [ 228.497749] [ T1839] kthread (kernel/kthread.c:467)
kern :warn : [ 228.502188] [ T1839] ? kthread (kernel/kthread.c:443 (discriminator 1))
kern :warn : [ 228.506714] [ T1839] ? __pfx_kthread (kernel/kthread.c:412)
kern :warn : [ 228.511743] [ T1839] ret_from_fork (arch/x86/kernel/process.c:164)
kern :warn : [ 228.516771] [ T1839] ? __pfx_ret_from_fork (arch/x86/kernel/process.c:153)
kern :warn : [ 228.522398] [ T1839] ? __switch_to (include/linux/thread_info.h:142 (discriminator 2) arch/x86/kernel/process.h:17 (discriminator 2) arch/x86/kernel/process_64.c:676 (discriminator 2))
kern :warn : [ 228.527456] [ T1839] ? __pfx_kthread (kernel/kthread.c:412)
kern :warn : [ 228.532497] [ T1839] ret_from_fork_asm (arch/x86/entry/entry_64.S:255)
kern :warn : [ 228.537734] [ T1839] </TASK>
kern :warn : [ 228.541003] [ T1839] irq event stamp: 215957
kern :warn : [ 228.545733] [ T1839] hardirqs last enabled at (215971): __up_console_sem (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 arch/x86/include/asm/irqflags.h:159 kernel/printk/printk.c:347)
kern :warn : [ 228.556322] [ T1839] hardirqs last disabled at (215984): __up_console_sem (kernel/printk/printk.c:345 (discriminator 3))
kern :warn : [ 228.566915] [ T1839] softirqs last enabled at (215908): handle_softirqs (kernel/softirq.c:469 (discriminator 2) kernel/softirq.c:650 (discriminator 2))
kern :warn : [ 228.577608] [ T1839] softirqs last disabled at (215903): __irq_exit_rcu (kernel/softirq.c:657 kernel/softirq.c:496 kernel/softirq.c:723)
kern :warn : [ 228.588198] [ T1839] ---[ end trace 0000000000000000 ]---
kern :info : [ 228.608877] [ T2296] i40e: Intel(R) Ethernet Connection XL710 Network Driver
kern :info : [ 228.616739] [ T2296] i40e: Copyright (c) 2013 - 2019 Intel Corporation.
kern :info : [ 228.706584] [ T1839] nvme nvme0: 128/0/0 default/read/poll queues
kern :info : [ 228.737888] [ T1644] nvme nvme0: Ignoring bogus Namespace Identifiers
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20260331/202603312126.697741fb-lkp@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Tue, Mar 24 2026 at 09:47, Guixin Liu wrote:
> After the upper-layer driver removes the device and before the next
> probe, events such as firmware updates may increase the number of
> interrupts supported by the device. However, the irq_domain still
> retains the old hwsize, which causes subsequent interrupt allocation
> failures. Update hwsize during MSI-X device domain setup to fix this
> issue.
When a device is removed then the corresponding struct device is torn
down, which implies that the device domain is freed as well. So how can
this end up with the old state on the next probe?
Thanks,
tglx
在 2026/3/24 21:59, Thomas Gleixner 写道: > On Tue, Mar 24 2026 at 09:47, Guixin Liu wrote: >> After the upper-layer driver removes the device and before the next >> probe, events such as firmware updates may increase the number of >> interrupts supported by the device. However, the irq_domain still >> retains the old hwsize, which causes subsequent interrupt allocation >> failures. Update hwsize during MSI-X device domain setup to fix this >> issue. > When a device is removed then the corresponding struct device is torn > down, which implies that the device domain is freed as well. So how can > this end up with the old state on the next probe? > > Thanks, > > tglx Hi, My description was a bit ambiguous. The msi_device_data_release() path to remove the irq_domain is only triggered when the device is removed at the PCI layer. If only the upper-layer driver unbinds, this path will not be reached. Best Regards, Guixin Liu
On Wed, Mar 25 2026 at 09:34, Guixin Liu wrote:
> 在 2026/3/24 21:59, Thomas Gleixner 写道:
>> On Tue, Mar 24 2026 at 09:47, Guixin Liu wrote:
>>> After the upper-layer driver removes the device and before the next
>>> probe, events such as firmware updates may increase the number of
>>> interrupts supported by the device. However, the irq_domain still
>>> retains the old hwsize, which causes subsequent interrupt allocation
>>> failures. Update hwsize during MSI-X device domain setup to fix this
>>> issue.
>> When a device is removed then the corresponding struct device is torn
>> down, which implies that the device domain is freed as well. So how can
>> this end up with the old state on the next probe?
>
> Hi, My description was a bit ambiguous. The msi_device_data_release()
> path to remove the irq_domain is only triggered when the device is
> removed at the PCI layer. If only the upper-layer driver unbinds,
> this path will not be reached.
What's an upper-layer driver? Please be precise.
I assume you are talking about the device driver itself. Right, the
unbind of the driver won't remove the domain. But there is no real good
reason for keeping it around at that point.
So the straight forward solution is to free the MSI domain when the
driver shuts down and tears the MSI interrupts down.
Thanks,
tglx
在 2026/3/25 15:42, Thomas Gleixner 写道: > On Wed, Mar 25 2026 at 09:34, Guixin Liu wrote: >> 在 2026/3/24 21:59, Thomas Gleixner 写道: >>> On Tue, Mar 24 2026 at 09:47, Guixin Liu wrote: >>>> After the upper-layer driver removes the device and before the next >>>> probe, events such as firmware updates may increase the number of >>>> interrupts supported by the device. However, the irq_domain still >>>> retains the old hwsize, which causes subsequent interrupt allocation >>>> failures. Update hwsize during MSI-X device domain setup to fix this >>>> issue. >>> When a device is removed then the corresponding struct device is torn >>> down, which implies that the device domain is freed as well. So how can >>> this end up with the old state on the next probe? >> Hi, My description was a bit ambiguous. The msi_device_data_release() >> path to remove the irq_domain is only triggered when the device is >> removed at the PCI layer. If only the upper-layer driver unbinds, >> this path will not be reached. > What's an upper-layer driver? Please be precise. > > I assume you are talking about the device driver itself. Sorry, My description is not very accurate, yes, it's device driver. > Right, the > unbind of the driver won't remove the domain. But there is no real good > reason for keeping it around at that point. > > So the straight forward solution is to free the MSI domain when the > driver shuts down and tears the MSI interrupts down. Yes, I had also considered this aspect before, I will change the scheme to this, and send another patch, thanks. Best Regards, Guixin Liu > > Thanks, > > tglx >
On Wed, Mar 25 2026 at 16:40, Guixin Liu wrote:
> 在 2026/3/25 15:42, Thomas Gleixner 写道:
>> So the straight forward solution is to free the MSI domain when the
>> driver shuts down and tears the MSI interrupts down.
> Yes, I had also considered this aspect before, I will change the scheme
> to this, and send another patch, thanks.
Actually none of that is required. When you update the firmware then
just let the PCI core rescan the device. That makes way more sense as
the new firmware might change other config entries as well not only the
MSI ones.
Thanks,
tglx
在 2026/3/25 21:58, Thomas Gleixner 写道: > On Wed, Mar 25 2026 at 16:40, Guixin Liu wrote: >> 在 2026/3/25 15:42, Thomas Gleixner 写道: >>> So the straight forward solution is to free the MSI domain when the >>> driver shuts down and tears the MSI interrupts down. >> Yes, I had also considered this aspect before, I will change the scheme >> to this, and send another patch, thanks. > Actually none of that is required. When you update the firmware then > just let the PCI core rescan the device. That makes way more sense as > the new firmware might change other config entries as well not only the > MSI ones. > > Thanks, > > tglx You are right, besides MSI, other attributes may also change. Removing and then recanning the PCI device is the best approach. Best Regards, Guixin Liu
© 2016 - 2026 Red Hat, Inc.