[v1] Update MSI-X irq domain hwsize

[PATCH 2/2] PCI/MSI: Update MSI-X irq domain hwsize

Posted by Guixin Liu 1 week, 3 days ago

After the upper-layer driver removes the device and before the next
probe, events such as firmware updates may increase the number of
interrupts supported by the device. However, the irq_domain still
retains the old hwsize, which causes subsequent interrupt allocation
failures. Update hwsize during MSI-X device domain setup to fix this
issue.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Guanghui Feng <guanghuifeng@linux.alibaba.com>
---
 drivers/pci/msi/irqdomain.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/msi/irqdomain.c b/drivers/pci/msi/irqdomain.c
index 6e65f0f44112..485bfba059cc 100644
--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -274,8 +274,11 @@ bool pci_setup_msix_device_domain(struct pci_dev *pdev, unsigned int hwsize)
 	if (WARN_ON_ONCE(pdev->msi_enabled))
 		return false;
 
-	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSIX))
+	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSIX)) {
+		if (msi_domain_update_hwsize(&pdev->dev, MSI_DEFAULT_DOMAIN, hwsize))
+			pr_warn("too big MSI-X hwsize:%u\n", hwsize);
 		return true;
+	}
 	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSI))
 		msi_remove_device_irq_domain(&pdev->dev, MSI_DEFAULT_DOMAIN);
 
-- 
2.32.0.3.g01195cf9f

Re: [PATCH 2/2] PCI/MSI: Update MSI-X irq domain hwsize

Posted by kernel test robot 3 days, 3 hours ago


Hello,

kernel test robot noticed "RIP:msi_domain_update_hwsize" on:

commit: 8812ea3b58e360d64259b4b79523cec52ff6a014 ("[PATCH 2/2] PCI/MSI: Update MSI-X irq domain hwsize")
url: https://github.com/intel-lab-lkp/linux/commits/Guixin-Liu/genirq-msi-Introduce-update-hwsize-helper/20260324-104239
base: https://git.kernel.org/cgit/linux/kernel/git/pci/pci.git next
patch link: https://lore.kernel.org/all/20260324014754.4973-3-kanie@linux.alibaba.com/
patch subject: [PATCH 2/2] PCI/MSI: Update MSI-X irq domain hwsize

in testcase: perf-sanity-tests
version: 
with following parameters:

	perf_compiler: gcc
	group: group-01



config: x86_64-rhel-9.4-bpf
compiler: gcc-14
test machine: 256 threads 4 sockets INTEL(R) XEON(R) PLATINUM 8592+ (Emerald Rapids) with 256G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202603312126.697741fb-lkp@intel.com



kern  :warn  : [  227.562210] [   T1839] ------------[ cut here ]------------
kern  :warn  : [  227.568011] [   T1929] idxd 0000:74:02.0: No in-kernel DMA with PASID. -1
kern  :warn  : [  227.573941] [   T1839] WARNING: kernel/irq/msi.c:605 at msi_domain_update_hwsize+0xd2/0x130, CPU#0: 2/1839
kern  :warn  : [  227.592907] [   T1839] Modules linked in: intel_cstate pmt_discovery drm_client_lib cxl_acpi intel_sdsi pmt_class intel_qat mei_me idxd(+) drm_shmem_helper soundcore libie nvme(+) cxl_port i2c_i801 intel_uncore pcspkr acpi_power_meter libie_adminq isst_if_common nvme_core drm_kms_helper mei idxd_bus intel_vsec crc8 i2c_smbus i2c_ismt wmi ipmi_si acpi_ipmi cxl_core ipmi_devintf einj ipmi_msghandler pinctrl_emmitsburg acpi_pad joydev pfr_update pfr_telemetry binfmt_misc drm nfnetlink ip_tables x_tables sch_fq_codel
kern  :info  : [  227.597293] [   T1929] idxd 0000:74:02.0: Intel(R) Accelerator Device (v100)
kern  :warn  : [  227.643806] [   T1839] CPU: 0 UID: 0 PID: 1839 Comm: kworker/0:2 Tainted: G S                  7.0.0-rc1-00098-g8812ea3b58e3 #1 PREEMPT(full)
kern  :info  : [  227.652130] [   T1868] idxd 0000:e7:01.0: enabling device (0144 -> 0146)
kern  :warn  : [  227.665534] [   T1839] Tainted: [S]=CPU_OUT_OF_SPEC
kern  :warn  : [  227.665537] [   T1839] Hardware name: Intel Corporation D50DNP1SBB/D50DNP1SBB, BIOS SE5C7411.86B.9532.D02.2309201845 09/20/2023
kern  :warn  : [  227.665540] [   T1839] Workqueue: sync_wq local_pci_probe_callback
kern  :warn  : [  227.673200] [   T1868] idxd 0000:e7:01.0: No in-kernel DMA with PASID. -1
kern  :warn  : [  227.678012] [   T1839] RIP: 0010:msi_domain_update_hwsize (kernel/irq/msi.c:605 (discriminator 7) kernel/irq/msi.c:1147 (discriminator 7))
kern  :warn  : [  227.678017] [   T1839] Code: cc 48 8d bd 80 03 00 00 e8 8b 81 4c 00 48 8b 85 80 03 00 00 be ff ff ff ff 48 8d 78 70 e8 76 9d 8f 01 85 c0 0f 85 6b ff ff ff <0f> 0b e9 64 ff ff ff 0f 0b 5b b8 ed ff ff ff 5d 41 5c c3 cc cc cc
All code
========
   0:	cc                   	int3
   1:	48 8d bd 80 03 00 00 	lea    0x380(%rbp),%rdi
   8:	e8 8b 81 4c 00       	call   0x4c8198
   d:	48 8b 85 80 03 00 00 	mov    0x380(%rbp),%rax
  14:	be ff ff ff ff       	mov    $0xffffffff,%esi
  19:	48 8d 78 70          	lea    0x70(%rax),%rdi
  1d:	e8 76 9d 8f 01       	call   0x18f9d98
  22:	85 c0                	test   %eax,%eax
  24:	0f 85 6b ff ff ff    	jne    0xffffffffffffff95
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	e9 64 ff ff ff       	jmp    0xffffffffffffff95
  31:	0f 0b                	ud2
  33:	5b                   	pop    %rbx
  34:	b8 ed ff ff ff       	mov    $0xffffffed,%eax
  39:	5d                   	pop    %rbp
  3a:	41 5c                	pop    %r12
  3c:	c3                   	ret
  3d:	cc                   	int3
  3e:	cc                   	int3
  3f:	cc                   	int3

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	e9 64 ff ff ff       	jmp    0xffffffffffffff6b
   7:	0f 0b                	ud2
   9:	5b                   	pop    %rbx
   a:	b8 ed ff ff ff       	mov    $0xffffffed,%eax
   f:	5d                   	pop    %rbp
  10:	41 5c                	pop    %r12
  12:	c3                   	ret
  13:	cc                   	int3
  14:	cc                   	int3
  15:	cc                   	int3
kern  :warn  : [  227.678022] [   T1839] RSP: 0018:ff11000136797920 EFLAGS: 00010246
kern  :info  : [  227.692041] [    T407] nvme nvme1: pci function 0000:a8:00.0

kern  :warn  : [  227.697311] [   T1839] RAX: 0000000000000000 RBX: 0000000000000081 RCX: 0000000000000001
kern  :info  : [  227.720115] [   T1868] idxd 0000:e7:01.0: Intel(R) Accelerator Device (v100)
kern  :warn  : [  227.733477] [   T1839] RDX: 0000000000000001 RSI: ff1100013bac6898 RDI: ff11000136789030
kern  :warn  : [  227.733482] [   T1839] RBP: ff110030888620d0 R08: 0000000000000001 R09: ffe21c0026cf2f0c
kern  :info  : [  227.734061] [   T2446] ipmi_ssif: IPMI SSIF Interface driver
kern  :info  : [  227.741722] [   T1845] idxd 0000:f1:02.0: enabling device (0140 -> 0142)
kern  :warn  : [  227.742240] [    T407] ------------[ cut here ]------------
kern  :warn  : [  227.742244] [    T407] WARNING: kernel/irq/msi.c:605 at msi_domain_update_hwsize+0xd2/0x130, CPU#64: 0/407
kern  :warn  : [  227.742254] [    T407] Modules linked in: qat_4xxx(+) intel_cstate ipmi_ssif pmt_discovery drm_client_lib cxl_acpi intel_sdsi pmt_class intel_qat mei_me idxd(+) drm_shmem_helper soundcore libie nvme(+) cxl_port i2c_i801 intel_uncore pcspkr acpi_power_meter libie_adminq isst_if_common nvme_core drm_kms_helper mei idxd_bus intel_vsec crc8 i2c_smbus i2c_ismt wmi ipmi_si acpi_ipmi cxl_core ipmi_devintf einj ipmi_msghandler pinctrl_emmitsburg acpi_pad joydev pfr_update pfr_telemetry binfmt_misc drm nfnetlink ip_tables x_tables sch_fq_codel
kern  :warn  : [  227.742324] [    T407] CPU: 64 UID: 0 PID: 407 Comm: kworker/64:0 Tainted: G S                  7.0.0-rc1-00098-g8812ea3b58e3 #1 PREEMPT(full)
kern  :warn  : [  227.742329] [    T407] Tainted: [S]=CPU_OUT_OF_SPEC
kern  :warn  : [  227.742331] [    T407] Hardware name: Intel Corporation D50DNP1SBB/D50DNP1SBB, BIOS SE5C7411.86B.9532.D02.2309201845 09/20/2023
kern  :warn  : [  227.742333] [    T407] Workqueue: sync_wq local_pci_probe_callback
kern  :warn  : [  227.742339] [    T407] RIP: 0010:msi_domain_update_hwsize (kernel/irq/msi.c:605 (discriminator 7) kernel/irq/msi.c:1147 (discriminator 7))
kern  :warn  : [  227.742343] [    T407] Code: cc 48 8d bd 80 03 00 00 e8 8b 81 4c 00 48 8b 85 80 03 00 00 be ff ff ff ff 48 8d 78 70 e8 76 9d 8f 01 85 c0 0f 85 6b ff ff ff <0f> 0b e9 64 ff ff ff 0f 0b 5b b8 ed ff ff ff 5d 41 5c c3 cc cc cc
All code
========
   0:	cc                   	int3
   1:	48 8d bd 80 03 00 00 	lea    0x380(%rbp),%rdi
   8:	e8 8b 81 4c 00       	call   0x4c8198
   d:	48 8b 85 80 03 00 00 	mov    0x380(%rbp),%rax
  14:	be ff ff ff ff       	mov    $0xffffffff,%esi
  19:	48 8d 78 70          	lea    0x70(%rax),%rdi
  1d:	e8 76 9d 8f 01       	call   0x18f9d98
  22:	85 c0                	test   %eax,%eax
  24:	0f 85 6b ff ff ff    	jne    0xffffffffffffff95
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	e9 64 ff ff ff       	jmp    0xffffffffffffff95
  31:	0f 0b                	ud2
  33:	5b                   	pop    %rbx
  34:	b8 ed ff ff ff       	mov    $0xffffffed,%eax
  39:	5d                   	pop    %rbp
  3a:	41 5c                	pop    %r12
  3c:	c3                   	ret
  3d:	cc                   	int3
  3e:	cc                   	int3
  3f:	cc                   	int3

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	e9 64 ff ff ff       	jmp    0xffffffffffffff6b
   7:	0f 0b                	ud2
   9:	5b                   	pop    %rbx
   a:	b8 ed ff ff ff       	mov    $0xffffffed,%eax
   f:	5d                   	pop    %rbp
  10:	41 5c                	pop    %r12
  12:	c3                   	ret
  13:	cc                   	int3
  14:	cc                   	int3
  15:	cc                   	int3
kern  :warn  : [  227.742347] [    T407] RSP: 0018:ff11002084c7f920 EFLAGS: 00010246
kern  :warn  : [  227.742351] [    T407] RAX: 0000000000000000 RBX: 0000000000000081 RCX: 0000000000000001
kern  :warn  : [  227.742353] [    T407] RDX: 0000000000000001 RSI: ff110020dbdb8898 RDI: ff11002084c73d30
kern  :warn  : [  227.742356] [    T407] RBP: ff1100209020a0d0 R08: 0000000000000001 R09: ffe21c041098ff0c
kern  :warn  : [  227.742358] [    T407] R10: ff11002084c7f867 R11: 0000000000000000 R12: 0000000000000000
kern  :warn  : [  227.742360] [    T407] R13: 0000000000000081 R14: ff1100209020a0d0 R15: 0000000000000000
kern  :warn  : [  227.742363] [    T407] FS:  0000000000000000(0000) GS:ff11002eb3b3d000(0000) knlGS:0000000000000000
kern  :warn  : [  227.742365] [    T407] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kern  :warn  : [  227.742368] [    T407] CR2: 00005577e39810b8 CR3: 0000004077672004 CR4: 0000000000771ef0
kern  :warn  : [  227.742371] [    T407] PKRU: 55555554
kern  :warn  : [  227.742373] [    T407] Call Trace:
kern  :warn  : [  227.742375] [    T407]  <TASK>
kern  :warn  : [  227.742391] [    T407]  pci_setup_msix_device_domain (drivers/pci/msi/irqdomain.c:278 (discriminator 1))
kern  :warn  : [  227.742399] [    T407]  __pci_enable_msix_range (drivers/pci/msi/msi.c:850 (discriminator 1))
kern  :warn  : [  227.742406] [    T407]  ? pci_free_irq_vectors (drivers/pci/msi/api.c:382)
kern  :warn  : [  227.742412] [    T407]  ? lock_release (kernel/locking/lockdep.c:470 (discriminator 4) kernel/locking/lockdep.c:5891 (discriminator 4) kernel/locking/lockdep.c:5875 (discriminator 4))
kern  :warn  : [  227.742422] [    T407]  pci_alloc_irq_vectors_affinity (drivers/pci/msi/api.c:270)
kern  :warn  : [  227.742428] [    T407]  ? __pfx_pci_alloc_irq_vectors_affinity (drivers/pci/msi/api.c:255)
kern  :warn  : [  227.742432] [    T407]  ? pci_free_msi_irqs (drivers/pci/msi/msi.c:929)
kern  :warn  : [  227.742443] [    T407] nvme_setup_io_queues (drivers/nvme/host/pci.c:2746 drivers/nvme/host/pci.c:2832) nvme
kern  :warn  : [  227.742466] [    T407]  ? __pfx_nvme_setup_io_queues (drivers/nvme/host/pci.c:2763) nvme
kern  :warn  : [  227.742482] [    T407]  ? __pfx_nvme_calc_irq_sets (drivers/nvme/host/pci.c:2678) nvme
kern  :warn  : [  227.742495] [    T407]  ? preempt_count_sub (kernel/sched/core.c:5782 kernel/sched/core.c:5778 kernel/sched/core.c:5800)
kern  :warn  : [  227.742501] [    T407]  ? nvme_setup_host_mem (drivers/nvme/host/pci.c:2519) nvme
kern  :warn  : [  227.742514] [    T407]  ? _raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:179 (discriminator 3) kernel/locking/spinlock.c:194 (discriminator 3))
kern  :warn  : [  227.742521] [    T407] nvme_probe.cold (drivers/nvme/host/pci.c:3589) nvme
kern  :warn  : [  227.742537] [    T407]  ? __pfx_nvme_probe (drivers/nvme/host/pci.c:3530) nvme
kern  :warn  : [  227.742552] [    T407]  local_pci_probe (drivers/pci/pci-driver.c:324)
kern  :warn  : [  227.742557] [    T407]  local_pci_probe_callback (drivers/pci/pci-driver.c:352 (discriminator 1))
kern  :warn  : [  227.742561] [    T407]  process_one_work (arch/x86/include/asm/jump_label.h:37 include/trace/events/workqueue.h:110 kernel/workqueue.c:3280)
kern  :warn  : [  227.742571] [    T407]  ? __pfx_process_one_work (kernel/workqueue.c:3177)
kern  :warn  : [  227.742575] [    T407]  ? lock_acquire (include/trace/events/lock.h:24 (discriminator 15) include/trace/events/lock.h:24 (discriminator 15) kernel/locking/lockdep.c:5831 (discriminator 15))
kern  :warn  : [  227.742579] [    T407]  ? __list_add_valid_or_report (lib/list_debug.c:32 (discriminator 1))
kern  :warn  : [  227.742586] [    T407]  ? __pfx_local_pci_probe_callback (drivers/pci/pci-driver.c:349)
kern  :warn  : [  227.742591] [    T407]  worker_thread (kernel/workqueue.c:3352 (discriminator 2) kernel/workqueue.c:3439 (discriminator 2))
kern  :warn  : [  227.742597] [    T407]  ? __kthread_parkme (kernel/kthread.c:303 (discriminator 1))
kern  :warn  : [  227.742602] [    T407]  ? __pfx_worker_thread (kernel/workqueue.c:3385)
kern  :warn  : [  227.742606] [    T407]  kthread (kernel/kthread.c:467)
kern  :warn  : [  227.742609] [    T407]  ? kthread (kernel/kthread.c:443 (discriminator 1))
kern  :warn  : [  227.742612] [    T407]  ? __pfx_kthread (kernel/kthread.c:412)
kern  :warn  : [  227.742617] [    T407]  ret_from_fork (arch/x86/kernel/process.c:164)
kern  :warn  : [  227.742623] [    T407]  ? __pfx_ret_from_fork (arch/x86/kernel/process.c:153)
kern  :warn  : [  227.742629] [    T407]  ? __switch_to (include/linux/thread_info.h:142 (discriminator 2) arch/x86/kernel/process.h:17 (discriminator 2) arch/x86/kernel/process_64.c:676 (discriminator 2))
kern  :warn  : [  227.742634] [    T407]  ? __pfx_kthread (kernel/kthread.c:412)
kern  :warn  : [  227.742639] [    T407]  ret_from_fork_asm (arch/x86/entry/entry_64.S:255)
kern  :warn  : [  227.742651] [    T407]  </TASK>
kern  :warn  : [  227.742653] [    T407] irq event stamp: 2041
kern  :warn  : [  227.742654] [    T407] hardirqs last  enabled at (2047): vprintk_emit (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 arch/x86/include/asm/irqflags.h:159 kernel/printk/printk.c:2021 kernel/printk/printk.c:2478)
kern  :warn  : [  227.742659] [    T407] hardirqs last disabled at (2052): vprintk_emit (kernel/printk/printk.c:2000 (discriminator 3) kernel/printk/printk.c:2478 (discriminator 3))
kern  :warn  : [  227.742662] [    T407] softirqs last  enabled at (1836): handle_softirqs (kernel/softirq.c:469 (discriminator 2) kernel/softirq.c:650 (discriminator 2))
kern  :warn  : [  227.742668] [    T407] softirqs last disabled at (1831): __irq_exit_rcu (kernel/softirq.c:657 kernel/softirq.c:496 kernel/softirq.c:723)
kern  :warn  : [  227.742671] [    T407] ---[ end trace 0000000000000000 ]---
kern  :warn  : [  227.746239] [   T1839] R10: ff11000136797867 R11: 0000000000000000 R12: 0000000000000000
kern  :warn  : [  227.746243] [   T1839] R13: 0000000000000081 R14: ff110030888620d0 R15: 0000000000000000
kern  :warn  : [  227.746246] [   T1839] FS:  0000000000000000(0000) GS:ff11000eb5d3d000(0000) knlGS:0000000000000000
kern  :warn  : [  227.748790] [   T1845] idxd 0000:f1:02.0: No in-kernel DMA with PASID. -1
kern  :warn  : [  227.757536] [   T1839] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kern  :warn  : [  227.757540] [   T1839] CR2: 00007fd8fe021c70 CR3: 0000004077672004 CR4: 0000000000771ef0
kern  :warn  : [  227.757543] [   T1839] PKRU: 55555554
kern  :warn  : [  227.757545] [   T1839] Call Trace:
kern  :info  : [  227.780919] [   T1845] idxd 0000:f1:02.0: Intel(R) Accelerator Device (v100)
kern  :warn  : [  227.782837] [   T1839]  <TASK>
kern  :info  : [  227.796542] [    T407] nvme nvme1: 128/0/0 default/read/poll queues
kern  :warn  : [  227.802183] [   T1839]  pci_setup_msix_device_domain (drivers/pci/msi/irqdomain.c:278 (discriminator 1))
kern  :info  : [  227.835748] [   T1670] nvme nvme1: Ignoring bogus Namespace Identifiers
kern  :warn  : [  227.866727] [   T1839]  __pci_enable_msix_range (drivers/pci/msi/msi.c:850 (discriminator 1))
kern  :info  : [  227.962311] [   T1670]  nvme1n1: p1 p2 p3 p4
kern  :warn  : [  227.967364] [   T1839]  ? pci_free_irq_vectors (drivers/pci/msi/api.c:382)
kern  :warn  : [  227.967372] [   T1839]  ? lock_release (kernel/locking/lockdep.c:470 (discriminator 4) kernel/locking/lockdep.c:5891 (discriminator 4) kernel/locking/lockdep.c:5875 (discriminator 4))
kern  :warn  : [  228.376774] [   T1839]  pci_alloc_irq_vectors_affinity (drivers/pci/msi/api.c:270)
kern  :warn  : [  228.383465] [   T1839]  ? __pfx_pci_alloc_irq_vectors_affinity (drivers/pci/msi/api.c:255)
kern  :warn  : [  228.390737] [   T1839]  ? pci_free_msi_irqs (drivers/pci/msi/msi.c:929)
kern  :warn  : [  228.396166] [   T1839] nvme_setup_io_queues (drivers/nvme/host/pci.c:2746 drivers/nvme/host/pci.c:2832) nvme
kern  :warn  : [  228.402590] [   T1839]  ? __pfx_nvme_setup_io_queues (drivers/nvme/host/pci.c:2763) nvme
kern  :warn  : [  228.409578] [   T1839]  ? __pfx_nvme_calc_irq_sets (drivers/nvme/host/pci.c:2678) nvme
kern  :warn  : [  228.416373] [   T1839]  ? preempt_count_sub (kernel/sched/core.c:5782 kernel/sched/core.c:5778 kernel/sched/core.c:5800)
kern  :warn  : [  228.421804] [   T1839]  ? nvme_setup_host_mem (drivers/nvme/host/pci.c:2519) nvme
kern  :warn  : [  228.428205] [   T1839]  ? _raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:179 (discriminator 3) kernel/locking/spinlock.c:194 (discriminator 3))
kern  :warn  : [  228.434597] [   T1839] nvme_probe.cold (drivers/nvme/host/pci.c:3589) nvme
kern  :warn  : [  228.440519] [   T1839]  ? __pfx_nvme_probe (drivers/nvme/host/pci.c:3530) nvme
kern  :warn  : [  228.446532] [   T1839]  local_pci_probe (drivers/pci/pci-driver.c:324)
kern  :warn  : [  228.451561] [   T1839]  local_pci_probe_callback (drivers/pci/pci-driver.c:352 (discriminator 1))
kern  :warn  : [  228.457464] [   T1839]  process_one_work (arch/x86/include/asm/jump_label.h:37 include/trace/events/workqueue.h:110 kernel/workqueue.c:3280)
kern  :warn  : [  228.462789] [   T1839]  ? __pfx_process_one_work (kernel/workqueue.c:3177)
kern  :warn  : [  228.468694] [   T1839]  ? __list_add_valid_or_report (lib/list_debug.c:32 (discriminator 1))
kern  :warn  : [  228.474999] [   T1839]  ? __pfx_local_pci_probe_callback (drivers/pci/pci-driver.c:349)
kern  :warn  : [  228.481681] [   T1839]  worker_thread (kernel/workqueue.c:3352 (discriminator 2) kernel/workqueue.c:3439 (discriminator 2))
kern  :warn  : [  228.486713] [   T1839]  ? __kthread_parkme (kernel/kthread.c:303 (discriminator 1))
kern  :warn  : [  228.492136] [   T1839]  ? __pfx_worker_thread (kernel/workqueue.c:3385)
kern  :warn  : [  228.497749] [   T1839]  kthread (kernel/kthread.c:467)
kern  :warn  : [  228.502188] [   T1839]  ? kthread (kernel/kthread.c:443 (discriminator 1))
kern  :warn  : [  228.506714] [   T1839]  ? __pfx_kthread (kernel/kthread.c:412)
kern  :warn  : [  228.511743] [   T1839]  ret_from_fork (arch/x86/kernel/process.c:164)
kern  :warn  : [  228.516771] [   T1839]  ? __pfx_ret_from_fork (arch/x86/kernel/process.c:153)
kern  :warn  : [  228.522398] [   T1839]  ? __switch_to (include/linux/thread_info.h:142 (discriminator 2) arch/x86/kernel/process.h:17 (discriminator 2) arch/x86/kernel/process_64.c:676 (discriminator 2))
kern  :warn  : [  228.527456] [   T1839]  ? __pfx_kthread (kernel/kthread.c:412)
kern  :warn  : [  228.532497] [   T1839]  ret_from_fork_asm (arch/x86/entry/entry_64.S:255)
kern  :warn  : [  228.537734] [   T1839]  </TASK>
kern  :warn  : [  228.541003] [   T1839] irq event stamp: 215957
kern  :warn  : [  228.545733] [   T1839] hardirqs last  enabled at (215971): __up_console_sem (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 arch/x86/include/asm/irqflags.h:159 kernel/printk/printk.c:347)
kern  :warn  : [  228.556322] [   T1839] hardirqs last disabled at (215984): __up_console_sem (kernel/printk/printk.c:345 (discriminator 3))
kern  :warn  : [  228.566915] [   T1839] softirqs last  enabled at (215908): handle_softirqs (kernel/softirq.c:469 (discriminator 2) kernel/softirq.c:650 (discriminator 2))
kern  :warn  : [  228.577608] [   T1839] softirqs last disabled at (215903): __irq_exit_rcu (kernel/softirq.c:657 kernel/softirq.c:496 kernel/softirq.c:723)
kern  :warn  : [  228.588198] [   T1839] ---[ end trace 0000000000000000 ]---
kern  :info  : [  228.608877] [   T2296] i40e: Intel(R) Ethernet Connection XL710 Network Driver
kern  :info  : [  228.616739] [   T2296] i40e: Copyright (c) 2013 - 2019 Intel Corporation.
kern  :info  : [  228.706584] [   T1839] nvme nvme0: 128/0/0 default/read/poll queues
kern  :info  : [  228.737888] [   T1644] nvme nvme0: Ignoring bogus Namespace Identifiers


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20260331/202603312126.697741fb-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH 2/2] PCI/MSI: Update MSI-X irq domain hwsize

Posted by Thomas Gleixner 1 week, 3 days ago

On Tue, Mar 24 2026 at 09:47, Guixin Liu wrote:
> After the upper-layer driver removes the device and before the next
> probe, events such as firmware updates may increase the number of
> interrupts supported by the device. However, the irq_domain still
> retains the old hwsize, which causes subsequent interrupt allocation
> failures. Update hwsize during MSI-X device domain setup to fix this
> issue.

When a device is removed then the corresponding struct device is torn
down, which implies that the device domain is freed as well. So how can
this end up with the old state on the next probe?

Thanks,

        tglx

Re: [PATCH 2/2] PCI/MSI: Update MSI-X irq domain hwsize

Posted by Guixin Liu 1 week, 2 days ago


在 2026/3/24 21:59, Thomas Gleixner 写道:
> On Tue, Mar 24 2026 at 09:47, Guixin Liu wrote:
>> After the upper-layer driver removes the device and before the next
>> probe, events such as firmware updates may increase the number of
>> interrupts supported by the device. However, the irq_domain still
>> retains the old hwsize, which causes subsequent interrupt allocation
>> failures. Update hwsize during MSI-X device domain setup to fix this
>> issue.
> When a device is removed then the corresponding struct device is torn
> down, which implies that the device domain is freed as well. So how can
> this end up with the old state on the next probe?
>
> Thanks,
>
>          tglx
Hi, My description was a bit ambiguous. The msi_device_data_release()
path to remove the irq_domain is only triggered when the device is
removed at the PCI layer. If only the upper-layer driver unbinds,
this path will not be reached.

Best Regards,
Guixin Liu

Re: [PATCH 2/2] PCI/MSI: Update MSI-X irq domain hwsize

Posted by Thomas Gleixner 1 week, 2 days ago

On Wed, Mar 25 2026 at 09:34, Guixin Liu wrote:
> 在 2026/3/24 21:59, Thomas Gleixner 写道:
>> On Tue, Mar 24 2026 at 09:47, Guixin Liu wrote:
>>> After the upper-layer driver removes the device and before the next
>>> probe, events such as firmware updates may increase the number of
>>> interrupts supported by the device. However, the irq_domain still
>>> retains the old hwsize, which causes subsequent interrupt allocation
>>> failures. Update hwsize during MSI-X device domain setup to fix this
>>> issue.
>> When a device is removed then the corresponding struct device is torn
>> down, which implies that the device domain is freed as well. So how can
>> this end up with the old state on the next probe?
>
> Hi, My description was a bit ambiguous. The msi_device_data_release()
> path to remove the irq_domain is only triggered when the device is
> removed at the PCI layer. If only the upper-layer driver unbinds,
> this path will not be reached.

What's an upper-layer driver? Please be precise.

I assume you are talking about the device driver itself. Right, the
unbind of the driver won't remove the domain. But there is no real good
reason for keeping it around at that point.

So the straight forward solution is to free the MSI domain when the
driver shuts down and tears the MSI interrupts down.

Thanks,

        tglx

Re: [PATCH 2/2] PCI/MSI: Update MSI-X irq domain hwsize

Posted by Guixin Liu 1 week, 2 days ago


在 2026/3/25 15:42, Thomas Gleixner 写道:
> On Wed, Mar 25 2026 at 09:34, Guixin Liu wrote:
>> 在 2026/3/24 21:59, Thomas Gleixner 写道:
>>> On Tue, Mar 24 2026 at 09:47, Guixin Liu wrote:
>>>> After the upper-layer driver removes the device and before the next
>>>> probe, events such as firmware updates may increase the number of
>>>> interrupts supported by the device. However, the irq_domain still
>>>> retains the old hwsize, which causes subsequent interrupt allocation
>>>> failures. Update hwsize during MSI-X device domain setup to fix this
>>>> issue.
>>> When a device is removed then the corresponding struct device is torn
>>> down, which implies that the device domain is freed as well. So how can
>>> this end up with the old state on the next probe?
>> Hi, My description was a bit ambiguous. The msi_device_data_release()
>> path to remove the irq_domain is only triggered when the device is
>> removed at the PCI layer. If only the upper-layer driver unbinds,
>> this path will not be reached.
> What's an upper-layer driver? Please be precise.
>
> I assume you are talking about the device driver itself.
Sorry, My description is not very accurate, yes, it's device driver.
> Right, the
> unbind of the driver won't remove the domain. But there is no real good
> reason for keeping it around at that point.
>
> So the straight forward solution is to free the MSI domain when the
> driver shuts down and tears the MSI interrupts down.
Yes, I had also considered this aspect before, I will change the scheme 
to this,
and send another patch, thanks.

Best Regards,
Guixin Liu
>
> Thanks,
>
>          tglx
>

Re: [PATCH 2/2] PCI/MSI: Update MSI-X irq domain hwsize

Posted by Thomas Gleixner 1 week, 2 days ago

On Wed, Mar 25 2026 at 16:40, Guixin Liu wrote:
> 在 2026/3/25 15:42, Thomas Gleixner 写道:
>> So the straight forward solution is to free the MSI domain when the
>> driver shuts down and tears the MSI interrupts down.
> Yes, I had also considered this aspect before, I will change the scheme 
> to this, and send another patch, thanks.

Actually none of that is required. When you update the firmware then
just let the PCI core rescan the device. That makes way more sense as
the new firmware might change other config entries as well not only the
MSI ones.

Thanks,

        tglx

Re: [PATCH 2/2] PCI/MSI: Update MSI-X irq domain hwsize

Posted by Guixin Liu 1 week, 1 day ago


在 2026/3/25 21:58, Thomas Gleixner 写道:
> On Wed, Mar 25 2026 at 16:40, Guixin Liu wrote:
>> 在 2026/3/25 15:42, Thomas Gleixner 写道:
>>> So the straight forward solution is to free the MSI domain when the
>>> driver shuts down and tears the MSI interrupts down.
>> Yes, I had also considered this aspect before, I will change the scheme
>> to this, and send another patch, thanks.
> Actually none of that is required. When you update the firmware then
> just let the PCI core rescan the device. That makes way more sense as
> the new firmware might change other config entries as well not only the
> MSI ones.
>
> Thanks,
>
>          tglx
You are right, besides MSI, other attributes may also change.
Removing and then recanning the PCI device is the best approach.

Best Regards,
Guixin Liu