RE: [EXTERNAL] Re: [PATCH] x86/ioapic: Don't return 0 as valid virq

Thomas Gleixner posted 1 patch 2 years, 10 months ago
RE: [EXTERNAL] Re: [PATCH] x86/ioapic: Don't return 0 as valid virq
Posted by Thomas Gleixner 2 years, 10 months ago
On Tue, Mar 14 2023 at 10:23, Saurabh Singh Sengar wrote:
>> This should be added to your commit message: what guest VM is that and
>> why should the kernel support it.
>
> Guest VM is a linux VM running as child partition on Hyper-V. Hyper-v Linux
> documentation is in Documentation/virt/hyperv/.
>
> In commit I wanted to mention that any system which is not registering
> IO-APIC will have this issue. But I am fine to mention specifically
> about the issue I am facing.  As part of your next comment, I have
> explained the issue in detail if that is good, I can put that as
> commit message.
>> 
>> Why doesn't it need an IO-APIC and why does the current code need to be
>> changed just for your guest VM?
>
> For Hyper-V Virtual Machines, few platforms don't have any devices to be
> hooked to IO-APIC. Although it has Hyper-V based MSI over VMBus which
> assigns interrupts to PCIe devices. In such platforms IO-APIC is not
> registered which causes gsi_top value to remain at 0 and not get properly
> assigned. Moreover, due to the inability to disable CONFIG_X86_IO_APIC
> flag, the io-apic code still gets compiled. Thus, arch_dynirq_lower_bound
> function in io_apic.c decides the lower bound of irq numbers based on gsi_top.
>
> Later when PCIe-MSI attempts to allocate interrupts, it gets 0 as the first
> virq number because gsi_top is still 0. 0 being invalid virq is ignored by
> MSI irq domain and results allocation of the same PCIe MSI twice.
>
> 		CPU0		CPU1
> 0:		2			0		Hyper-V PCIe MSI 1073741824-edge
> 1:		69			0		Hyper-V PCIe MSI 1073741824-edge      nvme0q0
>
> To avoid this issue, if IO-APIC and gsi_top are not initialized, return the
> hint value passed as 'from' value to arch_dynirq_lower_bound instead of 0.
> This will also be identical to the behaviour of weak arch_dynirq_lower_bound
> function defined in kernel/softirq.c.

I find this mightly confusing. Something like this perhaps:

  Subject: x86/ioapic: Don't return 0 from arch_dynirq_lower_bound()

  arch_dynirq_lower_bound() is invoked by the core interrupt code to
  retrieve the lowest possible Linux interrupt number for dynamically
  allocated interrupts like MSI.

  The x86 implementation uses this to exclude the IO/APIC GSI space.
  This works correctly as long as there is an IO/APIC registered, but
  returns 0 if not. This has been observed in VMs where the BIOS does
  not advertise an IO/APIC.  

  0 is an invalid interrupt number except for the legacy timer interrupt
  on x86. The return value is unchecked in the core code, so it ends up
  to allocate interrupt number 0 which is subsequently considered to be
  invalid by the caller, e.g. the MSI allocation code.

  The function has already a check for 0 in the case that an IO/APIC is
  registered, but ioapic_dynirq_base is 0 in case of device tree setups.

  Consolidate this and zero check for both ioapic_dynirq_base and gsi_top,
  which is used in the case that no IO/APIC is registered.

And then make the code to look like the below, which makes it very
clear what this is about.

Thanks,

        tglx
---
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2477,17 +2477,22 @@ static int io_apic_get_redir_entries(int
 
 unsigned int arch_dynirq_lower_bound(unsigned int from)
 {
+	unsigned int ret;
+
 	/*
 	 * dmar_alloc_hwirq() may be called before setup_IO_APIC(), so use
 	 * gsi_top if ioapic_dynirq_base hasn't been initialized yet.
 	 */
-	if (!ioapic_initialized)
-		return gsi_top;
+	ret = ioapic_dynirq_base ? : gsi_top;
+
 	/*
-	 * For DT enabled machines ioapic_dynirq_base is irrelevant and not
-	 * updated. So simply return @from if ioapic_dynirq_base == 0.
+	 * For DT enabled machines ioapic_dynirq_base is irrelevant and
+	 * always 0. gsi_top can be 0 if there is no IO/APIC registered.
+	 *
+	 * 0 is an invalid interrupt number for dynamic allocations. Return
+	 * @from instead.
 	 */
-	return ioapic_dynirq_base ? : from;
+	return ret ? : from;
 }
 
 #ifdef CONFIG_X86_32
Re: [EXTERNAL] Re: [PATCH] x86/ioapic: Don't return 0 as valid virq
Posted by Saurabh Singh Sengar 2 years, 10 months ago
On Fri, Mar 24, 2023 at 04:39:07PM +0100, Thomas Gleixner wrote:
> On Tue, Mar 14 2023 at 10:23, Saurabh Singh Sengar wrote:
> >> This should be added to your commit message: what guest VM is that and
> >> why should the kernel support it.
> >
> > Guest VM is a linux VM running as child partition on Hyper-V. Hyper-v Linux
> > documentation is in Documentation/virt/hyperv/.
> >
> > In commit I wanted to mention that any system which is not registering
> > IO-APIC will have this issue. But I am fine to mention specifically
> > about the issue I am facing.  As part of your next comment, I have
> > explained the issue in detail if that is good, I can put that as
> > commit message.
> >> 
> >> Why doesn't it need an IO-APIC and why does the current code need to be
> >> changed just for your guest VM?
> >
> > For Hyper-V Virtual Machines, few platforms don't have any devices to be
> > hooked to IO-APIC. Although it has Hyper-V based MSI over VMBus which
> > assigns interrupts to PCIe devices. In such platforms IO-APIC is not
> > registered which causes gsi_top value to remain at 0 and not get properly
> > assigned. Moreover, due to the inability to disable CONFIG_X86_IO_APIC
> > flag, the io-apic code still gets compiled. Thus, arch_dynirq_lower_bound
> > function in io_apic.c decides the lower bound of irq numbers based on gsi_top.
> >
> > Later when PCIe-MSI attempts to allocate interrupts, it gets 0 as the first
> > virq number because gsi_top is still 0. 0 being invalid virq is ignored by
> > MSI irq domain and results allocation of the same PCIe MSI twice.
> >
> > 		CPU0		CPU1
> > 0:		2			0		Hyper-V PCIe MSI 1073741824-edge
> > 1:		69			0		Hyper-V PCIe MSI 1073741824-edge      nvme0q0
> >
> > To avoid this issue, if IO-APIC and gsi_top are not initialized, return the
> > hint value passed as 'from' value to arch_dynirq_lower_bound instead of 0.
> > This will also be identical to the behaviour of weak arch_dynirq_lower_bound
> > function defined in kernel/softirq.c.
> 
> I find this mightly confusing. Something like this perhaps:
> 
>   Subject: x86/ioapic: Don't return 0 from arch_dynirq_lower_bound()
> 
>   arch_dynirq_lower_bound() is invoked by the core interrupt code to
>   retrieve the lowest possible Linux interrupt number for dynamically
>   allocated interrupts like MSI.
> 
>   The x86 implementation uses this to exclude the IO/APIC GSI space.
>   This works correctly as long as there is an IO/APIC registered, but
>   returns 0 if not. This has been observed in VMs where the BIOS does
>   not advertise an IO/APIC.  
> 
>   0 is an invalid interrupt number except for the legacy timer interrupt
>   on x86. The return value is unchecked in the core code, so it ends up
>   to allocate interrupt number 0 which is subsequently considered to be
>   invalid by the caller, e.g. the MSI allocation code.
> 
>   The function has already a check for 0 in the case that an IO/APIC is
>   registered, but ioapic_dynirq_base is 0 in case of device tree setups.
> 
>   Consolidate this and zero check for both ioapic_dynirq_base and gsi_top,
>   which is used in the case that no IO/APIC is registered.
> 
> And then make the code to look like the below, which makes it very
> clear what this is about.
> 
> Thanks,
> 
>         tglx
> ---
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -2477,17 +2477,22 @@ static int io_apic_get_redir_entries(int
>  
>  unsigned int arch_dynirq_lower_bound(unsigned int from)
>  {
> +	unsigned int ret;
> +
>  	/*
>  	 * dmar_alloc_hwirq() may be called before setup_IO_APIC(), so use
>  	 * gsi_top if ioapic_dynirq_base hasn't been initialized yet.
>  	 */
> -	if (!ioapic_initialized)
> -		return gsi_top;
> +	ret = ioapic_dynirq_base ? : gsi_top;
> +
>  	/*
> -	 * For DT enabled machines ioapic_dynirq_base is irrelevant and not
> -	 * updated. So simply return @from if ioapic_dynirq_base == 0.
> +	 * For DT enabled machines ioapic_dynirq_base is irrelevant and
> +	 * always 0. gsi_top can be 0 if there is no IO/APIC registered.
> +	 *
> +	 * 0 is an invalid interrupt number for dynamic allocations. Return
> +	 * @from instead.
>  	 */
> -	return ioapic_dynirq_base ? : from;
> +	return ret ? : from;
>  }
>  
>  #ifdef CONFIG_X86_32
>

Thanks you for your valuable suggestions. Commit message and code looks
much better now. I will send the V2 with your "Co-Developed-by" tag, I
hope its fine with you.

Regards,
Saurabh