Move IRQs from dying CPU to the online ones.
Guest-bound IRQs are already handled by scheduler in the process of
moving vCPUs to active pCPUs, so we only need to handle IRQs used by Xen
itself.
If IRQ is to be migrated, it's affinity is set to a mask of all online
CPUs. With current GIC implementation, this means they are routed to a
random online CPU. This may cause extra moves if multiple cores are
disabled in sequence, but should prevent all interrupts from piling up
on CPU0 in case of repeated up-down cycles on different cores.
IRQs from CPU 0 are never migrated, as dying CPU 0 means we are either
shutting down compeletely or entering system suspend.
Considering that all Xen-used IRQs are currently allocated during init
on CPU 0, and setup_irq uses smp_processor_id for the initial affinity.
This change is not strictly required for correct operation for now, but
it should future-proof cpu hotplug and system suspend support in case
some kind if IRQ balancing is implemented later.
Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
v3->v4:
* patch introduced
---
xen/arch/arm/include/asm/irq.h | 2 ++
xen/arch/arm/irq.c | 39 ++++++++++++++++++++++++++++++++++
xen/arch/arm/smpboot.c | 2 ++
3 files changed, 43 insertions(+)
diff --git a/xen/arch/arm/include/asm/irq.h b/xen/arch/arm/include/asm/irq.h
index 09788dbfeb..6e6e27bb80 100644
--- a/xen/arch/arm/include/asm/irq.h
+++ b/xen/arch/arm/include/asm/irq.h
@@ -126,6 +126,8 @@ bool irq_type_set_by_domain(const struct domain *d);
void irq_end_none(struct irq_desc *irq);
#define irq_end_none irq_end_none
+void evacuate_irqs(unsigned int from);
+
#endif /* _ASM_HW_IRQ_H */
/*
* Local variables:
diff --git a/xen/arch/arm/irq.c b/xen/arch/arm/irq.c
index 28b40331f7..b383d71930 100644
--- a/xen/arch/arm/irq.c
+++ b/xen/arch/arm/irq.c
@@ -158,6 +158,45 @@ static int init_local_irq_data(unsigned int cpu)
return 0;
}
+static void evacuate_irq(int irq, unsigned int from)
+{
+ struct irq_desc *desc = irq_to_desc(irq);
+ unsigned long flags;
+
+ /* Don't move irqs from CPU 0 as it is always last to be disabled */
+ if ( from == 0 )
+ return;
+
+ ASSERT(!cpumask_empty(&cpu_online_map));
+ ASSERT(!cpumask_test_cpu(from, &cpu_online_map));
+
+ spin_lock_irqsave(&desc->lock, flags);
+ if ( likely(!desc->action) )
+ goto out;
+
+ if ( likely(test_bit(_IRQ_GUEST, &desc->status) ||
+ test_bit(_IRQ_MOVE_PENDING, &desc->status)) )
+ goto out;
+
+ if ( cpumask_test_cpu(from, desc->affinity) )
+ irq_set_affinity(desc, &cpu_online_map);
+
+out:
+ spin_unlock_irqrestore(&desc->lock, flags);
+ return;
+}
+
+void evacuate_irqs(unsigned int from)
+{
+ int irq;
+
+ for ( irq = NR_LOCAL_IRQS; irq < NR_IRQS; irq++ )
+ evacuate_irq(irq, from);
+
+ for ( irq = ESPI_BASE_INTID; irq < ESPI_MAX_INTID; irq++ )
+ evacuate_irq(irq, from);
+}
+
static int cpu_callback(struct notifier_block *nfb, unsigned long action,
void *hcpu)
{
diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c
index 7f3cfa812e..46b24783dd 100644
--- a/xen/arch/arm/smpboot.c
+++ b/xen/arch/arm/smpboot.c
@@ -425,6 +425,8 @@ void __cpu_disable(void)
smp_mb();
+ evacuate_irqs(cpu);
+
/* Return to caller; eventually the IPI mechanism will unwind and the
* scheduler will drop to the idle loop, which will call stop_cpu(). */
}
--
2.51.2
Hi,
On 12/11/2025 10:51, Mykyta Poturai wrote:
> Move IRQs from dying CPU to the online ones.
> Guest-bound IRQs are already handled by scheduler in the process of
> moving vCPUs to active pCPUs, so we only need to handle IRQs used by Xen
> itself.
>
> If IRQ is to be migrated, it's affinity is set to a mask of all online
> CPUs. With current GIC implementation, this means they are routed to a
> random online CPU. This may cause extra moves if multiple cores are
> disabled in sequence, but should prevent all interrupts from piling up
> on CPU0 in case of repeated up-down cycles on different cores.
Wouldn't they eventually all move to CPU0 in the case of suspend/resume
or if all the CPUs but CPU0 are turned off and then off? If so,
shouldn't we try to rebalance the interrupts?
>
> IRQs from CPU 0 are never migrated, as dying CPU 0 means we are either
> shutting down compeletely or entering system suspend.
I can't find a place where __cpu_disable() is called on CPU0. Do you
have any pointer? In any case, I am not sure I want to bake that
assumption in more places of the code.
>
> Considering that all Xen-used IRQs are currently allocated during init
> on CPU 0, and setup_irq uses smp_processor_id for the initial affinity.
Looking at the SMMU driver, we seems to request IRQs at the time the
device is attached. So are you sure about this?
> This change is not strictly required for correct operation for now, but
> it should future-proof cpu hotplug and system suspend support in case
> some kind if IRQ balancing is implemented later.
>
> Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
>
> v3->v4:
> * patch introduced
> ---
> xen/arch/arm/include/asm/irq.h | 2 ++
> xen/arch/arm/irq.c | 39 ++++++++++++++++++++++++++++++++++
> xen/arch/arm/smpboot.c | 2 ++
> 3 files changed, 43 insertions(+)
>
> diff --git a/xen/arch/arm/include/asm/irq.h b/xen/arch/arm/include/asm/irq.h
> index 09788dbfeb..6e6e27bb80 100644
> --- a/xen/arch/arm/include/asm/irq.h
> +++ b/xen/arch/arm/include/asm/irq.h
> @@ -126,6 +126,8 @@ bool irq_type_set_by_domain(const struct domain *d);
> void irq_end_none(struct irq_desc *irq);
> #define irq_end_none irq_end_none
>
> +void evacuate_irqs(unsigned int from);
> +
> #endif /* _ASM_HW_IRQ_H */
> /*
> * Local variables:
> diff --git a/xen/arch/arm/irq.c b/xen/arch/arm/irq.c
> index 28b40331f7..b383d71930 100644
> --- a/xen/arch/arm/irq.c
> +++ b/xen/arch/arm/irq.c
> @@ -158,6 +158,45 @@ static int init_local_irq_data(unsigned int cpu)
> return 0;
> }
>
> +static void evacuate_irq(int irq, unsigned int from)
Any reason why the 'irq' is signed?
> +{
> + struct irq_desc *desc = irq_to_desc(irq);
> + unsigned long flags;
> +
> + /* Don't move irqs from CPU 0 as it is always last to be disabled */
Per above, I am not convinced that we should special case CPU 0. But if
we do, then shouldn't this be part of evacuate_irqs() so we don't
pointlessly go through all the IRQs?
> + if ( from == 0 )
> + return;
> +
> + ASSERT(!cpumask_empty(&cpu_online_map));
> + ASSERT(!cpumask_test_cpu(from, &cpu_online_map));
> +
> + spin_lock_irqsave(&desc->lock, flags);
> + if ( likely(!desc->action) )
> + goto out;
> +
> + if ( likely(test_bit(_IRQ_GUEST, &desc->status) ||
> + test_bit(_IRQ_MOVE_PENDING, &desc->status)) )
> + goto out;
> +
> + if ( cpumask_test_cpu(from, desc->affinity) )
> + irq_set_affinity(desc, &cpu_online_map);
I think it would be worth explaining why we are routing to any CPU
online rather than checking whether the affinity has other online CPUs.
Just to note, I don't have strong opinion either way. It mainly needs to
be documented.
> +
> +out:
> + spin_unlock_irqrestore(&desc->lock, flags);
> + return;
> +}
> +
> +void evacuate_irqs(unsigned int from)
> +{
> + int irq;
> +> + for ( irq = NR_LOCAL_IRQS; irq < NR_IRQS; irq++ )
> + evacuate_irq(irq, from);
> +
> + for ( irq = ESPI_BASE_INTID; irq < ESPI_MAX_INTID; irq++ )
AFAICT, irq_to_desc() would not be able to cope with ESPI interrupts
when CONFIG_GICV3_ESPI is not set. Has this been tested?
> + evacuate_irq(irq, from);
> +}
> +
> static int cpu_callback(struct notifier_block *nfb, unsigned long action,
> void *hcpu)
> {
> diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c
> index 7f3cfa812e..46b24783dd 100644
> --- a/xen/arch/arm/smpboot.c
> +++ b/xen/arch/arm/smpboot.c
> @@ -425,6 +425,8 @@ void __cpu_disable(void)
>
> smp_mb();
>
> + evacuate_irqs(cpu);
I think it would be worth explaining why evacuate_irqs() is called this
late in the process.
> +> /* Return to caller; eventually the IPI mechanism will
unwind and the
> * scheduler will drop to the idle loop, which will call stop_cpu(). */
> }
Cheers,
--
Julien Grall
On 16.11.25 13:24, Julien Grall wrote:
> Hi,
>
> On 12/11/2025 10:51, Mykyta Poturai wrote:
>> Move IRQs from dying CPU to the online ones.
>> Guest-bound IRQs are already handled by scheduler in the process of
>> moving vCPUs to active pCPUs, so we only need to handle IRQs used by Xen
>> itself.
>>
>> If IRQ is to be migrated, it's affinity is set to a mask of all online
>> CPUs. With current GIC implementation, this means they are routed to a
>> random online CPU. This may cause extra moves if multiple cores are
>> disabled in sequence, but should prevent all interrupts from piling up
>> on CPU0 in case of repeated up-down cycles on different cores.
>
> Wouldn't they eventually all move to CPU0 in the case of suspend/resume
> or if all the CPUs but CPU0 are turned off and then off? If so,
> shouldn't we try to rebalance the interrupts?
>
In case of disabling/enabling all cores in a sequence, yes. This was
designed with the idea of achieving some balancing when
enabling/disabling some cores for power saving reasons. I agree that
proper balancing should be implemented, but it is a complex task on its
own and requires a substantial amount of testing on different hardware
to prove it is close to optimal. So I think implementing it is out of
scope for what I’m trying to do here.
If this would be okay, I can implement a relatively simple solution of
just adding onlined CPUs back to the affinity mask for now. I think it
should improve the situation for the “switching all cores” case.
>>
>> IRQs from CPU 0 are never migrated, as dying CPU 0 means we are either
>> shutting down compeletely or entering system suspend.
>
> I can't find a place where __cpu_disable() is called on CPU0. Do you
> have any pointer? In any case, I am not sure I want to bake that
> assumption in more places of the code.
>
I assume it would be called when suspend is implemented. In any case, I
will rework this to replace the hard check for CPU 0 with the “is it the
last CPU online” one.
>>
>> Considering that all Xen-used IRQs are currently allocated during init
>> on CPU 0, and setup_irq uses smp_processor_id for the initial affinity.
>
> Looking at the SMMU driver, we seems to request IRQs at the time the
> device is attached. So are you sure about this?
>
Indeed, I have missed that one. I will remove those statements then.
>> This change is not strictly required for correct operation for now, but
>> it should future-proof cpu hotplug and system suspend support in case
>> some kind if IRQ balancing is implemented later.
>>
>> Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
>>
>> v3->v4:
>> * patch introduced
>> ---
>> xen/arch/arm/include/asm/irq.h | 2 ++
>> xen/arch/arm/irq.c | 39 ++++++++++++++++++++++++++++++++++
>> xen/arch/arm/smpboot.c | 2 ++
>> 3 files changed, 43 insertions(+)
>>
>> diff --git a/xen/arch/arm/include/asm/irq.h b/xen/arch/arm/include/
>> asm/irq.h
>> index 09788dbfeb..6e6e27bb80 100644
>> --- a/xen/arch/arm/include/asm/irq.h
>> +++ b/xen/arch/arm/include/asm/irq.h
>> @@ -126,6 +126,8 @@ bool irq_type_set_by_domain(const struct domain *d);
>> void irq_end_none(struct irq_desc *irq);
>> #define irq_end_none irq_end_none
>> +void evacuate_irqs(unsigned int from);
>> +
>> #endif /* _ASM_HW_IRQ_H */
>> /*
>> * Local variables:
>> diff --git a/xen/arch/arm/irq.c b/xen/arch/arm/irq.c
>> index 28b40331f7..b383d71930 100644
>> --- a/xen/arch/arm/irq.c
>> +++ b/xen/arch/arm/irq.c
>> @@ -158,6 +158,45 @@ static int init_local_irq_data(unsigned int cpu)
>> return 0;
>> }
>> +static void evacuate_irq(int irq, unsigned int from)
>
> Any reason why the 'irq' is signed?
>
>> +{
>> + struct irq_desc *desc = irq_to_desc(irq);
>> + unsigned long flags;
>> +
>> + /* Don't move irqs from CPU 0 as it is always last to be disabled */
>
> Per above, I am not convinced that we should special case CPU 0. But if
> we do, then shouldn't this be part of evacuate_irqs() so we don't
> pointlessly go through all the IRQs?
>
>> + if ( from == 0 )
>> + return;
>> +
>> + ASSERT(!cpumask_empty(&cpu_online_map));
>> + ASSERT(!cpumask_test_cpu(from, &cpu_online_map));
>> +
>> + spin_lock_irqsave(&desc->lock, flags);
>> + if ( likely(!desc->action) )
>> + goto out;
>> +
>> + if ( likely(test_bit(_IRQ_GUEST, &desc->status) ||
>> + test_bit(_IRQ_MOVE_PENDING, &desc->status)) )
>> + goto out;
>> +
>> + if ( cpumask_test_cpu(from, desc->affinity) )
>> + irq_set_affinity(desc, &cpu_online_map);
>
> I think it would be worth explaining why we are routing to any CPU
> online rather than checking whether the affinity has other online CPUs.
>
> Just to note, I don't have strong opinion either way. It mainly needs to
> be documented.
>
>> +
>> +out:
>> + spin_unlock_irqrestore(&desc->lock, flags);
>> + return;
>> +}
>> +
>> +void evacuate_irqs(unsigned int from)
>> +{
>> + int irq;
> > +> + for ( irq = NR_LOCAL_IRQS; irq < NR_IRQS; irq++ )
>> + evacuate_irq(irq, from);
>> +
>> + for ( irq = ESPI_BASE_INTID; irq < ESPI_MAX_INTID; irq++ )
>
> AFAICT, irq_to_desc() would not be able to cope with ESPI interrupts
> when CONFIG_GICV3_ESPI is not set. Has this been tested?
>
>> + evacuate_irq(irq, from);
>> +}
>> +
>> static int cpu_callback(struct notifier_block *nfb, unsigned long
>> action,
>> void *hcpu)
>> {
>> diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c
>> index 7f3cfa812e..46b24783dd 100644
>> --- a/xen/arch/arm/smpboot.c
>> +++ b/xen/arch/arm/smpboot.c
>> @@ -425,6 +425,8 @@ void __cpu_disable(void)
>> smp_mb();
>> + evacuate_irqs(cpu);
>
> I think it would be worth explaining why evacuate_irqs() is called this
> late in the process.
>
> > +> /* Return to caller; eventually the IPI mechanism will
> unwind and the
>> * scheduler will drop to the idle loop, which will call
>> stop_cpu(). */
>> }
>
> Cheers,
>
--
Mykyta
Hi, On 12/12/2025 10:01, Mykyta Poturai wrote: > On 16.11.25 13:24, Julien Grall wrote: >> Hi, >> >> On 12/11/2025 10:51, Mykyta Poturai wrote: >>> Move IRQs from dying CPU to the online ones. >>> Guest-bound IRQs are already handled by scheduler in the process of >>> moving vCPUs to active pCPUs, so we only need to handle IRQs used by Xen >>> itself. >>> >>> If IRQ is to be migrated, it's affinity is set to a mask of all online >>> CPUs. With current GIC implementation, this means they are routed to a >>> random online CPU. This may cause extra moves if multiple cores are >>> disabled in sequence, but should prevent all interrupts from piling up >>> on CPU0 in case of repeated up-down cycles on different cores. >> >> Wouldn't they eventually all move to CPU0 in the case of suspend/resume >> or if all the CPUs but CPU0 are turned off and then off? If so, >> shouldn't we try to rebalance the interrupts? >> > > In case of disabling/enabling all cores in a sequence, yes. This was > designed with the idea of achieving some balancing when > enabling/disabling some cores for power saving reasons. I understand how this may balance when disabling some cores. But I don't understand how it helps for enabling cores. Can you provide more details? > I agree that > proper balancing should be implemented, but it is a complex task on its > own and requires a substantial amount of testing on different hardware > to prove it is close to optimal. So I think implementing it is out of > scope for what I’m trying to do here. Can you provide some details about what you are trying and why it would be ok to avoid the balancing? > > If this would be okay, I can implement a relatively simple solution of > just adding onlined CPUs back to the affinity mask for now. I think it > should improve the situation for the “switching all cores” case. Do you mean calling gic_irq_set_affinity() with the CPU? If so, Xen is still going to get one CPU from the affinity mask. > >>> >>> IRQs from CPU 0 are never migrated, as dying CPU 0 means we are either >>> shutting down compeletely or entering system suspend. >> >> I can't find a place where __cpu_disable() is called on CPU0. Do you >> have any pointer? In any case, I am not sure I want to bake that >> assumption in more places of the code. >> > > I assume it would be called when suspend is implemented. In any case, I > will rework this to replace the hard check for CPU 0 with the “is it the > last CPU online” one. AFAIK __cpu_disable() is not going to be called during suspend/resume for CPU0. So the only case is shutting down the platform. > >>> >>> Considering that all Xen-used IRQs are currently allocated during init >>> on CPU 0, and setup_irq uses smp_processor_id for the initial affinity. >> >> Looking at the SMMU driver, we seems to request IRQs at the time the >> device is attached. So are you sure about this? >> > > Indeed, I have missed that one. I will remove those statements then. I think you need to have something in the commit message explaining why you are ignoring the balancing for the SMMU. Cheers, -- Julien Grall
© 2016 - 2026 Red Hat, Inc.