[PATCH v3 0/5] Implement CPU hotplug on Arm

Mykyta Poturai posted 5 patches 4 months ago
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/cover.1760083684.git.mykyta._5Fpoturai@epam.com
There is a newer version of this series
config/Tools.mk.in               |  1 +
docs/misc/cpu-hotplug.txt        | 51 ++++++++++++++++++++++++++++++++
tools/configure                  | 30 +++++++++++++++++++
tools/configure.ac               |  1 +
tools/libs/guest/Makefile.common |  4 +++
tools/misc/Makefile              |  2 +-
xen/arch/arm/gic.c               | 11 +++++--
xen/arch/arm/sysctl.c            | 45 ++++++++++++++++++++++++++++
xen/arch/arm/time.c              | 21 ++++++++++---
9 files changed, 159 insertions(+), 7 deletions(-)
create mode 100644 docs/misc/cpu-hotplug.txt
mode change 100755 => 100644 tools/configure
[PATCH v3 0/5] Implement CPU hotplug on Arm
Posted by Mykyta Poturai 4 months ago
This series implements support for CPU hotplug/unplug on Arm. To achieve this,
several things need to be done:

1. XEN_SYSCTL_CPU_HOTPLUG_* calls implemented.
2. timer and GIC maintenance interrupts switched to static irqactions to remove
the need for freeing them during release_irq.
3. Enabled the build of xen-hptool on Arm.

Tested on QEMU.

v2->v3:
* add docs

v1->v2:
* see individual patches

Mykyta Poturai (5):
  arm/time: Use static irqaction
  arm/gic: Use static irqaction
  arm/sysctl: Implement cpu hotplug ops
  tools: Allow building xen-hptool without CONFIG_MIGRATE
  docs: Document CPU hotplug

 config/Tools.mk.in               |  1 +
 docs/misc/cpu-hotplug.txt        | 51 ++++++++++++++++++++++++++++++++
 tools/configure                  | 30 +++++++++++++++++++
 tools/configure.ac               |  1 +
 tools/libs/guest/Makefile.common |  4 +++
 tools/misc/Makefile              |  2 +-
 xen/arch/arm/gic.c               | 11 +++++--
 xen/arch/arm/sysctl.c            | 45 ++++++++++++++++++++++++++++
 xen/arch/arm/time.c              | 21 ++++++++++---
 9 files changed, 159 insertions(+), 7 deletions(-)
 create mode 100644 docs/misc/cpu-hotplug.txt
 mode change 100755 => 100644 tools/configure

-- 
2.34.1
Re: [PATCH v3 0/5] Implement CPU hotplug on Arm
Posted by Mykola Kvach 3 months, 3 weeks ago
Hi Mykyta,

Thanks for the series.

It seems there might be issues here -- please take a look and let me
know if my concerns are valid:

1. FF-A notification IRQ: after a CPU down->up cycle the IRQ
configuration may be lost.

2. GICv3 LPIs: a CPU may fail to come back up unless its LPI pending
table exists (is allocated) on bring-up. See
gicv3_lpi_allocate_pendtable() and its call chain.

3. IRQ migration on CPU down: if an IRQ targets a CPU being offlined,
its affinity should be moved to an online CPU before completing the
offlining.

4. Race between the new hypercalls and disable/enable_nonboot_cpus():
disable_nonboot_cpus is called, enable_nonboot_cpus() reads
frozen_cpus, and before it calls cpu_up() a hypercall onlines the CPU.
cpu_up() then fails as "already online", but the CPU_RESUME_FAILED
path may still run for an already-online CPU, risking use-after-free
of per-CPU state (e.g. via free_percpu_area()) and other issues
related to CPU_RESUME_FAILED notification.



On Fri, Oct 10, 2025 at 12:36 PM Mykyta Poturai <Mykyta_Poturai@epam.com> wrote:
>
> This series implements support for CPU hotplug/unplug on Arm. To achieve this,
> several things need to be done:
>
> 1. XEN_SYSCTL_CPU_HOTPLUG_* calls implemented.
> 2. timer and GIC maintenance interrupts switched to static irqactions to remove
> the need for freeing them during release_irq.
> 3. Enabled the build of xen-hptool on Arm.
>
> Tested on QEMU.
>
> v2->v3:
> * add docs
>
> v1->v2:
> * see individual patches
>
> Mykyta Poturai (5):
>   arm/time: Use static irqaction
>   arm/gic: Use static irqaction
>   arm/sysctl: Implement cpu hotplug ops
>   tools: Allow building xen-hptool without CONFIG_MIGRATE
>   docs: Document CPU hotplug
>
>  config/Tools.mk.in               |  1 +
>  docs/misc/cpu-hotplug.txt        | 51 ++++++++++++++++++++++++++++++++
>  tools/configure                  | 30 +++++++++++++++++++
>  tools/configure.ac               |  1 +
>  tools/libs/guest/Makefile.common |  4 +++
>  tools/misc/Makefile              |  2 +-
>  xen/arch/arm/gic.c               | 11 +++++--
>  xen/arch/arm/sysctl.c            | 45 ++++++++++++++++++++++++++++
>  xen/arch/arm/time.c              | 21 ++++++++++---
>  9 files changed, 159 insertions(+), 7 deletions(-)
>  create mode 100644 docs/misc/cpu-hotplug.txt
>  mode change 100755 => 100644 tools/configure
>
> --
> 2.34.1
>

Best regards,
Mykola
Re: [PATCH v3 0/5] Implement CPU hotplug on Arm
Posted by Mykyta Poturai 3 months, 3 weeks ago
On 15.10.25 20:30, Mykola Kvach wrote:
> Hi Mykyta,
> 
> Thanks for the series.
> 
> It seems there might be issues here -- please take a look and let me
> know if my concerns are valid:
> 
> 1. FF-A notification IRQ: after a CPU down->up cycle the IRQ
> configuration may be lost.

OPTEE and FFA are marked as unsupported.

> 2. GICv3 LPIs: a CPU may fail to come back up unless its LPI pending
> table exists (is allocated) on bring-up. See
> gicv3_lpi_allocate_pendtable() and its call chain.

ITS is marked as unsupported. I have a plan to deal with this, but it is 
out of scope of this series.

> 3. IRQ migration on CPU down: if an IRQ targets a CPU being offlined,
> its affinity should be moved to an online CPU before completing the
> offlining.

All guest tied IRQ migration is handled by the scheduler. Regarding the 
irqs used by Xen, I didn't find any with affinity to other CPUs than CPU 
0, which can't be disabled. I think theoretically it is possible for 
them to have different affinity, but it seems unlikely considering that 
x86 hotplug code also doesn't seem to do any Xen irq migration AFAIU.

> 4. Race between the new hypercalls and disable/enable_nonboot_cpus():
> disable_nonboot_cpus is called, enable_nonboot_cpus() reads
> frozen_cpus, and before it calls cpu_up() a hypercall onlines the CPU.
> cpu_up() then fails as "already online", but the CPU_RESUME_FAILED
> path may still run for an already-online CPU, risking use-after-free
> of per-CPU state (e.g. via free_percpu_area()) and other issues
> related to CPU_RESUME_FAILED notification.
> 

There don't seem to be any calls to disable/enable_nonboot_cpus() on 
Arm. If we take x86 as an example, then they are called with all domains 
already paused, and I don't see how paused domains can issue hypercalls.

> 
> Best regards,
> Mykola

-- 
Mykyta
Re: [PATCH v3 0/5] Implement CPU hotplug on Arm
Posted by Mykola Kvach 3 months, 3 weeks ago
Hi Mykyta,

Thank you for your answers

On Mon, Oct 20, 2025 at 5:15 PM Mykyta Poturai <Mykyta_Poturai@epam.com> wrote:
>
> On 15.10.25 20:30, Mykola Kvach wrote:
> > Hi Mykyta,
> >
> > Thanks for the series.
> >
> > It seems there might be issues here -- please take a look and let me
> > know if my concerns are valid:
> >
> > 1. FF-A notification IRQ: after a CPU down->up cycle the IRQ
> > configuration may be lost.
>
> OPTEE and FFA are marked as unsupported.

Understood, thanks. Would it be worth documenting this?

>
> > 2. GICv3 LPIs: a CPU may fail to come back up unless its LPI pending
> > table exists (is allocated) on bring-up. See
> > gicv3_lpi_allocate_pendtable() and its call chain.
>
> ITS is marked as unsupported. I have a plan to deal with this, but it is
> out of scope of this series.

Thanks for the clarification. Should we document this somewhere?

>
> > 3. IRQ migration on CPU down: if an IRQ targets a CPU being offlined,
> > its affinity should be moved to an online CPU before completing the
> > offlining.
>
> All guest tied IRQ migration is handled by the scheduler. Regarding the
> irqs used by Xen, I didn't find any with affinity to other CPUs than CPU
> 0, which can't be disabled. I think theoretically it is possible for
> them to have different affinity, but it seems unlikely considering that
> x86 hotplug code also doesn't seem to do any Xen irq migration AFAIU.

What about arm_smmu_init_domain_context and its related call chains?
As far as I can see, some of these paths touch XEN_DOMCTL_* hypercalls,
and my understanding is they can be issued on any CPU. Should we add a
check that no enabled (e)SPIs owned by Xen are pinned to the offlining
CPU?

>
> > 4. Race between the new hypercalls and disable/enable_nonboot_cpus():
> > disable_nonboot_cpus is called, enable_nonboot_cpus() reads
> > frozen_cpus, and before it calls cpu_up() a hypercall onlines the CPU.
> > cpu_up() then fails as "already online", but the CPU_RESUME_FAILED
> > path may still run for an already-online CPU, risking use-after-free
> > of per-CPU state (e.g. via free_percpu_area()) and other issues
> > related to CPU_RESUME_FAILED notification.
> >
>
> There don't seem to be any calls to disable/enable_nonboot_cpus() on
> Arm. If we take x86 as an example, then they are called with all domains
> already paused, and I don't see how paused domains can issue hypercalls.

Agreed; this looks even less likely given that disable_* runs on CPU0 and
your new hypercalls execute on CPU0. The only plausible issue would be a
contrived case where code disables non-boot CPUs from CPU0 but enables them
from another CPU woken by a hypercall. That seems unrealistic.

>
> >
> > Best regards,
> > Mykola
>
> --
> Mykyta

Best regards,
Mykola
Re: [PATCH v3 0/5] Implement CPU hotplug on Arm
Posted by Julien Grall 3 months ago
Hi,

On 20/10/2025 19:00, Mykola Kvach wrote:
> Thank you for your answers
> 
> On Mon, Oct 20, 2025 at 5:15 PM Mykyta Poturai <Mykyta_Poturai@epam.com> wrote:
>>
>> On 15.10.25 20:30, Mykola Kvach wrote:
>>> Hi Mykyta,
>>>
>>> Thanks for the series.
>>>
>>> It seems there might be issues here -- please take a look and let me
>>> know if my concerns are valid:
>>>
>>> 1. FF-A notification IRQ: after a CPU down->up cycle the IRQ
>>> configuration may be lost.
>>
>> OPTEE and FFA are marked as unsupported.
> 
> Understood, thanks. Would it be worth documenting this?

This must be documented. OP-TEE, FFA and ITS will eventually be 
supported. So we need to know the gap.

I think it would also be worth to have a Kconfig indicating whether CPU 
hotplug (and soon suspend/resume) can be used with the documentation. So 
CPU hotplug will gracefully fail rather than putting the system in a 
undefined state.

>>
>>> 2. GICv3 LPIs: a CPU may fail to come back up unless its LPI pending
>>> table exists (is allocated) on bring-up. See
>>> gicv3_lpi_allocate_pendtable() and its call chain.
>>
>> ITS is marked as unsupported. I have a plan to deal with this, but it is
>> out of scope of this series.
 > > Thanks for the clarification. Should we document this somewhere?
> 
>>
>>> 3. IRQ migration on CPU down: if an IRQ targets a CPU being offlined,
>>> its affinity should be moved to an online CPU before completing the
>>> offlining.
>>
>> All guest tied IRQ migration is handled by the scheduler. Regarding the
>> irqs used by Xen, I didn't find any with affinity to other CPUs than CPU
>> 0, which can't be disabled. I think theoretically it is possible for
>> them to have different affinity, but it seems unlikely considering that
>> x86 hotplug code also doesn't seem to do any Xen irq migration AFAIU.
> 
> What about arm_smmu_init_domain_context and its related call chains?
> As far as I can see, some of these paths touch XEN_DOMCTL_* hypercalls,
> and my understanding is they can be issued on any CPU.

You are right. The SMMU can be configured from any pCPU. When 
request_irq() is called, it will route the IRQ to the current pCPU.

Those IRQs are not guest interrupts, so from my understanding, they 
would not be migrated.

> Should we add a
> check that no enabled (e)SPIs owned by Xen are pinned to the offlining
> CPU?

This would be good. But I also think we should aim to migrate those 
interrupts.

>>
>>> 4. Race between the new hypercalls and disable/enable_nonboot_cpus():
>>> disable_nonboot_cpus is called, enable_nonboot_cpus() reads
>>> frozen_cpus, and before it calls cpu_up() a hypercall onlines the CPU.
>>> cpu_up() then fails as "already online", but the CPU_RESUME_FAILED
>>> path may still run for an already-online CPU, risking use-after-free
>>> of per-CPU state (e.g. via free_percpu_area()) and other issues
>>> related to CPU_RESUME_FAILED notification.
>>>
>>
>> There don't seem to be any calls to disable/enable_nonboot_cpus() on
>> Arm.

Yet. There is a patch series to use the functions as part of 
suspend/resume. In fact this series is a pre-requisite for the 
suspend/resume series.

> If we take x86 as an example, then they are called with all domains
>> already paused, and I don't see how paused domains can issue hypercalls.

The Arm version will also freeze all the domains before calling 
disable_nonboot_cpus(). So there should be no race on Arm as well.

Cheers,

-- 
Julien Grall