From: Mykola Kvach <mykola_kvach@epam.com>
This patch adds support for the PSCI SYSTEM_SUSPEND function in the vPSCI
(virtual PSCI) interface, allowing guests to request suspend via the PSCI
v1.0 SYSTEM_SUSPEND call (both 32-bit and 64-bit variants).
The implementation:
- Adds SYSTEM_SUSPEND function IDs to PSCI definitions
- Implements trapping and handling of SYSTEM_SUSPEND in vPSCI
- Allows only non-hardware domains to invoke SYSTEM_SUSPEND; for the
  hardware domain, PSCI_NOT_SUPPORTED is returned to avoid halting the
  system in hwdom_shutdown() called from domain_shutdown
- Ensures all secondary VCPUs of the calling domain are offline before
  allowing suspend due to PSCI spec
- the domain is shut down with SHUTDOWN_suspend, and resumes execution at
  the address provided during suspend by guest
Usage:
For Linux-based guests, suspend can be initiated with:
    echo mem > /sys/power/state
or via:
    systemctl suspend
Resuming the guest is performed from control domain using:
      xl resume <domain>
Signed-off-by: Mykola Kvach <mykola_kvach@epam.com>
---
Changes in V5:
- don't use standby mode, restore execution in a provided by guest point
- move checking that all CPUs, except current one, are offline to after
  pausing the vCPUs
- provide ret status from arch_domain_shutdown and handle it in
  domain_shutdown
- adjust VPSCI_NR_FUNCS to reflect the number of newly added PSCI functions
Changes in V4:
Dropped all changes related to watchdog, domain is marked as shutting
down in domain_shutdown and watchdog timeout handler won't trigger
because of it.
Previous versions included code to manage Xen watchdog timers during suspend,
but this was removed. When a guest OS starts the Xen watchdog (either via the
kernel driver or xenwatchdogd), it is responsible for managing that state
across suspend/resume. On Linux, the Xen kernel driver properly stops the
watchdog during suspend. However, when xenwatchdogd is used instead, suspend
handling is incomplete, potentially leading to watchdog-triggered resets on
resume. Xen leaves watchdog handling to the guest OS and its services.
Dropped all changes related to VCPU context, because instead domain_shutdown
is used, so we don't need any extra changes for suspending domain.
Changes in V3:
Dropped all domain flags and related code (which touched common functions like
vcpu_unblock), keeping only the necessary changes for Xen suspend/resume, i.e.
suspend/resume is now fully supported only for the hardware domain.
Proper support for domU suspend/resume will be added in a future patch.
This patch does not yet include VCPU context reset or domain context
restoration in VCPU.
---
 xen/arch/arm/include/asm/perfc_defn.h |  1 +
 xen/arch/arm/include/asm/psci.h       |  2 +
 xen/arch/arm/include/asm/vpsci.h      |  2 +-
 xen/arch/arm/mmu/p2m.c                |  6 +-
 xen/arch/arm/vpsci.c                  | 96 +++++++++++++++++++++++----
 5 files changed, 92 insertions(+), 15 deletions(-)
diff --git a/xen/arch/arm/include/asm/perfc_defn.h b/xen/arch/arm/include/asm/perfc_defn.h
index effd25b69e..8dfcac7e3b 100644
--- a/xen/arch/arm/include/asm/perfc_defn.h
+++ b/xen/arch/arm/include/asm/perfc_defn.h
@@ -33,6 +33,7 @@ PERFCOUNTER(vpsci_system_reset,        "vpsci: system_reset")
 PERFCOUNTER(vpsci_cpu_suspend,         "vpsci: cpu_suspend")
 PERFCOUNTER(vpsci_cpu_affinity_info,   "vpsci: cpu_affinity_info")
 PERFCOUNTER(vpsci_features,            "vpsci: features")
+PERFCOUNTER(vpsci_system_suspend,      "vpsci: system_suspend")
 
 PERFCOUNTER(vcpu_kick,                 "vcpu: notify other vcpu")
 
diff --git a/xen/arch/arm/include/asm/psci.h b/xen/arch/arm/include/asm/psci.h
index 4780972621..48a93e6b79 100644
--- a/xen/arch/arm/include/asm/psci.h
+++ b/xen/arch/arm/include/asm/psci.h
@@ -47,10 +47,12 @@ void call_psci_system_reset(void);
 #define PSCI_0_2_FN32_SYSTEM_OFF          PSCI_0_2_FN32(8)
 #define PSCI_0_2_FN32_SYSTEM_RESET        PSCI_0_2_FN32(9)
 #define PSCI_1_0_FN32_PSCI_FEATURES       PSCI_0_2_FN32(10)
+#define PSCI_1_0_FN32_SYSTEM_SUSPEND      PSCI_0_2_FN32(14)
 
 #define PSCI_0_2_FN64_CPU_SUSPEND         PSCI_0_2_FN64(1)
 #define PSCI_0_2_FN64_CPU_ON              PSCI_0_2_FN64(3)
 #define PSCI_0_2_FN64_AFFINITY_INFO       PSCI_0_2_FN64(4)
+#define PSCI_1_0_FN64_SYSTEM_SUSPEND      PSCI_0_2_FN64(14)
 
 /* PSCI v0.2 affinity level state returned by AFFINITY_INFO */
 #define PSCI_0_2_AFFINITY_LEVEL_ON      0
diff --git a/xen/arch/arm/include/asm/vpsci.h b/xen/arch/arm/include/asm/vpsci.h
index 0cca5e6830..69d40f9d7f 100644
--- a/xen/arch/arm/include/asm/vpsci.h
+++ b/xen/arch/arm/include/asm/vpsci.h
@@ -23,7 +23,7 @@
 #include <asm/psci.h>
 
 /* Number of function implemented by virtual PSCI (only 0.2 or later) */
-#define VPSCI_NR_FUNCS  12
+#define VPSCI_NR_FUNCS  14
 
 /* Functions handle PSCI calls from the guests */
 bool do_vpsci_0_1_call(struct cpu_user_regs *regs, uint32_t fid);
diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
index 67296dabb5..f9c09a49e2 100644
--- a/xen/arch/arm/mmu/p2m.c
+++ b/xen/arch/arm/mmu/p2m.c
@@ -6,6 +6,8 @@
 #include <xen/sched.h>
 #include <xen/softirq.h>
 
+#include <public/sched.h>
+
 #include <asm/alternative.h>
 #include <asm/event.h>
 #include <asm/flushtlb.h>
@@ -198,7 +200,9 @@ void dump_p2m_lookup(struct domain *d, paddr_t addr)
  */
 void p2m_save_state(struct vcpu *p)
 {
-    p->arch.sctlr = READ_SYSREG(SCTLR_EL1);
+    if ( !(p->domain->is_shutting_down &&
+           p->domain->shutdown_code == SHUTDOWN_suspend) )
+        p->arch.sctlr = READ_SYSREG(SCTLR_EL1);
 
     if ( cpus_have_const_cap(ARM64_WORKAROUND_AT_SPECULATE) )
     {
diff --git a/xen/arch/arm/vpsci.c b/xen/arch/arm/vpsci.c
index 7ba9ccd94b..c6b9ac1fc8 100644
--- a/xen/arch/arm/vpsci.c
+++ b/xen/arch/arm/vpsci.c
@@ -10,28 +10,18 @@
 
 #include <public/sched.h>
 
-static int do_common_cpu_on(register_t target_cpu, register_t entry_point,
+static int do_setup_vcpu_ctx(struct vcpu *v, register_t entry_point,
                             register_t context_id)
 {
-    struct vcpu *v;
     struct domain *d = current->domain;
     struct vcpu_guest_context *ctxt;
     int rc;
     bool is_thumb = entry_point & 1;
-    register_t vcpuid;
-
-    vcpuid = vaffinity_to_vcpuid(target_cpu);
-
-    if ( (v = domain_vcpu(d, vcpuid)) == NULL )
-        return PSCI_INVALID_PARAMETERS;
 
     /* THUMB set is not allowed with 64-bit domain */
     if ( is_64bit_domain(d) && is_thumb )
         return PSCI_INVALID_ADDRESS;
 
-    if ( !test_bit(_VPF_down, &v->pause_flags) )
-        return PSCI_ALREADY_ON;
-
     if ( (ctxt = alloc_vcpu_guest_context()) == NULL )
         return PSCI_DENIED;
 
@@ -78,11 +68,32 @@ static int do_common_cpu_on(register_t target_cpu, register_t entry_point,
     if ( rc < 0 )
         return PSCI_DENIED;
 
-    vcpu_wake(v);
-
     return PSCI_SUCCESS;
 }
 
+static int do_common_cpu_on(register_t target_cpu, register_t entry_point,
+                            register_t context_id)
+{
+    int rc;
+    struct vcpu *v;
+    struct domain *d = current->domain;
+    register_t vcpuid;
+
+    vcpuid = vaffinity_to_vcpuid(target_cpu);
+
+    if ( (v = domain_vcpu(d, vcpuid)) == NULL )
+        return PSCI_INVALID_PARAMETERS;
+
+    if ( !test_bit(_VPF_down, &v->pause_flags) )
+        return PSCI_ALREADY_ON;
+
+    rc = do_setup_vcpu_ctx(v, entry_point, context_id);
+    if ( rc == PSCI_SUCCESS )
+        vcpu_wake(v);
+
+    return rc;
+}
+
 static int32_t do_psci_cpu_on(uint32_t vcpuid, register_t entry_point)
 {
     int32_t ret;
@@ -197,6 +208,52 @@ static void do_psci_0_2_system_reset(void)
     domain_shutdown(d,SHUTDOWN_reboot);
 }
 
+static void do_resume_on_error(struct domain *d)
+{
+    struct vcpu *v;
+
+    spin_lock(&d->shutdown_lock);
+
+    d->is_shutting_down = d->is_shut_down = 0;
+    d->shutdown_code = SHUTDOWN_CODE_INVALID;
+
+    for_each_vcpu ( d, v )
+    {
+        if ( v->paused_for_shutdown )
+            vcpu_unpause(v);
+        v->paused_for_shutdown = 0;
+    }
+
+    spin_unlock(&d->shutdown_lock);
+}
+
+static int32_t do_psci_1_0_system_suspend(register_t epoint, register_t cid)
+{
+    int ret;
+    struct vcpu *v;
+    struct domain *d = current->domain;
+
+    /* Drop this check once SYSTEM_SUSPEND is supported in hardware domain */
+    if ( is_hardware_domain(d) )
+        return PSCI_NOT_SUPPORTED;
+
+    domain_shutdown(d, SHUTDOWN_suspend);
+
+    /* Ensure that all CPUs other than the calling one are offline */
+    for_each_vcpu ( d, v )
+        if ( v != current && is_vcpu_online(v) )
+        {
+            do_resume_on_error(d);
+            return PSCI_DENIED;
+        }
+
+    ret = do_setup_vcpu_ctx(current, epoint, cid);
+    if ( ret != PSCI_SUCCESS )
+        do_resume_on_error(d);
+
+    return ret;
+}
+
 static int32_t do_psci_1_0_features(uint32_t psci_func_id)
 {
     /* /!\ Ordered by function ID and not name */
@@ -214,6 +271,8 @@ static int32_t do_psci_1_0_features(uint32_t psci_func_id)
     case PSCI_0_2_FN32_SYSTEM_OFF:
     case PSCI_0_2_FN32_SYSTEM_RESET:
     case PSCI_1_0_FN32_PSCI_FEATURES:
+    case PSCI_1_0_FN32_SYSTEM_SUSPEND:
+    case PSCI_1_0_FN64_SYSTEM_SUSPEND:
     case ARM_SMCCC_VERSION_FID:
         return 0;
     default:
@@ -344,6 +403,17 @@ bool do_vpsci_0_2_call(struct cpu_user_regs *regs, uint32_t fid)
         return true;
     }
 
+    case PSCI_1_0_FN32_SYSTEM_SUSPEND:
+    case PSCI_1_0_FN64_SYSTEM_SUSPEND:
+    {
+        register_t epoint = PSCI_ARG(regs,1);
+        register_t cid = PSCI_ARG(regs,2);
+
+        perfc_incr(vpsci_system_suspend);
+        PSCI_SET_RESULT(regs, do_psci_1_0_system_suspend(epoint, cid));
+        return true;
+    }
+
     default:
         return false;
     }
-- 
2.48.1Hi Mykola,
On 27/06/2025 11:51, Mykola Kvach wrote:
> diff --git a/xen/arch/arm/include/asm/perfc_defn.h b/xen/arch/arm/include/asm/perfc_defn.h
> index effd25b69e..8dfcac7e3b 100644
> --- a/xen/arch/arm/include/asm/perfc_defn.h
> +++ b/xen/arch/arm/include/asm/perfc_defn.h
> @@ -33,6 +33,7 @@ PERFCOUNTER(vpsci_system_reset,        "vpsci: system_reset")
>   PERFCOUNTER(vpsci_cpu_suspend,         "vpsci: cpu_suspend")
>   PERFCOUNTER(vpsci_cpu_affinity_info,   "vpsci: cpu_affinity_info")
>   PERFCOUNTER(vpsci_features,            "vpsci: features")
> +PERFCOUNTER(vpsci_system_suspend,      "vpsci: system_suspend")
>   
>   PERFCOUNTER(vcpu_kick,                 "vcpu: notify other vcpu")
>   
> diff --git a/xen/arch/arm/include/asm/psci.h b/xen/arch/arm/include/asm/psci.h
> index 4780972621..48a93e6b79 100644
> --- a/xen/arch/arm/include/asm/psci.h
> +++ b/xen/arch/arm/include/asm/psci.h
> @@ -47,10 +47,12 @@ void call_psci_system_reset(void);
>   #define PSCI_0_2_FN32_SYSTEM_OFF          PSCI_0_2_FN32(8)
>   #define PSCI_0_2_FN32_SYSTEM_RESET        PSCI_0_2_FN32(9)
>   #define PSCI_1_0_FN32_PSCI_FEATURES       PSCI_0_2_FN32(10)
> +#define PSCI_1_0_FN32_SYSTEM_SUSPEND      PSCI_0_2_FN32(14)
>   
>   #define PSCI_0_2_FN64_CPU_SUSPEND         PSCI_0_2_FN64(1)
>   #define PSCI_0_2_FN64_CPU_ON              PSCI_0_2_FN64(3)
>   #define PSCI_0_2_FN64_AFFINITY_INFO       PSCI_0_2_FN64(4)
> +#define PSCI_1_0_FN64_SYSTEM_SUSPEND      PSCI_0_2_FN64(14)
>   
>   /* PSCI v0.2 affinity level state returned by AFFINITY_INFO */
>   #define PSCI_0_2_AFFINITY_LEVEL_ON      0
> diff --git a/xen/arch/arm/include/asm/vpsci.h b/xen/arch/arm/include/asm/vpsci.h
> index 0cca5e6830..69d40f9d7f 100644
> --- a/xen/arch/arm/include/asm/vpsci.h
> +++ b/xen/arch/arm/include/asm/vpsci.h
> @@ -23,7 +23,7 @@
>   #include <asm/psci.h>
>   
>   /* Number of function implemented by virtual PSCI (only 0.2 or later) */
> -#define VPSCI_NR_FUNCS  12
> +#define VPSCI_NR_FUNCS  14
>   
>   /* Functions handle PSCI calls from the guests */
>   bool do_vpsci_0_1_call(struct cpu_user_regs *regs, uint32_t fid);
> diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
> index 67296dabb5..f9c09a49e2 100644
> --- a/xen/arch/arm/mmu/p2m.c
> +++ b/xen/arch/arm/mmu/p2m.c
> @@ -6,6 +6,8 @@
>   #include <xen/sched.h>
>   #include <xen/softirq.h>
>   
> +#include <public/sched.h>
> +
>   #include <asm/alternative.h>
>   #include <asm/event.h>
>   #include <asm/flushtlb.h>
> @@ -198,7 +200,9 @@ void dump_p2m_lookup(struct domain *d, paddr_t addr)
>    */
>   void p2m_save_state(struct vcpu *p)
>   {
> -    p->arch.sctlr = READ_SYSREG(SCTLR_EL1);
> +    if ( !(p->domain->is_shutting_down &&
> +           p->domain->shutdown_code == SHUTDOWN_suspend) )
This is definitely not obvious why you need to change p2m_save_state(). 
AFAIU, you are doing this because when suspending the system, you will 
overwrite p->arch.sctlr. However, this is super fragile. For instance, 
you still seem to overwrite TTBR{0,1} and TTBCR.
TTBR{0,1} are technically unknown at boot. So it is fine to ignore them. 
  But for TTBCR, I am not 100% whether this is supposed to be unknown.
Anyway, adding more "if (...)" is not the right solution because in the 
future we may decide to reset more registers.
It would be better if we stash the value sand then update the registers. 
Another possibility would be to entirely skip the save path for CPUs 
that are turned off (after all this is a bit useless work...).
> +static int do_common_cpu_on(register_t target_cpu, register_t entry_point,
> +                            register_t context_id)
> +{
> +    int rc;
> +    struct vcpu *v;
> +    struct domain *d = current->domain;
> +    register_t vcpuid;
> +
> +    vcpuid = vaffinity_to_vcpuid(target_cpu);
> +
> +    if ( (v = domain_vcpu(d, vcpuid)) == NULL )
> +        return PSCI_INVALID_PARAMETERS;
> +
> +    if ( !test_bit(_VPF_down, &v->pause_flags) )
> +        return PSCI_ALREADY_ON;
> +
> +    rc = do_setup_vcpu_ctx(v, entry_point, context_id);
> +    if ( rc == PSCI_SUCCESS )
> +        vcpu_wake(v);
> +
> +    return rc;
> +}
> +
>   static int32_t do_psci_cpu_on(uint32_t vcpuid, register_t entry_point)
>   {
>       int32_t ret;
> @@ -197,6 +208,52 @@ static void do_psci_0_2_system_reset(void)
>       domain_shutdown(d,SHUTDOWN_reboot);
>   }
>   
> +static void do_resume_on_error(struct domain *d)
This looks like an open-coding version of domain_resume() without the 
domain_{,un}pause(). What worries me is there is a comment on top of
domain_pause() explaining why it is called. I understand, we can't call 
domain_pause() here (it doesn't work on the current domain). However, it 
feels to me there is an explanation necessary why this is fine to diverge.
I am not a scheduler expert, so I am CCing Juergen in the hope he could 
provide some inputs.
> +{
> +    struct vcpu *v;
> +
> +    spin_lock(&d->shutdown_lock);
> +
> +    d->is_shutting_down = d->is_shut_down = 0;
> +    d->shutdown_code = SHUTDOWN_CODE_INVALID;
> +
> +    for_each_vcpu ( d, v )
> +    {
> +        if ( v->paused_for_shutdown )
> +            vcpu_unpause(v);
> +        v->paused_for_shutdown = 0;
> +    }
> +
> +    spin_unlock(&d->shutdown_lock);
> +}
> +
> +static int32_t do_psci_1_0_system_suspend(register_t epoint, register_t cid)
> +{
> +    int ret;
> +    struct vcpu *v;
> +    struct domain *d = current->domain;
> +
> +    /* Drop this check once SYSTEM_SUSPEND is supported in hardware domain */
> +    if ( is_hardware_domain(d) )
> +        return PSCI_NOT_SUPPORTED;
> +
> +    domain_shutdown(d, SHUTDOWN_suspend);
It would be good to add a comment explaining why you need to call 
domain_shutdown() first. Otherwise, one could wonder whether we can get 
rid of the complexity when a vCPU is still online.
> +
> +    /* Ensure that all CPUs other than the calling one are offline */
> +    for_each_vcpu ( d, v )
NIT: Even if this is a single "statement" below, I think adding the 
brace would make the code clearer.
> +        if ( v != current && is_vcpu_online(v) )
> +        {
> +            do_resume_on_error(d);
> +            return PSCI_DENIED;
> +        }
> +
> +    ret = do_setup_vcpu_ctx(current, epoint, cid);
> +    if ( ret != PSCI_SUCCESS )
> +        do_resume_on_error(d);
> +
> +    return ret;
> +}
> +
>   static int32_t do_psci_1_0_features(uint32_t psci_func_id)
>   {
>       /* /!\ Ordered by function ID and not name */
> @@ -214,6 +271,8 @@ static int32_t do_psci_1_0_features(uint32_t psci_func_id)
>       case PSCI_0_2_FN32_SYSTEM_OFF:
>       case PSCI_0_2_FN32_SYSTEM_RESET:
>       case PSCI_1_0_FN32_PSCI_FEATURES:
> +    case PSCI_1_0_FN32_SYSTEM_SUSPEND:
> +    case PSCI_1_0_FN64_SYSTEM_SUSPEND:
>       case ARM_SMCCC_VERSION_FID:
>           return 0;
>       default:
> @@ -344,6 +403,17 @@ bool do_vpsci_0_2_call(struct cpu_user_regs *regs, uint32_t fid)
>           return true;
>       }
>   
> +    case PSCI_1_0_FN32_SYSTEM_SUSPEND:
> +    case PSCI_1_0_FN64_SYSTEM_SUSPEND:
> +    {
> +        register_t epoint = PSCI_ARG(regs,1);
> +        register_t cid = PSCI_ARG(regs,2);
Coding style: For the two lines above, there is a missing space after ",".
Also, if a 64-bit domain is calling the 32-bit version, then we also 
need to ignore the top 32-bits. AFAICT CPU_ON also have the same issue. 
I am not going to ask fixing CPU_ON (it would be good though), but I 
will at least ask we don't spread the mistake further.
Cheers,
-- 
Julien Grall
                
            On 28.06.25 20:17, Julien Grall wrote:
> Hi Mykola,
> 
> On 27/06/2025 11:51, Mykola Kvach wrote:
>> diff --git a/xen/arch/arm/include/asm/perfc_defn.h b/xen/arch/arm/include/asm/ 
>> perfc_defn.h
>> index effd25b69e..8dfcac7e3b 100644
>> --- a/xen/arch/arm/include/asm/perfc_defn.h
>> +++ b/xen/arch/arm/include/asm/perfc_defn.h
>> @@ -33,6 +33,7 @@ PERFCOUNTER(vpsci_system_reset,        "vpsci: system_reset")
>>   PERFCOUNTER(vpsci_cpu_suspend,         "vpsci: cpu_suspend")
>>   PERFCOUNTER(vpsci_cpu_affinity_info,   "vpsci: cpu_affinity_info")
>>   PERFCOUNTER(vpsci_features,            "vpsci: features")
>> +PERFCOUNTER(vpsci_system_suspend,      "vpsci: system_suspend")
>>   PERFCOUNTER(vcpu_kick,                 "vcpu: notify other vcpu")
>> diff --git a/xen/arch/arm/include/asm/psci.h b/xen/arch/arm/include/asm/psci.h
>> index 4780972621..48a93e6b79 100644
>> --- a/xen/arch/arm/include/asm/psci.h
>> +++ b/xen/arch/arm/include/asm/psci.h
>> @@ -47,10 +47,12 @@ void call_psci_system_reset(void);
>>   #define PSCI_0_2_FN32_SYSTEM_OFF          PSCI_0_2_FN32(8)
>>   #define PSCI_0_2_FN32_SYSTEM_RESET        PSCI_0_2_FN32(9)
>>   #define PSCI_1_0_FN32_PSCI_FEATURES       PSCI_0_2_FN32(10)
>> +#define PSCI_1_0_FN32_SYSTEM_SUSPEND      PSCI_0_2_FN32(14)
>>   #define PSCI_0_2_FN64_CPU_SUSPEND         PSCI_0_2_FN64(1)
>>   #define PSCI_0_2_FN64_CPU_ON              PSCI_0_2_FN64(3)
>>   #define PSCI_0_2_FN64_AFFINITY_INFO       PSCI_0_2_FN64(4)
>> +#define PSCI_1_0_FN64_SYSTEM_SUSPEND      PSCI_0_2_FN64(14)
>>   /* PSCI v0.2 affinity level state returned by AFFINITY_INFO */
>>   #define PSCI_0_2_AFFINITY_LEVEL_ON      0
>> diff --git a/xen/arch/arm/include/asm/vpsci.h b/xen/arch/arm/include/asm/vpsci.h
>> index 0cca5e6830..69d40f9d7f 100644
>> --- a/xen/arch/arm/include/asm/vpsci.h
>> +++ b/xen/arch/arm/include/asm/vpsci.h
>> @@ -23,7 +23,7 @@
>>   #include <asm/psci.h>
>>   /* Number of function implemented by virtual PSCI (only 0.2 or later) */
>> -#define VPSCI_NR_FUNCS  12
>> +#define VPSCI_NR_FUNCS  14
>>   /* Functions handle PSCI calls from the guests */
>>   bool do_vpsci_0_1_call(struct cpu_user_regs *regs, uint32_t fid);
>> diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
>> index 67296dabb5..f9c09a49e2 100644
>> --- a/xen/arch/arm/mmu/p2m.c
>> +++ b/xen/arch/arm/mmu/p2m.c
>> @@ -6,6 +6,8 @@
>>   #include <xen/sched.h>
>>   #include <xen/softirq.h>
>> +#include <public/sched.h>
>> +
>>   #include <asm/alternative.h>
>>   #include <asm/event.h>
>>   #include <asm/flushtlb.h>
>> @@ -198,7 +200,9 @@ void dump_p2m_lookup(struct domain *d, paddr_t addr)
>>    */
>>   void p2m_save_state(struct vcpu *p)
>>   {
>> -    p->arch.sctlr = READ_SYSREG(SCTLR_EL1);
>> +    if ( !(p->domain->is_shutting_down &&
>> +           p->domain->shutdown_code == SHUTDOWN_suspend) )
> 
> This is definitely not obvious why you need to change p2m_save_state(). AFAIU, 
> you are doing this because when suspending the system, you will overwrite p- 
>  >arch.sctlr. However, this is super fragile. For instance, you still seem to 
> overwrite TTBR{0,1} and TTBCR.
> 
> TTBR{0,1} are technically unknown at boot. So it is fine to ignore them.  But 
> for TTBCR, I am not 100% whether this is supposed to be unknown.
> 
> Anyway, adding more "if (...)" is not the right solution because in the future 
> we may decide to reset more registers.
> 
> It would be better if we stash the value sand then update the registers. Another 
> possibility would be to entirely skip the save path for CPUs that are turned off 
> (after all this is a bit useless work...).
> 
>> +static int do_common_cpu_on(register_t target_cpu, register_t entry_point,
>> +                            register_t context_id)
>> +{
>> +    int rc;
>> +    struct vcpu *v;
>> +    struct domain *d = current->domain;
>> +    register_t vcpuid;
>> +
>> +    vcpuid = vaffinity_to_vcpuid(target_cpu);
>> +
>> +    if ( (v = domain_vcpu(d, vcpuid)) == NULL )
>> +        return PSCI_INVALID_PARAMETERS;
>> +
>> +    if ( !test_bit(_VPF_down, &v->pause_flags) )
>> +        return PSCI_ALREADY_ON;
>> +
>> +    rc = do_setup_vcpu_ctx(v, entry_point, context_id);
>> +    if ( rc == PSCI_SUCCESS )
>> +        vcpu_wake(v);
>> +
>> +    return rc;
>> +}
>> +
>>   static int32_t do_psci_cpu_on(uint32_t vcpuid, register_t entry_point)
>>   {
>>       int32_t ret;
>> @@ -197,6 +208,52 @@ static void do_psci_0_2_system_reset(void)
>>       domain_shutdown(d,SHUTDOWN_reboot);
>>   }
>> +static void do_resume_on_error(struct domain *d)
> 
> This looks like an open-coding version of domain_resume() without the 
> domain_{,un}pause(). What worries me is there is a comment on top of
> domain_pause() explaining why it is called. I understand, we can't call 
> domain_pause() here (it doesn't work on the current domain). However, it feels 
> to me there is an explanation necessary why this is fine to diverge.
> 
> I am not a scheduler expert, so I am CCing Juergen in the hope he could provide 
> some inputs.
I don't think the reason for domain_[un]pause() is directly scheduling
related.
It seems to be x86 specific for shadow page table handling.
In any case I'd suggest to split domain_resume() into 2 functions, one
of them (e.g. domain_resume_nopause()) replacing do_resume_on_error()
and domain_resume() just pausing the domain, calling the new function,
and then unpausing the domain again.
Another option might be to move the suspend action into a tasklet, enabling
to call domain_resume() directly if needed. I didn't check whether other
constraints even allow this idea, though.
Juergen
                
            Hi Jürgen,
On Mon, Jul 21, 2025 at 11:08 AM Jürgen Groß <jgross@suse.com> wrote:
>
> On 28.06.25 20:17, Julien Grall wrote:
> > Hi Mykola,
> >
> > On 27/06/2025 11:51, Mykola Kvach wrote:
> >> diff --git a/xen/arch/arm/include/asm/perfc_defn.h b/xen/arch/arm/include/asm/
> >> perfc_defn.h
> >> index effd25b69e..8dfcac7e3b 100644
> >> --- a/xen/arch/arm/include/asm/perfc_defn.h
> >> +++ b/xen/arch/arm/include/asm/perfc_defn.h
> >> @@ -33,6 +33,7 @@ PERFCOUNTER(vpsci_system_reset,        "vpsci: system_reset")
> >>   PERFCOUNTER(vpsci_cpu_suspend,         "vpsci: cpu_suspend")
> >>   PERFCOUNTER(vpsci_cpu_affinity_info,   "vpsci: cpu_affinity_info")
> >>   PERFCOUNTER(vpsci_features,            "vpsci: features")
> >> +PERFCOUNTER(vpsci_system_suspend,      "vpsci: system_suspend")
> >>   PERFCOUNTER(vcpu_kick,                 "vcpu: notify other vcpu")
> >> diff --git a/xen/arch/arm/include/asm/psci.h b/xen/arch/arm/include/asm/psci.h
> >> index 4780972621..48a93e6b79 100644
> >> --- a/xen/arch/arm/include/asm/psci.h
> >> +++ b/xen/arch/arm/include/asm/psci.h
> >> @@ -47,10 +47,12 @@ void call_psci_system_reset(void);
> >>   #define PSCI_0_2_FN32_SYSTEM_OFF          PSCI_0_2_FN32(8)
> >>   #define PSCI_0_2_FN32_SYSTEM_RESET        PSCI_0_2_FN32(9)
> >>   #define PSCI_1_0_FN32_PSCI_FEATURES       PSCI_0_2_FN32(10)
> >> +#define PSCI_1_0_FN32_SYSTEM_SUSPEND      PSCI_0_2_FN32(14)
> >>   #define PSCI_0_2_FN64_CPU_SUSPEND         PSCI_0_2_FN64(1)
> >>   #define PSCI_0_2_FN64_CPU_ON              PSCI_0_2_FN64(3)
> >>   #define PSCI_0_2_FN64_AFFINITY_INFO       PSCI_0_2_FN64(4)
> >> +#define PSCI_1_0_FN64_SYSTEM_SUSPEND      PSCI_0_2_FN64(14)
> >>   /* PSCI v0.2 affinity level state returned by AFFINITY_INFO */
> >>   #define PSCI_0_2_AFFINITY_LEVEL_ON      0
> >> diff --git a/xen/arch/arm/include/asm/vpsci.h b/xen/arch/arm/include/asm/vpsci.h
> >> index 0cca5e6830..69d40f9d7f 100644
> >> --- a/xen/arch/arm/include/asm/vpsci.h
> >> +++ b/xen/arch/arm/include/asm/vpsci.h
> >> @@ -23,7 +23,7 @@
> >>   #include <asm/psci.h>
> >>   /* Number of function implemented by virtual PSCI (only 0.2 or later) */
> >> -#define VPSCI_NR_FUNCS  12
> >> +#define VPSCI_NR_FUNCS  14
> >>   /* Functions handle PSCI calls from the guests */
> >>   bool do_vpsci_0_1_call(struct cpu_user_regs *regs, uint32_t fid);
> >> diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
> >> index 67296dabb5..f9c09a49e2 100644
> >> --- a/xen/arch/arm/mmu/p2m.c
> >> +++ b/xen/arch/arm/mmu/p2m.c
> >> @@ -6,6 +6,8 @@
> >>   #include <xen/sched.h>
> >>   #include <xen/softirq.h>
> >> +#include <public/sched.h>
> >> +
> >>   #include <asm/alternative.h>
> >>   #include <asm/event.h>
> >>   #include <asm/flushtlb.h>
> >> @@ -198,7 +200,9 @@ void dump_p2m_lookup(struct domain *d, paddr_t addr)
> >>    */
> >>   void p2m_save_state(struct vcpu *p)
> >>   {
> >> -    p->arch.sctlr = READ_SYSREG(SCTLR_EL1);
> >> +    if ( !(p->domain->is_shutting_down &&
> >> +           p->domain->shutdown_code == SHUTDOWN_suspend) )
> >
> > This is definitely not obvious why you need to change p2m_save_state(). AFAIU,
> > you are doing this because when suspending the system, you will overwrite p-
> >  >arch.sctlr. However, this is super fragile. For instance, you still seem to
> > overwrite TTBR{0,1} and TTBCR.
> >
> > TTBR{0,1} are technically unknown at boot. So it is fine to ignore them.  But
> > for TTBCR, I am not 100% whether this is supposed to be unknown.
> >
> > Anyway, adding more "if (...)" is not the right solution because in the future
> > we may decide to reset more registers.
> >
> > It would be better if we stash the value sand then update the registers. Another
> > possibility would be to entirely skip the save path for CPUs that are turned off
> > (after all this is a bit useless work...).
> >
> >> +static int do_common_cpu_on(register_t target_cpu, register_t entry_point,
> >> +                            register_t context_id)
> >> +{
> >> +    int rc;
> >> +    struct vcpu *v;
> >> +    struct domain *d = current->domain;
> >> +    register_t vcpuid;
> >> +
> >> +    vcpuid = vaffinity_to_vcpuid(target_cpu);
> >> +
> >> +    if ( (v = domain_vcpu(d, vcpuid)) == NULL )
> >> +        return PSCI_INVALID_PARAMETERS;
> >> +
> >> +    if ( !test_bit(_VPF_down, &v->pause_flags) )
> >> +        return PSCI_ALREADY_ON;
> >> +
> >> +    rc = do_setup_vcpu_ctx(v, entry_point, context_id);
> >> +    if ( rc == PSCI_SUCCESS )
> >> +        vcpu_wake(v);
> >> +
> >> +    return rc;
> >> +}
> >> +
> >>   static int32_t do_psci_cpu_on(uint32_t vcpuid, register_t entry_point)
> >>   {
> >>       int32_t ret;
> >> @@ -197,6 +208,52 @@ static void do_psci_0_2_system_reset(void)
> >>       domain_shutdown(d,SHUTDOWN_reboot);
> >>   }
> >> +static void do_resume_on_error(struct domain *d)
> >
> > This looks like an open-coding version of domain_resume() without the
> > domain_{,un}pause(). What worries me is there is a comment on top of
> > domain_pause() explaining why it is called. I understand, we can't call
> > domain_pause() here (it doesn't work on the current domain). However, it feels
> > to me there is an explanation necessary why this is fine to diverge.
> >
> > I am not a scheduler expert, so I am CCing Juergen in the hope he could provide
> > some inputs.
>
> I don't think the reason for domain_[un]pause() is directly scheduling
> related.
>
> It seems to be x86 specific for shadow page table handling.
>
> In any case I'd suggest to split domain_resume() into 2 functions, one
> of them (e.g. domain_resume_nopause()) replacing do_resume_on_error()
> and domain_resume() just pausing the domain, calling the new function,
> and then unpausing the domain again.
Got it — thank you for the review!
>
> Another option might be to move the suspend action into a tasklet, enabling
> to call domain_resume() directly if needed. I didn't check whether other
> constraints even allow this idea, though.
>
>
> Juergen
Best regards,
Mykola
                
            Hi Julien,
On Sat, Jun 28, 2025 at 9:17 PM Julien Grall <julien@xen.org> wrote:
>
> Hi Mykola,
>
> On 27/06/2025 11:51, Mykola Kvach wrote:
> > diff --git a/xen/arch/arm/include/asm/perfc_defn.h b/xen/arch/arm/include/asm/perfc_defn.h
> > index effd25b69e..8dfcac7e3b 100644
> > --- a/xen/arch/arm/include/asm/perfc_defn.h
> > +++ b/xen/arch/arm/include/asm/perfc_defn.h
> > @@ -33,6 +33,7 @@ PERFCOUNTER(vpsci_system_reset,        "vpsci: system_reset")
> >   PERFCOUNTER(vpsci_cpu_suspend,         "vpsci: cpu_suspend")
> >   PERFCOUNTER(vpsci_cpu_affinity_info,   "vpsci: cpu_affinity_info")
> >   PERFCOUNTER(vpsci_features,            "vpsci: features")
> > +PERFCOUNTER(vpsci_system_suspend,      "vpsci: system_suspend")
> >
> >   PERFCOUNTER(vcpu_kick,                 "vcpu: notify other vcpu")
> >
> > diff --git a/xen/arch/arm/include/asm/psci.h b/xen/arch/arm/include/asm/psci.h
> > index 4780972621..48a93e6b79 100644
> > --- a/xen/arch/arm/include/asm/psci.h
> > +++ b/xen/arch/arm/include/asm/psci.h
> > @@ -47,10 +47,12 @@ void call_psci_system_reset(void);
> >   #define PSCI_0_2_FN32_SYSTEM_OFF          PSCI_0_2_FN32(8)
> >   #define PSCI_0_2_FN32_SYSTEM_RESET        PSCI_0_2_FN32(9)
> >   #define PSCI_1_0_FN32_PSCI_FEATURES       PSCI_0_2_FN32(10)
> > +#define PSCI_1_0_FN32_SYSTEM_SUSPEND      PSCI_0_2_FN32(14)
> >
> >   #define PSCI_0_2_FN64_CPU_SUSPEND         PSCI_0_2_FN64(1)
> >   #define PSCI_0_2_FN64_CPU_ON              PSCI_0_2_FN64(3)
> >   #define PSCI_0_2_FN64_AFFINITY_INFO       PSCI_0_2_FN64(4)
> > +#define PSCI_1_0_FN64_SYSTEM_SUSPEND      PSCI_0_2_FN64(14)
> >
> >   /* PSCI v0.2 affinity level state returned by AFFINITY_INFO */
> >   #define PSCI_0_2_AFFINITY_LEVEL_ON      0
> > diff --git a/xen/arch/arm/include/asm/vpsci.h b/xen/arch/arm/include/asm/vpsci.h
> > index 0cca5e6830..69d40f9d7f 100644
> > --- a/xen/arch/arm/include/asm/vpsci.h
> > +++ b/xen/arch/arm/include/asm/vpsci.h
> > @@ -23,7 +23,7 @@
> >   #include <asm/psci.h>
> >
> >   /* Number of function implemented by virtual PSCI (only 0.2 or later) */
> > -#define VPSCI_NR_FUNCS  12
> > +#define VPSCI_NR_FUNCS  14
> >
> >   /* Functions handle PSCI calls from the guests */
> >   bool do_vpsci_0_1_call(struct cpu_user_regs *regs, uint32_t fid);
> > diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
> > index 67296dabb5..f9c09a49e2 100644
> > --- a/xen/arch/arm/mmu/p2m.c
> > +++ b/xen/arch/arm/mmu/p2m.c
> > @@ -6,6 +6,8 @@
> >   #include <xen/sched.h>
> >   #include <xen/softirq.h>
> >
> > +#include <public/sched.h>
> > +
> >   #include <asm/alternative.h>
> >   #include <asm/event.h>
> >   #include <asm/flushtlb.h>
> > @@ -198,7 +200,9 @@ void dump_p2m_lookup(struct domain *d, paddr_t addr)
> >    */
> >   void p2m_save_state(struct vcpu *p)
> >   {
> > -    p->arch.sctlr = READ_SYSREG(SCTLR_EL1);
> > +    if ( !(p->domain->is_shutting_down &&
> > +           p->domain->shutdown_code == SHUTDOWN_suspend) )
>
> This is definitely not obvious why you need to change p2m_save_state().
> AFAIU, you are doing this because when suspending the system, you will
> overwrite p->arch.sctlr. However, this is super fragile. For instance,
> you still seem to overwrite TTBR{0,1} and TTBCR.
>
> TTBR{0,1} are technically unknown at boot. So it is fine to ignore them.
>   But for TTBCR, I am not 100% whether this is supposed to be unknown.
Yes, you are right. According to the PSCI specification, this is fine --
see section "6.4.3 Initial core configuration".
The MMU must be disabled on startup, so it looks like
we don't need to worry about restoring TTBR{0,1}, or TTBCR, see
"6.4.3.4 MMU, cache and branch predictor enables"
>
> Anyway, adding more "if (...)" is not the right solution because in the
> future we may decide to reset more registers.
Agree
>
> It would be better if we stash the value sand then update the registers.
> Another possibility would be to entirely skip the save path for CPUs
> that are turned off (after all this is a bit useless work...).
Do you mean we should avoid calling ctxt_switch_from for a suspended
domain?
>
> > +static int do_common_cpu_on(register_t target_cpu, register_t entry_point,
> > +                            register_t context_id)
> > +{
> > +    int rc;
> > +    struct vcpu *v;
> > +    struct domain *d = current->domain;
> > +    register_t vcpuid;
> > +
> > +    vcpuid = vaffinity_to_vcpuid(target_cpu);
> > +
> > +    if ( (v = domain_vcpu(d, vcpuid)) == NULL )
> > +        return PSCI_INVALID_PARAMETERS;
> > +
> > +    if ( !test_bit(_VPF_down, &v->pause_flags) )
> > +        return PSCI_ALREADY_ON;
> > +
> > +    rc = do_setup_vcpu_ctx(v, entry_point, context_id);
> > +    if ( rc == PSCI_SUCCESS )
> > +        vcpu_wake(v);
> > +
> > +    return rc;
> > +}
> > +
> >   static int32_t do_psci_cpu_on(uint32_t vcpuid, register_t entry_point)
> >   {
> >       int32_t ret;
> > @@ -197,6 +208,52 @@ static void do_psci_0_2_system_reset(void)
> >       domain_shutdown(d,SHUTDOWN_reboot);
> >   }
> >
> > +static void do_resume_on_error(struct domain *d)
>
> This looks like an open-coding version of domain_resume() without the
> domain_{,un}pause(). What worries me is there is a comment on top of
> domain_pause() explaining why it is called. I understand, we can't call
> domain_pause() here (it doesn't work on the current domain). However, it
> feels to me there is an explanation necessary why this is fine to diverge.
>
> I am not a scheduler expert, so I am CCing Juergen in the hope he could
> provide some inputs.
>
> > +{
> > +    struct vcpu *v;
> > +
> > +    spin_lock(&d->shutdown_lock);
> > +
> > +    d->is_shutting_down = d->is_shut_down = 0;
> > +    d->shutdown_code = SHUTDOWN_CODE_INVALID;
> > +
> > +    for_each_vcpu ( d, v )
> > +    {
> > +        if ( v->paused_for_shutdown )
> > +            vcpu_unpause(v);
> > +        v->paused_for_shutdown = 0;
> > +    }
> > +
> > +    spin_unlock(&d->shutdown_lock);
> > +}
> > +
> > +static int32_t do_psci_1_0_system_suspend(register_t epoint, register_t cid)
> > +{
> > +    int ret;
> > +    struct vcpu *v;
> > +    struct domain *d = current->domain;
> > +
> > +    /* Drop this check once SYSTEM_SUSPEND is supported in hardware domain */
> > +    if ( is_hardware_domain(d) )
> > +        return PSCI_NOT_SUPPORTED;
> > +
> > +    domain_shutdown(d, SHUTDOWN_suspend);
>
> It would be good to add a comment explaining why you need to call
> domain_shutdown() first. Otherwise, one could wonder whether we can get
> rid of the complexity when a vCPU is still online.
It's done first here because domain_shutdown() handles pausing of the
vCPUs internally, and this allows us to securely check whether the vCPUs
are online or not without interference from the guest.
But you're probably right — a better solution might be to perform proper
checking of which vCPUs are still online before calling it.
>
> > +
> > +    /* Ensure that all CPUs other than the calling one are offline */
> > +    for_each_vcpu ( d, v )
>
> NIT: Even if this is a single "statement" below, I think adding the
> brace would make the code clearer.
Ok
>
> > +        if ( v != current && is_vcpu_online(v) )
> > +        {
> > +            do_resume_on_error(d);
> > +            return PSCI_DENIED;
> > +        }
> > +
> > +    ret = do_setup_vcpu_ctx(current, epoint, cid);
> > +    if ( ret != PSCI_SUCCESS )
> > +        do_resume_on_error(d);
> > +
> > +    return ret;
> > +}
> > +
> >   static int32_t do_psci_1_0_features(uint32_t psci_func_id)
> >   {
> >       /* /!\ Ordered by function ID and not name */
> > @@ -214,6 +271,8 @@ static int32_t do_psci_1_0_features(uint32_t psci_func_id)
> >       case PSCI_0_2_FN32_SYSTEM_OFF:
> >       case PSCI_0_2_FN32_SYSTEM_RESET:
> >       case PSCI_1_0_FN32_PSCI_FEATURES:
> > +    case PSCI_1_0_FN32_SYSTEM_SUSPEND:
> > +    case PSCI_1_0_FN64_SYSTEM_SUSPEND:
> >       case ARM_SMCCC_VERSION_FID:
> >           return 0;
> >       default:
> > @@ -344,6 +403,17 @@ bool do_vpsci_0_2_call(struct cpu_user_regs *regs, uint32_t fid)
> >           return true;
> >       }
> >
> > +    case PSCI_1_0_FN32_SYSTEM_SUSPEND:
> > +    case PSCI_1_0_FN64_SYSTEM_SUSPEND:
> > +    {
> > +        register_t epoint = PSCI_ARG(regs,1);
> > +        register_t cid = PSCI_ARG(regs,2);
>
> Coding style: For the two lines above, there is a missing space after ",".
Sure, will fix — thanks for pointing that out.
>
> Also, if a 64-bit domain is calling the 32-bit version, then we also
> need to ignore the top 32-bits. AFAICT CPU_ON also have the same issue.
> I am not going to ask fixing CPU_ON (it would be good though), but I
> will at least ask we don't spread the mistake further.
Maybe it would be better to return an error in this case?
Should I also add checks for the case where the guest OS is 32-bit but
tries to call the 64-bit version of SYSTEM_SUSPEND?
Best regards,
Mykola
>
> Cheers,
>
> --
> Julien Grall
>
                
            Hi,
On 02/07/2025 11:01, Mykola Kvach wrote:
> On Sat, Jun 28, 2025 at 9:17 PM Julien Grall <julien@xen.org> wrote:
>> It would be better if we stash the value sand then update the registers.
>> Another possibility would be to entirely skip the save path for CPUs
>> that are turned off (after all this is a bit useless work...).
> 
> Do you mean we should avoid calling ctxt_switch_from for a suspended
> domain?
This would be one way to handle it. I haven't looked in details whether 
there are any other implications.
[...]
>>> +{
>>> +    struct vcpu *v;
>>> +
>>> +    spin_lock(&d->shutdown_lock);
>>> +
>>> +    d->is_shutting_down = d->is_shut_down = 0;
>>> +    d->shutdown_code = SHUTDOWN_CODE_INVALID;
>>> +
>>> +    for_each_vcpu ( d, v )
>>> +    {
>>> +        if ( v->paused_for_shutdown )
>>> +            vcpu_unpause(v);
>>> +        v->paused_for_shutdown = 0;
>>> +    }
>>> +
>>> +    spin_unlock(&d->shutdown_lock);
>>> +}
>>> +
>>> +static int32_t do_psci_1_0_system_suspend(register_t epoint, register_t cid)
>>> +{
>>> +    int ret;
>>> +    struct vcpu *v;
>>> +    struct domain *d = current->domain;
>>> +
>>> +    /* Drop this check once SYSTEM_SUSPEND is supported in hardware domain */
>>> +    if ( is_hardware_domain(d) )
>>> +        return PSCI_NOT_SUPPORTED;
>>> +
>>> +    domain_shutdown(d, SHUTDOWN_suspend);
>>
>> It would be good to add a comment explaining why you need to call
>> domain_shutdown() first. Otherwise, one could wonder whether we can get
>> rid of the complexity when a vCPU is still online.
> 
> It's done first here because domain_shutdown() handles pausing of the
> vCPUs internally, and this allows us to securely check whether the vCPUs
> are online or not without interference from the guest.
> 
> But you're probably right — a better solution might be to perform proper
> checking of which vCPUs are still online before calling it.
To clarify, I think this is right to call domain_shutdown() first to 
avoid any race when checking which vCPUs are still online (see the 
discussion we had in the previous version). My point is it would be good 
to document it in the code because this is not obvious.
>>
>>> +        if ( v != current && is_vcpu_online(v) )
>>> +        {
>>> +            do_resume_on_error(d);
>>> +            return PSCI_DENIED;
>>> +        }
>>> +
>>> +    ret = do_setup_vcpu_ctx(current, epoint, cid);
>>> +    if ( ret != PSCI_SUCCESS )
>>> +        do_resume_on_error(d);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>    static int32_t do_psci_1_0_features(uint32_t psci_func_id)
>>>    {
>>>        /* /!\ Ordered by function ID and not name */
>>> @@ -214,6 +271,8 @@ static int32_t do_psci_1_0_features(uint32_t psci_func_id)
>>>        case PSCI_0_2_FN32_SYSTEM_OFF:
>>>        case PSCI_0_2_FN32_SYSTEM_RESET:
>>>        case PSCI_1_0_FN32_PSCI_FEATURES:
>>> +    case PSCI_1_0_FN32_SYSTEM_SUSPEND:
>>> +    case PSCI_1_0_FN64_SYSTEM_SUSPEND:
>>>        case ARM_SMCCC_VERSION_FID:
>>>            return 0;
>>>        default:
>>> @@ -344,6 +403,17 @@ bool do_vpsci_0_2_call(struct cpu_user_regs *regs, uint32_t fid)
>>>            return true;
>>>        }
>>>
>>> +    case PSCI_1_0_FN32_SYSTEM_SUSPEND:
>>> +    case PSCI_1_0_FN64_SYSTEM_SUSPEND:
>>> +    {
>>> +        register_t epoint = PSCI_ARG(regs,1);
>>> +        register_t cid = PSCI_ARG(regs,2);
>>
>> Coding style: For the two lines above, there is a missing space after ",".
> 
> Sure, will fix — thanks for pointing that out.
> 
>>
>> Also, if a 64-bit domain is calling the 32-bit version, then we also
>> need to ignore the top 32-bits. AFAICT CPU_ON also have the same issue.
>> I am not going to ask fixing CPU_ON (it would be good though), but I
>> will at least ask we don't spread the mistake further.
> 
> Maybe it would be better to return an error in this case?
Why should we return an error? This is valid for a 64-bit domain to use 
SMC32 convention.
> Should I also add checks for the case where the guest OS is 32-bit but
> tries to call the 64-bit version of SYSTEM_SUSPEND?
We already have this generic check at the beginning of vsmccc_handle_call().
Cheers,
-- 
Julien Grall
                
            On Wed, Jul 2, 2025 at 3:28 PM Julien Grall <julien@xen.org> wrote:
>
> Hi,
>
> On 02/07/2025 11:01, Mykola Kvach wrote:
> > On Sat, Jun 28, 2025 at 9:17 PM Julien Grall <julien@xen.org> wrote:
> >> It would be better if we stash the value sand then update the registers.
> >> Another possibility would be to entirely skip the save path for CPUs
> >> that are turned off (after all this is a bit useless work...).
> >
> > Do you mean we should avoid calling ctxt_switch_from for a suspended
> > domain?
>
> This would be one way to handle it. I haven't looked in details whether
> there are any other implications.
>
> [...]
>
> >>> +{
> >>> +    struct vcpu *v;
> >>> +
> >>> +    spin_lock(&d->shutdown_lock);
> >>> +
> >>> +    d->is_shutting_down = d->is_shut_down = 0;
> >>> +    d->shutdown_code = SHUTDOWN_CODE_INVALID;
> >>> +
> >>> +    for_each_vcpu ( d, v )
> >>> +    {
> >>> +        if ( v->paused_for_shutdown )
> >>> +            vcpu_unpause(v);
> >>> +        v->paused_for_shutdown = 0;
> >>> +    }
> >>> +
> >>> +    spin_unlock(&d->shutdown_lock);
> >>> +}
> >>> +
> >>> +static int32_t do_psci_1_0_system_suspend(register_t epoint, register_t cid)
> >>> +{
> >>> +    int ret;
> >>> +    struct vcpu *v;
> >>> +    struct domain *d = current->domain;
> >>> +
> >>> +    /* Drop this check once SYSTEM_SUSPEND is supported in hardware domain */
> >>> +    if ( is_hardware_domain(d) )
> >>> +        return PSCI_NOT_SUPPORTED;
> >>> +
> >>> +    domain_shutdown(d, SHUTDOWN_suspend);
> >>
> >> It would be good to add a comment explaining why you need to call
> >> domain_shutdown() first. Otherwise, one could wonder whether we can get
> >> rid of the complexity when a vCPU is still online.
> >
> > It's done first here because domain_shutdown() handles pausing of the
> > vCPUs internally, and this allows us to securely check whether the vCPUs
> > are online or not without interference from the guest.
> >
> > But you're probably right — a better solution might be to perform proper
> > checking of which vCPUs are still online before calling it.
>
> To clarify, I think this is right to call domain_shutdown() first to
> avoid any race when checking which vCPUs are still online (see the
> discussion we had in the previous version). My point is it would be good
> to document it in the code because this is not obvious.
got it
>
>
> >>
> >>> +        if ( v != current && is_vcpu_online(v) )
> >>> +        {
> >>> +            do_resume_on_error(d);
> >>> +            return PSCI_DENIED;
> >>> +        }
> >>> +
> >>> +    ret = do_setup_vcpu_ctx(current, epoint, cid);
> >>> +    if ( ret != PSCI_SUCCESS )
> >>> +        do_resume_on_error(d);
> >>> +
> >>> +    return ret;
> >>> +}
> >>> +
> >>>    static int32_t do_psci_1_0_features(uint32_t psci_func_id)
> >>>    {
> >>>        /* /!\ Ordered by function ID and not name */
> >>> @@ -214,6 +271,8 @@ static int32_t do_psci_1_0_features(uint32_t psci_func_id)
> >>>        case PSCI_0_2_FN32_SYSTEM_OFF:
> >>>        case PSCI_0_2_FN32_SYSTEM_RESET:
> >>>        case PSCI_1_0_FN32_PSCI_FEATURES:
> >>> +    case PSCI_1_0_FN32_SYSTEM_SUSPEND:
> >>> +    case PSCI_1_0_FN64_SYSTEM_SUSPEND:
> >>>        case ARM_SMCCC_VERSION_FID:
> >>>            return 0;
> >>>        default:
> >>> @@ -344,6 +403,17 @@ bool do_vpsci_0_2_call(struct cpu_user_regs *regs, uint32_t fid)
> >>>            return true;
> >>>        }
> >>>
> >>> +    case PSCI_1_0_FN32_SYSTEM_SUSPEND:
> >>> +    case PSCI_1_0_FN64_SYSTEM_SUSPEND:
> >>> +    {
> >>> +        register_t epoint = PSCI_ARG(regs,1);
> >>> +        register_t cid = PSCI_ARG(regs,2);
> >>
> >> Coding style: For the two lines above, there is a missing space after ",".
> >
> > Sure, will fix — thanks for pointing that out.
> >
> >>
> >> Also, if a 64-bit domain is calling the 32-bit version, then we also
> >> need to ignore the top 32-bits. AFAICT CPU_ON also have the same issue.
> >> I am not going to ask fixing CPU_ON (it would be good though), but I
> >> will at least ask we don't spread the mistake further.
> >
> > Maybe it would be better to return an error in this case?
>
> Why should we return an error? This is valid for a 64-bit domain to use
> SMC32 convention.
I mean — in that case, is it possible that the upper 32 bits are set to
non-zero values without it being an explicit error from the guest?
In my code, the macro used to extract 64-bit values (on 64-bit Xen, of
course) just copies values from the Xn registers directly.
According to the SMC Calling Convention specification:
"System Software on ARM Platforms" (ARM DEN 0028A), we must use Wn
for SMC32 parameters in AArch64.
AFAIK, writing to Wn zeroes the top 32 bits of Xn. So, if the guest
is properly using 32-bit values for arguments, the upper bits will already
be zeroed.
If the guest deliberately writes to the full Xn register --
perhaps to simplify handling of both SMC32 and SMC64, it may set
non-zero upper bits. In that case, I believe this is a guest error.
Finally, from the same document, section 3.1 "Register use in AArch64
SMC calls":
    The same architectural registers, R0–R7, are used for the two AArch64
    calling conventions, SMC32 and SMC64.
    The working size of the register is identified by its name:
        Xn: All 64 bits used.
        Wn: Lower 32 bits used, upper 32 bits are zero.
So, I think we should not ignore the upper 32 bits and should instead
treat non-zero values as an error when handling SMC32 calls from AArch64.
Alternatively, we could simply do nothing and expect the guest to follow
the SMC Calling Convention specification and fill the registers
accordingly.
Best regards,
Mykola
>
> > Should I also add checks for the case where the guest OS is 32-bit but
> > tries to call the 64-bit version of SYSTEM_SUSPEND?
>
> We already have this generic check at the beginning of vsmccc_handle_call().
>
> Cheers,
>
> --
> Julien Grall
>
                
            Hi Mykola, On 02/07/2025 23:27, Mykola Kvach wrote: > On Wed, Jul 2, 2025 at 3:28 PM Julien Grall <julien@xen.org> wrote: >> Why should we return an error? This is valid for a 64-bit domain to use >> SMC32 convention. > > I mean — in that case, is it possible that the upper 32 bits are set to > non-zero values without it being an explicit error from the guest? > > In my code, the macro used to extract 64-bit values (on 64-bit Xen, of > course) just copies values from the Xn registers directly. > > According to the SMC Calling Convention specification: > "System Software on ARM Platforms" (ARM DEN 0028A), we must use Wn > for SMC32 parameters in AArch64. The version A is more than 12 years old. You want to use the latest version. From the SMCCC DEN0028G [1] section 3.1 (Register use in AArch64 SMC and HVC calls): " The working size of the register is identified by its name: • Xn: All 64-bits are used. • Wn: The least significant 32-bits are used, and the most significant 32-bits are zero. Implementations must ignore the most significant bits. " So... > > AFAIK, writing to Wn zeroes the top 32 bits of Xn. So, if the guest > is properly using 32-bit values for arguments, the upper bits will already > be zeroed. ... while the guest should write 0 in the top 32-bit, we should not reject not reject non-zero values nor do nothing. Instead we should ignore the top bits. Also, per the Arm Arm (ARM DDI 0487J.a) page D1-5406, it is implementation defined on whether the top 32-bits are zeroed when the previous exception context was AArch32. Xen will zero them on entry to avoid any surprise (see [2]), but that's only guarantee if this is a 32-bit domain (running either on 64-bit or 32-bit Xen) as SMC can only be called from EL1. As a side note, KVM is also ignoring the top 32-bits (see [3]). Cheers, [1] https://developer.arm.com/documentation/den0028/gbet0/?lang=en [2] https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=32365f3476ac4655f2f26111cd7879912808cd77 [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kvm/psci.c#n223 -- Julien Grall
Hi Julien, On Thu, Jul 3, 2025 at 1:57 AM Julien Grall <julien@xen.org> wrote: > > Hi Mykola, > > On 02/07/2025 23:27, Mykola Kvach wrote: > > On Wed, Jul 2, 2025 at 3:28 PM Julien Grall <julien@xen.org> wrote: > >> Why should we return an error? This is valid for a 64-bit domain to use > >> SMC32 convention. > > > > I mean — in that case, is it possible that the upper 32 bits are set to > > non-zero values without it being an explicit error from the guest? > > > > In my code, the macro used to extract 64-bit values (on 64-bit Xen, of > > course) just copies values from the Xn registers directly. > > > > According to the SMC Calling Convention specification: > > "System Software on ARM Platforms" (ARM DEN 0028A), we must use Wn > > for SMC32 parameters in AArch64. > > The version A is more than 12 years old. You want to use the latest > version. From the SMCCC DEN0028G [1] section 3.1 (Register use in > AArch64 SMC and HVC calls): > > " > The working size of the register is identified by its name: > • Xn: All 64-bits are used. > • Wn: The least significant 32-bits are used, and the most significant > 32-bits are zero. Implementations must > ignore the most significant bits. > " You're right — I should have referred to the latest version of the specification. Thanks for pointing that out, and for the detailed review and explanation. I appreciate your time! > > So... > > > > > AFAIK, writing to Wn zeroes the top 32 bits of Xn. So, if the guest > > is properly using 32-bit values for arguments, the upper bits will already > > be zeroed. > > ... while the guest should write 0 in the top 32-bit, we should not > reject not reject non-zero values nor do nothing. Instead we should > ignore the top bits. > > Also, per the Arm Arm (ARM DDI 0487J.a) page D1-5406, it is > implementation defined on whether the top 32-bits are zeroed when the > previous exception context was AArch32. Xen will zero them on entry to > avoid any surprise (see [2]), but that's only guarantee if this is a > 32-bit domain (running either on 64-bit or 32-bit Xen) as SMC can only > be called from EL1. > > As a side note, KVM is also ignoring the top 32-bits (see [3]). Got it. I'll update the code to ignore the top 32 bits when an AArch64 domain issues SMC32 calls. Thanks for the clarification! > > Cheers, > > [1] https://developer.arm.com/documentation/den0028/gbet0/?lang=en > [2] > https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=32365f3476ac4655f2f26111cd7879912808cd77 > [3] > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kvm/psci.c#n223 > > -- > Julien Grall > Best regards, Mykola
© 2016 - 2025 Red Hat, Inc.