[v1] arm/cpu: fix soft lockup panic after resuming from stop

[Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Heyi Guo 6 years, 11 months ago

When we stop a VM for more than 30 seconds and then resume it, by qemu
monitor command "stop" and "cont", Linux on VM will complain of "soft
lockup - CPU#x stuck for xxs!" as below:

[ 2783.809517] watchdog: BUG: soft lockup - CPU#3 stuck for 2395s!
[ 2783.809559] watchdog: BUG: soft lockup - CPU#2 stuck for 2395s!
[ 2783.809561] watchdog: BUG: soft lockup - CPU#1 stuck for 2395s!
[ 2783.809563] Modules linked in...

This is because Guest Linux uses generic timer virtual counter as
a software watchdog, and CNTVCT_EL0 does not stop when VM is stopped
by qemu.

This patch is to fix this issue by saving the value of CNTVCT_EL0 when
stopping and restoring it when resuming.

Cc: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Heyi Guo <guoheyi@huawei.com>
---
 target/arm/cpu.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index 96f0ff0..7bbba3d 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -896,6 +896,60 @@ static void arm_cpu_finalizefn(Object *obj)
 #endif
 }
 
+static int get_vcpu_timer_tick(CPUState *cs, uint64_t *tick_at_pause)
+{
+    int err;
+    struct kvm_one_reg reg;
+
+    reg.id = KVM_REG_ARM_TIMER_CNT;
+    reg.addr = (uintptr_t) tick_at_pause;
+
+    err = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, &reg);
+    return err;
+}
+
+static int set_vcpu_timer_tick(CPUState *cs, uint64_t tick_at_pause)
+{
+    int err;
+    struct kvm_one_reg reg;
+
+    reg.id = KVM_REG_ARM_TIMER_CNT;
+    reg.addr = (uintptr_t) &tick_at_pause;
+
+    err = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);
+    return err;
+}
+
+static void arch_timer_change_state_handler(void *opaque, int running,
+                                            RunState state)
+{
+    static uint64_t hw_ticks_at_paused;
+    static RunState pre_state = RUN_STATE__MAX;
+    int err;
+    CPUState *cs = (CPUState *)opaque;
+
+    switch (state) {
+    case RUN_STATE_PAUSED:
+        err = get_vcpu_timer_tick(cs, &hw_ticks_at_paused);
+        if (err) {
+            error_report("Get vcpu timer tick failed: %d", err);
+        }
+        break;
+    case RUN_STATE_RUNNING:
+        if (pre_state == RUN_STATE_PAUSED) {
+            err = set_vcpu_timer_tick(cs, hw_ticks_at_paused);
+            if (err) {
+                error_report("Resume vcpu timer tick failed: %d", err);
+            }
+        }
+        break;
+    default:
+        break;
+    }
+
+    pre_state = state;
+}
+
 static void arm_cpu_realizefn(DeviceState *dev, Error **errp)
 {
     CPUState *cs = CPU(dev);
@@ -906,6 +960,12 @@ static void arm_cpu_realizefn(DeviceState *dev, Error **errp)
     Error *local_err = NULL;
     bool no_aa32 = false;
 
+    /*
+     * Only add change state handler for arch timer once, for KVM will help to
+     * synchronize virtual timer of all VCPUs.
+     */
+    static bool arch_timer_change_state_handler_added;
+
     /* If we needed to query the host kernel for the CPU features
      * then it's possible that might have failed in the initfn, but
      * this is the first point where we can report it.
@@ -1181,6 +1241,11 @@ static void arm_cpu_realizefn(DeviceState *dev, Error **errp)
 
     init_cpreg_list(cpu);
 
+    if (!arch_timer_change_state_handler_added && kvm_enabled()) {
+        qemu_add_vm_change_state_handler(arch_timer_change_state_handler, cs);
+        arch_timer_change_state_handler_added = true;
+    }
+
 #ifndef CONFIG_USER_ONLY
     if (cpu->has_el3 || arm_feature(env, ARM_FEATURE_M_SECURITY)) {
         cs->num_ases = 2;
-- 
1.8.3.1

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by no-reply@patchew.org 6 years, 11 months ago

Patchew URL: https://patchew.org/QEMU/1552370960-2061-1-git-send-email-guoheyi@huawei.com/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
time make docker-test-mingw@fedora SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC      aarch64-softmmu/trace/generated-helpers.o
  CC      aarch64-softmmu/target/arm/translate-sve.o
/tmp/qemu-test/src/target/arm/cpu.c: In function 'get_vcpu_timer_tick':
/tmp/qemu-test/src/target/arm/cpu.c:902:24: error: storage size of 'reg' isn't known
     struct kvm_one_reg reg;
                        ^~~
/tmp/qemu-test/src/target/arm/cpu.c:904:14: error: 'KVM_REG_ARM_TIMER_CNT' undeclared (first use in this function)
     reg.id = KVM_REG_ARM_TIMER_CNT;
              ^~~~~~~~~~~~~~~~~~~~~
/tmp/qemu-test/src/target/arm/cpu.c:904:14: note: each undeclared identifier is reported only once for each function it appears in
/tmp/qemu-test/src/target/arm/cpu.c:907:30: error: 'KVM_GET_ONE_REG' undeclared (first use in this function)
     err = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, &reg);
                              ^~~~~~~~~~~~~~~
/tmp/qemu-test/src/target/arm/cpu.c:902:24: error: unused variable 'reg' [-Werror=unused-variable]
     struct kvm_one_reg reg;
                        ^~~
/tmp/qemu-test/src/target/arm/cpu.c: In function 'set_vcpu_timer_tick':
/tmp/qemu-test/src/target/arm/cpu.c:914:24: error: storage size of 'reg' isn't known
     struct kvm_one_reg reg;
                        ^~~
/tmp/qemu-test/src/target/arm/cpu.c:916:14: error: 'KVM_REG_ARM_TIMER_CNT' undeclared (first use in this function)
     reg.id = KVM_REG_ARM_TIMER_CNT;
              ^~~~~~~~~~~~~~~~~~~~~
/tmp/qemu-test/src/target/arm/cpu.c:919:30: error: 'KVM_SET_ONE_REG' undeclared (first use in this function)
     err = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);
                              ^~~~~~~~~~~~~~~
/tmp/qemu-test/src/target/arm/cpu.c:914:24: error: unused variable 'reg' [-Werror=unused-variable]
     struct kvm_one_reg reg;
                        ^~~
cc1: all warnings being treated as errors


The full log is available at
http://patchew.org/logs/1552370960-2061-1-git-send-email-guoheyi@huawei.com/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Heyi Guo 6 years, 11 months ago

Hi all,

I'm sorry this patch failed the docker-mingw@fedora build test. I'm going to move the code to target/arm/kvm.c.

Please ignore this one.

Thanks,

Heyi


On 2019/3/12 14:09, Heyi Guo wrote:
> When we stop a VM for more than 30 seconds and then resume it, by qemu
> monitor command "stop" and "cont", Linux on VM will complain of "soft
> lockup - CPU#x stuck for xxs!" as below:
>
> [ 2783.809517] watchdog: BUG: soft lockup - CPU#3 stuck for 2395s!
> [ 2783.809559] watchdog: BUG: soft lockup - CPU#2 stuck for 2395s!
> [ 2783.809561] watchdog: BUG: soft lockup - CPU#1 stuck for 2395s!
> [ 2783.809563] Modules linked in...
>
> This is because Guest Linux uses generic timer virtual counter as
> a software watchdog, and CNTVCT_EL0 does not stop when VM is stopped
> by qemu.
>
> This patch is to fix this issue by saving the value of CNTVCT_EL0 when
> stopping and restoring it when resuming.
>
> Cc: Peter Maydell <peter.maydell@linaro.org>
> Signed-off-by: Heyi Guo <guoheyi@huawei.com>
> ---
>   target/arm/cpu.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 65 insertions(+)
>
> diff --git a/target/arm/cpu.c b/target/arm/cpu.c
> index 96f0ff0..7bbba3d 100644
> --- a/target/arm/cpu.c
> +++ b/target/arm/cpu.c
> @@ -896,6 +896,60 @@ static void arm_cpu_finalizefn(Object *obj)
>   #endif
>   }
>   
> +static int get_vcpu_timer_tick(CPUState *cs, uint64_t *tick_at_pause)
> +{
> +    int err;
> +    struct kvm_one_reg reg;
> +
> +    reg.id = KVM_REG_ARM_TIMER_CNT;
> +    reg.addr = (uintptr_t) tick_at_pause;
> +
> +    err = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, &reg);
> +    return err;
> +}
> +
> +static int set_vcpu_timer_tick(CPUState *cs, uint64_t tick_at_pause)
> +{
> +    int err;
> +    struct kvm_one_reg reg;
> +
> +    reg.id = KVM_REG_ARM_TIMER_CNT;
> +    reg.addr = (uintptr_t) &tick_at_pause;
> +
> +    err = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);
> +    return err;
> +}
> +
> +static void arch_timer_change_state_handler(void *opaque, int running,
> +                                            RunState state)
> +{
> +    static uint64_t hw_ticks_at_paused;
> +    static RunState pre_state = RUN_STATE__MAX;
> +    int err;
> +    CPUState *cs = (CPUState *)opaque;
> +
> +    switch (state) {
> +    case RUN_STATE_PAUSED:
> +        err = get_vcpu_timer_tick(cs, &hw_ticks_at_paused);
> +        if (err) {
> +            error_report("Get vcpu timer tick failed: %d", err);
> +        }
> +        break;
> +    case RUN_STATE_RUNNING:
> +        if (pre_state == RUN_STATE_PAUSED) {
> +            err = set_vcpu_timer_tick(cs, hw_ticks_at_paused);
> +            if (err) {
> +                error_report("Resume vcpu timer tick failed: %d", err);
> +            }
> +        }
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    pre_state = state;
> +}
> +
>   static void arm_cpu_realizefn(DeviceState *dev, Error **errp)
>   {
>       CPUState *cs = CPU(dev);
> @@ -906,6 +960,12 @@ static void arm_cpu_realizefn(DeviceState *dev, Error **errp)
>       Error *local_err = NULL;
>       bool no_aa32 = false;
>   
> +    /*
> +     * Only add change state handler for arch timer once, for KVM will help to
> +     * synchronize virtual timer of all VCPUs.
> +     */
> +    static bool arch_timer_change_state_handler_added;
> +
>       /* If we needed to query the host kernel for the CPU features
>        * then it's possible that might have failed in the initfn, but
>        * this is the first point where we can report it.
> @@ -1181,6 +1241,11 @@ static void arm_cpu_realizefn(DeviceState *dev, Error **errp)
>   
>       init_cpreg_list(cpu);
>   
> +    if (!arch_timer_change_state_handler_added && kvm_enabled()) {
> +        qemu_add_vm_change_state_handler(arch_timer_change_state_handler, cs);
> +        arch_timer_change_state_handler_added = true;
> +    }
> +
>   #ifndef CONFIG_USER_ONLY
>       if (cpu->has_el3 || arm_feature(env, ARM_FEATURE_M_SECURITY)) {
>           cs->num_ases = 2;

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Peter Maydell 6 years, 11 months ago

On Tue, 12 Mar 2019 at 06:10, Heyi Guo <guoheyi@huawei.com> wrote:
>
> When we stop a VM for more than 30 seconds and then resume it, by qemu
> monitor command "stop" and "cont", Linux on VM will complain of "soft
> lockup - CPU#x stuck for xxs!" as below:
>
> [ 2783.809517] watchdog: BUG: soft lockup - CPU#3 stuck for 2395s!
> [ 2783.809559] watchdog: BUG: soft lockup - CPU#2 stuck for 2395s!
> [ 2783.809561] watchdog: BUG: soft lockup - CPU#1 stuck for 2395s!
> [ 2783.809563] Modules linked in...
>
> This is because Guest Linux uses generic timer virtual counter as
> a software watchdog, and CNTVCT_EL0 does not stop when VM is stopped
> by qemu.
>
> This patch is to fix this issue by saving the value of CNTVCT_EL0 when
> stopping and restoring it when resuming.

Hi -- I know we have issues with the passage of time in Arm VMs
running under KVM when the VM is suspended, but the topic is
a tricky one, and it's not clear to me that this is the correct
way to fix it. I would prefer to see us start with a discussion
on the kvm-arm mailing list about the best approach to the problem.

I've cc'd that list and a couple of the Arm KVM maintainers
for their opinion.

QEMU patch left below for context -- the brief summary is that
it uses KVM_GET_ONE_REG/KVM_SET_ONE_REG on the timer CNT register
to save it on VM pause and write that value back on VM resume.

thanks
-- PMM



> Cc: Peter Maydell <peter.maydell@linaro.org>
> Signed-off-by: Heyi Guo <guoheyi@huawei.com>
> ---
>  target/arm/cpu.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 65 insertions(+)
>
> diff --git a/target/arm/cpu.c b/target/arm/cpu.c
> index 96f0ff0..7bbba3d 100644
> --- a/target/arm/cpu.c
> +++ b/target/arm/cpu.c
> @@ -896,6 +896,60 @@ static void arm_cpu_finalizefn(Object *obj)
>  #endif
>  }
>
> +static int get_vcpu_timer_tick(CPUState *cs, uint64_t *tick_at_pause)
> +{
> +    int err;
> +    struct kvm_one_reg reg;
> +
> +    reg.id = KVM_REG_ARM_TIMER_CNT;
> +    reg.addr = (uintptr_t) tick_at_pause;
> +
> +    err = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, &reg);
> +    return err;
> +}
> +
> +static int set_vcpu_timer_tick(CPUState *cs, uint64_t tick_at_pause)
> +{
> +    int err;
> +    struct kvm_one_reg reg;
> +
> +    reg.id = KVM_REG_ARM_TIMER_CNT;
> +    reg.addr = (uintptr_t) &tick_at_pause;
> +
> +    err = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);
> +    return err;
> +}
> +
> +static void arch_timer_change_state_handler(void *opaque, int running,
> +                                            RunState state)
> +{
> +    static uint64_t hw_ticks_at_paused;
> +    static RunState pre_state = RUN_STATE__MAX;
> +    int err;
> +    CPUState *cs = (CPUState *)opaque;
> +
> +    switch (state) {
> +    case RUN_STATE_PAUSED:
> +        err = get_vcpu_timer_tick(cs, &hw_ticks_at_paused);
> +        if (err) {
> +            error_report("Get vcpu timer tick failed: %d", err);
> +        }
> +        break;
> +    case RUN_STATE_RUNNING:
> +        if (pre_state == RUN_STATE_PAUSED) {
> +            err = set_vcpu_timer_tick(cs, hw_ticks_at_paused);
> +            if (err) {
> +                error_report("Resume vcpu timer tick failed: %d", err);
> +            }
> +        }
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    pre_state = state;
> +}
> +
>  static void arm_cpu_realizefn(DeviceState *dev, Error **errp)
>  {
>      CPUState *cs = CPU(dev);
> @@ -906,6 +960,12 @@ static void arm_cpu_realizefn(DeviceState *dev, Error **errp)
>      Error *local_err = NULL;
>      bool no_aa32 = false;
>
> +    /*
> +     * Only add change state handler for arch timer once, for KVM will help to
> +     * synchronize virtual timer of all VCPUs.
> +     */
> +    static bool arch_timer_change_state_handler_added;
> +
>      /* If we needed to query the host kernel for the CPU features
>       * then it's possible that might have failed in the initfn, but
>       * this is the first point where we can report it.
> @@ -1181,6 +1241,11 @@ static void arm_cpu_realizefn(DeviceState *dev, Error **errp)
>
>      init_cpreg_list(cpu);
>
> +    if (!arch_timer_change_state_handler_added && kvm_enabled()) {
> +        qemu_add_vm_change_state_handler(arch_timer_change_state_handler, cs);
> +        arch_timer_change_state_handler_added = true;
> +    }
> +
>  #ifndef CONFIG_USER_ONLY
>      if (cpu->has_el3 || arm_feature(env, ARM_FEATURE_M_SECURITY)) {
>          cs->num_ases = 2;
> --
> 1.8.3.1

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Christoffer Dall 6 years, 11 months ago

[Adding Steven Price, who has recently looked at this, in cc]

On Tue, Mar 12, 2019 at 10:08:47AM +0000, Peter Maydell wrote:
> On Tue, 12 Mar 2019 at 06:10, Heyi Guo <guoheyi@huawei.com> wrote:
> >
> > When we stop a VM for more than 30 seconds and then resume it, by qemu
> > monitor command "stop" and "cont", Linux on VM will complain of "soft
> > lockup - CPU#x stuck for xxs!" as below:
> >
> > [ 2783.809517] watchdog: BUG: soft lockup - CPU#3 stuck for 2395s!
> > [ 2783.809559] watchdog: BUG: soft lockup - CPU#2 stuck for 2395s!
> > [ 2783.809561] watchdog: BUG: soft lockup - CPU#1 stuck for 2395s!
> > [ 2783.809563] Modules linked in...
> >
> > This is because Guest Linux uses generic timer virtual counter as
> > a software watchdog, and CNTVCT_EL0 does not stop when VM is stopped
> > by qemu.
> >
> > This patch is to fix this issue by saving the value of CNTVCT_EL0 when
> > stopping and restoring it when resuming.
> 
> Hi -- I know we have issues with the passage of time in Arm VMs
> running under KVM when the VM is suspended, but the topic is
> a tricky one, and it's not clear to me that this is the correct
> way to fix it. I would prefer to see us start with a discussion
> on the kvm-arm mailing list about the best approach to the problem.
> 
> I've cc'd that list and a couple of the Arm KVM maintainers
> for their opinion.
> 
> QEMU patch left below for context -- the brief summary is that
> it uses KVM_GET_ONE_REG/KVM_SET_ONE_REG on the timer CNT register
> to save it on VM pause and write that value back on VM resume.
> 
> thanks
> -- PMM
> 
> 
> 
> > Cc: Peter Maydell <peter.maydell@linaro.org>
> > Signed-off-by: Heyi Guo <guoheyi@huawei.com>
> > ---
> >  target/arm/cpu.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 65 insertions(+)
> >
> > diff --git a/target/arm/cpu.c b/target/arm/cpu.c
> > index 96f0ff0..7bbba3d 100644
> > --- a/target/arm/cpu.c
> > +++ b/target/arm/cpu.c
> > @@ -896,6 +896,60 @@ static void arm_cpu_finalizefn(Object *obj)
> >  #endif
> >  }
> >
> > +static int get_vcpu_timer_tick(CPUState *cs, uint64_t *tick_at_pause)
> > +{
> > +    int err;
> > +    struct kvm_one_reg reg;
> > +
> > +    reg.id = KVM_REG_ARM_TIMER_CNT;
> > +    reg.addr = (uintptr_t) tick_at_pause;
> > +
> > +    err = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, &reg);
> > +    return err;
> > +}
> > +
> > +static int set_vcpu_timer_tick(CPUState *cs, uint64_t tick_at_pause)
> > +{
> > +    int err;
> > +    struct kvm_one_reg reg;
> > +
> > +    reg.id = KVM_REG_ARM_TIMER_CNT;
> > +    reg.addr = (uintptr_t) &tick_at_pause;
> > +
> > +    err = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);
> > +    return err;
> > +}
> > +
> > +static void arch_timer_change_state_handler(void *opaque, int running,
> > +                                            RunState state)
> > +{
> > +    static uint64_t hw_ticks_at_paused;
> > +    static RunState pre_state = RUN_STATE__MAX;
> > +    int err;
> > +    CPUState *cs = (CPUState *)opaque;
> > +
> > +    switch (state) {
> > +    case RUN_STATE_PAUSED:
> > +        err = get_vcpu_timer_tick(cs, &hw_ticks_at_paused);
> > +        if (err) {
> > +            error_report("Get vcpu timer tick failed: %d", err);
> > +        }
> > +        break;
> > +    case RUN_STATE_RUNNING:
> > +        if (pre_state == RUN_STATE_PAUSED) {
> > +            err = set_vcpu_timer_tick(cs, hw_ticks_at_paused);
> > +            if (err) {
> > +                error_report("Resume vcpu timer tick failed: %d", err);
> > +            }
> > +        }
> > +        break;
> > +    default:
> > +        break;
> > +    }
> > +
> > +    pre_state = state;
> > +}
> > +
> >  static void arm_cpu_realizefn(DeviceState *dev, Error **errp)
> >  {
> >      CPUState *cs = CPU(dev);
> > @@ -906,6 +960,12 @@ static void arm_cpu_realizefn(DeviceState *dev, Error **errp)
> >      Error *local_err = NULL;
> >      bool no_aa32 = false;
> >
> > +    /*
> > +     * Only add change state handler for arch timer once, for KVM will help to
> > +     * synchronize virtual timer of all VCPUs.
> > +     */
> > +    static bool arch_timer_change_state_handler_added;
> > +
> >      /* If we needed to query the host kernel for the CPU features
> >       * then it's possible that might have failed in the initfn, but
> >       * this is the first point where we can report it.
> > @@ -1181,6 +1241,11 @@ static void arm_cpu_realizefn(DeviceState *dev, Error **errp)
> >
> >      init_cpreg_list(cpu);
> >
> > +    if (!arch_timer_change_state_handler_added && kvm_enabled()) {
> > +        qemu_add_vm_change_state_handler(arch_timer_change_state_handler, cs);
> > +        arch_timer_change_state_handler_added = true;
> > +    }
> > +
> >  #ifndef CONFIG_USER_ONLY
> >      if (cpu->has_el3 || arm_feature(env, ARM_FEATURE_M_SECURITY)) {
> >          cs->num_ases = 2;
> > --
> > 1.8.3.1

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Marc Zyngier 6 years, 11 months ago

Hi Peter,

On 12/03/2019 10:08, Peter Maydell wrote:
> On Tue, 12 Mar 2019 at 06:10, Heyi Guo <guoheyi@huawei.com> wrote:
>>
>> When we stop a VM for more than 30 seconds and then resume it, by qemu
>> monitor command "stop" and "cont", Linux on VM will complain of "soft
>> lockup - CPU#x stuck for xxs!" as below:
>>
>> [ 2783.809517] watchdog: BUG: soft lockup - CPU#3 stuck for 2395s!
>> [ 2783.809559] watchdog: BUG: soft lockup - CPU#2 stuck for 2395s!
>> [ 2783.809561] watchdog: BUG: soft lockup - CPU#1 stuck for 2395s!
>> [ 2783.809563] Modules linked in...
>>
>> This is because Guest Linux uses generic timer virtual counter as
>> a software watchdog, and CNTVCT_EL0 does not stop when VM is stopped
>> by qemu.
>>
>> This patch is to fix this issue by saving the value of CNTVCT_EL0 when
>> stopping and restoring it when resuming.
> 
> Hi -- I know we have issues with the passage of time in Arm VMs
> running under KVM when the VM is suspended, but the topic is
> a tricky one, and it's not clear to me that this is the correct
> way to fix it. I would prefer to see us start with a discussion
> on the kvm-arm mailing list about the best approach to the problem.
> 
> I've cc'd that list and a couple of the Arm KVM maintainers
> for their opinion.
> 
> QEMU patch left below for context -- the brief summary is that
> it uses KVM_GET_ONE_REG/KVM_SET_ONE_REG on the timer CNT register
> to save it on VM pause and write that value back on VM resume.

That's probably good enough for this particular use case, but I think
there is more. I can get into similar trouble if I suspend my laptop, or
suspend QEMU. It also has the slightly bizarre effect of skewing time,
and this will affect timekeeping in the guest in ways that are much more
subtle than just shouty CPUs.

Christoffer and Steve had some stuff regarding Live Physical Time, which
should cover that, and other cases such as host system suspend, and QEMU
being suspended.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Heyi Guo 6 years, 11 months ago

Dear all,

Really appreciate your comments and information. For "Live Physical Time", is there any document to tell the whole story? And may I know the timeline to make it happen?

Thanks,

Heyi


On 2019/3/12 22:12, Marc Zyngier wrote:
> Hi Peter,
>
> On 12/03/2019 10:08, Peter Maydell wrote:
>> On Tue, 12 Mar 2019 at 06:10, Heyi Guo <guoheyi@huawei.com> wrote:
>>> When we stop a VM for more than 30 seconds and then resume it, by qemu
>>> monitor command "stop" and "cont", Linux on VM will complain of "soft
>>> lockup - CPU#x stuck for xxs!" as below:
>>>
>>> [ 2783.809517] watchdog: BUG: soft lockup - CPU#3 stuck for 2395s!
>>> [ 2783.809559] watchdog: BUG: soft lockup - CPU#2 stuck for 2395s!
>>> [ 2783.809561] watchdog: BUG: soft lockup - CPU#1 stuck for 2395s!
>>> [ 2783.809563] Modules linked in...
>>>
>>> This is because Guest Linux uses generic timer virtual counter as
>>> a software watchdog, and CNTVCT_EL0 does not stop when VM is stopped
>>> by qemu.
>>>
>>> This patch is to fix this issue by saving the value of CNTVCT_EL0 when
>>> stopping and restoring it when resuming.
>> Hi -- I know we have issues with the passage of time in Arm VMs
>> running under KVM when the VM is suspended, but the topic is
>> a tricky one, and it's not clear to me that this is the correct
>> way to fix it. I would prefer to see us start with a discussion
>> on the kvm-arm mailing list about the best approach to the problem.
>>
>> I've cc'd that list and a couple of the Arm KVM maintainers
>> for their opinion.
>>
>> QEMU patch left below for context -- the brief summary is that
>> it uses KVM_GET_ONE_REG/KVM_SET_ONE_REG on the timer CNT register
>> to save it on VM pause and write that value back on VM resume.
> That's probably good enough for this particular use case, but I think
> there is more. I can get into similar trouble if I suspend my laptop, or
> suspend QEMU. It also has the slightly bizarre effect of skewing time,
> and this will affect timekeeping in the guest in ways that are much more
> subtle than just shouty CPUs.
>
> Christoffer and Steve had some stuff regarding Live Physical Time, which
> should cover that, and other cases such as host system suspend, and QEMU
> being suspended.
>
> Thanks,
>
> 	M.

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Steven Price 6 years, 11 months ago

On 13/03/2019 01:57, Heyi Guo wrote:
> Dear all,
> 
> Really appreciate your comments and information. For "Live Physical
> Time", is there any document to tell the whole story? And may I know the
> timeline to make it happen?

The documentation is available here:
https://developer.arm.com/docs/den0057/a

I also posted an RFC for Linux support last year:
http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620338.html

In terms of timeline I don't have anything definite. We need to get the
spec nailed down before much progress can be made on the code. And at
the moment Live Physical Time (LPT) isn't that useful on it's own
because there are a number of other issues with migrating an arm64 guest
from one host to another which has different hardware (e.g. handling
different CPU errata).

Steve

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Steven Price 6 years, 11 months ago

On 12/03/2019 14:12, Marc Zyngier wrote:
> Hi Peter,
> 
> On 12/03/2019 10:08, Peter Maydell wrote:
>> On Tue, 12 Mar 2019 at 06:10, Heyi Guo <guoheyi@huawei.com> wrote:
>>>
>>> When we stop a VM for more than 30 seconds and then resume it, by qemu
>>> monitor command "stop" and "cont", Linux on VM will complain of "soft
>>> lockup - CPU#x stuck for xxs!" as below:
>>>
>>> [ 2783.809517] watchdog: BUG: soft lockup - CPU#3 stuck for 2395s!
>>> [ 2783.809559] watchdog: BUG: soft lockup - CPU#2 stuck for 2395s!
>>> [ 2783.809561] watchdog: BUG: soft lockup - CPU#1 stuck for 2395s!
>>> [ 2783.809563] Modules linked in...
>>>
>>> This is because Guest Linux uses generic timer virtual counter as
>>> a software watchdog, and CNTVCT_EL0 does not stop when VM is stopped
>>> by qemu.
>>>
>>> This patch is to fix this issue by saving the value of CNTVCT_EL0 when
>>> stopping and restoring it when resuming.

An alternative way of fixing this particular issue ("stop"/"cont"
commands in QEMU) would be to wire up KVM_KVMCLOCK_CTRL for arm to allow
QEMU to signal to the guest that it was forcibly stopped for a while
(and so the watchdog timeout can be ignored by the guest).

>> Hi -- I know we have issues with the passage of time in Arm VMs
>> running under KVM when the VM is suspended, but the topic is
>> a tricky one, and it's not clear to me that this is the correct
>> way to fix it. I would prefer to see us start with a discussion
>> on the kvm-arm mailing list about the best approach to the problem.
>>
>> I've cc'd that list and a couple of the Arm KVM maintainers
>> for their opinion.
>>
>> QEMU patch left below for context -- the brief summary is that
>> it uses KVM_GET_ONE_REG/KVM_SET_ONE_REG on the timer CNT register
>> to save it on VM pause and write that value back on VM resume.
> 
> That's probably good enough for this particular use case, but I think
> there is more. I can get into similar trouble if I suspend my laptop, or
> suspend QEMU. It also has the slightly bizarre effect of skewing time,
> and this will affect timekeeping in the guest in ways that are much more
> subtle than just shouty CPUs.

Indeed this is the bigger issue - user space doesn't get an opportunity
to be involved when suspending/resuming, so saving/restoring (or using
KVM_KVMCLOCK_CTRL) in user space won't fix these cases.

> Christoffer and Steve had some stuff regarding Live Physical Time, which
> should cover that, and other cases such as host system suspend, and QEMU
> being suspended.

Live Physical Time (LPT) is only part of the solution - this handles the
mess that otherwise would occur when moving to a new host with a
different clock frequency.

Personally I think what we need is:

* Either a patch like the one from Heyi Guo (save/restore CNTVCT_EL0) or
alternatively hooking up KVM_KVMCLOCK_CTRL to prevent the watchdog
firing when user space explicitly stops scheduling the guest for a while.

* KVM itself saving/restoring CNTVCT_EL0 during suspend/resume so the
guest doesn't see time pass during a suspend.

* Something equivalent to MSR_KVM_WALL_CLOCK_NEW for arm which allows
the guest to query the wall clock time from the host and provides an
offset between CNTVCT_EL0 to wall clock time which the KVM can update
during suspend/resume. This means that during a suspend/resume the guest
can observe that wall clock time has passed, without having to be
bothered about CNTVCT_EL0 jumping forwards.

Steve

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Heyi Guo 6 years, 11 months ago

Hi Peter and Steven,


On 2019/3/13 18:11, Steven Price wrote:
> On 12/03/2019 14:12, Marc Zyngier wrote:
>> Hi Peter,
>>
>> On 12/03/2019 10:08, Peter Maydell wrote:
>>> On Tue, 12 Mar 2019 at 06:10, Heyi Guo <guoheyi@huawei.com> wrote:
>>>> When we stop a VM for more than 30 seconds and then resume it, by qemu
>>>> monitor command "stop" and "cont", Linux on VM will complain of "soft
>>>> lockup - CPU#x stuck for xxs!" as below:
>>>>
>>>> [ 2783.809517] watchdog: BUG: soft lockup - CPU#3 stuck for 2395s!
>>>> [ 2783.809559] watchdog: BUG: soft lockup - CPU#2 stuck for 2395s!
>>>> [ 2783.809561] watchdog: BUG: soft lockup - CPU#1 stuck for 2395s!
>>>> [ 2783.809563] Modules linked in...
>>>>
>>>> This is because Guest Linux uses generic timer virtual counter as
>>>> a software watchdog, and CNTVCT_EL0 does not stop when VM is stopped
>>>> by qemu.
>>>>
>>>> This patch is to fix this issue by saving the value of CNTVCT_EL0 when
>>>> stopping and restoring it when resuming.
> An alternative way of fixing this particular issue ("stop"/"cont"
> commands in QEMU) would be to wire up KVM_KVMCLOCK_CTRL for arm to allow
> QEMU to signal to the guest that it was forcibly stopped for a while
> (and so the watchdog timeout can be ignored by the guest).
>
>>> Hi -- I know we have issues with the passage of time in Arm VMs
>>> running under KVM when the VM is suspended, but the topic is
>>> a tricky one, and it's not clear to me that this is the correct
>>> way to fix it. I would prefer to see us start with a discussion
>>> on the kvm-arm mailing list about the best approach to the problem.
>>>
>>> I've cc'd that list and a couple of the Arm KVM maintainers
>>> for their opinion.
>>>
>>> QEMU patch left below for context -- the brief summary is that
>>> it uses KVM_GET_ONE_REG/KVM_SET_ONE_REG on the timer CNT register
>>> to save it on VM pause and write that value back on VM resume.
>> That's probably good enough for this particular use case, but I think
>> there is more. I can get into similar trouble if I suspend my laptop, or
>> suspend QEMU. It also has the slightly bizarre effect of skewing time,
>> and this will affect timekeeping in the guest in ways that are much more
>> subtle than just shouty CPUs.
> Indeed this is the bigger issue - user space doesn't get an opportunity
> to be involved when suspending/resuming, so saving/restoring (or using
> KVM_KVMCLOCK_CTRL) in user space won't fix these cases.
>
>> Christoffer and Steve had some stuff regarding Live Physical Time, which
>> should cover that, and other cases such as host system suspend, and QEMU
>> being suspended.
> Live Physical Time (LPT) is only part of the solution - this handles the
> mess that otherwise would occur when moving to a new host with a
> different clock frequency.
>
> Personally I think what we need is:
>
> * Either a patch like the one from Heyi Guo (save/restore CNTVCT_EL0) or
> alternatively hooking up KVM_KVMCLOCK_CTRL to prevent the watchdog
> firing when user space explicitly stops scheduling the guest for a while.
Per Steven's comments, the above change is still needed even when LPT is implemented, so do it make sense for us to apply this patch first? At least it fixes something and does not cause conflict with future solution. And do you think it is better to use KVM_KVMCLOCK_CTRL?

Thanks,
Heyi

>
> * KVM itself saving/restoring CNTVCT_EL0 during suspend/resume so the
> guest doesn't see time pass during a suspend.
>
> * Something equivalent to MSR_KVM_WALL_CLOCK_NEW for arm which allows
> the guest to query the wall clock time from the host and provides an
> offset between CNTVCT_EL0 to wall clock time which the KVM can update
> during suspend/resume. This means that during a suspend/resume the guest
> can observe that wall clock time has passed, without having to be
> bothered about CNTVCT_EL0 jumping forwards.
>
> Steve
>
> .
>

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Christoffer Dall 6 years, 11 months ago

Hi Steve,

On Wed, Mar 13, 2019 at 10:11:30AM +0000, Steven Price wrote:
> 
> Personally I think what we need is:
> 
> * Either a patch like the one from Heyi Guo (save/restore CNTVCT_EL0) or
> alternatively hooking up KVM_KVMCLOCK_CTRL to prevent the watchdog
> firing when user space explicitly stops scheduling the guest for a while.

If we save/restore CNTVCT_EL0 and the warning goes away, does the guest
wall clock timekeeping get all confused and does it figure this out
automagically somehow?

Does KVM_KVMCLOCK_CTRL solve that problem?

> 
> * KVM itself saving/restoring CNTVCT_EL0 during suspend/resume so the
> guest doesn't see time pass during a suspend.

This smells like policy to me so I'd much prefer keeping as much
functionality in user space as possible.  If we already have the APIs we
need from KVM, let's use them.

> 
> * Something equivalent to MSR_KVM_WALL_CLOCK_NEW for arm which allows
> the guest to query the wall clock time from the host and provides an
> offset between CNTVCT_EL0 to wall clock time which the KVM can update
> during suspend/resume. This means that during a suspend/resume the guest
> can observe that wall clock time has passed, without having to be
> bothered about CNTVCT_EL0 jumping forwards.
> 

Isn't the proper Arm architectural solution for this to read the
physical counter for wall clock time keeping ?

(Yes that will require a trap on physical counter reads after migration
on current systems, but migration sucks in terms of timekeeping
already.)

Thanks,

    Christoffer

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Heyi Guo 6 years, 10 months ago

Hi Christoffer and Steve,


On 2019/3/15 16:59, Christoffer Dall wrote:
> Hi Steve,
>
> On Wed, Mar 13, 2019 at 10:11:30AM +0000, Steven Price wrote:
>> Personally I think what we need is:
>>
>> * Either a patch like the one from Heyi Guo (save/restore CNTVCT_EL0) or
>> alternatively hooking up KVM_KVMCLOCK_CTRL to prevent the watchdog
>> firing when user space explicitly stops scheduling the guest for a while.
> If we save/restore CNTVCT_EL0 and the warning goes away, does the guest
> wall clock timekeeping get all confused and does it figure this out
> automagically somehow?
What's the source for guest wall clock timekeeping in current ARM64 implementation? Is it the value from CNTP_TVAL_EL0? Will guest OS keep track of this? Or is the wall clock simply platform RTC?

I tested the RFC change and did not see anything unusual. Did I miss someting?

>
> Does KVM_KVMCLOCK_CTRL solve that problem?
It seems KVM_KVMCLOCK_CTRL is dedicated for guest pause/resume scenario, but it relies on pvclock both in KVM and Guest and right now only X86 supports it :(

Could Steve provide more insights about the whole thing?

Thanks,
Heyi

>
>> * KVM itself saving/restoring CNTVCT_EL0 during suspend/resume so the
>> guest doesn't see time pass during a suspend.
> This smells like policy to me so I'd much prefer keeping as much
> functionality in user space as possible.  If we already have the APIs we
> need from KVM, let's use them.
>
>> * Something equivalent to MSR_KVM_WALL_CLOCK_NEW for arm which allows
>> the guest to query the wall clock time from the host and provides an
>> offset between CNTVCT_EL0 to wall clock time which the KVM can update
>> during suspend/resume. This means that during a suspend/resume the guest
>> can observe that wall clock time has passed, without having to be
>> bothered about CNTVCT_EL0 jumping forwards.
>>
> Isn't the proper Arm architectural solution for this to read the
> physical counter for wall clock time keeping ?
>
> (Yes that will require a trap on physical counter reads after migration
> on current systems, but migration sucks in terms of timekeeping
> already.)
>
>
> Thanks,
>
>      Christoffer
>
> .
>

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Steven Price 6 years, 10 months ago

On 19/03/2019 14:39, Heyi Guo wrote:
> Hi Christoffer and Steve,
> 
> 
> On 2019/3/15 16:59, Christoffer Dall wrote:
>> Hi Steve,
>>
>> On Wed, Mar 13, 2019 at 10:11:30AM +0000, Steven Price wrote:
>>> Personally I think what we need is:
>>>
>>> * Either a patch like the one from Heyi Guo (save/restore CNTVCT_EL0) or
>>> alternatively hooking up KVM_KVMCLOCK_CTRL to prevent the watchdog
>>> firing when user space explicitly stops scheduling the guest for a
>>> while.
>> If we save/restore CNTVCT_EL0 and the warning goes away, does the guest
>> wall clock timekeeping get all confused and does it figure this out
>> automagically somehow?
> What's the source for guest wall clock timekeeping in current ARM64
> implementation? Is it the value from CNTP_TVAL_EL0? Will guest OS keep
> track of this? Or is the wall clock simply platform RTC?
> 
> I tested the RFC change and did not see anything unusual. Did I miss
> someting?

Are you running ARM or ARM64? I'm assuming ARM64 here...

Unless I'm mistaken you should see the time reported by the guest not
progress when you have stopped it using the QEMU monitor console.

Running something like "while /bin/true; do date; sleep 1; done" should
allow you to see that without the patch time will jump in the guest
(i.e. the time reflects wall-clock time). With the patch I believe it
will not jump (i.e. the clock will be behind wall-clock time after the
pause).

However, this behaviour does depend on the exact system being emulated.
From drivers/clocksource/arm_arch_timer.c:

> static void __init arch_counter_register(unsigned type)
> {
> 	u64 start_count;
> 
> 	/* Register the CP15 based counter if we have one */
> 	if (type & ARCH_TIMER_TYPE_CP15) {
> 		if ((IS_ENABLED(CONFIG_ARM64) && !is_hyp_mode_available()) ||
> 		    arch_timer_uses_ppi == ARCH_TIMER_VIRT_PPI)
> 			arch_timer_read_counter = arch_counter_get_cntvct;
> 		else
> 			arch_timer_read_counter = arch_counter_get_cntpct;
> 
> 		clocksource_counter.archdata.vdso_direct = vdso_default;
> 	} else {
> 		arch_timer_read_counter = arch_counter_get_cntvct_mem;
> 	}

So we can see here that there are three functions for reading the
counter - there's the memory interface for CNTVCT, the "CP15" interface
also for CNTVCT, and an interface for CNTPCT. CNTPCT is only used for
!ARM64 or if hyp mode is available (not relevant until nested
virtualisation is added).

The upshot is that on arm64 we're using the virtual counter (CNTVCT).

>>
>> Does KVM_KVMCLOCK_CTRL solve that problem?
> It seems KVM_KVMCLOCK_CTRL is dedicated for guest pause/resume scenario,
> but it relies on pvclock both in KVM and Guest and right now only X86
> supports it :(

KVM_KVMCLOCK_CTRL simply provides a mechanism for the host to inform the
guest that a vCPU hasn't been scheduled for "a while". The guest can
then use that information to avoid triggering the watchdog timeout. As
you note it is only currently implemented for X86.

> Could Steve provide more insights about the whole thing?

I'll try - see below.

> Thanks,
> Heyi
> 
>>
>>> * KVM itself saving/restoring CNTVCT_EL0 during suspend/resume so the
>>> guest doesn't see time pass during a suspend.
>> This smells like policy to me so I'd much prefer keeping as much
>> functionality in user space as possible.  If we already have the APIs we
>> need from KVM, let's use them.

The problem with suspend/resume is that user space doesn't get a
look-in. There's no way (generically) for a user space application to
save/restore registers during the suspend. User space simply sees time
jump forwards - which is precisely what we're trying to hide from the
guest (as it otherwise gets upset that it appears a vCPU has deadlocked
for a while).

So while I agree this is "policy", it's something we need the kernel to
do on our behalf. Potentially we can put it behind an ioctl so that user
space can opt-in to it.

>>> * Something equivalent to MSR_KVM_WALL_CLOCK_NEW for arm which allows
>>> the guest to query the wall clock time from the host and provides an
>>> offset between CNTVCT_EL0 to wall clock time which the KVM can update
>>> during suspend/resume. This means that during a suspend/resume the guest
>>> can observe that wall clock time has passed, without having to be
>>> bothered about CNTVCT_EL0 jumping forwards.
>>>
>> Isn't the proper Arm architectural solution for this to read the
>> physical counter for wall clock time keeping ?
>>
>> (Yes that will require a trap on physical counter reads after migration
>> on current systems, but migration sucks in terms of timekeeping
>> already.)

The physical counter is certainly tempting as it largely does what we
want as a secondary counter. However I'm wary of using it because it
starts to get "really interesting" when dealing with nested
virtualisation. Any by "really interesting" I of course mean horribly
broken :)

Basically the problem is that with nested virtualisation the offset
between the physical counter and the virtual counter is controlled by
the virtual-EL2. Because we want certain behaviours of the virtual
counter (pausing when the guest pauses) we have to replicate that onto
the physical counter to maintain the offset between the two that the
guest expects.

Given that it seems to be a non-starter when you introduce nested virt I
think we should just bite the bullet and implement a PV wall clock
mechanism for arm64 (similar to MSR_KVM_WALL_CLOCK_NEW).

This also has the advantage that it is going to be faster than the
physical counter when that has to be trapped (e.g. after migration).

Steve

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Heyi Guo 6 years, 10 months ago

Hi Steven,

Thanks for your explanation. Please see my comments below.


On 2019/3/21 0:29, Steven Price wrote:
> On 19/03/2019 14:39, Heyi Guo wrote:
>> Hi Christoffer and Steve,
>>
>>
>> On 2019/3/15 16:59, Christoffer Dall wrote:
>>> Hi Steve,
>>>
>>> On Wed, Mar 13, 2019 at 10:11:30AM +0000, Steven Price wrote:
>>>> Personally I think what we need is:
>>>>
>>>> * Either a patch like the one from Heyi Guo (save/restore CNTVCT_EL0) or
>>>> alternatively hooking up KVM_KVMCLOCK_CTRL to prevent the watchdog
>>>> firing when user space explicitly stops scheduling the guest for a
>>>> while.
>>> If we save/restore CNTVCT_EL0 and the warning goes away, does the guest
>>> wall clock timekeeping get all confused and does it figure this out
>>> automagically somehow?
>> What's the source for guest wall clock timekeeping in current ARM64
>> implementation? Is it the value from CNTP_TVAL_EL0? Will guest OS keep
>> track of this? Or is the wall clock simply platform RTC?
>>
>> I tested the RFC change and did not see anything unusual. Did I miss
>> someting?
> Are you running ARM or ARM64? I'm assuming ARM64 here...
Yes, definitely :)

>
> Unless I'm mistaken you should see the time reported by the guest not
> progress when you have stopped it using the QEMU monitor console.
>
> Running something like "while /bin/true; do date; sleep 1; done" should
> allow you to see that without the patch time will jump in the guest
> (i.e. the time reflects wall-clock time). With the patch I believe it
> will not jump (i.e. the clock will be behind wall-clock time after the
> pause).
I did the test and the result was exactly like you said.
>
> However, this behaviour does depend on the exact system being emulated.
> >From drivers/clocksource/arm_arch_timer.c:
>
>> static void __init arch_counter_register(unsigned type)
>> {
>> 	u64 start_count;
>>
>> 	/* Register the CP15 based counter if we have one */
>> 	if (type & ARCH_TIMER_TYPE_CP15) {
>> 		if ((IS_ENABLED(CONFIG_ARM64) && !is_hyp_mode_available()) ||
>> 		    arch_timer_uses_ppi == ARCH_TIMER_VIRT_PPI)
>> 			arch_timer_read_counter = arch_counter_get_cntvct;
>> 		else
>> 			arch_timer_read_counter = arch_counter_get_cntpct;
>>
>> 		clocksource_counter.archdata.vdso_direct = vdso_default;
>> 	} else {
>> 		arch_timer_read_counter = arch_counter_get_cntvct_mem;
>> 	}
> So we can see here that there are three functions for reading the
> counter - there's the memory interface for CNTVCT, the "CP15" interface
> also for CNTVCT, and an interface for CNTPCT. CNTPCT is only used for
> !ARM64 or if hyp mode is available (not relevant until nested
> virtualisation is added).
>
> The upshot is that on arm64 we're using the virtual counter (CNTVCT).
>
>>> Does KVM_KVMCLOCK_CTRL solve that problem?
>> It seems KVM_KVMCLOCK_CTRL is dedicated for guest pause/resume scenario,
>> but it relies on pvclock both in KVM and Guest and right now only X86
>> supports it :(
> KVM_KVMCLOCK_CTRL simply provides a mechanism for the host to inform the
> guest that a vCPU hasn't been scheduled for "a while". The guest can
> then use that information to avoid triggering the watchdog timeout. As
> you note it is only currently implemented for X86.
>
>> Could Steve provide more insights about the whole thing?
> I'll try - see below.
>
>> Thanks,
>> Heyi
>>
>>>> * KVM itself saving/restoring CNTVCT_EL0 during suspend/resume so the
>>>> guest doesn't see time pass during a suspend.
>>> This smells like policy to me so I'd much prefer keeping as much
>>> functionality in user space as possible.  If we already have the APIs we
>>> need from KVM, let's use them.
> The problem with suspend/resume is that user space doesn't get a
> look-in. There's no way (generically) for a user space application to
> save/restore registers during the suspend. User space simply sees time
> jump forwards - which is precisely what we're trying to hide from the
> guest (as it otherwise gets upset that it appears a vCPU has deadlocked
> for a while).
>
> So while I agree this is "policy", it's something we need the kernel to
> do on our behalf. Potentially we can put it behind an ioctl so that user
> space can opt-in to it.
>
>>>> * Something equivalent to MSR_KVM_WALL_CLOCK_NEW for arm which allows
>>>> the guest to query the wall clock time from the host and provides an
>>>> offset between CNTVCT_EL0 to wall clock time which the KVM can update
>>>> during suspend/resume. This means that during a suspend/resume the guest
>>>> can observe that wall clock time has passed, without having to be
>>>> bothered about CNTVCT_EL0 jumping forwards.
>>>>
>>> Isn't the proper Arm architectural solution for this to read the
>>> physical counter for wall clock time keeping ?
>>>
>>> (Yes that will require a trap on physical counter reads after migration
>>> on current systems, but migration sucks in terms of timekeeping
>>> already.)
> The physical counter is certainly tempting as it largely does what we
> want as a secondary counter. However I'm wary of using it because it
> starts to get "really interesting" when dealing with nested
> virtualisation. Any by "really interesting" I of course mean horribly
> broken :)
>
> Basically the problem is that with nested virtualisation the offset
> between the physical counter and the virtual counter is controlled by
> the virtual-EL2. Because we want certain behaviours of the virtual
> counter (pausing when the guest pauses) we have to replicate that onto
> the physical counter to maintain the offset between the two that the
> guest expects.
>
> Given that it seems to be a non-starter when you introduce nested virt I
> think we should just bite the bullet and implement a PV wall clock
> mechanism for arm64 (similar to MSR_KVM_WALL_CLOCK_NEW).

If we decide to implement something like pvclock, shall we increase the priority of this work? It seems there is much to do, including the spec, KVM, qemu and guest kernel driver...

Thanks,

Heyi

>
> This also has the advantage that it is going to be faster than the
> physical counter when that has to be trapped (e.g. after migration).
>
> Steve
>
> .
>

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Heyi Guo 6 years, 10 months ago

I also tested save/restore operations, and observed that clock in guest would not jump after restoring either. If we consider guest clock not being synchronized with real wall clock as an issue, does it mean save/restore function has the same issue?

Thanks,

Heyi


On 2019/3/26 19:54, Heyi Guo wrote:
> Hi Steven,
>
> Thanks for your explanation. Please see my comments below.
>
>
> On 2019/3/21 0:29, Steven Price wrote:
>> On 19/03/2019 14:39, Heyi Guo wrote:
>>> Hi Christoffer and Steve,
>>>
>>>
>>> On 2019/3/15 16:59, Christoffer Dall wrote:
>>>> Hi Steve,
>>>>
>>>> On Wed, Mar 13, 2019 at 10:11:30AM +0000, Steven Price wrote:
>>>>> Personally I think what we need is:
>>>>>
>>>>> * Either a patch like the one from Heyi Guo (save/restore CNTVCT_EL0) or
>>>>> alternatively hooking up KVM_KVMCLOCK_CTRL to prevent the watchdog
>>>>> firing when user space explicitly stops scheduling the guest for a
>>>>> while.
>>>> If we save/restore CNTVCT_EL0 and the warning goes away, does the guest
>>>> wall clock timekeeping get all confused and does it figure this out
>>>> automagically somehow?
>>> What's the source for guest wall clock timekeeping in current ARM64
>>> implementation? Is it the value from CNTP_TVAL_EL0? Will guest OS keep
>>> track of this? Or is the wall clock simply platform RTC?
>>>
>>> I tested the RFC change and did not see anything unusual. Did I miss
>>> someting?
>> Are you running ARM or ARM64? I'm assuming ARM64 here...
> Yes, definitely :)
>
>>
>> Unless I'm mistaken you should see the time reported by the guest not
>> progress when you have stopped it using the QEMU monitor console.
>>
>> Running something like "while /bin/true; do date; sleep 1; done" should
>> allow you to see that without the patch time will jump in the guest
>> (i.e. the time reflects wall-clock time). With the patch I believe it
>> will not jump (i.e. the clock will be behind wall-clock time after the
>> pause).
> I did the test and the result was exactly like you said.
>>
>> However, this behaviour does depend on the exact system being emulated.
>> >From drivers/clocksource/arm_arch_timer.c:
>>
>>> static void __init arch_counter_register(unsigned type)
>>> {
>>>     u64 start_count;
>>>
>>>     /* Register the CP15 based counter if we have one */
>>>     if (type & ARCH_TIMER_TYPE_CP15) {
>>>         if ((IS_ENABLED(CONFIG_ARM64) && !is_hyp_mode_available()) ||
>>>             arch_timer_uses_ppi == ARCH_TIMER_VIRT_PPI)
>>>             arch_timer_read_counter = arch_counter_get_cntvct;
>>>         else
>>>             arch_timer_read_counter = arch_counter_get_cntpct;
>>>
>>>         clocksource_counter.archdata.vdso_direct = vdso_default;
>>>     } else {
>>>         arch_timer_read_counter = arch_counter_get_cntvct_mem;
>>>     }
>> So we can see here that there are three functions for reading the
>> counter - there's the memory interface for CNTVCT, the "CP15" interface
>> also for CNTVCT, and an interface for CNTPCT. CNTPCT is only used for
>> !ARM64 or if hyp mode is available (not relevant until nested
>> virtualisation is added).
>>
>> The upshot is that on arm64 we're using the virtual counter (CNTVCT).
>>
>>>> Does KVM_KVMCLOCK_CTRL solve that problem?
>>> It seems KVM_KVMCLOCK_CTRL is dedicated for guest pause/resume scenario,
>>> but it relies on pvclock both in KVM and Guest and right now only X86
>>> supports it :(
>> KVM_KVMCLOCK_CTRL simply provides a mechanism for the host to inform the
>> guest that a vCPU hasn't been scheduled for "a while". The guest can
>> then use that information to avoid triggering the watchdog timeout. As
>> you note it is only currently implemented for X86.
>>
>>> Could Steve provide more insights about the whole thing?
>> I'll try - see below.
>>
>>> Thanks,
>>> Heyi
>>>
>>>>> * KVM itself saving/restoring CNTVCT_EL0 during suspend/resume so the
>>>>> guest doesn't see time pass during a suspend.
>>>> This smells like policy to me so I'd much prefer keeping as much
>>>> functionality in user space as possible.  If we already have the APIs we
>>>> need from KVM, let's use them.
>> The problem with suspend/resume is that user space doesn't get a
>> look-in. There's no way (generically) for a user space application to
>> save/restore registers during the suspend. User space simply sees time
>> jump forwards - which is precisely what we're trying to hide from the
>> guest (as it otherwise gets upset that it appears a vCPU has deadlocked
>> for a while).
>>
>> So while I agree this is "policy", it's something we need the kernel to
>> do on our behalf. Potentially we can put it behind an ioctl so that user
>> space can opt-in to it.
>>
>>>>> * Something equivalent to MSR_KVM_WALL_CLOCK_NEW for arm which allows
>>>>> the guest to query the wall clock time from the host and provides an
>>>>> offset between CNTVCT_EL0 to wall clock time which the KVM can update
>>>>> during suspend/resume. This means that during a suspend/resume the guest
>>>>> can observe that wall clock time has passed, without having to be
>>>>> bothered about CNTVCT_EL0 jumping forwards.
>>>>>
>>>> Isn't the proper Arm architectural solution for this to read the
>>>> physical counter for wall clock time keeping ?
>>>>
>>>> (Yes that will require a trap on physical counter reads after migration
>>>> on current systems, but migration sucks in terms of timekeeping
>>>> already.)
>> The physical counter is certainly tempting as it largely does what we
>> want as a secondary counter. However I'm wary of using it because it
>> starts to get "really interesting" when dealing with nested
>> virtualisation. Any by "really interesting" I of course mean horribly
>> broken :)
>>
>> Basically the problem is that with nested virtualisation the offset
>> between the physical counter and the virtual counter is controlled by
>> the virtual-EL2. Because we want certain behaviours of the virtual
>> counter (pausing when the guest pauses) we have to replicate that onto
>> the physical counter to maintain the offset between the two that the
>> guest expects.
>>
>> Given that it seems to be a non-starter when you introduce nested virt I
>> think we should just bite the bullet and implement a PV wall clock
>> mechanism for arm64 (similar to MSR_KVM_WALL_CLOCK_NEW).
>
> If we decide to implement something like pvclock, shall we increase the priority of this work? It seems there is much to do, including the spec, KVM, qemu and guest kernel driver...
>
> Thanks,
>
> Heyi
>
>>
>> This also has the advantage that it is going to be faster than the
>> physical counter when that has to be trapped (e.g. after migration).
>>
>> Steve
>>
>> .
>>
>

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Steven Price 6 years, 10 months ago

Hi Heyi,

On 26/03/2019 13:53, Heyi Guo wrote:
> I also tested save/restore operations, and observed that clock in guest
> would not jump after restoring either. If we consider guest clock not
> being synchronized with real wall clock as an issue, does it mean
> save/restore function has the same issue?

Basically at the moment when the guest isn't running you have a choice
of two behaviours:

 1. Stop (i.e. save/restore) CNTVCT - this means that the guest sees no
time occur. If the guest needs to have a concept of wall-clock time
(e.g. it communicates with other systems over a network) then this can
cause problems (e.g. timeouts might be wrong, certificates might start
appearing to be in the future etc).

 2. Leave CNTVCT running - the guest sees the time pass but interprets
the vCPUs as effectively having locked up. Linux will trigger the soft
lockup warning.

There are two ways of solving this, which match the two behaviours above:

 1. Provide the guest with a view of wall-clock time. The obvious way of
doing this is with a pvclock implementation like MSR_KVM_WALL_CLOCK_NEW
for x86.

 2. Inform the guest to ignore the apparent "soft-lockup". There's
already an ioctl for x86 for this: KVM_KVMCLOCK_CTRL

My preference is for option 1 - as this gives the guest a good view of
both the time that it is actually executing (useful for internal
watchdog timers like the soft-lockup one in Linux) and maintains a view
of wall-clock time (useful when communicating with other external
services - e.g. the a server on the internet). Your patch to QEMU
provides the first step of that, but as you mention there's much more to do.

One thing I haven't investigated in great detail is how KVM handles the
timer during various forms of suspend. In particular for suspend types
like full hibernation the host's physical counter will jump (quite
possibly backwards) - I haven't looked in detail how KVM presents this
to the guest. Hopefully not by making it go backwards!

I'm not sure how much time I'm going to have to look at this in the near
future, but please keep me in the loop if you decide to tackle any of this.

Thanks,

Steve

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Heyi Guo 6 years, 10 months ago

Hi Steve,

After reading kernel code about time keeping and something related, I've not got a clear picture of how we can use MSR_KVM_WALL_CLOCK_NEW to keep wall clock in guest VM.

1. On X86, MSR_KVM_WALL_CLOCK_NEW is only used by the callback of system suspend and resume; I didn't find it used for runtime wall clock reading.
2. To use the MSR for wall clock synchronization, shall we register KVM PV-clock as a higher rating clock source, so that it will be bound to tk_core.timekeeper and be read at each time of running update_wall_time() in each timer tick?
3. If the above is true, how can we keep the paravirtualized wall clock always updated? Is it always trapped to the hypervisor? I'm afraid this may cause performance loss. If there is no trap and the data is updated by the hypervisor periodically, how can we guarantee the accuracy?

Meanwhile it seems easier to use KVM_KVMCLOCK_CTRL to get rid of false positive soft lock panic, and guest can rely on cntvct for wall clock updating as it does now, and it seems not difficult for the hypervisor to keep cntvct "always on" and "monotonic".

Please let me know if I miss something.

Thanks,
Heyi

On 2019/3/27 1:12, Steven Price wrote:
> Hi Heyi,
>
> On 26/03/2019 13:53, Heyi Guo wrote:
>> I also tested save/restore operations, and observed that clock in guest
>> would not jump after restoring either. If we consider guest clock not
>> being synchronized with real wall clock as an issue, does it mean
>> save/restore function has the same issue?
> Basically at the moment when the guest isn't running you have a choice
> of two behaviours:
>
>   1. Stop (i.e. save/restore) CNTVCT - this means that the guest sees no
> time occur. If the guest needs to have a concept of wall-clock time
> (e.g. it communicates with other systems over a network) then this can
> cause problems (e.g. timeouts might be wrong, certificates might start
> appearing to be in the future etc).
>
>   2. Leave CNTVCT running - the guest sees the time pass but interprets
> the vCPUs as effectively having locked up. Linux will trigger the soft
> lockup warning.
>
> There are two ways of solving this, which match the two behaviours above:
>
>   1. Provide the guest with a view of wall-clock time. The obvious way of
> doing this is with a pvclock implementation like MSR_KVM_WALL_CLOCK_NEW
> for x86.
>
>   2. Inform the guest to ignore the apparent "soft-lockup". There's
> already an ioctl for x86 for this: KVM_KVMCLOCK_CTRL
>
> My preference is for option 1 - as this gives the guest a good view of
> both the time that it is actually executing (useful for internal
> watchdog timers like the soft-lockup one in Linux) and maintains a view
> of wall-clock time (useful when communicating with other external
> services - e.g. the a server on the internet). Your patch to QEMU
> provides the first step of that, but as you mention there's much more to do.
>
> One thing I haven't investigated in great detail is how KVM handles the
> timer during various forms of suspend. In particular for suspend types
> like full hibernation the host's physical counter will jump (quite
> possibly backwards) - I haven't looked in detail how KVM presents this
> to the guest. Hopefully not by making it go backwards!
>
> I'm not sure how much time I'm going to have to look at this in the near
> future, but please keep me in the loop if you decide to tackle any of this.
>
> Thanks,
>
> Steve
>
> .
>

Re: [Qemu-devel] [RFC] arm/cpu: fix soft lockup panic after resuming from stop

Posted by Steven Price 6 years, 10 months ago

On 11/04/2019 08:27, Heyi Guo wrote:
> Hi Steve,
> 
> After reading kernel code about time keeping and something related, I've
> not got a clear picture of how we can use MSR_KVM_WALL_CLOCK_NEW to keep
> wall clock in guest VM.
> 
> 1. On X86, MSR_KVM_WALL_CLOCK_NEW is only used by the callback of system
> suspend and resume; I didn't find it used for runtime wall clock reading.

The MSR is only to provide an offset from the guest's view of system
time to the actual wall-clock time. Like you mention it's used on resume
to resynchronise with wall-clock time.

Looking at this more carefully, I think I may have got a bit confused
with how this works on x86. As you say MSR_KVM_WALL_CLOCK_NEW is only
used on boot (and suspend/resume) to provide the difference between the
TSC and wall-clock.

MSR_KVM_SYSTEM_TIME_NEW seems to be the MSR that provides the offset
between the host's TSC and the guest's view of time. In particular the
KVM request KVM_REQ_CLOCK_UPDATE causes this to be updated.

There's a lot of code in there to deal with e.g. the host's TSC going
backwards (due to buggy hardware, but also due to suspend/hibernate).

> 2. To use the MSR for wall clock synchronization, shall we register KVM
> PV-clock as a higher rating clock source, so that it will be bound to
> tk_core.timekeeper and be read at each time of running
> update_wall_time() in each timer tick?
>
> 3. If the above is true, how can we keep the paravirtualized wall clock
> always updated? Is it always trapped to the hypervisor? I'm afraid this
> may cause performance loss. If there is no trap and the data is updated
> by the hypervisor periodically, how can we guarantee the accuracy?

We obviously don't want to be making frequent hypercalls (or other
traps) - so the idea is to provide the guest with a structure which is
updated when the host is aware something has changed. The guest then
reads this structure (using a version field to avoid races with the
host) and uses it compute it's own version of time.

> Meanwhile it seems easier to use KVM_KVMCLOCK_CTRL to get rid of false
> positive soft lock panic, and guest can rely on cntvct for wall clock
> updating as it does now, and it seems not difficult for the hypervisor
> to keep cntvct "always on" and "monotonic".
> 
> Please let me know if I miss something.

Yes I'm coming around to that way of thinking - as long as the
hypervisor provides the "always on"/"monotonic" properties then really
the only missing bit is informing the guest when it was paused so it can
"do the right thing". This does appear to be what the x86 code is doing,
but I have to admit I struggle to fully understand it.

Steve

> Thanks,
> Heyi
> 
> On 2019/3/27 1:12, Steven Price wrote:
>> Hi Heyi,
>>
>> On 26/03/2019 13:53, Heyi Guo wrote:
>>> I also tested save/restore operations, and observed that clock in guest
>>> would not jump after restoring either. If we consider guest clock not
>>> being synchronized with real wall clock as an issue, does it mean
>>> save/restore function has the same issue?
>> Basically at the moment when the guest isn't running you have a choice
>> of two behaviours:
>>
>>   1. Stop (i.e. save/restore) CNTVCT - this means that the guest sees no
>> time occur. If the guest needs to have a concept of wall-clock time
>> (e.g. it communicates with other systems over a network) then this can
>> cause problems (e.g. timeouts might be wrong, certificates might start
>> appearing to be in the future etc).
>>
>>   2. Leave CNTVCT running - the guest sees the time pass but interprets
>> the vCPUs as effectively having locked up. Linux will trigger the soft
>> lockup warning.
>>
>> There are two ways of solving this, which match the two behaviours above:
>>
>>   1. Provide the guest with a view of wall-clock time. The obvious way of
>> doing this is with a pvclock implementation like MSR_KVM_WALL_CLOCK_NEW
>> for x86.
>>
>>   2. Inform the guest to ignore the apparent "soft-lockup". There's
>> already an ioctl for x86 for this: KVM_KVMCLOCK_CTRL
>>
>> My preference is for option 1 - as this gives the guest a good view of
>> both the time that it is actually executing (useful for internal
>> watchdog timers like the soft-lockup one in Linux) and maintains a view
>> of wall-clock time (useful when communicating with other external
>> services - e.g. the a server on the internet). Your patch to QEMU
>> provides the first step of that, but as you mention there's much more
>> to do.
>>
>> One thing I haven't investigated in great detail is how KVM handles the
>> timer during various forms of suspend. In particular for suspend types
>> like full hibernation the host's physical counter will jump (quite
>> possibly backwards) - I haven't looked in detail how KVM presents this
>> to the guest. Hopefully not by making it go backwards!
>>
>> I'm not sure how much time I'm going to have to look at this in the near
>> future, but please keep me in the loop if you decide to tackle any of
>> this.
>>
>> Thanks,
>>
>> Steve
>>
>> .
>>
> 
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm