arch/arm64/kvm/arch_timer.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)
During automated testing of Nested Virtualization using avocado-vt,
it has been observed that during some boot test iterations,
the Guest-Hypervisor boot was getting crashed with a
synchronous exception while it is still booting EDK2.
The test is launching Multiple instances of Guest-Hypervisor boot
and while booting, QEMU monitor issued the command "info register"
at regular intervals to take a register dump. To execute this
command, QEMU stops the run and does the register read of various
registers. While resuming the run, the function kvm_arm_timer_write()
writes back the saved CNTV_CTL_EL0 register with ISTATUS cleared always
and resulting in the loss of pending interrupt for emulated timers.
In hardware based timers, ISTATUS is a RO/WI bit and gets set by the
h/w, if the condition is still met. However, in Nested-Virtualization
case, the Guest-Hypervisor's EDK2 is using an emulated virtual timer
and losing ISTATUS state and the interrupt forever.
Adding fix in kvm_arm_timer_write to set ISTATUS for emulated
timers, if the timer expired already.
Fixes: 81dc9504a700 ("KVM: arm64: nv: timers: Support hyp timer emulation")
Co-developed-by: Vishnu Pajjuri <vishnu@os.amperecomputing.com>
Signed-off-by: Vishnu Pajjuri <vishnu@os.amperecomputing.com>
Signed-off-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
---
arch/arm64/kvm/arch_timer.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
index 1215df590418..aca58113d790 100644
--- a/arch/arm64/kvm/arch_timer.c
+++ b/arch/arm64/kvm/arch_timer.c
@@ -1199,7 +1199,16 @@ static void kvm_arm_timer_write(struct kvm_vcpu *vcpu,
break;
case TIMER_REG_CTL:
- timer_set_ctl(timer, val & ~ARCH_TIMER_CTRL_IT_STAT);
+ struct timer_map map;
+
+ val &= ~ARCH_TIMER_CTRL_IT_STAT;
+ get_timer_map(vcpu, &map);
+ /* Set ISTATUS bit for emulated timers, if timer expired. */
+ if (timer == map.emul_vtimer || timer == map.emul_ptimer) {
+ if (!kvm_timer_compute_delta(timer))
+ val |= ARCH_TIMER_CTRL_IT_STAT;
+ }
+ timer_set_ctl(timer, val);
break;
case TIMER_REG_CVAL:
--
2.47.0
Hi Ganapatrao,
kernel test robot noticed the following build warnings:
[auto build test WARNING on kvmarm/next]
[also build test WARNING on arm64/for-next/core soc/for-next linus/master arm/for-next arm/fixes v6.13-rc2 next-20241209]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Ganapatrao-Kulkarni/KVM-arm64-nv-Set-ISTATUS-for-emulated-timers-If-timer-expired/20241209-133651
base: https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next
patch link: https://lore.kernel.org/r/20241209053201.339939-1-gankulkarni%40os.amperecomputing.com
patch subject: [PATCH] KVM: arm64: nv: Set ISTATUS for emulated timers, If timer expired
config: arm64-allmodconfig (https://download.01.org/0day-ci/archive/20241210/202412100311.HEYAT0bx-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241210/202412100311.HEYAT0bx-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412100311.HEYAT0bx-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> arch/arm64/kvm/arch_timer.c:1202:3: warning: label followed by a declaration is a C23 extension [-Wc23-extensions]
1202 | struct timer_map map;
| ^
1 warning generated.
vim +1202 arch/arm64/kvm/arch_timer.c
1190
1191 static void kvm_arm_timer_write(struct kvm_vcpu *vcpu,
1192 struct arch_timer_context *timer,
1193 enum kvm_arch_timer_regs treg,
1194 u64 val)
1195 {
1196 switch (treg) {
1197 case TIMER_REG_TVAL:
1198 timer_set_cval(timer, kvm_phys_timer_read() - timer_get_offset(timer) + (s32)val);
1199 break;
1200
1201 case TIMER_REG_CTL:
> 1202 struct timer_map map;
1203
1204 val &= ~ARCH_TIMER_CTRL_IT_STAT;
1205 get_timer_map(vcpu, &map);
1206 /* Set ISTATUS bit for emulated timers, if timer expired. */
1207 if (timer == map.emul_vtimer || timer == map.emul_ptimer) {
1208 if (!kvm_timer_compute_delta(timer))
1209 val |= ARCH_TIMER_CTRL_IT_STAT;
1210 }
1211 timer_set_ctl(timer, val);
1212 break;
1213
1214 case TIMER_REG_CVAL:
1215 timer_set_cval(timer, val);
1216 break;
1217
1218 case TIMER_REG_VOFF:
1219 *timer->offset.vcpu_offset = val;
1220 break;
1221
1222 default:
1223 BUG();
1224 }
1225 }
1226
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Ganapatrao,
Did you notice that the Columbia list was killed over two years ago,
as per ac107abef1976 ("KVM: arm64: Advertise new kvmarm mailing
list")? Consider that script/get_maintainer.pl is your friend.
Cc'ing the correct list instead,
On Mon, 09 Dec 2024 05:32:01 +0000,
Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
>
> During automated testing of Nested Virtualization using avocado-vt,
Which is not merged upstream. So what branch are you using? Based on
what kernel version? On what HW? With which virtualisation features?
> it has been observed that during some boot test iterations,
> the Guest-Hypervisor boot was getting crashed with a
> synchronous exception while it is still booting EDK2.
>
> The test is launching Multiple instances of Guest-Hypervisor boot
Is the multiple instance aspect really relevant to the reproduction of
the problem?
> and while booting, QEMU monitor issued the command "info register"
> at regular intervals to take a register dump. To execute this
> command, QEMU stops the run and does the register read of various
> registers. While resuming the run, the function kvm_arm_timer_write()
> writes back the saved CNTV_CTL_EL0 register with ISTATUS cleared always
It is userspace that causes this write-back, right? AFAICT, KVM never
does that on its own.
> and resulting in the loss of pending interrupt for emulated timers.
How does a missing interrupt result in a synchronous exception in
EDK2? In my experience, EDK2 panics if it sees spurious interrupts,
not when it is missing interrupts (it just locks up, which is
expected).
> In hardware based timers, ISTATUS is a RO/WI bit and gets set by the
> h/w, if the condition is still met. However, in Nested-Virtualization
> case, the Guest-Hypervisor's EDK2 is using an emulated virtual timer
> and losing ISTATUS state and the interrupt forever.
Why is this specific to NV? Can't the same thing happen to the
physical timer in a non-VHE configuration?
>
> Adding fix in kvm_arm_timer_write to set ISTATUS for emulated
> timers, if the timer expired already.
>
> Fixes: 81dc9504a700 ("KVM: arm64: nv: timers: Support hyp timer emulation")
Where is this coming from? This patch doesn't touch the code you are
changing, so how can it be the source of your problems? As far as I
can tell, this has been the case since 5c5196da4e966 ("KVM: arm/arm64:
Support EL1 phys timer register access in set/get reg").
> Co-developed-by: Vishnu Pajjuri <vishnu@os.amperecomputing.com>
> Signed-off-by: Vishnu Pajjuri <vishnu@os.amperecomputing.com>
> Signed-off-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
> ---
> arch/arm64/kvm/arch_timer.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index 1215df590418..aca58113d790 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -1199,7 +1199,16 @@ static void kvm_arm_timer_write(struct kvm_vcpu *vcpu,
> break;
>
> case TIMER_REG_CTL:
> - timer_set_ctl(timer, val & ~ARCH_TIMER_CTRL_IT_STAT);
> + struct timer_map map;
> +
> + val &= ~ARCH_TIMER_CTRL_IT_STAT;
> + get_timer_map(vcpu, &map);
> + /* Set ISTATUS bit for emulated timers, if timer expired. */
> + if (timer == map.emul_vtimer || timer == map.emul_ptimer) {
> + if (!kvm_timer_compute_delta(timer))
> + val |= ARCH_TIMER_CTRL_IT_STAT;
> + }
> + timer_set_ctl(timer, val);
> break;
This really looks awfully complicated, when it is only a matter of
recomputing the interrupt state, something that is at the core of the
timer emulation. Why can't the following work:
diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
index 895f09658ef83..bd6efafbe7109 100644
--- a/arch/arm64/kvm/arch_timer.c
+++ b/arch/arm64/kvm/arch_timer.c
@@ -1298,13 +1298,17 @@ static void kvm_arm_timer_write(struct kvm_vcpu *vcpu,
enum kvm_arch_timer_regs treg,
u64 val)
{
+ unsigned long tmp = val;
+
switch (treg) {
case TIMER_REG_TVAL:
timer_set_cval(timer, kvm_phys_timer_read() - timer_get_offset(timer) + (s32)val);
break;
case TIMER_REG_CTL:
- timer_set_ctl(timer, val & ~ARCH_TIMER_CTRL_IT_STAT);
+ __assign_bit(__ffs(ARCH_TIMER_CTRL_IT_STAT), &tmp,
+ kvm_timer_should_fire(timer));
+ timer_set_ctl(timer, tmp);
break;
case TIMER_REG_CVAL:
But overall, this very much looks like it is only papering over the
real issue, which is that the *emulation* should regenerate the
pending bit, and not rely on the userspace interface.
As far as I can tell, we already correctly compute the status bit on
read (read_timer_ctl()), so the guest should always observe something
consistent when it traps. We also *never* use the status bit as an
input to the emulation, and always recompute it from scratch (it is
only there for the benefit of the guest or userspace).
So I can't see how upstream is broken in at the moment, and you need
to explain how this actually triggers. Ideally, with a standalone
reproducer or a selftest.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
Hi Marc,
On 09-12-2024 03:24 pm, Marc Zyngier wrote:
> Ganapatrao,
>
> Did you notice that the Columbia list was killed over two years ago,
> as per ac107abef1976 ("KVM: arm64: Advertise new kvmarm mailing
> list")? Consider that script/get_maintainer.pl is your friend.
My bad, copy paste issue!.
>
> Cc'ing the correct list instead,
Thanks.
>
> On Mon, 09 Dec 2024 05:32:01 +0000,
> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
>>
>> During automated testing of Nested Virtualization using avocado-vt,
>
> Which is not merged upstream. So what branch are you using? Based on
> what kernel version? On what HW? With which virtualisation features?
>
Testing is done on Ampere's AmpereOne platform using 6.10 based kernel
with NV patches from your repo.
>> it has been observed that during some boot test iterations,
>> the Guest-Hypervisor boot was getting crashed with a
>> synchronous exception while it is still booting EDK2.
>>
>> The test is launching Multiple instances of Guest-Hypervisor boot
>
> Is the multiple instance aspect really relevant to the reproduction of
> the problem?
Not really, but it requires multiple attempts/iterations to hit the
issue. Even with automated test, it was seen at some iteration out of 10
to 15 iterations.
>
>> and while booting, QEMU monitor issued the command "info register"
>> at regular intervals to take a register dump. To execute this
>> command, QEMU stops the run and does the register read of various
>> registers. While resuming the run, the function kvm_arm_timer_write()
>> writes back the saved CNTV_CTL_EL0 register with ISTATUS cleared always
>
> It is userspace that causes this write-back, right? AFAICT, KVM never
> does that on its own.
>
>> and resulting in the loss of pending interrupt for emulated timers.
>
> How does a missing interrupt result in a synchronous exception in
> EDK2? In my experience, EDK2 panics if it sees spurious interrupts,
> not when it is missing interrupts (it just locks up, which is
> expected).
Not sure, why it is hitting exception, rather than hang at EDK2.
However, EDK2 timer handler code is ignoring the interrupt since ISTATUS
is not set and not moving CVAL forward.
>
>> In hardware based timers, ISTATUS is a RO/WI bit and gets set by the
>> h/w, if the condition is still met. However, in Nested-Virtualization
>> case, the Guest-Hypervisor's EDK2 is using an emulated virtual timer
>> and losing ISTATUS state and the interrupt forever.
>
> Why is this specific to NV? Can't the same thing happen to the
> physical timer in a non-VHE configuration?
>
You mean, emulated v-timer in non-VHE boot?
It might impact non-VHE case as well, not tried though.
>>
>> Adding fix in kvm_arm_timer_write to set ISTATUS for emulated
>> timers, if the timer expired already.
>>
>> Fixes: 81dc9504a700 ("KVM: arm64: nv: timers: Support hyp timer emulation")
>
> Where is this coming from? This patch doesn't touch the code you are
> changing, so how can it be the source of your problems? As far as I
> can tell, this has been the case since 5c5196da4e966 ("KVM: arm/arm64:
> Support EL1 phys timer register access in set/get reg").
Thanks, 5c5196da4e966 seems more relevant!.
>> Co-developed-by: Vishnu Pajjuri <vishnu@os.amperecomputing.com>
>> Signed-off-by: Vishnu Pajjuri <vishnu@os.amperecomputing.com>
>> Signed-off-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
>> ---
>> arch/arm64/kvm/arch_timer.c | 11 ++++++++++-
>> 1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
>> index 1215df590418..aca58113d790 100644
>> --- a/arch/arm64/kvm/arch_timer.c
>> +++ b/arch/arm64/kvm/arch_timer.c
>> @@ -1199,7 +1199,16 @@ static void kvm_arm_timer_write(struct kvm_vcpu *vcpu,
>> break;
>>
>> case TIMER_REG_CTL:
>> - timer_set_ctl(timer, val & ~ARCH_TIMER_CTRL_IT_STAT);
>> + struct timer_map map;
>> +
>> + val &= ~ARCH_TIMER_CTRL_IT_STAT;
>> + get_timer_map(vcpu, &map);
>> + /* Set ISTATUS bit for emulated timers, if timer expired. */
>> + if (timer == map.emul_vtimer || timer == map.emul_ptimer) {
>> + if (!kvm_timer_compute_delta(timer))
>> + val |= ARCH_TIMER_CTRL_IT_STAT;
>> + }
>> + timer_set_ctl(timer, val);
>> break;
>
> This really looks awfully complicated, when it is only a matter of
> recomputing the interrupt state, something that is at the core of the
> timer emulation. Why can't the following work:
Added check to handle ISTATUS for emulated timers only, since it is RO
for hardware timers, thought it might confuse others(non-nv case).
Otherwise below diff from you should work as well. I will try it out.
>
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index 895f09658ef83..bd6efafbe7109 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -1298,13 +1298,17 @@ static void kvm_arm_timer_write(struct kvm_vcpu *vcpu,
> enum kvm_arch_timer_regs treg,
> u64 val)
> {
> + unsigned long tmp = val;
> +
> switch (treg) {
> case TIMER_REG_TVAL:
> timer_set_cval(timer, kvm_phys_timer_read() - timer_get_offset(timer) + (s32)val);
> break;
>
> case TIMER_REG_CTL:
> - timer_set_ctl(timer, val & ~ARCH_TIMER_CTRL_IT_STAT);
> + __assign_bit(__ffs(ARCH_TIMER_CTRL_IT_STAT), &tmp,
> + kvm_timer_should_fire(timer));
> + timer_set_ctl(timer, tmp);
> break;
>
> case TIMER_REG_CVAL:
>
> But overall, this very much looks like it is only papering over the
> real issue, which is that the *emulation* should regenerate the
> pending bit, and not rely on the userspace interface.
>
> As far as I can tell, we already correctly compute the status bit on
> read (read_timer_ctl()), so the guest should always observe something
> consistent when it traps. We also *never* use the status bit as an
> input to the emulation, and always recompute it from scratch (it is
> only there for the benefit of the guest or userspace).
>
For emulated timers, we are not asserting again by calling
kvm_timer_update_irq in timer_emulate() until the level is down and
ready for trigger again. This was done to fix high rate of spurious
interrupts getting generated to V-Timer. Hence we are not able to
recover, if once ISTATUS is lost.
> So I can't see how upstream is broken in at the moment, and you need
> to explain how this actually triggers. Ideally, with a standalone
> reproducer or a selftest.
We could reproduce the issue with the simple test/script.
On one shell, launch L1 using qemu with add-on option
"-monitor unix:gh_monitor,server,nowait
On another shell, while L1 boots and still in UEFI, run repeatedly the
command (or put in a while 1 loop script)
"echo "info registers" | socat - unix-connect:gh_monitor >
/tmp/info_registers"
With above steps we were able to hit the issue within few attempts.
--
Thanks,
Ganapat/GK
On Mon, 09 Dec 2024 12:25:34 +0000,
Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
> >>
> >> During automated testing of Nested Virtualization using avocado-vt,
> >
> > Which is not merged upstream. So what branch are you using? Based on
> > what kernel version? On what HW? With which virtualisation features?
> >
>
> Testing is done on Ampere's AmpereOne platform using 6.10 based kernel
> with NV patches from your repo.
Grmbl... *Which* patches? At least give me the SHA1 of the branch,
because I have no idea what you are running. And 6.10 is definitely
not something I care about. If you're using the NV patches, the
*minimum* you should run is 6.13-rc1, because that's what the current
code is based on.
Also, does this machine have FEAT_ECV?
>
> >> it has been observed that during some boot test iterations,
> >> the Guest-Hypervisor boot was getting crashed with a
> >> synchronous exception while it is still booting EDK2.
> >>
> >> The test is launching Multiple instances of Guest-Hypervisor boot
> >
> > Is the multiple instance aspect really relevant to the reproduction of
> > the problem?
>
> Not really, but it requires multiple attempts/iterations to hit the
> issue. Even with automated test, it was seen at some iteration out of
> 10 to 15 iterations.
>
> >
> >> and while booting, QEMU monitor issued the command "info register"
> >> at regular intervals to take a register dump. To execute this
> >> command, QEMU stops the run and does the register read of various
> >> registers. While resuming the run, the function kvm_arm_timer_write()
> >> writes back the saved CNTV_CTL_EL0 register with ISTATUS cleared always
> >
> > It is userspace that causes this write-back, right? AFAICT, KVM never
> > does that on its own.
> >
> >> and resulting in the loss of pending interrupt for emulated timers.
> >
> > How does a missing interrupt result in a synchronous exception in
> > EDK2? In my experience, EDK2 panics if it sees spurious interrupts,
> > not when it is missing interrupts (it just locks up, which is
> > expected).
>
> Not sure, why it is hitting exception, rather than hang at EDK2.
> However, EDK2 timer handler code is ignoring the interrupt since
> ISTATUS is not set and not moving CVAL forward.
How is EDK2 getting this exception? Is this injected by KVM? Or is
that some EDK2 bug?
>
> >
> >> In hardware based timers, ISTATUS is a RO/WI bit and gets set by the
> >> h/w, if the condition is still met. However, in Nested-Virtualization
> >> case, the Guest-Hypervisor's EDK2 is using an emulated virtual timer
> >> and losing ISTATUS state and the interrupt forever.
> >
> > Why is this specific to NV? Can't the same thing happen to the
> > physical timer in a non-VHE configuration?
> >
>
> You mean, emulated v-timer in non-VHE boot?
Emulated *physical* timer.
> It might impact non-VHE case as well, not tried though.
Can you please try?
[...]
> > But overall, this very much looks like it is only papering over the
> > real issue, which is that the *emulation* should regenerate the
> > pending bit, and not rely on the userspace interface.
> >
> > As far as I can tell, we already correctly compute the status bit on
> > read (read_timer_ctl()), so the guest should always observe something
> > consistent when it traps. We also *never* use the status bit as an
> > input to the emulation, and always recompute it from scratch (it is
> > only there for the benefit of the guest or userspace).
> >
>
> For emulated timers, we are not asserting again by calling
> kvm_timer_update_irq in timer_emulate() until the level is down and
> ready for trigger again. This was done to fix high rate of spurious
> interrupts getting generated to V-Timer. Hence we are not able to
> recover, if once ISTATUS is lost.
Again, a trapping read should see the correct value, since we populate
that bit at read time.
> > So I can't see how upstream is broken in at the moment, and you need
> > to explain how this actually triggers. Ideally, with a standalone
> > reproducer or a selftest.
>
> We could reproduce the issue with the simple test/script.
> On one shell, launch L1 using qemu with add-on option
>
> "-monitor unix:gh_monitor,server,nowait
>
> On another shell, while L1 boots and still in UEFI, run repeatedly the
> command (or put in a while 1 loop script)
>
> "echo "info registers" | socat - unix-connect:gh_monitor >
> /tmp/info_registers"
>
> With above steps we were able to hit the issue within few attempts.
That's not a standalone reproducer. QEMU doesn't support NV, and
kvmtool doesn't have this sort of interface. I was asking for a bit of
C code that I could run directly, not something that requires me to
drag even more experimental code.
So here's my current guess, since you don't give me the needed
information. For what you describe to happen, I can only see two
possibilities:
- either your HW doesn't have FEAT_ECV, in which case the guest
directly reads from memory
- or you are running with something like this patch [1], and we serve
the guest by reading from memory very early, without returning to
the bulk of the emulation code
In either case, we only publish the updated status if the current IRQ
state is different from the computed output of the timer while
performing the emulation.
So if you were writing back a status bit set to 0 while the interrupt
was already pending, we'd deliver an interrupt, but not recompute the
status. The guest would consider the interrupt as spurious, not touch
the timer, and we'd never make forward progress. Rinse, repeat.
Assuming I got the analysis right, it would only be a matter of
hoisting the publication of the status into timer_emulate(), so that
it is made up to date on load.
Please give the fixup below a go.
M.
[1] https://lore.kernel.org/all/20241202172134.384923-6-maz@kernel.org/
From 2bbd6f9b41a20ad573376c20c158ff3c12db5009 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz@kernel.org>
Date: Mon, 9 Dec 2024 10:58:08 +0000
Subject: [PATCH] fixup! KVM: arm64: nv: Publish emulated timer interrupt state
in the in-memory state
---
arch/arm64/kvm/arch_timer.c | 32 +++++++++++++-------------------
1 file changed, 13 insertions(+), 19 deletions(-)
diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
index 895f09658ef83..91bda986c344b 100644
--- a/arch/arm64/kvm/arch_timer.c
+++ b/arch/arm64/kvm/arch_timer.c
@@ -432,25 +432,6 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level,
{
int ret;
- /*
- * Paper over NV2 brokenness by publishing the interrupt status
- * bit. This still results in a poor quality of emulation (guest
- * writes will have no effect until the next exit).
- *
- * But hey, it's fast, right?
- */
- if (is_hyp_ctxt(vcpu) &&
- (timer_ctx == vcpu_vtimer(vcpu) || timer_ctx == vcpu_ptimer(vcpu))) {
- u32 ctl = timer_get_ctl(timer_ctx);
-
- if (new_level)
- ctl |= ARCH_TIMER_CTRL_IT_STAT;
- else
- ctl &= ~ARCH_TIMER_CTRL_IT_STAT;
-
- timer_set_ctl(timer_ctx, ctl);
- }
-
timer_ctx->irq.level = new_level;
trace_kvm_timer_update_irq(vcpu->vcpu_id, timer_irq(timer_ctx),
timer_ctx->irq.level);
@@ -471,6 +452,19 @@ static void timer_emulate(struct arch_timer_context *ctx)
trace_kvm_timer_emulate(ctx, should_fire);
+ /*
+ * Paper over NV2 brokenness by publishing the interrupt status
+ * bit. This still results in a poor quality of emulation (guest
+ * writes will have no effect until the next exit).
+ *
+ * But hey, it's fast, right?
+ */
+ if (is_hyp_ctxt(ctx->vcpu)) {
+ unsigned long val = timer_get_ctl(ctx);
+ __assign_bit(__ffs(ARCH_TIMER_CTRL_IT_STAT), &val, should_fire);
+ timer_set_ctl(ctx, val);
+ }
+
if (should_fire != ctx->irq.level) {
kvm_timer_update_irq(ctx->vcpu, should_fire, ctx);
return;
--
2.39.2
--
Without deviation from the norm, progress is not possible.
> Please give the fixup below a go.
>
> M.
>
> [1] https://lore.kernel.org/all/20241202172134.384923-6-maz@kernel.org/
>
> From 2bbd6f9b41a20ad573376c20c158ff3c12db5009 Mon Sep 17 00:00:00 2001
> From: Marc Zyngier <maz@kernel.org>
> Date: Mon, 9 Dec 2024 10:58:08 +0000
> Subject: [PATCH] fixup! KVM: arm64: nv: Publish emulated timer interrupt state
> in the in-memory state
>
> ---
> arch/arm64/kvm/arch_timer.c | 32 +++++++++++++-------------------
> 1 file changed, 13 insertions(+), 19 deletions(-)
>
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index 895f09658ef83..91bda986c344b 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -432,25 +432,6 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level,
> {
> int ret;
>
> - /*
> - * Paper over NV2 brokenness by publishing the interrupt status
> - * bit. This still results in a poor quality of emulation (guest
> - * writes will have no effect until the next exit).
> - *
> - * But hey, it's fast, right?
> - */
> - if (is_hyp_ctxt(vcpu) &&
> - (timer_ctx == vcpu_vtimer(vcpu) || timer_ctx == vcpu_ptimer(vcpu))) {
> - u32 ctl = timer_get_ctl(timer_ctx);
> -
> - if (new_level)
> - ctl |= ARCH_TIMER_CTRL_IT_STAT;
> - else
> - ctl &= ~ARCH_TIMER_CTRL_IT_STAT;
> -
> - timer_set_ctl(timer_ctx, ctl);
> - }
> -
> timer_ctx->irq.level = new_level;
> trace_kvm_timer_update_irq(vcpu->vcpu_id, timer_irq(timer_ctx),
> timer_ctx->irq.level);
> @@ -471,6 +452,19 @@ static void timer_emulate(struct arch_timer_context *ctx)
>
> trace_kvm_timer_emulate(ctx, should_fire);
>
> + /*
> + * Paper over NV2 brokenness by publishing the interrupt status
> + * bit. This still results in a poor quality of emulation (guest
> + * writes will have no effect until the next exit).
> + *
> + * But hey, it's fast, right?
> + */
> + if (is_hyp_ctxt(ctx->vcpu)) {
> + unsigned long val = timer_get_ctl(ctx);
> + __assign_bit(__ffs(ARCH_TIMER_CTRL_IT_STAT), &val, should_fire);
> + timer_set_ctl(ctx, val);
> + }
> +
> if (should_fire != ctx->irq.level) {
> kvm_timer_update_irq(ctx->vcpu, should_fire, ctx);
> return;
I tried this patch and it did not work, looks like there are some
exit/entry paths which updates timer interrupt without going though
timer_emulate (like kvm_hrtimer_expire). If I undo the deleted lines in
above diff in the function kvm_timer_update_irq, then it is working.
below is the diff which is working.
diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
index dd038d62e99b..a54bdb6dc566 100644
--- a/arch/arm64/kvm/arch_timer.c
+++ b/arch/arm64/kvm/arch_timer.c
@@ -432,23 +432,11 @@ static void kvm_timer_update_irq(struct kvm_vcpu
*vcpu, bool new_level,
{
int ret;
- /*
- * Paper over NV2 brokenness by publishing the interrupt status
- * bit. This still results in a poor quality of emulation (guest
- * writes will have no effect until the next exit).
- *
- * But hey, it's fast, right?
- */
if (is_hyp_ctxt(vcpu) &&
(timer_ctx == vcpu_vtimer(vcpu) || timer_ctx ==
vcpu_ptimer(vcpu))) {
- u32 ctl = timer_get_ctl(timer_ctx);
-
- if (new_level)
- ctl |= ARCH_TIMER_CTRL_IT_STAT;
- else
- ctl &= ~ARCH_TIMER_CTRL_IT_STAT;
-
- timer_set_ctl(timer_ctx, ctl);
+ unsigned long val = timer_get_ctl(timer_ctx);
+ __assign_bit(__ffs(ARCH_TIMER_CTRL_IT_STAT), &val,
new_level);
+ timer_set_ctl(timer_ctx, val);
}
timer_ctx->irq.level = new_level;
@@ -471,6 +459,19 @@ static void timer_emulate(struct arch_timer_context
*ctx)
trace_kvm_timer_emulate(ctx, should_fire);
+ /*
+ * Paper over NV2 brokenness by publishing the interrupt status
+ * bit. This still results in a poor quality of emulation (guest
+ * writes will have no effect until the next exit).
+ *
+ * But hey, it's fast, right?
+ */
+ if (is_hyp_ctxt(ctx->vcpu)) {
+ unsigned long val = timer_get_ctl(ctx);
+ __assign_bit(__ffs(ARCH_TIMER_CTRL_IT_STAT), &val,
should_fire);
+ timer_set_ctl(ctx, val);
+ }
+
if (should_fire != ctx->irq.level) {
kvm_timer_update_irq(ctx->vcpu, should_fire, ctx);
return;
@@ -948,9 +949,6 @@ void kvm_timer_sync_nested(struct kvm_vcpu *vcpu)
* which allows trapping of the timer registers even with NV2.
* Still, this is still worse than FEAT_NV on its own. Meh.
*/
- if (cpus_have_final_cap(ARM64_HAS_ECV) || !is_hyp_ctxt(vcpu))
- return;
-
if (!vcpu_el2_e2h_is_set(vcpu)) {
/*
* A non-VHE guest hypervisor doesn't have any direct
access
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 2c35e1f8193e..52a263e7dcca 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1238,7 +1238,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
if (static_branch_unlikely(&userspace_irqchip_in_use))
kvm_timer_sync_user(vcpu);
- if (vcpu_has_nv(vcpu))
+ if (is_hyp_ctxt(vcpu))
kvm_timer_sync_nested(vcpu);
kvm_arch_vcpu_ctxsync_fp(vcpu);
--
Thanks,
Ganapat/GK
On 09-12-2024 06:50 pm, Marc Zyngier wrote:
> On Mon, 09 Dec 2024 12:25:34 +0000,
> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
>>>>
>>>> During automated testing of Nested Virtualization using avocado-vt,
>>>
>>> Which is not merged upstream. So what branch are you using? Based on
>>> what kernel version? On what HW? With which virtualisation features?
>>>
>>
>> Testing is done on Ampere's AmpereOne platform using 6.10 based kernel
>> with NV patches from your repo.
>
> Grmbl... *Which* patches? At least give me the SHA1 of the branch,
> because I have no idea what you are running. And 6.10 is definitely
> not something I care about. If you're using the NV patches, the
> *minimum* you should run is 6.13-rc1, because that's what the current
> code is based on.
>
I tried 6.13-rc1 based nv-next branch today, which failed to boot
UEFI as L1. Yet to debug this.
QEMU from Eric's repo is used for the testing.
https://github.com/eauger/qemu/tree/v9.0-nv-rfcv3
> Also, does this machine have FEAT_ECV?
Yes!
>
>>
>>>> it has been observed that during some boot test iterations,
>>>> the Guest-Hypervisor boot was getting crashed with a
>>>> synchronous exception while it is still booting EDK2.
>>>>
>>>> The test is launching Multiple instances of Guest-Hypervisor boot
>>>
>>> Is the multiple instance aspect really relevant to the reproduction of
>>> the problem?
>>
>> Not really, but it requires multiple attempts/iterations to hit the
>> issue. Even with automated test, it was seen at some iteration out of
>> 10 to 15 iterations.
>>
>>>
>>>> and while booting, QEMU monitor issued the command "info register"
>>>> at regular intervals to take a register dump. To execute this
>>>> command, QEMU stops the run and does the register read of various
>>>> registers. While resuming the run, the function kvm_arm_timer_write()
>>>> writes back the saved CNTV_CTL_EL0 register with ISTATUS cleared always
>>>
>>> It is userspace that causes this write-back, right? AFAICT, KVM never
>>> does that on its own.
>>>
>>>> and resulting in the loss of pending interrupt for emulated timers.
>>>
>>> How does a missing interrupt result in a synchronous exception in
>>> EDK2? In my experience, EDK2 panics if it sees spurious interrupts,
>>> not when it is missing interrupts (it just locks up, which is
>>> expected).
>>
>> Not sure, why it is hitting exception, rather than hang at EDK2.
>> However, EDK2 timer handler code is ignoring the interrupt since
>> ISTATUS is not set and not moving CVAL forward.
>
> How is EDK2 getting this exception? Is this injected by KVM? Or is
> that some EDK2 bug?
>
>>
>>>
>>>> In hardware based timers, ISTATUS is a RO/WI bit and gets set by the
>>>> h/w, if the condition is still met. However, in Nested-Virtualization
>>>> case, the Guest-Hypervisor's EDK2 is using an emulated virtual timer
>>>> and losing ISTATUS state and the interrupt forever.
>>>
>>> Why is this specific to NV? Can't the same thing happen to the
>>> physical timer in a non-VHE configuration?
>>>
>>
>> You mean, emulated v-timer in non-VHE boot?
>
> Emulated *physical* timer.
>
>> It might impact non-VHE case as well, not tried though.
>
> Can you please try?
Sure, I will try non-VHE as well.
>
> [...]
>
>>> But overall, this very much looks like it is only papering over the
>>> real issue, which is that the *emulation* should regenerate the
>>> pending bit, and not rely on the userspace interface.
>>>
>>> As far as I can tell, we already correctly compute the status bit on
>>> read (read_timer_ctl()), so the guest should always observe something
>>> consistent when it traps. We also *never* use the status bit as an
>>> input to the emulation, and always recompute it from scratch (it is
>>> only there for the benefit of the guest or userspace).
>>>
>>
>> For emulated timers, we are not asserting again by calling
>> kvm_timer_update_irq in timer_emulate() until the level is down and
>> ready for trigger again. This was done to fix high rate of spurious
>> interrupts getting generated to V-Timer. Hence we are not able to
>> recover, if once ISTATUS is lost.
>
> Again, a trapping read should see the correct value, since we populate
> that bit at read time.
>
>>> So I can't see how upstream is broken in at the moment, and you need
>>> to explain how this actually triggers. Ideally, with a standalone
>>> reproducer or a selftest.
>>
>> We could reproduce the issue with the simple test/script.
>> On one shell, launch L1 using qemu with add-on option
>>
>> "-monitor unix:gh_monitor,server,nowait
>>
>> On another shell, while L1 boots and still in UEFI, run repeatedly the
>> command (or put in a while 1 loop script)
>>
>> "echo "info registers" | socat - unix-connect:gh_monitor >
>> /tmp/info_registers"
>>
>> With above steps we were able to hit the issue within few attempts.
>
> That's not a standalone reproducer. QEMU doesn't support NV, and
> kvmtool doesn't have this sort of interface. I was asking for a bit of
> C code that I could run directly, not something that requires me to
> drag even more experimental code.
>
> So here's my current guess, since you don't give me the needed
> information. For what you describe to happen, I can only see two
> possibilities:
>
> - either your HW doesn't have FEAT_ECV, in which case the guest
> directly reads from memory
>
We do have the FEAT_ECV on AmpereOne, I was the one reported/fixed bug
with FEAT_ECV(CNTPOFF offset issue) in the past.
> - or you are running with something like this patch [1], and we serve
> the guest by reading from memory very early, without returning to
> the bulk of the emulation code
I see the kernel I am testing has this patch[1].
>
> In either case, we only publish the updated status if the current IRQ
> state is different from the computed output of the timer while
> performing the emulation.
>
> So if you were writing back a status bit set to 0 while the interrupt
> was already pending, we'd deliver an interrupt, but not recompute the
> status. The guest would consider the interrupt as spurious, not touch
> the timer, and we'd never make forward progress. Rinse, repeat.
>
> Assuming I got the analysis right,
Yes, this is what I tried to explain. LR shows pending, but UEFI is not
consuming and treating it as spurious since ISTATUS is not set.
it would only be a matter of
> hoisting the publication of the status into timer_emulate(), so that
> it is made up to date on load.
>
> Please give the fixup below a go.
>
Sure, I will give a try with below diff and let you know tomorrow.
This should work, I remember, this was the one of the option/fix that we
tried as fix while debugging.
>
> [1] https://lore.kernel.org/all/20241202172134.384923-6-maz@kernel.org/
>
> From 2bbd6f9b41a20ad573376c20c158ff3c12db5009 Mon Sep 17 00:00:00 2001
> From: Marc Zyngier <maz@kernel.org>
> Date: Mon, 9 Dec 2024 10:58:08 +0000
> Subject: [PATCH] fixup! KVM: arm64: nv: Publish emulated timer interrupt state
> in the in-memory state
>
> ---
> arch/arm64/kvm/arch_timer.c | 32 +++++++++++++-------------------
> 1 file changed, 13 insertions(+), 19 deletions(-)
>
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index 895f09658ef83..91bda986c344b 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -432,25 +432,6 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level,
> {
> int ret;
>
> - /*
> - * Paper over NV2 brokenness by publishing the interrupt status
> - * bit. This still results in a poor quality of emulation (guest
> - * writes will have no effect until the next exit).
> - *
> - * But hey, it's fast, right?
> - */
> - if (is_hyp_ctxt(vcpu) &&
> - (timer_ctx == vcpu_vtimer(vcpu) || timer_ctx == vcpu_ptimer(vcpu))) {
> - u32 ctl = timer_get_ctl(timer_ctx);
> -
> - if (new_level)
> - ctl |= ARCH_TIMER_CTRL_IT_STAT;
> - else
> - ctl &= ~ARCH_TIMER_CTRL_IT_STAT;
> -
> - timer_set_ctl(timer_ctx, ctl);
> - }
> -
> timer_ctx->irq.level = new_level;
> trace_kvm_timer_update_irq(vcpu->vcpu_id, timer_irq(timer_ctx),
> timer_ctx->irq.level);
> @@ -471,6 +452,19 @@ static void timer_emulate(struct arch_timer_context *ctx)
>
> trace_kvm_timer_emulate(ctx, should_fire);
>
> + /*
> + * Paper over NV2 brokenness by publishing the interrupt status
> + * bit. This still results in a poor quality of emulation (guest
> + * writes will have no effect until the next exit).
> + *
> + * But hey, it's fast, right?
> + */
> + if (is_hyp_ctxt(ctx->vcpu)) {
> + unsigned long val = timer_get_ctl(ctx);
> + __assign_bit(__ffs(ARCH_TIMER_CTRL_IT_STAT), &val, should_fire);
> + timer_set_ctl(ctx, val);
> + }
> +
> if (should_fire != ctx->irq.level) {
> kvm_timer_update_irq(ctx->vcpu, should_fire, ctx);
> return;
--
Thanks,
Ganapat/GK
Hi,
On 12/9/24 16:39, Ganapatrao Kulkarni wrote:
>
>
> On 09-12-2024 06:50 pm, Marc Zyngier wrote:
>> On Mon, 09 Dec 2024 12:25:34 +0000,
>> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
>>>>>
>>>>> During automated testing of Nested Virtualization using avocado-vt,
>>>>
>>>> Which is not merged upstream. So what branch are you using? Based on
>>>> what kernel version? On what HW? With which virtualisation features?
>>>>
>>>
>>> Testing is done on Ampere's AmpereOne platform using 6.10 based kernel
>>> with NV patches from your repo.
>>
>> Grmbl... *Which* patches? At least give me the SHA1 of the branch,
>> because I have no idea what you are running. And 6.10 is definitely
>> not something I care about. If you're using the NV patches, the
>> *minimum* you should run is 6.13-rc1, because that's what the current
>> code is based on.
>>
>
> I tried 6.13-rc1 based nv-next branch today, which failed to boot
> UEFI as L1. Yet to debug this.
I confirm am stuck with the same issue with nv-next on AmpereOne.
Thanks
Eric
>
>
> QEMU from Eric's repo is used for the testing.
> https://github.com/eauger/qemu/tree/v9.0-nv-rfcv3
>
>> Also, does this machine have FEAT_ECV?
>
> Yes!
>>
>>>
>>>>> it has been observed that during some boot test iterations,
>>>>> the Guest-Hypervisor boot was getting crashed with a
>>>>> synchronous exception while it is still booting EDK2.
>>>>>
>>>>> The test is launching Multiple instances of Guest-Hypervisor boot
>>>>
>>>> Is the multiple instance aspect really relevant to the reproduction of
>>>> the problem?
>>>
>>> Not really, but it requires multiple attempts/iterations to hit the
>>> issue. Even with automated test, it was seen at some iteration out of
>>> 10 to 15 iterations.
>>>
>>>>
>>>>> and while booting, QEMU monitor issued the command "info register"
>>>>> at regular intervals to take a register dump. To execute this
>>>>> command, QEMU stops the run and does the register read of various
>>>>> registers. While resuming the run, the function kvm_arm_timer_write()
>>>>> writes back the saved CNTV_CTL_EL0 register with ISTATUS cleared
>>>>> always
>>>>
>>>> It is userspace that causes this write-back, right? AFAICT, KVM never
>>>> does that on its own.
>>>>
>>>>> and resulting in the loss of pending interrupt for emulated timers.
>>>>
>>>> How does a missing interrupt result in a synchronous exception in
>>>> EDK2? In my experience, EDK2 panics if it sees spurious interrupts,
>>>> not when it is missing interrupts (it just locks up, which is
>>>> expected).
>>>
>>> Not sure, why it is hitting exception, rather than hang at EDK2.
>>> However, EDK2 timer handler code is ignoring the interrupt since
>>> ISTATUS is not set and not moving CVAL forward.
>>
>> How is EDK2 getting this exception? Is this injected by KVM? Or is
>> that some EDK2 bug?
>>
>>>
>>>>
>>>>> In hardware based timers, ISTATUS is a RO/WI bit and gets set by the
>>>>> h/w, if the condition is still met. However, in Nested-Virtualization
>>>>> case, the Guest-Hypervisor's EDK2 is using an emulated virtual timer
>>>>> and losing ISTATUS state and the interrupt forever.
>>>>
>>>> Why is this specific to NV? Can't the same thing happen to the
>>>> physical timer in a non-VHE configuration?
>>>>
>>>
>>> You mean, emulated v-timer in non-VHE boot?
>>
>> Emulated *physical* timer.
>>
>>> It might impact non-VHE case as well, not tried though.
>>
>> Can you please try?
>
> Sure, I will try non-VHE as well.
>>
>> [...]
>>
>>>> But overall, this very much looks like it is only papering over the
>>>> real issue, which is that the *emulation* should regenerate the
>>>> pending bit, and not rely on the userspace interface.
>>>>
>>>> As far as I can tell, we already correctly compute the status bit on
>>>> read (read_timer_ctl()), so the guest should always observe something
>>>> consistent when it traps. We also *never* use the status bit as an
>>>> input to the emulation, and always recompute it from scratch (it is
>>>> only there for the benefit of the guest or userspace).
>>>>
>>>
>>> For emulated timers, we are not asserting again by calling
>>> kvm_timer_update_irq in timer_emulate() until the level is down and
>>> ready for trigger again. This was done to fix high rate of spurious
>>> interrupts getting generated to V-Timer. Hence we are not able to
>>> recover, if once ISTATUS is lost.
>>
>> Again, a trapping read should see the correct value, since we populate
>> that bit at read time.
>>
>>>> So I can't see how upstream is broken in at the moment, and you need
>>>> to explain how this actually triggers. Ideally, with a standalone
>>>> reproducer or a selftest.
>>>
>>> We could reproduce the issue with the simple test/script.
>>> On one shell, launch L1 using qemu with add-on option
>>>
>>> "-monitor unix:gh_monitor,server,nowait
>>>
>>> On another shell, while L1 boots and still in UEFI, run repeatedly the
>>> command (or put in a while 1 loop script)
>>>
>>> "echo "info registers" | socat - unix-connect:gh_monitor >
>>> /tmp/info_registers"
>>>
>>> With above steps we were able to hit the issue within few attempts.
>>
>> That's not a standalone reproducer. QEMU doesn't support NV, and
>> kvmtool doesn't have this sort of interface. I was asking for a bit of
>> C code that I could run directly, not something that requires me to
>> drag even more experimental code.
>>
>> So here's my current guess, since you don't give me the needed
>> information. For what you describe to happen, I can only see two
>> possibilities:
>>
>> - either your HW doesn't have FEAT_ECV, in which case the guest
>> directly reads from memory
>>
>
> We do have the FEAT_ECV on AmpereOne, I was the one reported/fixed bug
> with FEAT_ECV(CNTPOFF offset issue) in the past.
>
>> - or you are running with something like this patch [1], and we serve
>> the guest by reading from memory very early, without returning to
>> the bulk of the emulation code
>
> I see the kernel I am testing has this patch[1].
>>
>> In either case, we only publish the updated status if the current IRQ
>> state is different from the computed output of the timer while
>> performing the emulation.
>>
>> So if you were writing back a status bit set to 0 while the interrupt
>> was already pending, we'd deliver an interrupt, but not recompute the
>> status. The guest would consider the interrupt as spurious, not touch
>> the timer, and we'd never make forward progress. Rinse, repeat.
>>
>> Assuming I got the analysis right,
>
> Yes, this is what I tried to explain. LR shows pending, but UEFI is not
> consuming and treating it as spurious since ISTATUS is not set.
>
> it would only be a matter of
>> hoisting the publication of the status into timer_emulate(), so that
>> it is made up to date on load.
>>
>> Please give the fixup below a go.
>>
>
> Sure, I will give a try with below diff and let you know tomorrow.
> This should work, I remember, this was the one of the option/fix that we
> tried as fix while debugging.
>
>>
>> [1] https://lore.kernel.org/all/20241202172134.384923-6-maz@kernel.org/
>>
>> From 2bbd6f9b41a20ad573376c20c158ff3c12db5009 Mon Sep 17 00:00:00 2001
>> From: Marc Zyngier <maz@kernel.org>
>> Date: Mon, 9 Dec 2024 10:58:08 +0000
>> Subject: [PATCH] fixup! KVM: arm64: nv: Publish emulated timer
>> interrupt state
>> in the in-memory state
>>
>> ---
>> arch/arm64/kvm/arch_timer.c | 32 +++++++++++++-------------------
>> 1 file changed, 13 insertions(+), 19 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
>> index 895f09658ef83..91bda986c344b 100644
>> --- a/arch/arm64/kvm/arch_timer.c
>> +++ b/arch/arm64/kvm/arch_timer.c
>> @@ -432,25 +432,6 @@ static void kvm_timer_update_irq(struct kvm_vcpu
>> *vcpu, bool new_level,
>> {
>> int ret;
>> - /*
>> - * Paper over NV2 brokenness by publishing the interrupt status
>> - * bit. This still results in a poor quality of emulation (guest
>> - * writes will have no effect until the next exit).
>> - *
>> - * But hey, it's fast, right?
>> - */
>> - if (is_hyp_ctxt(vcpu) &&
>> - (timer_ctx == vcpu_vtimer(vcpu) || timer_ctx ==
>> vcpu_ptimer(vcpu))) {
>> - u32 ctl = timer_get_ctl(timer_ctx);
>> -
>> - if (new_level)
>> - ctl |= ARCH_TIMER_CTRL_IT_STAT;
>> - else
>> - ctl &= ~ARCH_TIMER_CTRL_IT_STAT;
>> -
>> - timer_set_ctl(timer_ctx, ctl);
>> - }
>> -
>> timer_ctx->irq.level = new_level;
>> trace_kvm_timer_update_irq(vcpu->vcpu_id, timer_irq(timer_ctx),
>> timer_ctx->irq.level);
>> @@ -471,6 +452,19 @@ static void timer_emulate(struct
>> arch_timer_context *ctx)
>> trace_kvm_timer_emulate(ctx, should_fire);
>> + /*
>> + * Paper over NV2 brokenness by publishing the interrupt status
>> + * bit. This still results in a poor quality of emulation (guest
>> + * writes will have no effect until the next exit).
>> + *
>> + * But hey, it's fast, right?
>> + */
>> + if (is_hyp_ctxt(ctx->vcpu)) {
>> + unsigned long val = timer_get_ctl(ctx);
>> + __assign_bit(__ffs(ARCH_TIMER_CTRL_IT_STAT), &val, should_fire);
>> + timer_set_ctl(ctx, val);
>> + }
>> +
>> if (should_fire != ctx->irq.level) {
>> kvm_timer_update_irq(ctx->vcpu, should_fire, ctx);
>> return;
>
On Mon, 09 Dec 2024 16:46:30 +0000, Eric Auger <eauger@redhat.com> wrote: > > Hi, > > On 12/9/24 16:39, Ganapatrao Kulkarni wrote: > > > > > > On 09-12-2024 06:50 pm, Marc Zyngier wrote: > >> On Mon, 09 Dec 2024 12:25:34 +0000, > >> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote: > >>>>> > >>>>> During automated testing of Nested Virtualization using avocado-vt, > >>>> > >>>> Which is not merged upstream. So what branch are you using? Based on > >>>> what kernel version? On what HW? With which virtualisation features? > >>>> > >>> > >>> Testing is done on Ampere's AmpereOne platform using 6.10 based kernel > >>> with NV patches from your repo. > >> > >> Grmbl... *Which* patches? At least give me the SHA1 of the branch, > >> because I have no idea what you are running. And 6.10 is definitely > >> not something I care about. If you're using the NV patches, the > >> *minimum* you should run is 6.13-rc1, because that's what the current > >> code is based on. > >> > > > > I tried 6.13-rc1 based nv-next branch today, which failed to boot > > UEFI as L1. Yet to debug this. > > I confirm am stuck with the same issue with nv-next on AmpereOne. All I can say is that it Works For Me (TM) on M2 and Snapdragon, using kvmtool, a fresh EDK2 built by Ard, and with the guest running purely VHE (FEAT_E2H0 not implemented). I haven't tried QEMU, I don't have an AmpereOne. I have also spent way too much time on this recently. So I'm afraid people interested in making this particular combination work will have to debug it. M. -- Without deviation from the norm, progress is not possible.
Hi Marc, On 12/9/24 18:30, Marc Zyngier wrote: > On Mon, 09 Dec 2024 16:46:30 +0000, > Eric Auger <eauger@redhat.com> wrote: >> >> Hi, >> >> On 12/9/24 16:39, Ganapatrao Kulkarni wrote: >>> >>> >>> On 09-12-2024 06:50 pm, Marc Zyngier wrote: >>>> On Mon, 09 Dec 2024 12:25:34 +0000, >>>> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote: >>>>>>> >>>>>>> During automated testing of Nested Virtualization using avocado-vt, >>>>>> >>>>>> Which is not merged upstream. So what branch are you using? Based on >>>>>> what kernel version? On what HW? With which virtualisation features? >>>>>> >>>>> >>>>> Testing is done on Ampere's AmpereOne platform using 6.10 based kernel >>>>> with NV patches from your repo. >>>> >>>> Grmbl... *Which* patches? At least give me the SHA1 of the branch, >>>> because I have no idea what you are running. And 6.10 is definitely >>>> not something I care about. If you're using the NV patches, the >>>> *minimum* you should run is 6.13-rc1, because that's what the current >>>> code is based on. >>>> >>> >>> I tried 6.13-rc1 based nv-next branch today, which failed to boot >>> UEFI as L1. Yet to debug this. >> >> I confirm am stuck with the same issue with nv-next on AmpereOne. > > All I can say is that it Works For Me (TM) on M2 and Snapdragon, using > kvmtool, a fresh EDK2 built by Ard, and with the guest running purely > VHE (FEAT_E2H0 not implemented). > > I haven't tried QEMU, I don't have an AmpereOne. I have also spent way > too much time on this recently. So I'm afraid people interested in > making this particular combination work will have to debug it. yes I do agree. I am currently setting up kvmtool to double check and I will debug the qemu issue. Thanks Eric > > M. >
On 09-12-2024 11:04 pm, Eric Auger wrote: > Hi Marc, > > On 12/9/24 18:30, Marc Zyngier wrote: >> On Mon, 09 Dec 2024 16:46:30 +0000, >> Eric Auger <eauger@redhat.com> wrote: >>> >>> Hi, >>> >>> On 12/9/24 16:39, Ganapatrao Kulkarni wrote: >>>> >>>> >>>> On 09-12-2024 06:50 pm, Marc Zyngier wrote: >>>>> On Mon, 09 Dec 2024 12:25:34 +0000, >>>>> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote: >>>>>>>> >>>>>>>> During automated testing of Nested Virtualization using avocado-vt, >>>>>>> >>>>>>> Which is not merged upstream. So what branch are you using? Based on >>>>>>> what kernel version? On what HW? With which virtualisation features? >>>>>>> >>>>>> >>>>>> Testing is done on Ampere's AmpereOne platform using 6.10 based kernel >>>>>> with NV patches from your repo. >>>>> >>>>> Grmbl... *Which* patches? At least give me the SHA1 of the branch, >>>>> because I have no idea what you are running. And 6.10 is definitely >>>>> not something I care about. If you're using the NV patches, the >>>>> *minimum* you should run is 6.13-rc1, because that's what the current >>>>> code is based on. >>>>> >>>> >>>> I tried 6.13-rc1 based nv-next branch today, which failed to boot >>>> UEFI as L1. Yet to debug this. >>> >>> I confirm am stuck with the same issue with nv-next on AmpereOne. >> >> All I can say is that it Works For Me (TM) on M2 and Snapdragon, using >> kvmtool, a fresh EDK2 built by Ard, and with the guest running purely >> VHE (FEAT_E2H0 not implemented). >> >> I haven't tried QEMU, I don't have an AmpereOne. I have also spent way >> too much time on this recently. So I'm afraid people interested in >> making this particular combination work will have to debug it. > > yes I do agree. I am currently setting up kvmtool to double check and I > will debug the qemu issue. I could boot Guest-hypervisor using QEMU with -bios and nv-next as L0 with edk2 (QEMU_EFI.fd) built from upstream tree and fedora-41 as Guest-Hypervisor. The issue was disappeared when upgraded from fc39 to fc41. Same was experienced with the kvmtool boot as well(KVMTOOL_EFI.fd). -- Thanks, Ganapat/GK
Hi Ganapatrao, Marc, On 12/19/24 10:30 AM, Ganapatrao Kulkarni wrote: > > > On 09-12-2024 11:04 pm, Eric Auger wrote: >> Hi Marc, >> >> On 12/9/24 18:30, Marc Zyngier wrote: >>> On Mon, 09 Dec 2024 16:46:30 +0000, >>> Eric Auger <eauger@redhat.com> wrote: >>>> >>>> Hi, >>>> >>>> On 12/9/24 16:39, Ganapatrao Kulkarni wrote: >>>>> >>>>> >>>>> On 09-12-2024 06:50 pm, Marc Zyngier wrote: >>>>>> On Mon, 09 Dec 2024 12:25:34 +0000, >>>>>> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote: >>>>>>>>> >>>>>>>>> During automated testing of Nested Virtualization using >>>>>>>>> avocado-vt, >>>>>>>> >>>>>>>> Which is not merged upstream. So what branch are you using? >>>>>>>> Based on >>>>>>>> what kernel version? On what HW? With which virtualisation >>>>>>>> features? >>>>>>>> >>>>>>> >>>>>>> Testing is done on Ampere's AmpereOne platform using 6.10 based >>>>>>> kernel >>>>>>> with NV patches from your repo. >>>>>> >>>>>> Grmbl... *Which* patches? At least give me the SHA1 of the branch, >>>>>> because I have no idea what you are running. And 6.10 is definitely >>>>>> not something I care about. If you're using the NV patches, the >>>>>> *minimum* you should run is 6.13-rc1, because that's what the current >>>>>> code is based on. >>>>>> >>>>> >>>>> I tried 6.13-rc1 based nv-next branch today, which failed to boot >>>>> UEFI as L1. Yet to debug this. >>>> >>>> I confirm am stuck with the same issue with nv-next on AmpereOne. >>> >>> All I can say is that it Works For Me (TM) on M2 and Snapdragon, using >>> kvmtool, a fresh EDK2 built by Ard, and with the guest running purely >>> VHE (FEAT_E2H0 not implemented). >>> >>> I haven't tried QEMU, I don't have an AmpereOne. I have also spent way >>> too much time on this recently. So I'm afraid people interested in >>> making this particular combination work will have to debug it. >> >> yes I do agree. I am currently setting up kvmtool to double check and I >> will debug the qemu issue. > > I could boot Guest-hypervisor using QEMU with -bios and nv-next as L0 > with edk2 (QEMU_EFI.fd) built from upstream tree and fedora-41 as Guest- > Hypervisor. > > The issue was disappeared when upgraded from fc39 to fc41. Same was > experienced with the kvmtool boot as well(KVMTOOL_EFI.fd). I also confirm that using a mailine edk2 fixed the issues I faced previously (fed/rhel L1 guest not booting). I used Marc's nv-next branch and qemu rebase. With a rhel L1 guest I can now boot buildroot, debian and rhel guest as L2 (feat mainline edk2). For rhel, I tested different kinds of page size combinations for L1/L2 (4k and 64kB) and it worked. I tested on AmpereOne and Grace-Hopper systems Eric >
Hi Eric, On Tue, 14 Jan 2025 13:12:21 +0000, Eric Auger <eauger@redhat.com> wrote: > > I also confirm that using a mailine edk2 fixed the issues I faced > previously (fed/rhel L1 guest not booting). > > I used Marc's nv-next branch and qemu rebase. When did you sample this branch? It is almost daily rebased at the moment, given that we keep piling up things in kvmarm/next. > With a rhel L1 guest I can now boot buildroot, debian and rhel guest as > L2 (feat mainline edk2). For rhel, I tested different kinds of page size > combinations for L1/L2 (4k and 64kB) and it worked. > > I tested on AmpereOne and Grace-Hopper systems Thanks for the confirmation. I haven't had a chance to try QEMU yet, but I expect that save/restore will not work. Something to look into, I guess. M. -- Without deviation from the norm, progress is not possible.
Hi Marc, On 1/14/25 3:38 PM, Marc Zyngier wrote: > Hi Eric, > > On Tue, 14 Jan 2025 13:12:21 +0000, > Eric Auger <eauger@redhat.com> wrote: >> >> I also confirm that using a mailine edk2 fixed the issues I faced >> previously (fed/rhel L1 guest not booting). >> >> I used Marc's nv-next branch and qemu rebase. > > When did you sample this branch? It is almost daily rebased at the > moment, given that we keep piling up things in kvmarm/next. 5 days ago. If you want me to test a specific branch, please let me know, esp. in the context of latest series including [PATCH v2 00/17] KVM: arm64: Add NV GICv3 support > >> With a rhel L1 guest I can now boot buildroot, debian and rhel guest as >> L2 (feat mainline edk2). For rhel, I tested different kinds of page size >> combinations for L1/L2 (4k and 64kB) and it worked. >> >> I tested on AmpereOne and Grace-Hopper systems > > Thanks for the confirmation. I haven't had a chance to try QEMU yet, > but I expect that save/restore will not work. Something to look into, I have not tested this yet.I will give it a try. Eric > I guess. > > M. >
On Tue, 14 Jan 2025 14:57:43 +0000, Eric Auger <eauger@redhat.com> wrote: > > Hi Marc, > > On 1/14/25 3:38 PM, Marc Zyngier wrote: > > Hi Eric, > > > > On Tue, 14 Jan 2025 13:12:21 +0000, > > Eric Auger <eauger@redhat.com> wrote: > >> > >> I also confirm that using a mailine edk2 fixed the issues I faced > >> previously (fed/rhel L1 guest not booting). > >> > >> I used Marc's nv-next branch and qemu rebase. > > > > When did you sample this branch? It is almost daily rebased at the > > moment, given that we keep piling up things in kvmarm/next. > 5 days ago. OK, so probably before I reapplied everything on top of kvmarm/next. Not necessarily a bad thing, just slightly older. The core NV code isn't rapidly changing anymore, with the exception of the recursive nesting stuff that I am currently rewriting. > > If you want me to test a specific branch, please let me know, esp. in > the context of latest series including > > [PATCH v2 00/17] KVM: arm64: Add NV GICv3 support That'd be the current nv-next then, which has all the series currently on the list, and a few more. > > > >> With a rhel L1 guest I can now boot buildroot, debian and rhel guest as > >> L2 (feat mainline edk2). For rhel, I tested different kinds of page size > >> combinations for L1/L2 (4k and 64kB) and it worked. > >> > >> I tested on AmpereOne and Grace-Hopper systems > > > > Thanks for the confirmation. I haven't had a chance to try QEMU yet, > > but I expect that save/restore will not work. Something to look into, > I have not tested this yet.I will give it a try. Thanks, that'd be super helpful. M. -- Without deviation from the norm, progress is not possible.
Hi Marc, On 1/14/25 4:52 PM, Marc Zyngier wrote: > On Tue, 14 Jan 2025 14:57:43 +0000, > Eric Auger <eauger@redhat.com> wrote: >> >> Hi Marc, >> >> On 1/14/25 3:38 PM, Marc Zyngier wrote: >>> Hi Eric, >>> >>> On Tue, 14 Jan 2025 13:12:21 +0000, >>> Eric Auger <eauger@redhat.com> wrote: >>>> >>>> I also confirm that using a mailine edk2 fixed the issues I faced >>>> previously (fed/rhel L1 guest not booting). >>>> >>>> I used Marc's nv-next branch and qemu rebase. >>> >>> When did you sample this branch? It is almost daily rebased at the >>> moment, given that we keep piling up things in kvmarm/next. >> 5 days ago. > > OK, so probably before I reapplied everything on top of kvmarm/next. > Not necessarily a bad thing, just slightly older. The core NV code > isn't rapidly changing anymore, with the exception of the recursive > nesting stuff that I am currently rewriting. > >> >> If you want me to test a specific branch, please let me know, esp. in >> the context of latest series including >> >> [PATCH v2 00/17] KVM: arm64: Add NV GICv3 support > > That'd be the current nv-next then, which has all the series currently > on the list, and a few more. > >>> >>>> With a rhel L1 guest I can now boot buildroot, debian and rhel guest as >>>> L2 (feat mainline edk2). For rhel, I tested different kinds of page size >>>> combinations for L1/L2 (4k and 64kB) and it worked. >>>> >>>> I tested on AmpereOne and Grace-Hopper systems >>> >>> Thanks for the confirmation. I haven't had a chance to try QEMU yet, >>> but I expect that save/restore will not work. Something to look into, >> I have not tested this yet.I will give it a try. > > Thanks, that'd be super helpful. I confirm the migration fails when putting some registers on the dest side. I will further investigate. Eric > > M. >
On Thu, 16 Jan 2025 17:52:10 +0000, Eric Auger <eauger@redhat.com> wrote: > > Hi Marc, > > On 1/14/25 4:52 PM, Marc Zyngier wrote: > > On Tue, 14 Jan 2025 14:57:43 +0000, > > Eric Auger <eauger@redhat.com> wrote: > >> > >> Hi Marc, > >> > >> On 1/14/25 3:38 PM, Marc Zyngier wrote: > >>> Hi Eric, > >>> > >>> On Tue, 14 Jan 2025 13:12:21 +0000, > >>> Eric Auger <eauger@redhat.com> wrote: > >>>> > >>>> I also confirm that using a mailine edk2 fixed the issues I faced > >>>> previously (fed/rhel L1 guest not booting). > >>>> > >>>> I used Marc's nv-next branch and qemu rebase. > >>> > >>> When did you sample this branch? It is almost daily rebased at the > >>> moment, given that we keep piling up things in kvmarm/next. > >> 5 days ago. > > > > OK, so probably before I reapplied everything on top of kvmarm/next. > > Not necessarily a bad thing, just slightly older. The core NV code > > isn't rapidly changing anymore, with the exception of the recursive > > nesting stuff that I am currently rewriting. > > > >> > >> If you want me to test a specific branch, please let me know, esp. in > >> the context of latest series including > >> > >> [PATCH v2 00/17] KVM: arm64: Add NV GICv3 support > > > > That'd be the current nv-next then, which has all the series currently > > on the list, and a few more. > > > >>> > >>>> With a rhel L1 guest I can now boot buildroot, debian and rhel guest as > >>>> L2 (feat mainline edk2). For rhel, I tested different kinds of page size > >>>> combinations for L1/L2 (4k and 64kB) and it worked. > >>>> > >>>> I tested on AmpereOne and Grace-Hopper systems > >>> > >>> Thanks for the confirmation. I haven't had a chance to try QEMU yet, > >>> but I expect that save/restore will not work. Something to look into, > >> I have not tested this yet.I will give it a try. > > > > Thanks, that'd be super helpful. > > I confirm the migration fails when putting some registers on the dest > side. I will further investigate. I found at least one issue that could fail the migration. Before the VM starts running, we limit the feature set to the subset we actually support with NV. By doing this, we also change the value of IDreg fields that are not writable, because they describe features that we don't support. Obviously, that fails on restore. I need to have a think... M. -- Without deviation from the norm, progress is not possible.
Hi Marc, On 2/7/25 6:45 PM, Marc Zyngier wrote: > On Thu, 16 Jan 2025 17:52:10 +0000, > Eric Auger <eauger@redhat.com> wrote: >> >> Hi Marc, >> >> On 1/14/25 4:52 PM, Marc Zyngier wrote: >>> On Tue, 14 Jan 2025 14:57:43 +0000, >>> Eric Auger <eauger@redhat.com> wrote: >>>> >>>> Hi Marc, >>>> >>>> On 1/14/25 3:38 PM, Marc Zyngier wrote: >>>>> Hi Eric, >>>>> >>>>> On Tue, 14 Jan 2025 13:12:21 +0000, >>>>> Eric Auger <eauger@redhat.com> wrote: >>>>>> >>>>>> I also confirm that using a mailine edk2 fixed the issues I faced >>>>>> previously (fed/rhel L1 guest not booting). >>>>>> >>>>>> I used Marc's nv-next branch and qemu rebase. >>>>> >>>>> When did you sample this branch? It is almost daily rebased at the >>>>> moment, given that we keep piling up things in kvmarm/next. >>>> 5 days ago. >>> >>> OK, so probably before I reapplied everything on top of kvmarm/next. >>> Not necessarily a bad thing, just slightly older. The core NV code >>> isn't rapidly changing anymore, with the exception of the recursive >>> nesting stuff that I am currently rewriting. >>> >>>> >>>> If you want me to test a specific branch, please let me know, esp. in >>>> the context of latest series including >>>> >>>> [PATCH v2 00/17] KVM: arm64: Add NV GICv3 support >>> >>> That'd be the current nv-next then, which has all the series currently >>> on the list, and a few more. >>> >>>>> >>>>>> With a rhel L1 guest I can now boot buildroot, debian and rhel guest as >>>>>> L2 (feat mainline edk2). For rhel, I tested different kinds of page size >>>>>> combinations for L1/L2 (4k and 64kB) and it worked. >>>>>> >>>>>> I tested on AmpereOne and Grace-Hopper systems >>>>> >>>>> Thanks for the confirmation. I haven't had a chance to try QEMU yet, >>>>> but I expect that save/restore will not work. Something to look into, >>>> I have not tested this yet.I will give it a try. >>> >>> Thanks, that'd be super helpful. >> >> I confirm the migration fails when putting some registers on the dest >> side. I will further investigate. > > I found at least one issue that could fail the migration. Before the > VM starts running, we limit the feature set to the subset we actually > support with NV. > > By doing this, we also change the value of IDreg fields that are not > writable, because they describe features that we don't support. > Obviously, that fails on restore. At first sight the registers we fail to put are - ID_AA64PFR0_EL1 - ID_AA64DFR0_EL1 - ID_AA64MMFR1_EL1 I will add some traces in the kernel to figure out which fields cause issues Eric > > I need to have a think... > > M. >
Hey, On Fri, Feb 07, 2025 at 05:45:33PM +0000, Marc Zyngier wrote: > I found at least one issue that could fail the migration. Before the > VM starts running, we limit the feature set to the subset we actually > support with NV. > > By doing this, we also change the value of IDreg fields that are not > writable, because they describe features that we don't support. > Obviously, that fails on restore. > > I need to have a think... We spoke about this a while ago (and I forgot til now), but I was wondering if we could use vCPU feature flags to describe NV, including the selection between FEAT_E2H0 and FEAT_VHE. I think this might match userspace expectations a bit more closely where the state of the ID registers after init gives the actual feature set supported by the VM. -- Thanks, Oliver
On Fri, 07 Feb 2025 18:09:58 +0000, Oliver Upton <oliver.upton@linux.dev> wrote: > > Hey, > > On Fri, Feb 07, 2025 at 05:45:33PM +0000, Marc Zyngier wrote: > > I found at least one issue that could fail the migration. Before the > > VM starts running, we limit the feature set to the subset we actually > > support with NV. > > > > By doing this, we also change the value of IDreg fields that are not > > writable, because they describe features that we don't support. > > Obviously, that fails on restore. > > > > I need to have a think... > > We spoke about this a while ago (and I forgot til now), but I was > wondering if we could use vCPU feature flags to describe NV, including > the selection between FEAT_E2H0 and FEAT_VHE. > > I think this might match userspace expectations a bit more closely where > the state of the ID registers after init gives the actual feature set > supported by the VM. I'm not sure that's enough. Let me give you an example: My host has FEAT_XNX, described in ID_AA64MMFR1_EL1.XNX. For whatever reason, we don't allow this field to be written to, even out of NV context. This is odd, because for an EL1 VM, this field means nothing at all. However, we don't yet support FEAT_XNX with NV (it requires some extra surgery in the S2 walker, in the S2 shadowing code and in AT). How would you manage this field if you had a vcpu flag saying E2H0 or not? I don't think it helps, at least not in that particular case. It may help for some things, but not all. It feels that we need to define the field limit (and what is writable or not) based on the NV/!NV support, and maybe E2H0/VHE as well. This is getting complicated... M. -- Without deviation from the norm, progress is not possible.
Hi Marc, On 2/7/25 7:38 PM, Marc Zyngier wrote: > On Fri, 07 Feb 2025 18:09:58 +0000, > Oliver Upton <oliver.upton@linux.dev> wrote: >> >> Hey, >> >> On Fri, Feb 07, 2025 at 05:45:33PM +0000, Marc Zyngier wrote: >>> I found at least one issue that could fail the migration. Before the >>> VM starts running, we limit the feature set to the subset we actually >>> support with NV. >>> >>> By doing this, we also change the value of IDreg fields that are not >>> writable, because they describe features that we don't support. >>> Obviously, that fails on restore. >>> >>> I need to have a think... >> >> We spoke about this a while ago (and I forgot til now), but I was >> wondering if we could use vCPU feature flags to describe NV, including >> the selection between FEAT_E2H0 and FEAT_VHE. >> >> I think this might match userspace expectations a bit more closely where >> the state of the ID registers after init gives the actual feature set >> supported by the VM. > > I'm not sure that's enough. Let me give you an example: > > My host has FEAT_XNX, described in ID_AA64MMFR1_EL1.XNX. For whatever > reason, we don't allow this field to be written to, even out of NV > context. This is odd, because for an EL1 VM, this field means nothing > at all. So the curprit fields for me look like - ID_AA64MMFR1_EL1.XNX - ID_AA64DFR0_EL1.DoubleLock - ID_AA64PFR0_EL1.RAS This is still based on your nv-next branch from Jan 9 https://github.com/eauger/linux/tree/nv_next_jan9_2025 Eric > > However, we don't yet support FEAT_XNX with NV (it requires some extra > surgery in the S2 walker, in the S2 shadowing code and in AT). > > How would you manage this field if you had a vcpu flag saying E2H0 or > not? I don't think it helps, at least not in that particular case. It > may help for some things, but not all. > > It feels that we need to define the field limit (and what is writable > or not) based on the NV/!NV support, and maybe E2H0/VHE as well. > > This is getting complicated... > > M. >
On Mon, 10 Feb 2025 18:26:48 +0000, Eric Auger <eauger@redhat.com> wrote: > > Hi Marc, > > On 2/7/25 7:38 PM, Marc Zyngier wrote: > > On Fri, 07 Feb 2025 18:09:58 +0000, > > Oliver Upton <oliver.upton@linux.dev> wrote: > >> > >> Hey, > >> > >> On Fri, Feb 07, 2025 at 05:45:33PM +0000, Marc Zyngier wrote: > >>> I found at least one issue that could fail the migration. Before the > >>> VM starts running, we limit the feature set to the subset we actually > >>> support with NV. > >>> > >>> By doing this, we also change the value of IDreg fields that are not > >>> writable, because they describe features that we don't support. > >>> Obviously, that fails on restore. > >>> > >>> I need to have a think... > >> > >> We spoke about this a while ago (and I forgot til now), but I was > >> wondering if we could use vCPU feature flags to describe NV, including > >> the selection between FEAT_E2H0 and FEAT_VHE. > >> > >> I think this might match userspace expectations a bit more closely where > >> the state of the ID registers after init gives the actual feature set > >> supported by the VM. > > > > I'm not sure that's enough. Let me give you an example: > > > > My host has FEAT_XNX, described in ID_AA64MMFR1_EL1.XNX. For whatever > > reason, we don't allow this field to be written to, even out of NV > > context. This is odd, because for an EL1 VM, this field means nothing > > at all. > So the curprit fields for me look like > > - ID_AA64MMFR1_EL1.XNX > - ID_AA64DFR0_EL1.DoubleLock > - ID_AA64PFR0_EL1.RAS > > This is still based on your nv-next branch from Jan 9 > https://github.com/eauger/linux/tree/nv_next_jan9_2025 I have now pushed out a new nv-next branch with the new and improved UAPI. I expect migration to work a bit better, or at least not to explode on ID register restore. You will notice that things have changed a bit (extra flag and cap for FEAT_E2H0), but nothing really major. Thanks, M. -- Without deviation from the norm, progress is not possible.
Hi Marc,
On 15-02-2025 11:20 pm, Marc Zyngier wrote:
> On Mon, 10 Feb 2025 18:26:48 +0000,
> Eric Auger <eauger@redhat.com> wrote:
>>
>> Hi Marc,
>>
>> On 2/7/25 7:38 PM, Marc Zyngier wrote:
>>> On Fri, 07 Feb 2025 18:09:58 +0000,
>>> Oliver Upton <oliver.upton@linux.dev> wrote:
>>>>
>>>> Hey,
>>>>
>>>> On Fri, Feb 07, 2025 at 05:45:33PM +0000, Marc Zyngier wrote:
>>>>> I found at least one issue that could fail the migration. Before the
>>>>> VM starts running, we limit the feature set to the subset we actually
>>>>> support with NV.
>>>>>
>>>>> By doing this, we also change the value of IDreg fields that are not
>>>>> writable, because they describe features that we don't support.
>>>>> Obviously, that fails on restore.
>>>>>
>>>>> I need to have a think...
>>>>
>>>> We spoke about this a while ago (and I forgot til now), but I was
>>>> wondering if we could use vCPU feature flags to describe NV, including
>>>> the selection between FEAT_E2H0 and FEAT_VHE.
>>>>
>>>> I think this might match userspace expectations a bit more closely where
>>>> the state of the ID registers after init gives the actual feature set
>>>> supported by the VM.
>>>
>>> I'm not sure that's enough. Let me give you an example:
>>>
>>> My host has FEAT_XNX, described in ID_AA64MMFR1_EL1.XNX. For whatever
>>> reason, we don't allow this field to be written to, even out of NV
>>> context. This is odd, because for an EL1 VM, this field means nothing
>>> at all.
>> So the curprit fields for me look like
>>
>> - ID_AA64MMFR1_EL1.XNX
>> - ID_AA64DFR0_EL1.DoubleLock
>> - ID_AA64PFR0_EL1.RAS
>>
>> This is still based on your nv-next branch from Jan 9
>> https://github.com/eauger/linux/tree/nv_next_jan9_2025
>
> I have now pushed out a new nv-next branch with the new and improved
> UAPI. I expect migration to work a bit better, or at least not to
> explode on ID register restore. You will notice that things have
> changed a bit (extra flag and cap for FEAT_E2H0), but nothing really
> major.
>
Tried nv-next branch and it is breaking(kernel Oops) for normal VM boot
itself with qemu. Looks like this is happening since qemu is trying to
write to ID_UNALLOCATED mapped registers as part of save-restore of
registers.
Below diff fixes the issue,
[root@sut08sys-r112 arm-platforms]# git diff arch/arm64/kvm/sys_regs.c
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index e6f4599dca48..9459d25d4902 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -2310,6 +2310,7 @@ static bool bad_redir_trap(struct kvm_vcpu *vcpu,
.get_user = get_id_reg, \
.set_user = set_id_reg, \
.visibility = raz_visibility, \
+ .reset = kvm_read_sanitised_id_reg, \
.val = 0, \
}
#define ID_UNALLOCATED(crm, op2) { \
Op0(3), Op1(0), CRn(0), CRm(crm), Op2(op2), \
.access = access_id_reg, \
.get_user = get_id_reg, \
.set_user = set_id_reg, \
.visibility = raz_visibility, \
.reset = kvm_read_sanitised_id_reg, \
.val = 0, \
}
--
Thanks,
Ganapat/GK
On Tue, 18 Feb 2025 07:33:11 +0000,
Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
>
>
> Hi Marc,
>
> On 15-02-2025 11:20 pm, Marc Zyngier wrote:
> > On Mon, 10 Feb 2025 18:26:48 +0000,
> > Eric Auger <eauger@redhat.com> wrote:
> >>
> >> Hi Marc,
> >>
> >> On 2/7/25 7:38 PM, Marc Zyngier wrote:
> >>> On Fri, 07 Feb 2025 18:09:58 +0000,
> >>> Oliver Upton <oliver.upton@linux.dev> wrote:
> >>>>
> >>>> Hey,
> >>>>
> >>>> On Fri, Feb 07, 2025 at 05:45:33PM +0000, Marc Zyngier wrote:
> >>>>> I found at least one issue that could fail the migration. Before the
> >>>>> VM starts running, we limit the feature set to the subset we actually
> >>>>> support with NV.
> >>>>>
> >>>>> By doing this, we also change the value of IDreg fields that are not
> >>>>> writable, because they describe features that we don't support.
> >>>>> Obviously, that fails on restore.
> >>>>>
> >>>>> I need to have a think...
> >>>>
> >>>> We spoke about this a while ago (and I forgot til now), but I was
> >>>> wondering if we could use vCPU feature flags to describe NV, including
> >>>> the selection between FEAT_E2H0 and FEAT_VHE.
> >>>>
> >>>> I think this might match userspace expectations a bit more closely where
> >>>> the state of the ID registers after init gives the actual feature set
> >>>> supported by the VM.
> >>>
> >>> I'm not sure that's enough. Let me give you an example:
> >>>
> >>> My host has FEAT_XNX, described in ID_AA64MMFR1_EL1.XNX. For whatever
> >>> reason, we don't allow this field to be written to, even out of NV
> >>> context. This is odd, because for an EL1 VM, this field means nothing
> >>> at all.
> >> So the curprit fields for me look like
> >>
> >> - ID_AA64MMFR1_EL1.XNX
> >> - ID_AA64DFR0_EL1.DoubleLock
> >> - ID_AA64PFR0_EL1.RAS
> >>
> >> This is still based on your nv-next branch from Jan 9
> >> https://github.com/eauger/linux/tree/nv_next_jan9_2025
> >
> > I have now pushed out a new nv-next branch with the new and improved
> > UAPI. I expect migration to work a bit better, or at least not to
> > explode on ID register restore. You will notice that things have
> > changed a bit (extra flag and cap for FEAT_E2H0), but nothing really
> > major.
> >
>
> Tried nv-next branch and it is breaking(kernel Oops) for normal VM
> boot itself with qemu. Looks like this is happening since qemu is
> trying to write to ID_UNALLOCATED mapped registers as part of
> save-restore of registers.
My take on this problem ends up being more consolidation, and make
sure that the individual macros only override the default callbacks
for idregs.
Additionally, ID_UNALLOCATED gets a name matching the architectural
encoding.
M.
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index e6f4599dca48e..2e14562b5841f 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -2261,24 +2261,26 @@ static bool bad_redir_trap(struct kvm_vcpu *vcpu,
* from userspace.
*/
-#define ID_DESC(name) \
- SYS_DESC(SYS_##name), \
+#define ID_DESC_DEFAULT_CALLBACKS \
.access = access_id_reg, \
.get_user = get_id_reg, \
+ .set_user = set_id_reg, \
+ .visibility = id_visibility, \
.reset = kvm_read_sanitised_id_reg
+#define ID_DESC(name) \
+ SYS_DESC(SYS_##name), \
+ ID_DESC_DEFAULT_CALLBACKS
+
/* sys_reg_desc initialiser for known cpufeature ID registers */
#define ID_SANITISED(name) { \
ID_DESC(name), \
- .set_user = set_id_reg, \
- .visibility = id_visibility, \
.val = 0, \
}
/* sys_reg_desc initialiser for known cpufeature ID registers */
#define AA32_ID_SANITISED(name) { \
ID_DESC(name), \
- .set_user = set_id_reg, \
.visibility = aa32_id_visibility, \
.val = 0, \
}
@@ -2286,8 +2288,6 @@ static bool bad_redir_trap(struct kvm_vcpu *vcpu,
/* sys_reg_desc initialiser for writable ID registers */
#define ID_WRITABLE(name, mask) { \
ID_DESC(name), \
- .set_user = set_id_reg, \
- .visibility = id_visibility, \
.val = mask, \
}
@@ -2295,7 +2295,6 @@ static bool bad_redir_trap(struct kvm_vcpu *vcpu,
#define ID_FILTERED(sysreg, name, mask) { \
ID_DESC(sysreg), \
.set_user = set_##name, \
- .visibility = id_visibility, \
.val = (mask), \
}
@@ -2305,10 +2304,9 @@ static bool bad_redir_trap(struct kvm_vcpu *vcpu,
* (1 <= crm < 8, 0 <= Op2 < 8).
*/
#define ID_UNALLOCATED(crm, op2) { \
+ .name = "S3_0_0_" #crm "_" #op2, \
Op0(3), Op1(0), CRn(0), CRm(crm), Op2(op2), \
- .access = access_id_reg, \
- .get_user = get_id_reg, \
- .set_user = set_id_reg, \
+ ID_DESC_DEFAULT_CALLBACKS, \
.visibility = raz_visibility, \
.val = 0, \
}
@@ -2320,7 +2318,6 @@ static bool bad_redir_trap(struct kvm_vcpu *vcpu,
*/
#define ID_HIDDEN(name) { \
ID_DESC(name), \
- .set_user = set_id_reg, \
.visibility = raz_visibility, \
.val = 0, \
}
--
Without deviation from the norm, progress is not possible.
On 19-02-2025 02:54 am, Marc Zyngier wrote: > On Tue, 18 Feb 2025 07:33:11 +0000, > Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote: >> >> >> Hi Marc, >> >> On 15-02-2025 11:20 pm, Marc Zyngier wrote: >>> On Mon, 10 Feb 2025 18:26:48 +0000, >>> Eric Auger <eauger@redhat.com> wrote: >>>> >>>> Hi Marc, >>>> >>>> On 2/7/25 7:38 PM, Marc Zyngier wrote: >>>>> On Fri, 07 Feb 2025 18:09:58 +0000, >>>>> Oliver Upton <oliver.upton@linux.dev> wrote: >>>>>> >>>>>> Hey, >>>>>> >>>>>> On Fri, Feb 07, 2025 at 05:45:33PM +0000, Marc Zyngier wrote: >>>>>>> I found at least one issue that could fail the migration. Before the >>>>>>> VM starts running, we limit the feature set to the subset we actually >>>>>>> support with NV. >>>>>>> >>>>>>> By doing this, we also change the value of IDreg fields that are not >>>>>>> writable, because they describe features that we don't support. >>>>>>> Obviously, that fails on restore. >>>>>>> >>>>>>> I need to have a think... >>>>>> >>>>>> We spoke about this a while ago (and I forgot til now), but I was >>>>>> wondering if we could use vCPU feature flags to describe NV, including >>>>>> the selection between FEAT_E2H0 and FEAT_VHE. >>>>>> >>>>>> I think this might match userspace expectations a bit more closely where >>>>>> the state of the ID registers after init gives the actual feature set >>>>>> supported by the VM. >>>>> >>>>> I'm not sure that's enough. Let me give you an example: >>>>> >>>>> My host has FEAT_XNX, described in ID_AA64MMFR1_EL1.XNX. For whatever >>>>> reason, we don't allow this field to be written to, even out of NV >>>>> context. This is odd, because for an EL1 VM, this field means nothing >>>>> at all. >>>> So the curprit fields for me look like >>>> >>>> - ID_AA64MMFR1_EL1.XNX >>>> - ID_AA64DFR0_EL1.DoubleLock >>>> - ID_AA64PFR0_EL1.RAS >>>> >>>> This is still based on your nv-next branch from Jan 9 >>>> https://github.com/eauger/linux/tree/nv_next_jan9_2025 >>> >>> I have now pushed out a new nv-next branch with the new and improved >>> UAPI. I expect migration to work a bit better, or at least not to >>> explode on ID register restore. You will notice that things have >>> changed a bit (extra flag and cap for FEAT_E2H0), but nothing really >>> major. >>> >> >> Tried nv-next branch and it is breaking(kernel Oops) for normal VM >> boot itself with qemu. Looks like this is happening since qemu is >> trying to write to ID_UNALLOCATED mapped registers as part of >> save-restore of registers. > > My take on this problem ends up being more consolidation, and make > sure that the individual macros only override the default callbacks > for idregs. > > Additionally, ID_UNALLOCATED gets a name matching the architectural > encoding. > Thanks, having name helps future debugging. While I was debugging, the register name was showing as null and I have to map id to encodings. I have rebased to nv-next and looks all good!. -- Thanks, Ganapat/GK
On Tue, 18 Feb 2025 07:33:11 +0000, Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote: > > > Hi Marc, > > On 15-02-2025 11:20 pm, Marc Zyngier wrote: > > On Mon, 10 Feb 2025 18:26:48 +0000, > > Eric Auger <eauger@redhat.com> wrote: > >> > >> Hi Marc, > >> > >> On 2/7/25 7:38 PM, Marc Zyngier wrote: > >>> On Fri, 07 Feb 2025 18:09:58 +0000, > >>> Oliver Upton <oliver.upton@linux.dev> wrote: > >>>> > >>>> Hey, > >>>> > >>>> On Fri, Feb 07, 2025 at 05:45:33PM +0000, Marc Zyngier wrote: > >>>>> I found at least one issue that could fail the migration. Before the > >>>>> VM starts running, we limit the feature set to the subset we actually > >>>>> support with NV. > >>>>> > >>>>> By doing this, we also change the value of IDreg fields that are not > >>>>> writable, because they describe features that we don't support. > >>>>> Obviously, that fails on restore. > >>>>> > >>>>> I need to have a think... > >>>> > >>>> We spoke about this a while ago (and I forgot til now), but I was > >>>> wondering if we could use vCPU feature flags to describe NV, including > >>>> the selection between FEAT_E2H0 and FEAT_VHE. > >>>> > >>>> I think this might match userspace expectations a bit more closely where > >>>> the state of the ID registers after init gives the actual feature set > >>>> supported by the VM. > >>> > >>> I'm not sure that's enough. Let me give you an example: > >>> > >>> My host has FEAT_XNX, described in ID_AA64MMFR1_EL1.XNX. For whatever > >>> reason, we don't allow this field to be written to, even out of NV > >>> context. This is odd, because for an EL1 VM, this field means nothing > >>> at all. > >> So the curprit fields for me look like > >> > >> - ID_AA64MMFR1_EL1.XNX > >> - ID_AA64DFR0_EL1.DoubleLock > >> - ID_AA64PFR0_EL1.RAS > >> > >> This is still based on your nv-next branch from Jan 9 > >> https://github.com/eauger/linux/tree/nv_next_jan9_2025 > > > > I have now pushed out a new nv-next branch with the new and improved > > UAPI. I expect migration to work a bit better, or at least not to > > explode on ID register restore. You will notice that things have > > changed a bit (extra flag and cap for FEAT_E2H0), but nothing really > > major. > > > > Tried nv-next branch and it is breaking(kernel Oops) for normal VM > boot itself with qemu. Looks like this is happening since qemu is > trying to write to ID_UNALLOCATED mapped registers as part of > save-restore of registers. Yeah, ID_UNALLOCATED is in pretty bad shape overall. For start, it doesn't even have a name associated to it, and then everything else falls apart. I'll rework it shortly, thanks for the heads up. M. -- Without deviation from the norm, progress is not possible.
On Mon, 10 Feb 2025 18:26:48 +0000, Eric Auger <eauger@redhat.com> wrote: > > Hi Marc, > > On 2/7/25 7:38 PM, Marc Zyngier wrote: > > On Fri, 07 Feb 2025 18:09:58 +0000, > > Oliver Upton <oliver.upton@linux.dev> wrote: > >> > >> Hey, > >> > >> On Fri, Feb 07, 2025 at 05:45:33PM +0000, Marc Zyngier wrote: > >>> I found at least one issue that could fail the migration. Before the > >>> VM starts running, we limit the feature set to the subset we actually > >>> support with NV. > >>> > >>> By doing this, we also change the value of IDreg fields that are not > >>> writable, because they describe features that we don't support. > >>> Obviously, that fails on restore. > >>> > >>> I need to have a think... > >> > >> We spoke about this a while ago (and I forgot til now), but I was > >> wondering if we could use vCPU feature flags to describe NV, including > >> the selection between FEAT_E2H0 and FEAT_VHE. > >> > >> I think this might match userspace expectations a bit more closely where > >> the state of the ID registers after init gives the actual feature set > >> supported by the VM. > > > > I'm not sure that's enough. Let me give you an example: > > > > My host has FEAT_XNX, described in ID_AA64MMFR1_EL1.XNX. For whatever > > reason, we don't allow this field to be written to, even out of NV > > context. This is odd, because for an EL1 VM, this field means nothing > > at all. > So the curprit fields for me look like > > - ID_AA64MMFR1_EL1.XNX > - ID_AA64DFR0_EL1.DoubleLock > - ID_AA64PFR0_EL1.RAS Right, that more or less matches what I see locally. I adopted the following scheme: - On top of the existing KVM_ARM_VCPU_HAS_EL2, we have a new KVM_ARM_VCPU_HAS_EL2_E2H0, which is only valid when the former is also set, and force the whole VM in E2H==0 mode (that's Oliver's idea above) - The NV view of the ID registers is enforced at the point where we compute the limit values via the ID reg reset helper, instead of the post-init repainting - Neither of ID_AA64MMFR1_EL1.VH or ID_AA64MMFR4_EL1.E2H0 are writable at all - Only ID_AA64MMFR4_EL1.NV_frac is writable The result is fairly small, but is of course quite an ABI breakage, and I'm more than happy not to have merged it sooner. Maybe I should wait another couple of years! ;-) > This is still based on your nv-next branch from Jan 9 > https://github.com/eauger/linux/tree/nv_next_jan9_2025 Yeah, that's getting old now, specially given the nature of the bug fixes that have been added. I'll try to rebase everything on -rc2 with the latest fixes and the new ABI by the end of the week, and we will take it from there. Thanks, M. -- Without deviation from the norm, progress is not possible.
On Fri, Feb 07, 2025 at 06:38:22PM +0000, Marc Zyngier wrote: > On Fri, 07 Feb 2025 18:09:58 +0000, > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > Hey, > > > > On Fri, Feb 07, 2025 at 05:45:33PM +0000, Marc Zyngier wrote: > > > I found at least one issue that could fail the migration. Before the > > > VM starts running, we limit the feature set to the subset we actually > > > support with NV. > > > > > > By doing this, we also change the value of IDreg fields that are not > > > writable, because they describe features that we don't support. > > > Obviously, that fails on restore. > > > > > > I need to have a think... > > > > We spoke about this a while ago (and I forgot til now), but I was > > wondering if we could use vCPU feature flags to describe NV, including > > the selection between FEAT_E2H0 and FEAT_VHE. > > > > I think this might match userspace expectations a bit more closely where > > the state of the ID registers after init gives the actual feature set > > supported by the VM. > > I'm not sure that's enough. Let me give you an example: > > My host has FEAT_XNX, described in ID_AA64MMFR1_EL1.XNX. For whatever > reason, we don't allow this field to be written to, even out of NV > context. This is odd, because for an EL1 VM, this field means nothing > at all. > > However, we don't yet support FEAT_XNX with NV (it requires some extra > surgery in the S2 walker, in the S2 shadowing code and in AT). > > How would you manage this field if you had a vcpu flag saying E2H0 or > not? I don't think it helps, at least not in that particular case. It > may help for some things, but not all. > > It feels that we need to define the field limit (and what is writable > or not) based on the NV/!NV support, and maybe E2H0/VHE as well. I think we're both aiming in the right direction, I just didn't make my point clear. The ID registers we make visible to userspace after KVM_ARM_VCPU_INIT *must* describe a feature set we will actually honor for the VM. No late masking of features (i.e. first KVM_RUN), since it pretty much guarantees userspace won't be able to save/restore correctly. While we describe NV up front, we've been talking about allowing userspace to make the E2H0 v. VHE selection through the ID registers as well. My suggestion is that we have userspace select the flavor of NV it wants (E2H0 v. VHE) up front and completely nuke from orbit the idea of late feature fixups. This would solve your example of XNX, as we'd apply NV enforcement early and not expose the feature. Same would also hold for VHE-only features, such as recursive NV. Hopefully I made more sense this time around :) -- Thanks, Oliver
Hi Eric, On Thu, 16 Jan 2025 17:52:10 +0000, Eric Auger <eauger@redhat.com> wrote: > > >>>> I tested on AmpereOne and Grace-Hopper systems > >>> > >>> Thanks for the confirmation. I haven't had a chance to try QEMU yet, > >>> but I expect that save/restore will not work. Something to look into, > >> I have not tested this yet.I will give it a try. > > > > Thanks, that'd be super helpful. > > I confirm the migration fails when putting some registers on the dest > side. I will further investigate. Thanks for confirming. I suspect that the kernel visibility bits are not exactly correct (or depend on some odd ordering). If you can point me to a register that fails to restore correctly, that'd be useful. Hopefully, most EL2-specific registers suffer from the same defect. Cheers, M. -- Without deviation from the norm, progress is not possible.
On Thu, 19 Dec 2024 09:30:11 +0000, Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote: > > > > On 09-12-2024 11:04 pm, Eric Auger wrote: > > Hi Marc, > > > > On 12/9/24 18:30, Marc Zyngier wrote: > >> On Mon, 09 Dec 2024 16:46:30 +0000, > >> Eric Auger <eauger@redhat.com> wrote: > >>> > >>> Hi, > >>> > >>> On 12/9/24 16:39, Ganapatrao Kulkarni wrote: > >>>> > >>>> > >>>> On 09-12-2024 06:50 pm, Marc Zyngier wrote: > >>>>> On Mon, 09 Dec 2024 12:25:34 +0000, > >>>>> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote: > >>>>>>>> > >>>>>>>> During automated testing of Nested Virtualization using avocado-vt, > >>>>>>> > >>>>>>> Which is not merged upstream. So what branch are you using? Based on > >>>>>>> what kernel version? On what HW? With which virtualisation features? > >>>>>>> > >>>>>> > >>>>>> Testing is done on Ampere's AmpereOne platform using 6.10 based kernel > >>>>>> with NV patches from your repo. > >>>>> > >>>>> Grmbl... *Which* patches? At least give me the SHA1 of the branch, > >>>>> because I have no idea what you are running. And 6.10 is definitely > >>>>> not something I care about. If you're using the NV patches, the > >>>>> *minimum* you should run is 6.13-rc1, because that's what the current > >>>>> code is based on. > >>>>> > >>>> > >>>> I tried 6.13-rc1 based nv-next branch today, which failed to boot > >>>> UEFI as L1. Yet to debug this. > >>> > >>> I confirm am stuck with the same issue with nv-next on AmpereOne. > >> > >> All I can say is that it Works For Me (TM) on M2 and Snapdragon, using > >> kvmtool, a fresh EDK2 built by Ard, and with the guest running purely > >> VHE (FEAT_E2H0 not implemented). > >> > >> I haven't tried QEMU, I don't have an AmpereOne. I have also spent way > >> too much time on this recently. So I'm afraid people interested in > >> making this particular combination work will have to debug it. > > > > yes I do agree. I am currently setting up kvmtool to double check and I > > will debug the qemu issue. > > I could boot Guest-hypervisor using QEMU with -bios and nv-next as L0 > with edk2 (QEMU_EFI.fd) built from upstream tree and fedora-41 as > Guest-Hypervisor. Cool. This shows that Ard's hard work on EDK2 is paying off. > > The issue was disappeared when upgraded from fc39 to fc41. Same was > experienced with the kvmtool boot as well(KVMTOOL_EFI.fd). I guess the fc39 kernel is too old to understand the VHE-only ID registers, which was introduced in 6.9. Once you teach QEMU how to disable VHE, you should be able to boot older kernels that predate this change. M. -- Without deviation from the norm, progress is not possible.
On Mon, 09 Dec 2024 15:39:28 +0000, Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote: > > > > On 09-12-2024 06:50 pm, Marc Zyngier wrote: > > On Mon, 09 Dec 2024 12:25:34 +0000, > > Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote: > >>>> > >>>> During automated testing of Nested Virtualization using avocado-vt, > >>> > >>> Which is not merged upstream. So what branch are you using? Based on > >>> what kernel version? On what HW? With which virtualisation features? > >>> > >> > >> Testing is done on Ampere's AmpereOne platform using 6.10 based kernel > >> with NV patches from your repo. > > > > Grmbl... *Which* patches? At least give me the SHA1 of the branch, > > because I have no idea what you are running. And 6.10 is definitely > > not something I care about. If you're using the NV patches, the > > *minimum* you should run is 6.13-rc1, because that's what the current > > code is based on. > > > > I tried 6.13-rc1 based nv-next branch today, which failed to boot > UEFI as L1. Yet to debug this. Works nicely here with kvmtool as the VMM. From what I understand, EDK2 needs some surgery to correctly boot at EL2 without FEAT_E2H0. > We do have the FEAT_ECV on AmpereOne, I was the one reported/fixed bug > with FEAT_ECV(CNTPOFF offset issue) in the past. Sorry, I don't keep track of the feature set for machines I don't have access to. M. -- Without deviation from the norm, progress is not possible.
On Mon, 09 Dec 2024 13:20:48 +0000,
Marc Zyngier <maz@kernel.org> wrote:
>
> So here's my current guess, since you don't give me the needed
> information. For what you describe to happen, I can only see two
> possibilities:
>
> - either your HW doesn't have FEAT_ECV, in which case the guest
> directly reads from memory
>
> - or you are running with something like this patch [1], and we serve
> the guest by reading from memory very early, without returning to
> the bulk of the emulation code
>
> In either case, we only publish the updated status if the current IRQ
> state is different from the computed output of the timer while
> performing the emulation.
>
> So if you were writing back a status bit set to 0 while the interrupt
> was already pending, we'd deliver an interrupt, but not recompute the
> status. The guest would consider the interrupt as spurious, not touch
> the timer, and we'd never make forward progress. Rinse, repeat.
>
> Assuming I got the analysis right, it would only be a matter of
> hoisting the publication of the status into timer_emulate(), so that
> it is made up to date on load.
>
> Please give the fixup below a go.
Plus this on top for a good measure:
diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
index 91bda986c344b..c71193a7bb9c5 100644
--- a/arch/arm64/kvm/arch_timer.c
+++ b/arch/arm64/kvm/arch_timer.c
@@ -968,9 +968,6 @@ void kvm_timer_sync_nested(struct kvm_vcpu *vcpu)
* which allows trapping of the timer registers even with NV2.
* Still, this is still worse than FEAT_NV on its own. Meh.
*/
- if (cpus_have_final_cap(ARM64_HAS_ECV) || !is_hyp_ctxt(vcpu))
- return;
-
if (!vcpu_el2_e2h_is_set(vcpu)) {
/*
* A non-VHE guest hypervisor doesn't have any direct access
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index ff62b8b55b46e..1b8bb30dbb2ff 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1257,7 +1257,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
if (unlikely(!irqchip_in_kernel(vcpu->kvm)))
kvm_timer_sync_user(vcpu);
- if (vcpu_has_nv(vcpu))
+ if (is_hyp_ctxt(vcpu))
kvm_timer_sync_nested(vcpu);
kvm_arch_vcpu_ctxsync_fp(vcpu);
M.
--
Without deviation from the norm, progress is not possible.
© 2016 - 2025 Red Hat, Inc.