kvm: hyper-v: Delay firing of expired stimers

[PATCH] kvm: hyper-v: Delay firing of expired stimers

Posted by Alexander Graf 3 weeks, 2 days ago

During Windows Server 2025 hibernation, I have seen Windows' calculation
of interrupt target time get skewed over the hypervisor view of the same.
This can cause Windows to emit timer events in the past for events that
do not fire yet according to the real time source. This then leads to
interrupt storms in the guest which slow down execution to a point where
watchdogs trigger. Those manifest as bugchecks 0x9f and 0xa0 during
hibernation, typically in the resume path.

To work around this problem, we can delay timers that get created with a
target time in the past by a tiny bit (10µs) to give the guest CPU time
to process real work and make forward progress, hopefully recovering its
interrupt logic in the process. While this small delay can marginally
reduce accuracy of guest timers, 10µs are within the noise of VM
entry/exit overhead (~1-2 µs) so I do not expect to see real world impact.

To still provide some level of visibility when this happens, add a trace
point that clearly shows the discrepancy between the target time and the
current time.

Signed-off-by: Alexander Graf <graf@amazon.com>
---
 arch/x86/kvm/hyperv.c | 22 ++++++++++++++++++----
 arch/x86/kvm/trace.h  | 26 ++++++++++++++++++++++++++
 2 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 72b19a88a776..c41061acbcbc 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -666,13 +666,27 @@ static int stimer_start(struct kvm_vcpu_hv_stimer *stimer)
 	stimer->exp_time = stimer->count;
 	if (time_now >= stimer->count) {
 		/*
-		 * Expire timer according to Hypervisor Top-Level Functional
-		 * specification v4(15.3.1):
+		 * Hypervisor Top-Level Functional specification v4(15.3.1):
 		 * "If a one shot is enabled and the specified count is in
 		 * the past, it will expire immediately."
+		 *
+		 * However, there are cases during hibernation when Windows's
+		 * interrupt count calculation can go out of sync with KVM's
+		 * view of it, causing Windows to emit timer events in the past
+		 * for events that do not fire yet according to the real time
+		 * source. This then leads to interrupt storms in the guest
+		 * which slow down execution to a point where watchdogs trigger.
+		 *
+		 * Instead of taking TLFS literally on what "immediately" means,
+		 * give the guest at least 10µs to process work. While this can
+		 * marginally reduce accuracy of guest timers, 10µs are within
+		 * the noise of VM entry/exit overhead (~1-2 µs).
 		 */
-		stimer_mark_pending(stimer, false);
-		return 0;
+		trace_kvm_hv_stimer_start_expired(
+					hv_stimer_to_vcpu(stimer)->vcpu_id,
+					stimer->index,
+					time_now, stimer->count);
+		stimer->count = time_now + 100;
 	}
 
 	trace_kvm_hv_stimer_start_one_shot(hv_stimer_to_vcpu(stimer)->vcpu_id,
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 57d79fd31df0..f9e69c4d9e9b 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1401,6 +1401,32 @@ TRACE_EVENT(kvm_hv_stimer_start_one_shot,
 		  __entry->count)
 );
 
+/*
+ * Tracepoint for stimer_start(one-shot timer already expired).
+ */
+TRACE_EVENT(kvm_hv_stimer_start_expired,
+	TP_PROTO(int vcpu_id, int timer_index, u64 time_now, u64 count),
+	TP_ARGS(vcpu_id, timer_index, time_now, count),
+
+	TP_STRUCT__entry(
+		__field(int, vcpu_id)
+		__field(int, timer_index)
+		__field(u64, time_now)
+		__field(u64, count)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id = vcpu_id;
+		__entry->timer_index = timer_index;
+		__entry->time_now = time_now;
+		__entry->count = count;
+	),
+
+	TP_printk("vcpu_id %d timer %d time_now %llu count %llu (expired)",
+		  __entry->vcpu_id, __entry->timer_index, __entry->time_now,
+		  __entry->count)
+);
+
 /*
  * Tracepoint for stimer_timer_callback.
  */
-- 
2.47.1




Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

Re: [PATCH] kvm: hyper-v: Delay firing of expired stimers

Posted by David Woodhouse 2 weeks ago

On Thu, 2026-01-15 at 14:15 +0000, Alexander Graf wrote:
> 
> +		 *
> +		 * However, there are cases during hibernation when
> Windows's
> +		 * interrupt count calculation can go out of sync
> with KVM's
> +		 * view of it, causing Windows to emit timer events
> in the past
> +		 * for events that do not fire yet according to the
> real time
> +		 * source. This then leads to interrupt storms in
> the guest
> +		 * which slow down execution to a point where
> watchdogs trigger.

Do these 'cases during hibernation' occur when the TSC page hasn't been
set up yet in the new environment?

I note get_time_ref_counter() falls back to get_kvmclock_ns() in that
case, but get_kvmclock_ns() is known to be hosed. We stopped using it
for Xen timers for precisely that reason; see commit 451a707813aee.

The detail in this case might be different, but we really should
*understand* why stimer_start() has a different idea of the time than
the guest does, and address that properly.

FWIW it's probably in the noise and not the actual cause of this error,
but it looks like stimer_start() might benefit from using
kvm_get_monotonic_and_clockread(), to get an actual paired reading from
the same TSC read instead of calling get_time_ref_counter() and then
*later* calling ktime_get() to read CLOCK_MONOTONIC at a slightly
different time. You can pass a TSC reading into get_time_ref_counter(),
can't you? Let callers who don't care pass rdtsc() in.

Re: [PATCH] kvm: hyper-v: Delay firing of expired stimers

Posted by Sean Christopherson 2 weeks ago

On Thu, Jan 15, 2026, Alexander Graf wrote:
> During Windows Server 2025 hibernation, I have seen Windows' calculation
> of interrupt target time get skewed over the hypervisor view of the same.

> This can cause Windows to emit timer events in the past for events that
> do not fire yet according to the real time source. This then leads to
> interrupt storms in the guest which slow down execution to a point where
> watchdogs trigger. Those manifest as bugchecks 0x9f and 0xa0 during
> hibernation, typically in the resume path.
> 
> To work around this problem, we can delay timers that get created with a
> target time in the past by a tiny bit (10µs) to give the guest CPU time
> to process real work and make forward progress, hopefully recovering its
> interrupt logic in the process. While this small delay can marginally
> reduce accuracy of guest timers, 10µs are within the noise of VM
> entry/exit overhead (~1-2 µs) so I do not expect to see real world impact.

There is a lot of hope piled into this.  And *always* padding the count makes me
more than a bit uncomfortable.  If the skew is really due to a guest bug and not
something on the host's side, i.e. if this isn't just a symptom of a real bug that
can be fixed and the _only_ option is to chuck in a workaround, then I would
strongly prefer to be as conservative as possible.  E.g. is it possible to
precisely detect this scenario and only add the delay when the guest appears to
be stuck?

> To still provide some level of visibility when this happens, add a trace
> point that clearly shows the discrepancy between the target time and the
> current time.

This honestly doesn't seem all that useful.  As a debug tool, sure, but once the
workaround is in place, it doesn't seem like it'll add a lot of value since it
would require the end user to be aware of the workaround in the first place.

If we really want something, a stat or a pr_xxx_once() (even though I generally
dislike those) seems like it'd be more helpful.

Re: [PATCH] kvm: hyper-v: Delay firing of expired stimers

Posted by Alexander Graf 1 week, 6 days ago

On 23.01.26 19:21, Sean Christopherson wrote:
> On Thu, Jan 15, 2026, Alexander Graf wrote:
>> During Windows Server 2025 hibernation, I have seen Windows' calculation
>> of interrupt target time get skewed over the hypervisor view of the same.
>> This can cause Windows to emit timer events in the past for events that
>> do not fire yet according to the real time source. This then leads to
>> interrupt storms in the guest which slow down execution to a point where
>> watchdogs trigger. Those manifest as bugchecks 0x9f and 0xa0 during
>> hibernation, typically in the resume path.
>>
>> To work around this problem, we can delay timers that get created with a
>> target time in the past by a tiny bit (10µs) to give the guest CPU time
>> to process real work and make forward progress, hopefully recovering its
>> interrupt logic in the process. While this small delay can marginally
>> reduce accuracy of guest timers, 10µs are within the noise of VM
>> entry/exit overhead (~1-2 µs) so I do not expect to see real world impact.
> There is a lot of hope piled into this.  And *always* padding the count makes me
> more than a bit uncomfortable.  If the skew is really due to a guest bug and not
> something on the host's side, i.e. if this isn't just a symptom of a real bug that
> can be fixed and the _only_ option is to chuck in a workaround, then I would
> strongly prefer to be as conservative as possible.  E.g. is it possible to
> precisely detect this scenario and only add the delay when the guest appears to
> be stuck?


This patch only pads when a timer is in the past, which I have not seen 
happen much on real systems. Usually you're trying to configure a timer 
for the future :).

That said, I have continued digging deeper since I posted this patch and 
I'm still trying to wrap my head around under which exact conditions any 
of this really does happen. Let's put this patch on hold until I have a 
more reliable reproducer.


Alex




Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

Re: [PATCH] kvm: hyper-v: Delay firing of expired stimers

Posted by Vitaly Kuznetsov 1 week, 5 days ago

Alexander Graf <graf@amazon.com> writes:

> On 23.01.26 19:21, Sean Christopherson wrote:
>> On Thu, Jan 15, 2026, Alexander Graf wrote:
>>> During Windows Server 2025 hibernation, I have seen Windows' calculation
>>> of interrupt target time get skewed over the hypervisor view of the same.
>>> This can cause Windows to emit timer events in the past for events that
>>> do not fire yet according to the real time source. This then leads to
>>> interrupt storms in the guest which slow down execution to a point where
>>> watchdogs trigger. Those manifest as bugchecks 0x9f and 0xa0 during
>>> hibernation, typically in the resume path.
>>>
>>> To work around this problem, we can delay timers that get created with a
>>> target time in the past by a tiny bit (10µs) to give the guest CPU time
>>> to process real work and make forward progress, hopefully recovering its
>>> interrupt logic in the process. While this small delay can marginally
>>> reduce accuracy of guest timers, 10µs are within the noise of VM
>>> entry/exit overhead (~1-2 µs) so I do not expect to see real world impact.
>> There is a lot of hope piled into this.  And *always* padding the count makes me
>> more than a bit uncomfortable.  If the skew is really due to a guest bug and not
>> something on the host's side, i.e. if this isn't just a symptom of a real bug that
>> can be fixed and the _only_ option is to chuck in a workaround, then I would
>> strongly prefer to be as conservative as possible.  E.g. is it possible to
>> precisely detect this scenario and only add the delay when the guest appears to
>> be stuck?
>
>
> This patch only pads when a timer is in the past, which I have not seen 
> happen much on real systems. Usually you're trying to configure a timer 
> for the future :).
>
> That said, I have continued digging deeper since I posted this patch and 
> I'm still trying to wrap my head around under which exact conditions any 
> of this really does happen. Let's put this patch on hold until I have a 
> more reliable reproducer.

My bet goes to the clocksource switch, e.g. the guest disables (or just
stops using, good luck detecting that :-) ) TSC page and uses raw TSC
for some period or something. 

I remember we had to add some fairly ugly hacks where we also "piled a
log of hope", e.g.:

commit 0469f2f7ab4c6a6cae4b74c4f981c4da6d909411
Author: Vitaly Kuznetsov <vkuznets@redhat.com>
Date:   Tue Mar 16 15:37:36 2021 +0100

    KVM: x86: hyper-v: Don't touch TSC page values when guest opted for re-enlightenment

Also, AFAIR we don't currently implement "Synthetic Time-Unhalted Timer"
from TLFS and who knows, maybe Windows' behavior is going to change when
we do...

-- 
Vitaly