arch/x86/kvm/hyperv.c | 22 ++++++++++++++++++---- arch/x86/kvm/trace.h | 26 ++++++++++++++++++++++++++ 2 files changed, 44 insertions(+), 4 deletions(-)
During Windows Server 2025 hibernation, I have seen Windows' calculation
of interrupt target time get skewed over the hypervisor view of the same.
This can cause Windows to emit timer events in the past for events that
do not fire yet according to the real time source. This then leads to
interrupt storms in the guest which slow down execution to a point where
watchdogs trigger. Those manifest as bugchecks 0x9f and 0xa0 during
hibernation, typically in the resume path.
To work around this problem, we can delay timers that get created with a
target time in the past by a tiny bit (10µs) to give the guest CPU time
to process real work and make forward progress, hopefully recovering its
interrupt logic in the process. While this small delay can marginally
reduce accuracy of guest timers, 10µs are within the noise of VM
entry/exit overhead (~1-2 µs) so I do not expect to see real world impact.
To still provide some level of visibility when this happens, add a trace
point that clearly shows the discrepancy between the target time and the
current time.
Signed-off-by: Alexander Graf <graf@amazon.com>
---
arch/x86/kvm/hyperv.c | 22 ++++++++++++++++++----
arch/x86/kvm/trace.h | 26 ++++++++++++++++++++++++++
2 files changed, 44 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 72b19a88a776..c41061acbcbc 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -666,13 +666,27 @@ static int stimer_start(struct kvm_vcpu_hv_stimer *stimer)
stimer->exp_time = stimer->count;
if (time_now >= stimer->count) {
/*
- * Expire timer according to Hypervisor Top-Level Functional
- * specification v4(15.3.1):
+ * Hypervisor Top-Level Functional specification v4(15.3.1):
* "If a one shot is enabled and the specified count is in
* the past, it will expire immediately."
+ *
+ * However, there are cases during hibernation when Windows's
+ * interrupt count calculation can go out of sync with KVM's
+ * view of it, causing Windows to emit timer events in the past
+ * for events that do not fire yet according to the real time
+ * source. This then leads to interrupt storms in the guest
+ * which slow down execution to a point where watchdogs trigger.
+ *
+ * Instead of taking TLFS literally on what "immediately" means,
+ * give the guest at least 10µs to process work. While this can
+ * marginally reduce accuracy of guest timers, 10µs are within
+ * the noise of VM entry/exit overhead (~1-2 µs).
*/
- stimer_mark_pending(stimer, false);
- return 0;
+ trace_kvm_hv_stimer_start_expired(
+ hv_stimer_to_vcpu(stimer)->vcpu_id,
+ stimer->index,
+ time_now, stimer->count);
+ stimer->count = time_now + 100;
}
trace_kvm_hv_stimer_start_one_shot(hv_stimer_to_vcpu(stimer)->vcpu_id,
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 57d79fd31df0..f9e69c4d9e9b 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1401,6 +1401,32 @@ TRACE_EVENT(kvm_hv_stimer_start_one_shot,
__entry->count)
);
+/*
+ * Tracepoint for stimer_start(one-shot timer already expired).
+ */
+TRACE_EVENT(kvm_hv_stimer_start_expired,
+ TP_PROTO(int vcpu_id, int timer_index, u64 time_now, u64 count),
+ TP_ARGS(vcpu_id, timer_index, time_now, count),
+
+ TP_STRUCT__entry(
+ __field(int, vcpu_id)
+ __field(int, timer_index)
+ __field(u64, time_now)
+ __field(u64, count)
+ ),
+
+ TP_fast_assign(
+ __entry->vcpu_id = vcpu_id;
+ __entry->timer_index = timer_index;
+ __entry->time_now = time_now;
+ __entry->count = count;
+ ),
+
+ TP_printk("vcpu_id %d timer %d time_now %llu count %llu (expired)",
+ __entry->vcpu_id, __entry->timer_index, __entry->time_now,
+ __entry->count)
+);
+
/*
* Tracepoint for stimer_timer_callback.
*/
--
2.47.1
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
On Thu, 2026-01-15 at 14:15 +0000, Alexander Graf wrote: > > + * > + * However, there are cases during hibernation when > Windows's > + * interrupt count calculation can go out of sync > with KVM's > + * view of it, causing Windows to emit timer events > in the past > + * for events that do not fire yet according to the > real time > + * source. This then leads to interrupt storms in > the guest > + * which slow down execution to a point where > watchdogs trigger. Do these 'cases during hibernation' occur when the TSC page hasn't been set up yet in the new environment? I note get_time_ref_counter() falls back to get_kvmclock_ns() in that case, but get_kvmclock_ns() is known to be hosed. We stopped using it for Xen timers for precisely that reason; see commit 451a707813aee. The detail in this case might be different, but we really should *understand* why stimer_start() has a different idea of the time than the guest does, and address that properly. FWIW it's probably in the noise and not the actual cause of this error, but it looks like stimer_start() might benefit from using kvm_get_monotonic_and_clockread(), to get an actual paired reading from the same TSC read instead of calling get_time_ref_counter() and then *later* calling ktime_get() to read CLOCK_MONOTONIC at a slightly different time. You can pass a TSC reading into get_time_ref_counter(), can't you? Let callers who don't care pass rdtsc() in.
On Thu, Jan 15, 2026, Alexander Graf wrote: > During Windows Server 2025 hibernation, I have seen Windows' calculation > of interrupt target time get skewed over the hypervisor view of the same. > This can cause Windows to emit timer events in the past for events that > do not fire yet according to the real time source. This then leads to > interrupt storms in the guest which slow down execution to a point where > watchdogs trigger. Those manifest as bugchecks 0x9f and 0xa0 during > hibernation, typically in the resume path. > > To work around this problem, we can delay timers that get created with a > target time in the past by a tiny bit (10µs) to give the guest CPU time > to process real work and make forward progress, hopefully recovering its > interrupt logic in the process. While this small delay can marginally > reduce accuracy of guest timers, 10µs are within the noise of VM > entry/exit overhead (~1-2 µs) so I do not expect to see real world impact. There is a lot of hope piled into this. And *always* padding the count makes me more than a bit uncomfortable. If the skew is really due to a guest bug and not something on the host's side, i.e. if this isn't just a symptom of a real bug that can be fixed and the _only_ option is to chuck in a workaround, then I would strongly prefer to be as conservative as possible. E.g. is it possible to precisely detect this scenario and only add the delay when the guest appears to be stuck? > To still provide some level of visibility when this happens, add a trace > point that clearly shows the discrepancy between the target time and the > current time. This honestly doesn't seem all that useful. As a debug tool, sure, but once the workaround is in place, it doesn't seem like it'll add a lot of value since it would require the end user to be aware of the workaround in the first place. If we really want something, a stat or a pr_xxx_once() (even though I generally dislike those) seems like it'd be more helpful.
On 23.01.26 19:21, Sean Christopherson wrote: > On Thu, Jan 15, 2026, Alexander Graf wrote: >> During Windows Server 2025 hibernation, I have seen Windows' calculation >> of interrupt target time get skewed over the hypervisor view of the same. >> This can cause Windows to emit timer events in the past for events that >> do not fire yet according to the real time source. This then leads to >> interrupt storms in the guest which slow down execution to a point where >> watchdogs trigger. Those manifest as bugchecks 0x9f and 0xa0 during >> hibernation, typically in the resume path. >> >> To work around this problem, we can delay timers that get created with a >> target time in the past by a tiny bit (10µs) to give the guest CPU time >> to process real work and make forward progress, hopefully recovering its >> interrupt logic in the process. While this small delay can marginally >> reduce accuracy of guest timers, 10µs are within the noise of VM >> entry/exit overhead (~1-2 µs) so I do not expect to see real world impact. > There is a lot of hope piled into this. And *always* padding the count makes me > more than a bit uncomfortable. If the skew is really due to a guest bug and not > something on the host's side, i.e. if this isn't just a symptom of a real bug that > can be fixed and the _only_ option is to chuck in a workaround, then I would > strongly prefer to be as conservative as possible. E.g. is it possible to > precisely detect this scenario and only add the delay when the guest appears to > be stuck? This patch only pads when a timer is in the past, which I have not seen happen much on real systems. Usually you're trying to configure a timer for the future :). That said, I have continued digging deeper since I posted this patch and I'm still trying to wrap my head around under which exact conditions any of this really does happen. Let's put this patch on hold until I have a more reliable reproducer. Alex Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597
Alexander Graf <graf@amazon.com> writes:
> On 23.01.26 19:21, Sean Christopherson wrote:
>> On Thu, Jan 15, 2026, Alexander Graf wrote:
>>> During Windows Server 2025 hibernation, I have seen Windows' calculation
>>> of interrupt target time get skewed over the hypervisor view of the same.
>>> This can cause Windows to emit timer events in the past for events that
>>> do not fire yet according to the real time source. This then leads to
>>> interrupt storms in the guest which slow down execution to a point where
>>> watchdogs trigger. Those manifest as bugchecks 0x9f and 0xa0 during
>>> hibernation, typically in the resume path.
>>>
>>> To work around this problem, we can delay timers that get created with a
>>> target time in the past by a tiny bit (10µs) to give the guest CPU time
>>> to process real work and make forward progress, hopefully recovering its
>>> interrupt logic in the process. While this small delay can marginally
>>> reduce accuracy of guest timers, 10µs are within the noise of VM
>>> entry/exit overhead (~1-2 µs) so I do not expect to see real world impact.
>> There is a lot of hope piled into this. And *always* padding the count makes me
>> more than a bit uncomfortable. If the skew is really due to a guest bug and not
>> something on the host's side, i.e. if this isn't just a symptom of a real bug that
>> can be fixed and the _only_ option is to chuck in a workaround, then I would
>> strongly prefer to be as conservative as possible. E.g. is it possible to
>> precisely detect this scenario and only add the delay when the guest appears to
>> be stuck?
>
>
> This patch only pads when a timer is in the past, which I have not seen
> happen much on real systems. Usually you're trying to configure a timer
> for the future :).
>
> That said, I have continued digging deeper since I posted this patch and
> I'm still trying to wrap my head around under which exact conditions any
> of this really does happen. Let's put this patch on hold until I have a
> more reliable reproducer.
My bet goes to the clocksource switch, e.g. the guest disables (or just
stops using, good luck detecting that :-) ) TSC page and uses raw TSC
for some period or something.
I remember we had to add some fairly ugly hacks where we also "piled a
log of hope", e.g.:
commit 0469f2f7ab4c6a6cae4b74c4f981c4da6d909411
Author: Vitaly Kuznetsov <vkuznets@redhat.com>
Date: Tue Mar 16 15:37:36 2021 +0100
KVM: x86: hyper-v: Don't touch TSC page values when guest opted for re-enlightenment
Also, AFAIR we don't currently implement "Synthetic Time-Unhalted Timer"
from TLFS and who knows, maybe Windows' behavior is going to change when
we do...
--
Vitaly
© 2016 - 2026 Red Hat, Inc.