[PATCH 16/16] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop

Sean Christopherson posted 16 patches 1 month, 1 week ago
There is a newer version of this series
[PATCH 16/16] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
Posted by Sean Christopherson 1 month, 1 week ago
Prefer the TSC over kvmclock for sched_clock if the TSC is constant,
nonstop, and not marked unstable via command line.  I.e. use the same
criteria as tweaking the clocksource rating so that TSC is preferred over
kvmclock.  Per the below comment from native_sched_clock(), sched_clock
is more tolerant of slop than clocksource; using TSC for clocksource but
not sched_clock makes little to no sense, especially now that KVM CoCo
guests with a trusted TSC use TSC, not kvmclock.

        /*
         * Fall back to jiffies if there's no TSC available:
         * ( But note that we still use it if the TSC is marked
         *   unstable. We do this because unlike Time Of Day,
         *   the scheduler clock tolerates small errors and it's
         *   very important for it to be as fast as the platform
         *   can achieve it. )
         */

The only advantage of using kvmclock is that doing so allows for early
and common detection of PVCLOCK_GUEST_STOPPED, but that code has been
broken for nearly two years with nary a complaint, i.e. it can't be
_that_ valuable.  And as above, certain types of KVM guests are losing
the functionality regardless, i.e. acknowledging PVCLOCK_GUEST_STOPPED
needs to be decoupled from sched_clock() no matter what.

Link: https://lore.kernel.org/all/Z4hDK27OV7wK572A@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/kvmclock.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 9d05d070fe25..fb8cd8313d18 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -344,23 +344,23 @@ void __init kvmclock_init(void)
 		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 
 	/*
-	 * X86_FEATURE_NONSTOP_TSC is TSC runs at constant rate
-	 * with P/T states and does not stop in deep C-states.
-	 *
-	 * Invariant TSC exposed by host means kvmclock is not necessary:
-	 * can use TSC as clocksource.
-	 *
+	 * If the TSC counts at a constant frequency across P/T states, counts
+	 * in deep C-states, and the TSC hasn't been marked unstable, prefer
+	 * the TSC over kvmclock for sched_clock and drop kvmclock's rating so
+	 * that TSC is chosen as the clocksource.  Note, the TSC unstable check
+	 * exists purely to honor the TSC being marked unstable via command
+	 * line, any runtime detection of an unstable will happen after this.
 	 */
 	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
 	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
 	    !check_tsc_unstable()) {
 		kvm_clock.rating = 299;
 		tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
+	} else {
+		flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
+		kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
 	}
 
-	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
-	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
-
 	tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_cpu_khz,
 					  tsc_properties);
 
@@ -369,6 +369,11 @@ void __init kvmclock_init(void)
 #ifdef CONFIG_X86_LOCAL_APIC
 	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
 #endif
+	/*
+	 * Save/restore "sched" clock state even if kvmclock isn't being used
+	 * for sched_clock, as kvmclock is still used for wallclock and relies
+	 * on these hooks to re-enable kvmclock after suspend+resume.
+	 */
 	x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
 	x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
 	kvm_get_preset_lpj();
-- 
2.48.1.362.g079036d154-goog
Re: [PATCH 16/16] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
Posted by Sean Christopherson 1 month ago
Dropping a few people/lists whose emails are bouncing.

On Fri, Jan 31, 2025, Sean Christopherson wrote:
> @@ -369,6 +369,11 @@ void __init kvmclock_init(void)
>  #ifdef CONFIG_X86_LOCAL_APIC
>  	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
>  #endif
> +	/*
> +	 * Save/restore "sched" clock state even if kvmclock isn't being used
> +	 * for sched_clock, as kvmclock is still used for wallclock and relies
> +	 * on these hooks to re-enable kvmclock after suspend+resume.

This is wrong, wallclock is a different MSR entirely.

> +	 */
>  	x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
>  	x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;

And usurping sched_clock save/restore is *really* wrong if kvmclock isn't being
used as sched_clock, because when TSC is reset on suspend/hiberation, not doing
tsc_{save,restore}_sched_clock_state() results in time going haywire.

Subtly, that issue goes all the way back to patch "x86/paravirt: Don't use a PV
sched_clock in CoCo guests with trusted TSC" because pulling the rug out from
under kvmclock leads to the same problem.

The whole PV sched_clock scheme is a disaster.

Hyper-V overrides the save/restore callbacks, but _also_ runs the old TSC callbacks,
because Hyper-V doesn't ensure that it's actually using the Hyper-V clock for
sched_clock.  And the code is all kinds of funky, because it tries to keep the
x86 code isolated from the generic HV clock code, but (a) there's already x86 PV
specific code in drivers/clocksource/hyperv_timer.c, and (b) splitting the code
means that Hyper-V overides the sched_clock save/restore hooks even when PARAVIRT=n,
i.e. when HV clock can't possibly be used as sched_clock.

VMware appears to be buggy and doesn't do have offset adjustments, and also lets
the TSC callbacks run.

I can't tell if Xen is broken, or if it's the sanest of the bunch.  Xen does
save/restore things a la kvmclock, but only in the Xen PV suspend path.  So if
the "normal" suspend/hibernate paths are unreachable, Xen is sane.  If not, Xen
is quite broken.

To make matters worse, kvmclock is a mess and has existing bugs.  The BSP's clock
is disabled during syscore_suspend() (via kvm_suspend()), but only re-enabled in
the sched_clock callback.  So if suspend is aborted due to a late wakeup, the BSP
will run without its clock enabled, which "works" only because KVM-the-hypervisor
is kind enough to not clobber the shared memory when the clock is disabled.  But
over time, I would expect time on the BSP to drift from APs.

And then there's this crud:

  #ifdef CONFIG_X86_LOCAL_APIC
	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
  #endif

which (a) should be guarded by CONFIG_SMP, not X86_LOCAL_APIC, and (b) is only
actually needed when kvmclock is sched_clock, because timekeeping doesn't actually
need to start that early.  But of course kvmclock craptastic handling of suspend
and resume makes untangling that more difficult than it needs to be.

The icing on the cake is that after cleaning up all the hacks, and having
kvmclock hook clocksource.suspend/resume like it should, suspend/resume under
kvmclock corrupts wall clock time because timekeeping_resume() reads the persistent
clock before resuming clocksource clocks, and the stupid kvmclock wall clock subtly
consumes the clocksource/system clock.  *sigh*

I have yet more patches to clean all of this up.  The series is rather unwieldly,
as it's now sitting at 38 patches (ugh), but I don't see a way to chunk it up in
a meaningful way, because everything is so intertwined.  :-/
RE: [PATCH 16/16] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
Posted by Michael Kelley 1 month ago
From: Sean Christopherson <seanjc@google.com> Sent: Friday, February 7, 2025 9:23 AM
> 
> Dropping a few people/lists whose emails are bouncing.
> 
> On Fri, Jan 31, 2025, Sean Christopherson wrote:
> > @@ -369,6 +369,11 @@ void __init kvmclock_init(void)
> >  #ifdef CONFIG_X86_LOCAL_APIC
> >  	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
> >  #endif
> > +	/*
> > +	 * Save/restore "sched" clock state even if kvmclock isn't being used
> > +	 * for sched_clock, as kvmclock is still used for wallclock and relies
> > +	 * on these hooks to re-enable kvmclock after suspend+resume.
> 
> This is wrong, wallclock is a different MSR entirely.
> 
> > +	 */
> >  	x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
> >  	x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
> 
> And usurping sched_clock save/restore is *really* wrong if kvmclock isn't being
> used as sched_clock, because when TSC is reset on suspend/hiberation, not doing
> tsc_{save,restore}_sched_clock_state() results in time going haywire.
> 
> Subtly, that issue goes all the way back to patch "x86/paravirt: Don't use a PV
> sched_clock in CoCo guests with trusted TSC" because pulling the rug out from
> under kvmclock leads to the same problem.
> 
> The whole PV sched_clock scheme is a disaster.
> 
> Hyper-V overrides the save/restore callbacks, but _also_ runs the old TSC callbacks,
> because Hyper-V doesn't ensure that it's actually using the Hyper-V clock for
> sched_clock.  And the code is all kinds of funky, because it tries to keep the
> x86 code isolated from the generic HV clock code, but (a) there's already x86 PV
> specific code in drivers/clocksource/hyperv_timer.c, and (b) splitting the code
> means that Hyper-V overides the sched_clock save/restore hooks even when
> PARAVIRT=n, i.e. when HV clock can't possibly be used as sched_clock.

Regarding (a), the one occurrence of x86 PV-specific code hyperv_timer.c is
the call to paravirt_set_sched_clock(), and it's under an #ifdef sequence so that
it's not built if targeting some other architecture. Or do you see something else
that is x86-specific?

Regarding (b), in drivers/hv/Kconfig, CONFIG_HYPERV always selects PARAVIRT.
So the #else clause (where PARAVIRT=n) in that #ifdef sequence could arguably
have a BUILD_BUG() added. If I recall correctly, other Hyper-V stuff breaks if
PARAVIRT is forced to "n". So I don't think there's a current problem with the
sched_clock save/restore hooks. But I would be good with some restructuring
so that setting the sched clock save/restore hooks is more closely tied to the
sched clock choice, as long as the architecture independence of hyperv_timer.c
is preserved. And maybe there's a better way to handle hv_setup_sched_clock()
that is less messy with the #ifdef's. I'll think about that too.

Michael

> 
> VMware appears to be buggy and doesn't do have offset adjustments, and also lets
> the TSC callbacks run.
> 
> I can't tell if Xen is broken, or if it's the sanest of the bunch.  Xen does
> save/restore things a la kvmclock, but only in the Xen PV suspend path.  So if
> the "normal" suspend/hibernate paths are unreachable, Xen is sane.  If not, Xen
> is quite broken.
> 
> To make matters worse, kvmclock is a mess and has existing bugs.  The BSP's clock
> is disabled during syscore_suspend() (via kvm_suspend()), but only re-enabled in
> the sched_clock callback.  So if suspend is aborted due to a late wakeup, the BSP
> will run without its clock enabled, which "works" only because KVM-the-hypervisor
> is kind enough to not clobber the shared memory when the clock is disabled.  But
> over time, I would expect time on the BSP to drift from APs.
> 
> And then there's this crud:
> 
>   #ifdef CONFIG_X86_LOCAL_APIC
> 	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
>   #endif
> 
> which (a) should be guarded by CONFIG_SMP, not X86_LOCAL_APIC, and (b) is only
> actually needed when kvmclock is sched_clock, because timekeeping doesn't actually
> need to start that early.  But of course kvmclock craptastic handling of suspend
> and resume makes untangling that more difficult than it needs to be.
> 
> The icing on the cake is that after cleaning up all the hacks, and having
> kvmclock hook clocksource.suspend/resume like it should, suspend/resume under
> kvmclock corrupts wall clock time because timekeeping_resume() reads the persistent
> clock before resuming clocksource clocks, and the stupid kvmclock wall clock subtly
> consumes the clocksource/system clock.  *sigh*
> 
> I have yet more patches to clean all of this up.  The series is rather unwieldly,
> as it's now sitting at 38 patches (ugh), but I don't see a way to chunk it up in
> a meaningful way, because everything is so intertwined.  :-/
Re: [PATCH 16/16] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
Posted by Sean Christopherson 1 month ago
On Sat, Feb 08, 2025, Michael Kelley wrote:
> From: Sean Christopherson <seanjc@google.com> Sent: Friday, February 7, 2025 9:23 AM
> > 
> > Dropping a few people/lists whose emails are bouncing.
> > 
> > On Fri, Jan 31, 2025, Sean Christopherson wrote:
> > > @@ -369,6 +369,11 @@ void __init kvmclock_init(void)
> > >  #ifdef CONFIG_X86_LOCAL_APIC
> > >  	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
> > >  #endif
> > > +	/*
> > > +	 * Save/restore "sched" clock state even if kvmclock isn't being used
> > > +	 * for sched_clock, as kvmclock is still used for wallclock and relies
> > > +	 * on these hooks to re-enable kvmclock after suspend+resume.
> > 
> > This is wrong, wallclock is a different MSR entirely.
> > 
> > > +	 */
> > >  	x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
> > >  	x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
> > 
> > And usurping sched_clock save/restore is *really* wrong if kvmclock isn't being
> > used as sched_clock, because when TSC is reset on suspend/hiberation, not doing
> > tsc_{save,restore}_sched_clock_state() results in time going haywire.
> > 
> > Subtly, that issue goes all the way back to patch "x86/paravirt: Don't use a PV
> > sched_clock in CoCo guests with trusted TSC" because pulling the rug out from
> > under kvmclock leads to the same problem.
> > 
> > The whole PV sched_clock scheme is a disaster.
> > 
> > Hyper-V overrides the save/restore callbacks, but _also_ runs the old TSC callbacks,
> > because Hyper-V doesn't ensure that it's actually using the Hyper-V clock for
> > sched_clock.  And the code is all kinds of funky, because it tries to keep the
> > x86 code isolated from the generic HV clock code, but (a) there's already x86 PV
> > specific code in drivers/clocksource/hyperv_timer.c, and (b) splitting the code
> > means that Hyper-V overides the sched_clock save/restore hooks even when
> > PARAVIRT=n, i.e. when HV clock can't possibly be used as sched_clock.
> 
> Regarding (a), the one occurrence of x86 PV-specific code hyperv_timer.c is
> the call to paravirt_set_sched_clock(), and it's under an #ifdef sequence so that
> it's not built if targeting some other architecture. Or do you see something else
> that is x86-specific?
> 
> Regarding (b), in drivers/hv/Kconfig, CONFIG_HYPERV always selects PARAVIRT.
> So the #else clause (where PARAVIRT=n) in that #ifdef sequence could arguably
> have a BUILD_BUG() added. If I recall correctly, other Hyper-V stuff breaks if
> PARAVIRT is forced to "n". So I don't think there's a current problem with the
> sched_clock save/restore hooks. i

Oh, there are no build issues, and all of the x86 bits are nicely cordoned off.
My complaint is essentially that they're _too_ isolated; putting the sched_clock
save/restore setup in arch/x86/kernel/cpu/mshyperv.c is well-intentioned, but IMO
it does more harm than good because the split makes it difficult to connect the
dots to hv_setup_sched_clock() in drivers/clocksource/hyperv_timer.c.

> But I would be good with some restructuring so that setting the sched clock
> save/restore hooks is more closely tied to the sched clock choice,

Yeah, this is the intent of my ranting.  After the dust settles, the code can
look like this.

---
#ifdef CONFIG_GENERIC_SCHED_CLOCK
static __always_inline void hv_setup_sched_clock(void *sched_clock)
{
	/*
	 * We're on an architecture with generic sched clock (not x86/x64).
	 * The Hyper-V sched clock read function returns nanoseconds, not
	 * the normal 100ns units of the Hyper-V synthetic clock.
	 */
	sched_clock_register(sched_clock, 64, NSEC_PER_SEC);
}
#elif defined CONFIG_PARAVIRT
static u64 hv_ref_counter_at_suspend;
/*
 * Hyper-V clock counter resets during hibernation. Save and restore clock
 * offset during suspend/resume, while also considering the time passed
 * before suspend. This is to make sure that sched_clock using hv tsc page
 * based clocksource, proceeds from where it left off during suspend and
 * it shows correct time for the timestamps of kernel messages after resume.
 */
static void hv_save_sched_clock_state(void)
{
	hv_ref_counter_at_suspend = hv_read_reference_counter();
}

static void hv_restore_sched_clock_state(void)
{
	/*
	 * Adjust the offsets used by hv tsc clocksource to
	 * account for the time spent before hibernation.
	 * adjusted value = reference counter (time) at suspend
	 *                - reference counter (time) now.
	 */
	hv_sched_clock_offset -= (hv_ref_counter_at_suspend - hv_read_reference_counter());
}

static __always_inline void hv_setup_sched_clock(void *sched_clock)
{
	/* We're on x86/x64 *and* using PV ops */
	paravirt_set_sched_clock(sched_clock, hv_save_sched_clock_state,
				 hv_restore_sched_clock_state);
}
#else /* !CONFIG_GENERIC_SCHED_CLOCK && !CONFIG_PARAVIRT */
static __always_inline void hv_setup_sched_clock(void *sched_clock) {}
#endif /* CONFIG_GENERIC_SCHED_CLOCK */
---

> as long as the architecture independence of hyperv_timer.c is preserved.

LOL, ah yes, the architecture independence of MSRs and TSC :-D

Teasing aside, the code is firmly x86-only at the moment.  It's selectable only
by x86:

  config HYPERV_TIMER
	def_bool HYPERV && X86
 
and since at least commit e39acc37db34 ("clocksource: hyper-v: Provide noinstr
sched_clock()") there are references to symbols/functions that are provided only
by x86.

I assume arm64 support is a WIP, but keeping the upstream code arch independent
isn't very realistic if the code can't be at least compile-tested.  To help
drive-by contributors like myself, maybe select HYPER_TIMER on arm64 for
COMPILE_TEST=y builds?

  config HYPERV_TIMER
	def_bool HYPERV && (X86 || (COMPILE_TEST && ARM64))

I have no plans to touch code outside of CONFIG_PARAVIRT, i.e. outside of code
that is explicitly x86-only, but something along those lines would help people
like me understand the goal/intent, and in theory would also help y'all maintain
the code by detecting breakage.
RE: [PATCH 16/16] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
Posted by Michael Kelley 4 weeks, 1 day ago
From: Sean Christopherson <seanjc@google.com> Sent: Monday, February 10, 2025 8:22 AM
> 
> On Sat, Feb 08, 2025, Michael Kelley wrote:
> > From: Sean Christopherson <seanjc@google.com> Sent: Friday, February 7, 2025 9:23 AM
> > >
> > > Dropping a few people/lists whose emails are bouncing.
> > >
> > > On Fri, Jan 31, 2025, Sean Christopherson wrote:
> > > > @@ -369,6 +369,11 @@ void __init kvmclock_init(void)
> > > >  #ifdef CONFIG_X86_LOCAL_APIC
> > > >  	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
> > > >  #endif
> > > > +	/*
> > > > +	 * Save/restore "sched" clock state even if kvmclock isn't being used
> > > > +	 * for sched_clock, as kvmclock is still used for wallclock and relies
> > > > +	 * on these hooks to re-enable kvmclock after suspend+resume.
> > >
> > > This is wrong, wallclock is a different MSR entirely.
> > >
> > > > +	 */
> > > >  	x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
> > > >  	x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
> > >
> > > And usurping sched_clock save/restore is *really* wrong if kvmclock isn't being
> > > used as sched_clock, because when TSC is reset on suspend/hiberation, not doing
> > > tsc_{save,restore}_sched_clock_state() results in time going haywire.
> > >
> > > Subtly, that issue goes all the way back to patch "x86/paravirt: Don't use a PV
> > > sched_clock in CoCo guests with trusted TSC" because pulling the rug out from
> > > under kvmclock leads to the same problem.
> > >
> > > The whole PV sched_clock scheme is a disaster.
> > >
> > > Hyper-V overrides the save/restore callbacks, but _also_ runs the old TSC callbacks,
> > > because Hyper-V doesn't ensure that it's actually using the Hyper-V clock for
> > > sched_clock.  And the code is all kinds of funky, because it tries to keep the
> > > x86 code isolated from the generic HV clock code, but (a) there's already x86 PV
> > > specific code in drivers/clocksource/hyperv_timer.c, and (b) splitting the code
> > > means that Hyper-V overides the sched_clock save/restore hooks even when
> > > PARAVIRT=n, i.e. when HV clock can't possibly be used as sched_clock.
> >
> > Regarding (a), the one occurrence of x86 PV-specific code hyperv_timer.c is
> > the call to paravirt_set_sched_clock(), and it's under an #ifdef sequence so that
> > it's not built if targeting some other architecture. Or do you see something else
> > that is x86-specific?
> >
> > Regarding (b), in drivers/hv/Kconfig, CONFIG_HYPERV always selects PARAVIRT.
> > So the #else clause (where PARAVIRT=n) in that #ifdef sequence could arguably
> > have a BUILD_BUG() added. If I recall correctly, other Hyper-V stuff breaks if
> > PARAVIRT is forced to "n". So I don't think there's a current problem with the
> > sched_clock save/restore hooks. i
> 
> Oh, there are no build issues, and all of the x86 bits are nicely cordoned off.
> My complaint is essentially that they're _too_ isolated; putting the sched_clock
> save/restore setup in arch/x86/kernel/cpu/mshyperv.c is well-intentioned, but IMO
> it does more harm than good because the split makes it difficult to connect the
> dots to hv_setup_sched_clock() in drivers/clocksource/hyperv_timer.c.
> 
> > But I would be good with some restructuring so that setting the sched clock
> > save/restore hooks is more closely tied to the sched clock choice,
> 
> Yeah, this is the intent of my ranting.  After the dust settles, the code can
> look like this.

I'm good with what you are proposing. And if you want, there's no real need
for hv_ref_counter_at_suspend and hv_save/restore_sched_clock_state()
to be in the #ifdef sequence since the code has no architecture dependencies.
Sure, the only current caller is x86-specific, but the functionality is generic
and might useful on some other architecture in the future. That was the
essence of my comment on the original patch that added this code [1].
The patch author took it a step further and moved all the code into an
x86-specific module, which I was OK with at the time. But your moving
it back is probably better.

[1] https://lore.kernel.org/all/SN6PR02MB4157141DD58FD6EAE96C7CABD4992@SN6PR02MB4157.namprd02.prod.outlook.com/

> 
> ---
> #ifdef CONFIG_GENERIC_SCHED_CLOCK
> static __always_inline void hv_setup_sched_clock(void *sched_clock)
> {
> 	/*
> 	 * We're on an architecture with generic sched clock (not x86/x64).
> 	 * The Hyper-V sched clock read function returns nanoseconds, not
> 	 * the normal 100ns units of the Hyper-V synthetic clock.
> 	 */
> 	sched_clock_register(sched_clock, 64, NSEC_PER_SEC);
> }
> #elif defined CONFIG_PARAVIRT
> static u64 hv_ref_counter_at_suspend;
> /*
>  * Hyper-V clock counter resets during hibernation. Save and restore clock
>  * offset during suspend/resume, while also considering the time passed
>  * before suspend. This is to make sure that sched_clock using hv tsc page
>  * based clocksource, proceeds from where it left off during suspend and
>  * it shows correct time for the timestamps of kernel messages after resume.
>  */
> static void hv_save_sched_clock_state(void)
> {
> 	hv_ref_counter_at_suspend = hv_read_reference_counter();
> }
> 
> static void hv_restore_sched_clock_state(void)
> {
> 	/*
> 	 * Adjust the offsets used by hv tsc clocksource to
> 	 * account for the time spent before hibernation.
> 	 * adjusted value = reference counter (time) at suspend
> 	 *                - reference counter (time) now.
> 	 */
> 	hv_sched_clock_offset -= (hv_ref_counter_at_suspend -
> hv_read_reference_counter());
> }
> 
> static __always_inline void hv_setup_sched_clock(void *sched_clock)
> {
> 	/* We're on x86/x64 *and* using PV ops */
> 	paravirt_set_sched_clock(sched_clock, hv_save_sched_clock_state,
> 				 hv_restore_sched_clock_state);
> }
> #else /* !CONFIG_GENERIC_SCHED_CLOCK && !CONFIG_PARAVIRT */
> static __always_inline void hv_setup_sched_clock(void *sched_clock) {}
> #endif /* CONFIG_GENERIC_SCHED_CLOCK */
> ---
> 
> > as long as the architecture independence of hyperv_timer.c is preserved.
> 
> LOL, ah yes, the architecture independence of MSRs and TSC :-)

This is a digression from what you are trying to accomplish, but the function
and symbols names reflect history and are misleading. The terms "TSC" and
"MSR" are used, but arch-specific wrapper functions map "TSC" to the arm64
architectural system counter, for example. And the MSR references are all to
Hyper-V synthetic MSRs, which are mapped to the arm64 equivalent
registers and are accessed via explicit hypercalls instead of rdmsr()/wrmsr(). I
wish we had better terminology to use in the generic code.

> 
> Teasing aside, the code is firmly x86-only at the moment.  It's selectable only
> by x86:
> 
>   config HYPERV_TIMER
> 	def_bool HYPERV && X86
> 
> and since at least commit e39acc37db34 ("clocksource: hyper-v: Provide noinstr
> sched_clock()") there are references to symbols/functions that are provided only
> by x86.
> 
> I assume arm64 support is a WIP, but keeping the upstream code arch independent
> isn't very realistic if the code can't be at least compile-tested.  To help
> drive-by contributors like myself, maybe select HYPER_TIMER on arm64 for
> COMPILE_TEST=y builds?

Ah yes. I think I missed that commit e39acc37db34 added hv_raw_get_register(),
which doesn't have an arm64 equivalent. I had not previously known about
COMPILE_TEST=y. Thanks for pointing that out, and I'll check into using it.

Michael

> 
>   config HYPERV_TIMER
> 	def_bool HYPERV && (X86 || (COMPILE_TEST && ARM64))
> 
> I have no plans to touch code outside of CONFIG_PARAVIRT, i.e. outside of code
> that is explicitly x86-only, but something along those lines would help people
> like me understand the goal/intent, and in theory would also help y'all maintain
> the code by detecting breakage.
Re: [PATCH 16/16] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
Posted by Sean Christopherson 4 weeks ago
On Wed, Feb 12, 2025, Michael Kelley wrote:
> From: Sean Christopherson <seanjc@google.com> Sent: Monday, February 10, 2025 8:22 AM
> > On Sat, Feb 08, 2025, Michael Kelley wrote:
> > > But I would be good with some restructuring so that setting the sched clock
> > > save/restore hooks is more closely tied to the sched clock choice,
> > 
> > Yeah, this is the intent of my ranting.  After the dust settles, the code can
> > look like this.
> 
> I'm good with what you are proposing. And if you want, there's no real need
> for hv_ref_counter_at_suspend and hv_save/restore_sched_clock_state()
> to be in the #ifdef sequence since the code has no architecture dependencies.

Right, but because they will be local/static and there are no users outside of
x86, the compiler will complain about unused variables/functions on other
architectures.