[v1] x86: assorted time handling adjustments

[PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Jan Beulich 1 month ago

Callers may pass in TSC values from before the local TSC stamp was last
updated (this would in particular be the case when the TSC was latched, a
time rendezvous occurs, and the latched value is used only afterwards).
scale_delta(), otoh, deals with unsigned values, and hence would treat
negative incoming deltas as huge positive values, yielding entirely bogus
return values.

Fixes: 88e64cb785c1 ("x86/HVM: use fixed TSC value when saving or restoring domain")
Reported-by: Антон Марков <akmarkov45@gmail.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
An alternative might be to have scale_delta() deal with signed deltas, yet
that seemed more risky to me.

There could likely be more Fixes: tags; the one used is the oldest
applicable one, from what I can tell.

A similar issue looks to exist in read_xen_timer() and its read_cycle()
helper, if we're scheduled out (and beck in) between reading of the TSC
and calculation of the delta (involving ->tsc_timestamp). Am I
overlooking anything there?

stime2tsc() guards against negative deltas by using 0 instead; I'm not
quite sure that's correct either.

amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a
comment towards the TSC being "sane", but is that correct? Due to
TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then
wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before
calling tsc_ticks2ns()?

A similar issue looks to exist in tsc_get_info(), again when rdtsc()
possibly returns a huge value due to TSC_ADJUST. Once again I wonder
whether we shouldn't subtract boot_tsc_stamp.

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1649,8 +1649,13 @@ s_time_t get_s_time_fixed(u64 at_tsc)
         tsc = at_tsc;
     else
         tsc = rdtsc_ordered();
-    delta = tsc - t->stamp.local_tsc;
-    return t->stamp.local_stime + scale_delta(delta, &t->tsc_scale);
+
+    if ( tsc >= t->stamp.local_tsc )
+        delta = scale_delta(tsc - t->stamp.local_tsc, &t->tsc_scale);
+    else
+        delta = -scale_delta(t->stamp.local_tsc - tsc, &t->tsc_scale);
+
+    return t->stamp.local_stime + delta;
 }
 
 s_time_t get_s_time(void)

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Roger Pau Monné 3 weeks, 4 days ago

On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote:
> Callers may pass in TSC values from before the local TSC stamp was last
> updated (this would in particular be the case when the TSC was latched, a
> time rendezvous occurs, and the latched value is used only afterwards).
> scale_delta(), otoh, deals with unsigned values, and hence would treat
> negative incoming deltas as huge positive values, yielding entirely bogus
> return values.
> 
> Fixes: 88e64cb785c1 ("x86/HVM: use fixed TSC value when saving or restoring domain")
> Reported-by: Антон Марков <akmarkov45@gmail.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> An alternative might be to have scale_delta() deal with signed deltas, yet
> that seemed more risky to me.

I won't go that route, the caller can figure out the signedness of the
result as needed.

> There could likely be more Fixes: tags; the one used is the oldest
> applicable one, from what I can tell.
> 
> A similar issue looks to exist in read_xen_timer() and its read_cycle()
> helper, if we're scheduled out (and beck in) between reading of the TSC
> and calculation of the delta (involving ->tsc_timestamp). Am I
> overlooking anything there?

If we are scheduled out in the middle, and the ->tsc_timestamp is
updated, ->version would also be bumped, and hence the loop will be
restarted.  I don't think there's an issue there.

> stime2tsc() guards against negative deltas by using 0 instead; I'm not
> quite sure that's correct either.

Hm, we should likely do the same for stime2tsc() that you do for
get_s_time_fixed().  Given the current callers I think we might be
safe, but it's a risk.

> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a
> comment towards the TSC being "sane", but is that correct? Due to
> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then
> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before
> calling tsc_ticks2ns()?

amd_check_erratum_1474() runs after early_time_init(), which would
have cleared any TSC_ADJUST offset AFAICT.  There's a note in the
initcall to that regard:

/*
 * Must be executed after early_time_init() for tsc_ticks2ns() to have been
 * calibrated.  That prevents us doing the check in init_amd().
 */
presmp_initcall(amd_check_erratum_1474);

> A similar issue looks to exist in tsc_get_info(), again when rdtsc()
> possibly returns a huge value due to TSC_ADJUST. Once again I wonder
> whether we shouldn't subtract boot_tsc_stamp.

I would expect tsc_get_info() to also get called exclusively after
early_time_init()?

> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -1649,8 +1649,13 @@ s_time_t get_s_time_fixed(u64 at_tsc)
>          tsc = at_tsc;
>      else
>          tsc = rdtsc_ordered();
> -    delta = tsc - t->stamp.local_tsc;
> -    return t->stamp.local_stime + scale_delta(delta, &t->tsc_scale);
> +
> +    if ( tsc >= t->stamp.local_tsc )

Should we hint the compiler this is the likely path?

> +        delta = scale_delta(tsc - t->stamp.local_tsc, &t->tsc_scale);
> +    else
> +        delta = -scale_delta(t->stamp.local_tsc - tsc, &t->tsc_scale);
> +
> +    return t->stamp.local_stime + delta;

LGTM:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

However I see Anton still has concerns that this patch doesn't fully
fix the issue he reported, and I'm afraid I don't really understand
what's going on.  I will have to take a more detailed look at the
thread, or possibly attempt to reproduce myself.

Thanks, Roger.

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Jan Beulich 3 weeks, 4 days ago

On 14.01.2026 18:49, Roger Pau Monné wrote:
> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote:
>> A similar issue looks to exist in read_xen_timer() and its read_cycle()
>> helper, if we're scheduled out (and beck in) between reading of the TSC
>> and calculation of the delta (involving ->tsc_timestamp). Am I
>> overlooking anything there?
> 
> If we are scheduled out in the middle, and the ->tsc_timestamp is
> updated, ->version would also be bumped, and hence the loop will be
> restarted.  I don't think there's an issue there.

Oh, indeed - I was too focused on the read_cycle() alone. That may go
wrong, but as you say the result then simply won't be used.

>> stime2tsc() guards against negative deltas by using 0 instead; I'm not
>> quite sure that's correct either.
> 
> Hm, we should likely do the same for stime2tsc() that you do for
> get_s_time_fixed().  Given the current callers I think we might be
> safe, but it's a risk.

Will do then.

>> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a
>> comment towards the TSC being "sane", but is that correct? Due to
>> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then
>> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before
>> calling tsc_ticks2ns()?
> 
> amd_check_erratum_1474() runs after early_time_init(), which would
> have cleared any TSC_ADJUST offset AFAICT.  There's a note in the
> initcall to that regard:
> 
> /*
>  * Must be executed after early_time_init() for tsc_ticks2ns() to have been
>  * calibrated.  That prevents us doing the check in init_amd().
>  */
> presmp_initcall(amd_check_erratum_1474);

Hmm, I should have written "Due to e.g. TSC_ADJUST". Firmware may also
have played other games with MSR_TSC.

>> A similar issue looks to exist in tsc_get_info(), again when rdtsc()
>> possibly returns a huge value due to TSC_ADJUST. Once again I wonder
>> whether we shouldn't subtract boot_tsc_stamp.
> 
> I would expect tsc_get_info() to also get called exclusively after
> early_time_init()?

Same here then (obviously).

>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -1649,8 +1649,13 @@ s_time_t get_s_time_fixed(u64 at_tsc)
>>          tsc = at_tsc;
>>      else
>>          tsc = rdtsc_ordered();
>> -    delta = tsc - t->stamp.local_tsc;
>> -    return t->stamp.local_stime + scale_delta(delta, &t->tsc_scale);
>> +
>> +    if ( tsc >= t->stamp.local_tsc )
> 
> Should we hint the compiler this is the likely path?

I was wondering too, but was erring on the side of "no" primarily because
of Andrew's general preference towards no likely(). I'd be happy to add it
here.

>> +        delta = scale_delta(tsc - t->stamp.local_tsc, &t->tsc_scale);
>> +    else
>> +        delta = -scale_delta(t->stamp.local_tsc - tsc, &t->tsc_scale);
>> +
>> +    return t->stamp.local_stime + delta;
> 
> LGTM:
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

> However I see Anton still has concerns that this patch doesn't fully
> fix the issue he reported, and I'm afraid I don't really understand
> what's going on.  I will have to take a more detailed look at the
> thread, or possibly attempt to reproduce myself.

Largely the same here.

Jan

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Jan Beulich 2 weeks, 6 days ago

On 15.01.2026 09:00, Jan Beulich wrote:
> On 14.01.2026 18:49, Roger Pau Monné wrote:
>> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote:
>>> stime2tsc() guards against negative deltas by using 0 instead; I'm not
>>> quite sure that's correct either.
>>
>> Hm, we should likely do the same for stime2tsc() that you do for
>> get_s_time_fixed().  Given the current callers I think we might be
>> safe, but it's a risk.
> 
> Will do then.

While doing so, I came to wonder if there isn't a reason for this "capping".
In local_time_calibration() we also have

    /* Local time warps forward if it lags behind master time. */
    if ( curr.local_stime < curr.master_stime )
        curr.local_stime = curr.master_stime;

Which for the use of stime2tsc() in cstate_restore_tsc() might mean that
indeed there is a worry of the delta being negative, and the desire to
"warp forward" in that case.

Whereas for the other use in reprogram_timer() capping at 0 isn't overly
useful. By the time the value is used, it is likely in the past anyway.

Jan

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Jan Beulich 2 weeks, 6 days ago

On 20.01.2026 09:50, Jan Beulich wrote:
> On 15.01.2026 09:00, Jan Beulich wrote:
>> On 14.01.2026 18:49, Roger Pau Monné wrote:
>>> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote:
>>>> stime2tsc() guards against negative deltas by using 0 instead; I'm not
>>>> quite sure that's correct either.
>>>
>>> Hm, we should likely do the same for stime2tsc() that you do for
>>> get_s_time_fixed().  Given the current callers I think we might be
>>> safe, but it's a risk.
>>
>> Will do then.
> 
> While doing so, I came to wonder if there isn't a reason for this "capping".
> In local_time_calibration() we also have
> 
>     /* Local time warps forward if it lags behind master time. */
>     if ( curr.local_stime < curr.master_stime )
>         curr.local_stime = curr.master_stime;
> 
> Which for the use of stime2tsc() in cstate_restore_tsc() might mean that
> indeed there is a worry of the delta being negative, and the desire to
> "warp forward" in that case.

Proposed new function implementation (easier to look at than the diff):

uint64_t stime2tsc(s_time_t stime)
{
    const struct cpu_time *t = &this_cpu(cpu_time);
    s_time_t stime_delta = stime - t->stamp.local_stime;
    int64_t delta = 0;

    /*
     * While for reprogram_timer() the capping at 0 isn't relevant (the returned
     * value is likely in the past anyway then, by the time it is used), for
     * cstate_restore_tsc() it is meaningful: We need to avoid moving the TSC
     * backwards (relative to when it may last have been read).
     */
    if ( stime_delta > 0 )
    {
        struct time_scale sys_to_tsc = scale_reciprocal(t->tsc_scale);

        delta = scale_delta(stime_delta, &sys_to_tsc);
    }

    return t->stamp.local_tsc + delta;
}

Jan

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Roger Pau Monné 3 weeks, 4 days ago

On Thu, Jan 15, 2026 at 09:00:07AM +0100, Jan Beulich wrote:
> On 14.01.2026 18:49, Roger Pau Monné wrote:
> > On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote:
> >> A similar issue looks to exist in read_xen_timer() and its read_cycle()
> >> helper, if we're scheduled out (and beck in) between reading of the TSC
> >> and calculation of the delta (involving ->tsc_timestamp). Am I
> >> overlooking anything there?
> > 
> > If we are scheduled out in the middle, and the ->tsc_timestamp is
> > updated, ->version would also be bumped, and hence the loop will be
> > restarted.  I don't think there's an issue there.
> 
> Oh, indeed - I was too focused on the read_cycle() alone. That may go
> wrong, but as you say the result then simply won't be used.
> 
> >> stime2tsc() guards against negative deltas by using 0 instead; I'm not
> >> quite sure that's correct either.
> > 
> > Hm, we should likely do the same for stime2tsc() that you do for
> > get_s_time_fixed().  Given the current callers I think we might be
> > safe, but it's a risk.
> 
> Will do then.
> 
> >> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a
> >> comment towards the TSC being "sane", but is that correct? Due to
> >> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then
> >> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before
> >> calling tsc_ticks2ns()?
> > 
> > amd_check_erratum_1474() runs after early_time_init(), which would
> > have cleared any TSC_ADJUST offset AFAICT.  There's a note in the
> > initcall to that regard:
> > 
> > /*
> >  * Must be executed after early_time_init() for tsc_ticks2ns() to have been
> >  * calibrated.  That prevents us doing the check in init_amd().
> >  */
> > presmp_initcall(amd_check_erratum_1474);
> 
> Hmm, I should have written "Due to e.g. TSC_ADJUST". Firmware may also
> have played other games with MSR_TSC.

For amd_check_erratum_1474() we don't want to subtract boot_tsc_stamp,
otherwise when kexec'ed we won't be accounting properly for the time
since host startup, as subtracting boot_tsc_stamp would remove any
time consumed by a previously run OS.

> >> A similar issue looks to exist in tsc_get_info(), again when rdtsc()
> >> possibly returns a huge value due to TSC_ADJUST. Once again I wonder
> >> whether we shouldn't subtract boot_tsc_stamp.
> > 
> > I would expect tsc_get_info() to also get called exclusively after
> > early_time_init()?
> 
> Same here then (obviously).

For tsc_get_info() I think you are worried that the TSC might
overflow, and hence the calculation in scale_delta() would then be
skewed.  We must have other instances of this pattern however, what
about get_s_time_fixed(), I think it would also be affected?

Or maybe I'm not understanding the concern.  Given the proposed
scale_delta() logic, it won't be possible to distinguish rdtsc
overflowing from a value in the past.

Thanks, Roger.

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Jan Beulich 3 weeks, 4 days ago

On 15.01.2026 09:24, Roger Pau Monné wrote:
> On Thu, Jan 15, 2026 at 09:00:07AM +0100, Jan Beulich wrote:
>> On 14.01.2026 18:49, Roger Pau Monné wrote:
>>> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote:
>>>> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a
>>>> comment towards the TSC being "sane", but is that correct? Due to
>>>> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then
>>>> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before
>>>> calling tsc_ticks2ns()?
>>>
>>> amd_check_erratum_1474() runs after early_time_init(), which would
>>> have cleared any TSC_ADJUST offset AFAICT.  There's a note in the
>>> initcall to that regard:
>>>
>>> /*
>>>  * Must be executed after early_time_init() for tsc_ticks2ns() to have been
>>>  * calibrated.  That prevents us doing the check in init_amd().
>>>  */
>>> presmp_initcall(amd_check_erratum_1474);
>>
>> Hmm, I should have written "Due to e.g. TSC_ADJUST". Firmware may also
>> have played other games with MSR_TSC.
> 
> For amd_check_erratum_1474() we don't want to subtract boot_tsc_stamp,
> otherwise when kexec'ed we won't be accounting properly for the time
> since host startup, as subtracting boot_tsc_stamp would remove any
> time consumed by a previously run OS.

For both this and ...

>>>> A similar issue looks to exist in tsc_get_info(), again when rdtsc()
>>>> possibly returns a huge value due to TSC_ADJUST. Once again I wonder
>>>> whether we shouldn't subtract boot_tsc_stamp.
>>>
>>> I would expect tsc_get_info() to also get called exclusively after
>>> early_time_init()?
>>
>> Same here then (obviously).
> 
> For tsc_get_info() I think you are worried that the TSC might
> overflow, and hence the calculation in scale_delta() would then be
> skewed.  We must have other instances of this pattern however, what
> about get_s_time_fixed(), I think it would also be affected?
> 
> Or maybe I'm not understanding the concern.  Given the proposed
> scale_delta() logic, it won't be possible to distinguish rdtsc
> overflowing from a value in the past.

... this, my main point really is that scale_delta() (as its name says),
and hence also tsc_ticks2ns(), shouldn't be used on absolute counts, but
only deltas. (Yes, an absolute count can be viewed as delta from 0, but
that's correct only if we know the TSC started counting from 0 and was
never adjusted by some bias.)

Jan

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Roger Pau Monné 3 weeks, 4 days ago

On Thu, Jan 15, 2026 at 11:38:10AM +0100, Jan Beulich wrote:
> On 15.01.2026 09:24, Roger Pau Monné wrote:
> > On Thu, Jan 15, 2026 at 09:00:07AM +0100, Jan Beulich wrote:
> >> On 14.01.2026 18:49, Roger Pau Monné wrote:
> >>> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote:
> >>>> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a
> >>>> comment towards the TSC being "sane", but is that correct? Due to
> >>>> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then
> >>>> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before
> >>>> calling tsc_ticks2ns()?
> >>>
> >>> amd_check_erratum_1474() runs after early_time_init(), which would
> >>> have cleared any TSC_ADJUST offset AFAICT.  There's a note in the
> >>> initcall to that regard:
> >>>
> >>> /*
> >>>  * Must be executed after early_time_init() for tsc_ticks2ns() to have been
> >>>  * calibrated.  That prevents us doing the check in init_amd().
> >>>  */
> >>> presmp_initcall(amd_check_erratum_1474);
> >>
> >> Hmm, I should have written "Due to e.g. TSC_ADJUST". Firmware may also
> >> have played other games with MSR_TSC.
> > 
> > For amd_check_erratum_1474() we don't want to subtract boot_tsc_stamp,
> > otherwise when kexec'ed we won't be accounting properly for the time
> > since host startup, as subtracting boot_tsc_stamp would remove any
> > time consumed by a previously run OS.
> 
> For both this and ...
> 
> >>>> A similar issue looks to exist in tsc_get_info(), again when rdtsc()
> >>>> possibly returns a huge value due to TSC_ADJUST. Once again I wonder
> >>>> whether we shouldn't subtract boot_tsc_stamp.
> >>>
> >>> I would expect tsc_get_info() to also get called exclusively after
> >>> early_time_init()?
> >>
> >> Same here then (obviously).
> > 
> > For tsc_get_info() I think you are worried that the TSC might
> > overflow, and hence the calculation in scale_delta() would then be
> > skewed.  We must have other instances of this pattern however, what
> > about get_s_time_fixed(), I think it would also be affected?
> > 
> > Or maybe I'm not understanding the concern.  Given the proposed
> > scale_delta() logic, it won't be possible to distinguish rdtsc
> > overflowing from a value in the past.
> 
> ... this, my main point really is that scale_delta() (as its name says),
> and hence also tsc_ticks2ns(), shouldn't be used on absolute counts, but
> only deltas. (Yes, an absolute count can be viewed as delta from 0, but
> that's correct only if we know the TSC started counting from 0 and was
> never adjusted by some bias.)

Well amd_check_erratum_1474() does want the delta from 0 to the
current TSC, because that's the best? way to see when C6 needs to be
disabled.  Otherwise we just straight disable C6 on boot on affected
systems.

IMO the problem is not about using tsc_ticks2ns() or scale_delta()
with absolute or delta values, the issue if how the "delta" passed to
those functions is calculated.

Thanks, Roger.

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Jan Beulich 3 weeks, 4 days ago

On 15.01.2026 12:48, Roger Pau Monné wrote:
> On Thu, Jan 15, 2026 at 11:38:10AM +0100, Jan Beulich wrote:
>> On 15.01.2026 09:24, Roger Pau Monné wrote:
>>> On Thu, Jan 15, 2026 at 09:00:07AM +0100, Jan Beulich wrote:
>>>> On 14.01.2026 18:49, Roger Pau Monné wrote:
>>>>> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote:
>>>>>> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a
>>>>>> comment towards the TSC being "sane", but is that correct? Due to
>>>>>> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then
>>>>>> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before
>>>>>> calling tsc_ticks2ns()?
>>>>>
>>>>> amd_check_erratum_1474() runs after early_time_init(), which would
>>>>> have cleared any TSC_ADJUST offset AFAICT.  There's a note in the
>>>>> initcall to that regard:
>>>>>
>>>>> /*
>>>>>  * Must be executed after early_time_init() for tsc_ticks2ns() to have been
>>>>>  * calibrated.  That prevents us doing the check in init_amd().
>>>>>  */
>>>>> presmp_initcall(amd_check_erratum_1474);
>>>>
>>>> Hmm, I should have written "Due to e.g. TSC_ADJUST". Firmware may also
>>>> have played other games with MSR_TSC.
>>>
>>> For amd_check_erratum_1474() we don't want to subtract boot_tsc_stamp,
>>> otherwise when kexec'ed we won't be accounting properly for the time
>>> since host startup, as subtracting boot_tsc_stamp would remove any
>>> time consumed by a previously run OS.
>>
>> For both this and ...
>>
>>>>>> A similar issue looks to exist in tsc_get_info(), again when rdtsc()
>>>>>> possibly returns a huge value due to TSC_ADJUST. Once again I wonder
>>>>>> whether we shouldn't subtract boot_tsc_stamp.
>>>>>
>>>>> I would expect tsc_get_info() to also get called exclusively after
>>>>> early_time_init()?
>>>>
>>>> Same here then (obviously).
>>>
>>> For tsc_get_info() I think you are worried that the TSC might
>>> overflow, and hence the calculation in scale_delta() would then be
>>> skewed.  We must have other instances of this pattern however, what
>>> about get_s_time_fixed(), I think it would also be affected?
>>>
>>> Or maybe I'm not understanding the concern.  Given the proposed
>>> scale_delta() logic, it won't be possible to distinguish rdtsc
>>> overflowing from a value in the past.
>>
>> ... this, my main point really is that scale_delta() (as its name says),
>> and hence also tsc_ticks2ns(), shouldn't be used on absolute counts, but
>> only deltas. (Yes, an absolute count can be viewed as delta from 0, but
>> that's correct only if we know the TSC started counting from 0 and was
>> never adjusted by some bias.)
> 
> Well amd_check_erratum_1474() does want the delta from 0 to the
> current TSC, because that's the best? way to see when C6 needs to be
> disabled.  Otherwise we just straight disable C6 on boot on affected
> systems.

I think that may be necessary when we don't know what was done to the TSC
before we took control of the system.

Jan

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Anton Markov 3 weeks, 4 days ago

Sorry. I mixed several threads into one. It wasn't immediately clear that
the second bug isn't related to handling negative values in scale_delta. It's
related to gradual time desynchronization between cores, so they should
probably be addressed separately.

Thanks.

[PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Антон Марков 1 month ago

Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt 
(xen/arch/x86/hvm/hvm.c ), it was easy to catch because 
process_pending_softirqs is frequently called there, which in turn 
processes softirqs from the timer (where the timestamp is updated). 
After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped 
reproducing under no load. However, when the number of vCPUs is 4 times 
greater than the number of CPUs (under heavy load), the problem rarely 
reoccurs (mostly during snapshot restores during 
process_pending_softirqs calls), and this is no longer a simple case. If 
get_s_time_fixed can indeed be interrupted during execution after 
rdtsc_ordered, then the current fix is insufficient. It's necessary to 
atomically copy "t->stamp" to the stack using local_irq_disable and 
local_irq_enable (as in local_time_calibration), and then work with the 
copy, confident in its lifetime and immutability. But until 
get_s_time_fixed is proven to be interruptible, this is premature, so 
your fix is sufficient. I think I need more information and testing to 
say more.

Regarding the other scale_delta calls, if they include values 
calculated from externally saved tsc values that could have become 
stale during the process_pending_softirqs call, this definitely needs to 
be fixed.

Here's the problematic code from dwm.exe, if that helps:

void __fastcall 
CPartitionVerticalBlankScheduler::UpdateCurrentTime(CPartitionVerticalBlankScheduler 
*this) {
     uint64_t *xor_time_ptr = (uint64_t *)((char *)this + 0x3E40);
     CPartitionVerticalBlankScheduler *scheduler = this;
     LARGE_INTEGER *current_time_ptr = (LARGE_INTEGER *)((char *)this + 
0x3E30);
     uint64_t previous_time = *((_QWORD *)this + 0x7C6); // 0x3e30
     uint64_t address_high = ((_QWORD)this + 0x3E40) << 32;
     uint64_t xor_key = ((uint64_t)this + 0x3E40) | address_high;

     // validate internal state
     if ((previous_time ^ xor_key) != *((_QWORD *)this + 0x7C8)) { // 0x3e40
         char exception_record[0x98];
         memset(exception_record, 0, sizeof(exception_record));
         *(int *)exception_record = 0x88980080; // MILERR_INTERNALERROR

         uint64_t computed_xor = *xor_time_ptr ^ ((uint64_t)xor_time_ptr 
| address_high);
         int param_count = 4;
         int previous_time_high = (int)(previous_time >> 32);
         uint32_t previous_time_low = (uint32_t)previous_time;
         int computed_xor_high = (int)(computed_xor >> 32);
         uint32_t computed_xor_low = (uint32_t)computed_xor;

RaiseFailFastException((PEXCEPTION_RECORD)exception_record, nullptr, 0);
         previous_time = *((_QWORD *)scheduler + 1990);
     }

     // get current timestamp
     *((_QWORD *)scheduler + 0x7C7) = previous_time; // 0x3e38
     QueryPerformanceCounter(current_time_ptr);

     LARGE_INTEGER new_time = *current_time_ptr;
     uint64_t last_previous_time = *((_QWORD *)scheduler + 0x7C7); // 0x3e38

     // check if time go backward
     if (new_time.QuadPart < last_previous_time) {
         char exception_record[0x98];
         memset(exception_record, 0, sizeof(exception_record));
         *(int *)exception_record = 0x8898009B; // 
MILERR_QPC_TIME_WENT_BACKWARD

         int new_time_high = new_time.HighPart;
         uint32_t new_time_low = new_time.LowPart;
         int last_previous_high = (int)(last_previous_time >> 32);
         uint32_t last_previous_low = (uint32_t)last_previous_time;
         int frequency_high = g_qpcFrequency.HighPart;
         uint32_t frequency_low = g_qpcFrequency.LowPart;

         int64_t time_delta = last_previous_time - new_time.QuadPart;
         int64_t millis_delta = (1000 * time_delta) / 
g_qpcFrequency.QuadPart;
         int millis_high = (int)(millis_delta >> 32);
         uint32_t millis_low = (uint32_t)millis_delta;

         int param_count = 8;

RaiseFailFastException((PEXCEPTION_RECORD)exception_record, nullptr, 0);
         new_time = *(LARGE_INTEGER *)((char *)scheduler + 15920);
     }

     *xor_time_ptr = new_time.QuadPart ^ xor_key;
}

Thanks.

06.01.2026 16:58, Jan Beulich пишет:
> Callers may pass in TSC values from before the local TSC stamp was last
> updated (this would in particular be the case when the TSC was latched, a
> time rendezvous occurs, and the latched value is used only afterwards).
> scale_delta(), otoh, deals with unsigned values, and hence would treat
> negative incoming deltas as huge positive values, yielding entirely bogus
> return values.
>
> Fixes: 88e64cb785c1 ("x86/HVM: use fixed TSC value when saving or restoring domain")
> Reported-by: Антон Марков<akmarkov45@gmail.com>
> Signed-off-by: Jan Beulich<jbeulich@suse.com>
> ---
> An alternative might be to have scale_delta() deal with signed deltas, yet
> that seemed more risky to me.
>
> There could likely be more Fixes: tags; the one used is the oldest
> applicable one, from what I can tell.
>
> A similar issue looks to exist in read_xen_timer() and its read_cycle()
> helper, if we're scheduled out (and beck in) between reading of the TSC
> and calculation of the delta (involving ->tsc_timestamp). Am I
> overlooking anything there?
>
> stime2tsc() guards against negative deltas by using 0 instead; I'm not
> quite sure that's correct either.
>
> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a
> comment towards the TSC being "sane", but is that correct? Due to
> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then
> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before
> calling tsc_ticks2ns()?
>
> A similar issue looks to exist in tsc_get_info(), again when rdtsc()
> possibly returns a huge value due to TSC_ADJUST. Once again I wonder
> whether we shouldn't subtract boot_tsc_stamp.
>
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -1649,8 +1649,13 @@ s_time_t get_s_time_fixed(u64 at_tsc)
>           tsc = at_tsc;
>       else
>           tsc = rdtsc_ordered();
> -    delta = tsc - t->stamp.local_tsc;
> -    return t->stamp.local_stime + scale_delta(delta, &t->tsc_scale);
> +
> +    if ( tsc >= t->stamp.local_tsc )
> +        delta = scale_delta(tsc - t->stamp.local_tsc, &t->tsc_scale);
> +    else
> +        delta = -scale_delta(t->stamp.local_tsc - tsc, &t->tsc_scale);
> +
> +    return t->stamp.local_stime + delta;
>   }
>   
>   s_time_t get_s_time(void)
>

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Jan Beulich 1 month ago

On 06.01.2026 21:10, Антон Марков wrote:
> Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt 
> (xen/arch/x86/hvm/hvm.c ), it was easy to catch because 
> process_pending_softirqs is frequently called there, which in turn 
> processes softirqs from the timer (where the timestamp is updated). 
> After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped 
> reproducing under no load. However, when the number of vCPUs is 4 times 
> greater than the number of CPUs (under heavy load), the problem rarely 
> reoccurs (mostly during snapshot restores during 
> process_pending_softirqs calls), and this is no longer a simple case. If 
> get_s_time_fixed can indeed be interrupted during execution after 
> rdtsc_ordered, then the current fix is insufficient. It's necessary to 
> atomically copy "t->stamp" to the stack using local_irq_disable and 
> local_irq_enable (as in local_time_calibration), and then work with the 
> copy, confident in its lifetime and immutability. But until 
> get_s_time_fixed is proven to be interruptible, this is premature, so 
> your fix is sufficient. I think I need more information and testing to 
> say more.

While the cpu_calibration per-CPU variable is updated from IRQ context,
the cpu_time one isn't. Hence t->stamp's contents cannot change behind
the back of get_s_time_fixed(). I wonder whether ...

> Regarding the other scale_delta calls, if they include values 
> calculated from externally saved tsc values that could have become 
> stale during the process_pending_softirqs call, this definitely needs to 
> be fixed.

... another similar issue (possibly one not included in the set of
remarks I have in the patch, as none of those look related to what you
describe) might be causing the remaining, more rare problems you say you
see. That set of remarks is actually a result of me going over all other
scale_delta() calls, but of course I may have got the analysis wrong.

As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy
load) - while I don't think we have a support statement for such upstream
(when probably we should), iirc for our (SUSE's) products we would
consider that unsupported. Just fyi.

Also, btw, please don't top-post.

Jan

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Anton Markov 1 month ago

You're right. These aren't interrupts in get_s_time_fixed, but rather a
cumulative error in the sequence due to integer rounding. I added logging
of the current local_stime to local_time_calibration and noticed that the
timestamp between cores is gradually increasing. If the server has been
running for weeks, this could be a very large value.


```

@@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void)

if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )

{

/* Atomically read cpu_calibration struct and write cpu_time struct. */

+ printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n",
smp_processor_id(), t->stamp.local_stime, c->local_stime,

+ t->last_seen_ns, t->last_seen_tsc);

local_irq_disable();

t->stamp = *c;

local_irq_enable();

```


2 hours of work:


```

(XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597
(8565145862216, 0)

(XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609
(8565145863957, 0)

(XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491
(8565145864800, 0)

(XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609
(8565145865372, 0)


8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag

```


6 hours of work:

```

(XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993
(22915730870665, 0)

(XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889
(22915730870693, 0)

(XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841
(22915730872231, 0)

(XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696
(22915730872096, 0)


22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag

```


Clarification, according to my xen settings:

```

ucode=scan dom0_mem=53923M,max:53923M dom0_max_vcpus=4-96 dom0_vcpus_pin=0
force-ept=1 ept=no-ad,no-pml hap_1gb=0 hap_2mb=0 altp2m=1
hpet=legacy-replacement smt=1 spec-ctrl=no gnttab_max_frames=512
cpufreq=xen:performance max_cstate=1 sched=credit sched-gran=cpu apicv=0
sched_credit_tslice_ms=5 sched_ratelimit_us=500

```


Processors I tested on:


```

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz


Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 cx16
sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm cpuid_fault
fsgsbase erms xsaveopt arch_capabilities

```

```

Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz x2 (NUMA)


Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 fma cx16
sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm
3dnowprefetch cpuid_fault ibpb fsgsbase bmi1 avx2 bmi2 erms rtm avx512f
avx512dq rdseed adx avx512ifma clflushopt clwb avx512cd sha_ni avx512bw
avx512vl xsaveopt xsavec xgetbv1 avx512vbmi avx512_vbmi2 gfni vaes
vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear
arch_capabilities

```


Next I moved the code to r3 to speed up playback:


```

#include <stdint.h>

#include <stdio.h>


static __inline__ unsigned long long rdtsc(void)

{

unsigned hi, lo;

__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));

return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );

}


#define MILLISECS(_ms) ((int64_t)((_ms) * 1000000ULL))


struct cpu_time_stamp {

uint64_t local_tsc;

int64_t local_stime;

int64_t master_stime;

};


struct time_scale {

int shift;

uint32_t mul_frac;

};



static inline uint32_t div_frac(uint32_t dividend, uint32_t divisor)

{

uint32_t quotient, remainder;

asm (

"divl %4"

: "=a" (quotient), "=d" (remainder)

: "0" (0), "1" (dividend), "r" (divisor) );

return quotient;

}



void set_time_scale(struct time_scale *ts, uint64_t ticks_per_sec)

{

uint64_t tps64 = ticks_per_sec;

uint32_t tps32;

int shift = 0;


while ( tps64 > (MILLISECS(1000)*2) )

{

tps64 >>= 1;

shift--;

}


tps32 = (uint32_t)tps64;

while ( tps32 <= (uint32_t)MILLISECS(1000) )

{

tps32 <<= 1;

shift++;

}


ts->mul_frac = div_frac(MILLISECS(1000), tps32);

ts->shift = shift;

}



uint64_t scale_delta(uint64_t delta, const struct time_scale *scale)

{

uint64_t product;


if ( scale->shift < 0 )

delta >>= -scale->shift;

else

delta <<= scale->shift;


asm (

"mulq %2 ; shrd $32,%1,%0"

: "=a" (product), "=d" (delta)

: "rm" (delta), "0" ((uint64_t)scale->mul_frac) );


return product;

}


#define _TS_MUL_FRAC_IDENTITY 0x80000000UL


static inline struct time_scale scale_reciprocal(struct time_scale scale)

{

struct time_scale reciprocal;

uint32_t dividend;


dividend = _TS_MUL_FRAC_IDENTITY;

reciprocal.shift = 1 - scale.shift;

while ( dividend >= scale.mul_frac )

{

dividend >>= 1;

reciprocal.shift++;

}


asm (

"divl %4"

: "=a" (reciprocal.mul_frac), "=d" (dividend)

: "0" (0), "1" (dividend), "r" (scale.mul_frac) );


return reciprocal;

}



int64_t get_s_time_fixed(const struct cpu_time_stamp *t, const struct
time_scale *ts, uint64_t at_tsc)

{

uint64_t tsc, delta;


if ( at_tsc )

tsc = at_tsc;

else

tsc = rdtsc();

delta = tsc - t->local_tsc;

return t->local_stime + scale_delta(delta, ts);

}


int main() {


struct cpu_time_stamp ct;

struct time_scale ts;

struct time_scale back;


uint64_t start = rdtsc();


set_time_scale(&ts, 3000000000);

back = scale_reciprocal(ts);


ct.local_tsc = start;

ct.local_stime = scale_delta(start, &ts);


while (1) {

uint64_t new_tsc = rdtsc();

ct.local_stime = get_s_time_fixed(&ct, &ts, new_tsc);

ct.local_tsc = new_tsc;


uint64_t tmp_tsc = rdtsc();

printf("%lu %lu\n", tmp_tsc, scale_delta(get_s_time_fixed(&ct, &ts,
tmp_tsc), &back));

}


return 0;

}

```


It's enough to run the loop for 10-20 seconds to see the problem.
scale_delta rounds the least significant bits during calculation, which
causes the error to accumulate, at different rates on different cores,
depending on the least significant bits at the time of calibration.


Now let's talk about why dwm reacts this way. When a snapshot is reversed,
last_guest_time in hvm_get_guest_time_fixed is set to 0, which doesn't
prevent time from flowing backwards. This means that cache_tsc_offset in
hvm_get_guest_tsc_fixed may be configured correctly on one physical core,
but due to shedding on a physical core with a lagging tsc, the guest may
occasionally see a tsc value lower than it saw before the snapshot was
reversed. If this happens to the dwm code, it terminates with an error.


A quick solution to this problem might be to save the last_seen_tsc
parameter in a snapshot for each core, followed by validation.


The correct solution is to remove the rounding of the least significant
bits from the sequence.

On Wed, Jan 7, 2026 at 11:02 AM Jan Beulich <jbeulich@suse.com> wrote:

> On 06.01.2026 21:10, Антон Марков wrote:
> > Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt
> > (xen/arch/x86/hvm/hvm.c ), it was easy to catch because
> > process_pending_softirqs is frequently called there, which in turn
> > processes softirqs from the timer (where the timestamp is updated).
> > After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped
> > reproducing under no load. However, when the number of vCPUs is 4 times
> > greater than the number of CPUs (under heavy load), the problem rarely
> > reoccurs (mostly during snapshot restores during
> > process_pending_softirqs calls), and this is no longer a simple case. If
> > get_s_time_fixed can indeed be interrupted during execution after
> > rdtsc_ordered, then the current fix is insufficient. It's necessary to
> > atomically copy "t->stamp" to the stack using local_irq_disable and
> > local_irq_enable (as in local_time_calibration), and then work with the
> > copy, confident in its lifetime and immutability. But until
> > get_s_time_fixed is proven to be interruptible, this is premature, so
> > your fix is sufficient. I think I need more information and testing to
> > say more.
>
> While the cpu_calibration per-CPU variable is updated from IRQ context,
> the cpu_time one isn't. Hence t->stamp's contents cannot change behind
> the back of get_s_time_fixed(). I wonder whether ...
>
> > Regarding the other scale_delta calls, if they include values
> > calculated from externally saved tsc values that could have become
> > stale during the process_pending_softirqs call, this definitely needs to
> > be fixed.
>
> ... another similar issue (possibly one not included in the set of
> remarks I have in the patch, as none of those look related to what you
> describe) might be causing the remaining, more rare problems you say you
> see. That set of remarks is actually a result of me going over all other
> scale_delta() calls, but of course I may have got the analysis wrong.
>
> As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy
> load) - while I don't think we have a support statement for such upstream
> (when probably we should), iirc for our (SUSE's) products we would
> consider that unsupported. Just fyi.
>
> Also, btw, please don't top-post.
>
> Jan
>

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Anton Markov 4 weeks ago

Bit rounding isn't the main issue; the difference in ipi delivery to the
cores accumulates due to the ordering. Replacing get_s_time_fixed with
scale_delta in time_calibration_rendezvous_tail should be sufficient. This
configuration won't accumulate errors, but bit rounding can still cause a
significant difference from calibration to calibration, but it's not as
significant.

On Fri, Jan 9, 2026 at 7:11 PM Anton Markov <akmarkov45@gmail.com> wrote:

> You're right. These aren't interrupts in get_s_time_fixed, but rather a
> cumulative error in the sequence due to integer rounding. I added logging
> of the current local_stime to local_time_calibration and noticed that the
> timestamp between cores is gradually increasing. If the server has been
> running for weeks, this could be a very large value.
>
>
> ```
>
> @@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void)
>
> if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
>
> {
>
> /* Atomically read cpu_calibration struct and write cpu_time struct. */
>
> + printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n",
> smp_processor_id(), t->stamp.local_stime, c->local_stime,
>
> + t->last_seen_ns, t->last_seen_tsc);
>
> local_irq_disable();
>
> t->stamp = *c;
>
> local_irq_enable();
>
> ```
>
>
> 2 hours of work:
>
>
> ```
>
> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597
> (8565145862216, 0)
>
> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609
> (8565145863957, 0)
>
> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491
> (8565145864800, 0)
>
> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609
> (8565145865372, 0)
>
>
> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag
>
> ```
>
>
> 6 hours of work:
>
> ```
>
> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993
> (22915730870665, 0)
>
> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889
> (22915730870693, 0)
>
> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841
> (22915730872231, 0)
>
> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696
> (22915730872096, 0)
>
>
> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag
>
> ```
>
>
> Clarification, according to my xen settings:
>
> ```
>
> ucode=scan dom0_mem=53923M,max:53923M dom0_max_vcpus=4-96 dom0_vcpus_pin=0
> force-ept=1 ept=no-ad,no-pml hap_1gb=0 hap_2mb=0 altp2m=1
> hpet=legacy-replacement smt=1 spec-ctrl=no gnttab_max_frames=512
> cpufreq=xen:performance max_cstate=1 sched=credit sched-gran=cpu apicv=0
> sched_credit_tslice_ms=5 sched_ratelimit_us=500
>
> ```
>
>
> Processors I tested on:
>
>
> ```
>
> Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz
>
>
> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 cx16
> sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm cpuid_fault
> fsgsbase erms xsaveopt arch_capabilities
>
> ```
>
> ```
>
> Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz x2 (NUMA)
>
>
> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 fma cx16
> sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm
> 3dnowprefetch cpuid_fault ibpb fsgsbase bmi1 avx2 bmi2 erms rtm avx512f
> avx512dq rdseed adx avx512ifma clflushopt clwb avx512cd sha_ni avx512bw
> avx512vl xsaveopt xsavec xgetbv1 avx512vbmi avx512_vbmi2 gfni vaes
> vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear
> arch_capabilities
>
> ```
>
>
> Next I moved the code to r3 to speed up playback:
>
>
> ```
>
> #include <stdint.h>
>
> #include <stdio.h>
>
>
> static __inline__ unsigned long long rdtsc(void)
>
> {
>
> unsigned hi, lo;
>
> __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
>
> return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
>
> }
>
>
> #define MILLISECS(_ms) ((int64_t)((_ms) * 1000000ULL))
>
>
> struct cpu_time_stamp {
>
> uint64_t local_tsc;
>
> int64_t local_stime;
>
> int64_t master_stime;
>
> };
>
>
> struct time_scale {
>
> int shift;
>
> uint32_t mul_frac;
>
> };
>
>
>
> static inline uint32_t div_frac(uint32_t dividend, uint32_t divisor)
>
> {
>
> uint32_t quotient, remainder;
>
> asm (
>
> "divl %4"
>
> : "=a" (quotient), "=d" (remainder)
>
> : "0" (0), "1" (dividend), "r" (divisor) );
>
> return quotient;
>
> }
>
>
>
> void set_time_scale(struct time_scale *ts, uint64_t ticks_per_sec)
>
> {
>
> uint64_t tps64 = ticks_per_sec;
>
> uint32_t tps32;
>
> int shift = 0;
>
>
> while ( tps64 > (MILLISECS(1000)*2) )
>
> {
>
> tps64 >>= 1;
>
> shift--;
>
> }
>
>
> tps32 = (uint32_t)tps64;
>
> while ( tps32 <= (uint32_t)MILLISECS(1000) )
>
> {
>
> tps32 <<= 1;
>
> shift++;
>
> }
>
>
> ts->mul_frac = div_frac(MILLISECS(1000), tps32);
>
> ts->shift = shift;
>
> }
>
>
>
> uint64_t scale_delta(uint64_t delta, const struct time_scale *scale)
>
> {
>
> uint64_t product;
>
>
> if ( scale->shift < 0 )
>
> delta >>= -scale->shift;
>
> else
>
> delta <<= scale->shift;
>
>
> asm (
>
> "mulq %2 ; shrd $32,%1,%0"
>
> : "=a" (product), "=d" (delta)
>
> : "rm" (delta), "0" ((uint64_t)scale->mul_frac) );
>
>
> return product;
>
> }
>
>
> #define _TS_MUL_FRAC_IDENTITY 0x80000000UL
>
>
> static inline struct time_scale scale_reciprocal(struct time_scale scale)
>
> {
>
> struct time_scale reciprocal;
>
> uint32_t dividend;
>
>
> dividend = _TS_MUL_FRAC_IDENTITY;
>
> reciprocal.shift = 1 - scale.shift;
>
> while ( dividend >= scale.mul_frac )
>
> {
>
> dividend >>= 1;
>
> reciprocal.shift++;
>
> }
>
>
> asm (
>
> "divl %4"
>
> : "=a" (reciprocal.mul_frac), "=d" (dividend)
>
> : "0" (0), "1" (dividend), "r" (scale.mul_frac) );
>
>
> return reciprocal;
>
> }
>
>
>
> int64_t get_s_time_fixed(const struct cpu_time_stamp *t, const struct
> time_scale *ts, uint64_t at_tsc)
>
> {
>
> uint64_t tsc, delta;
>
>
> if ( at_tsc )
>
> tsc = at_tsc;
>
> else
>
> tsc = rdtsc();
>
> delta = tsc - t->local_tsc;
>
> return t->local_stime + scale_delta(delta, ts);
>
> }
>
>
> int main() {
>
>
> struct cpu_time_stamp ct;
>
> struct time_scale ts;
>
> struct time_scale back;
>
>
> uint64_t start = rdtsc();
>
>
> set_time_scale(&ts, 3000000000);
>
> back = scale_reciprocal(ts);
>
>
> ct.local_tsc = start;
>
> ct.local_stime = scale_delta(start, &ts);
>
>
> while (1) {
>
> uint64_t new_tsc = rdtsc();
>
> ct.local_stime = get_s_time_fixed(&ct, &ts, new_tsc);
>
> ct.local_tsc = new_tsc;
>
>
> uint64_t tmp_tsc = rdtsc();
>
> printf("%lu %lu\n", tmp_tsc, scale_delta(get_s_time_fixed(&ct, &ts,
> tmp_tsc), &back));
>
> }
>
>
> return 0;
>
> }
>
> ```
>
>
> It's enough to run the loop for 10-20 seconds to see the problem.
> scale_delta rounds the least significant bits during calculation, which
> causes the error to accumulate, at different rates on different cores,
> depending on the least significant bits at the time of calibration.
>
>
> Now let's talk about why dwm reacts this way. When a snapshot is reversed,
> last_guest_time in hvm_get_guest_time_fixed is set to 0, which doesn't
> prevent time from flowing backwards. This means that cache_tsc_offset in
> hvm_get_guest_tsc_fixed may be configured correctly on one physical core,
> but due to shedding on a physical core with a lagging tsc, the guest may
> occasionally see a tsc value lower than it saw before the snapshot was
> reversed. If this happens to the dwm code, it terminates with an error.
>
>
> A quick solution to this problem might be to save the last_seen_tsc
> parameter in a snapshot for each core, followed by validation.
>
>
> The correct solution is to remove the rounding of the least significant
> bits from the sequence.
>
> On Wed, Jan 7, 2026 at 11:02 AM Jan Beulich <jbeulich@suse.com> wrote:
>
>> On 06.01.2026 21:10, Антон Марков wrote:
>> > Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt
>> > (xen/arch/x86/hvm/hvm.c ), it was easy to catch because
>> > process_pending_softirqs is frequently called there, which in turn
>> > processes softirqs from the timer (where the timestamp is updated).
>> > After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped
>> > reproducing under no load. However, when the number of vCPUs is 4 times
>> > greater than the number of CPUs (under heavy load), the problem rarely
>> > reoccurs (mostly during snapshot restores during
>> > process_pending_softirqs calls), and this is no longer a simple case.
>> If
>> > get_s_time_fixed can indeed be interrupted during execution after
>> > rdtsc_ordered, then the current fix is insufficient. It's necessary to
>> > atomically copy "t->stamp" to the stack using local_irq_disable and
>> > local_irq_enable (as in local_time_calibration), and then work with the
>> > copy, confident in its lifetime and immutability. But until
>> > get_s_time_fixed is proven to be interruptible, this is premature, so
>> > your fix is sufficient. I think I need more information and testing to
>> > say more.
>>
>> While the cpu_calibration per-CPU variable is updated from IRQ context,
>> the cpu_time one isn't. Hence t->stamp's contents cannot change behind
>> the back of get_s_time_fixed(). I wonder whether ...
>>
>> > Regarding the other scale_delta calls, if they include values
>> > calculated from externally saved tsc values that could have become
>> > stale during the process_pending_softirqs call, this definitely needs
>> to
>> > be fixed.
>>
>> ... another similar issue (possibly one not included in the set of
>> remarks I have in the patch, as none of those look related to what you
>> describe) might be causing the remaining, more rare problems you say you
>> see. That set of remarks is actually a result of me going over all other
>> scale_delta() calls, but of course I may have got the analysis wrong.
>>
>> As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy
>> load) - while I don't think we have a support statement for such upstream
>> (when probably we should), iirc for our (SUSE's) products we would
>> consider that unsupported. Just fyi.
>>
>> Also, btw, please don't top-post.
>>
>> Jan
>>
>

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Jan Beulich 4 weeks ago

On 12.01.2026 11:31, Anton Markov wrote:
> Bit rounding isn't the main issue; the difference in ipi delivery to the
> cores accumulates due to the ordering. Replacing get_s_time_fixed with
> scale_delta in time_calibration_rendezvous_tail should be sufficient. This
> configuration won't accumulate errors, but bit rounding can still cause a
> significant difference from calibration to calibration, but it's not as
> significant.

That invocation of get_s_time_fixed() reduces to scale_delta() (without
further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW
it's not quite clear to me what change you are suggesting (that would
actually make a functional difference).

Btw, your prior response was too hard to properly read, due to excess blank
lines with at the same time squashed leading blanks. Together with your
apparent inability to avoid top-posting, I think you really want to adjust
your mail program's configuration.

Jan

> On Fri, Jan 9, 2026 at 7:11 PM Anton Markov <akmarkov45@gmail.com> wrote:
> 
>> You're right. These aren't interrupts in get_s_time_fixed, but rather a
>> cumulative error in the sequence due to integer rounding. I added logging
>> of the current local_stime to local_time_calibration and noticed that the
>> timestamp between cores is gradually increasing. If the server has been
>> running for weeks, this could be a very large value.
>>
>>
>> ```
>>
>> @@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void)
>>
>> if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
>>
>> {
>>
>> /* Atomically read cpu_calibration struct and write cpu_time struct. */
>>
>> + printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n",
>> smp_processor_id(), t->stamp.local_stime, c->local_stime,
>>
>> + t->last_seen_ns, t->last_seen_tsc);
>>
>> local_irq_disable();
>>
>> t->stamp = *c;
>>
>> local_irq_enable();
>>
>> ```
>>
>>
>> 2 hours of work:
>>
>>
>> ```
>>
>> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597
>> (8565145862216, 0)
>>
>> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609
>> (8565145863957, 0)
>>
>> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491
>> (8565145864800, 0)
>>
>> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609
>> (8565145865372, 0)
>>
>>
>> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag
>>
>> ```
>>
>>
>> 6 hours of work:
>>
>> ```
>>
>> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993
>> (22915730870665, 0)
>>
>> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889
>> (22915730870693, 0)
>>
>> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841
>> (22915730872231, 0)
>>
>> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696
>> (22915730872096, 0)
>>
>>
>> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag
>>
>> ```
>>
>>
>> Clarification, according to my xen settings:
>>
>> ```
>>
>> ucode=scan dom0_mem=53923M,max:53923M dom0_max_vcpus=4-96 dom0_vcpus_pin=0
>> force-ept=1 ept=no-ad,no-pml hap_1gb=0 hap_2mb=0 altp2m=1
>> hpet=legacy-replacement smt=1 spec-ctrl=no gnttab_max_frames=512
>> cpufreq=xen:performance max_cstate=1 sched=credit sched-gran=cpu apicv=0
>> sched_credit_tslice_ms=5 sched_ratelimit_us=500
>>
>> ```
>>
>>
>> Processors I tested on:
>>
>>
>> ```
>>
>> Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz
>>
>>
>> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
>> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
>> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 cx16
>> sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm cpuid_fault
>> fsgsbase erms xsaveopt arch_capabilities
>>
>> ```
>>
>> ```
>>
>> Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz x2 (NUMA)
>>
>>
>> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
>> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
>> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 fma cx16
>> sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm
>> 3dnowprefetch cpuid_fault ibpb fsgsbase bmi1 avx2 bmi2 erms rtm avx512f
>> avx512dq rdseed adx avx512ifma clflushopt clwb avx512cd sha_ni avx512bw
>> avx512vl xsaveopt xsavec xgetbv1 avx512vbmi avx512_vbmi2 gfni vaes
>> vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear
>> arch_capabilities
>>
>> ```
>>
>>
>> Next I moved the code to r3 to speed up playback:
>>
>>
>> ```
>>
>> #include <stdint.h>
>>
>> #include <stdio.h>
>>
>>
>> static __inline__ unsigned long long rdtsc(void)
>>
>> {
>>
>> unsigned hi, lo;
>>
>> __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
>>
>> return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
>>
>> }
>>
>>
>> #define MILLISECS(_ms) ((int64_t)((_ms) * 1000000ULL))
>>
>>
>> struct cpu_time_stamp {
>>
>> uint64_t local_tsc;
>>
>> int64_t local_stime;
>>
>> int64_t master_stime;
>>
>> };
>>
>>
>> struct time_scale {
>>
>> int shift;
>>
>> uint32_t mul_frac;
>>
>> };
>>
>>
>>
>> static inline uint32_t div_frac(uint32_t dividend, uint32_t divisor)
>>
>> {
>>
>> uint32_t quotient, remainder;
>>
>> asm (
>>
>> "divl %4"
>>
>> : "=a" (quotient), "=d" (remainder)
>>
>> : "0" (0), "1" (dividend), "r" (divisor) );
>>
>> return quotient;
>>
>> }
>>
>>
>>
>> void set_time_scale(struct time_scale *ts, uint64_t ticks_per_sec)
>>
>> {
>>
>> uint64_t tps64 = ticks_per_sec;
>>
>> uint32_t tps32;
>>
>> int shift = 0;
>>
>>
>> while ( tps64 > (MILLISECS(1000)*2) )
>>
>> {
>>
>> tps64 >>= 1;
>>
>> shift--;
>>
>> }
>>
>>
>> tps32 = (uint32_t)tps64;
>>
>> while ( tps32 <= (uint32_t)MILLISECS(1000) )
>>
>> {
>>
>> tps32 <<= 1;
>>
>> shift++;
>>
>> }
>>
>>
>> ts->mul_frac = div_frac(MILLISECS(1000), tps32);
>>
>> ts->shift = shift;
>>
>> }
>>
>>
>>
>> uint64_t scale_delta(uint64_t delta, const struct time_scale *scale)
>>
>> {
>>
>> uint64_t product;
>>
>>
>> if ( scale->shift < 0 )
>>
>> delta >>= -scale->shift;
>>
>> else
>>
>> delta <<= scale->shift;
>>
>>
>> asm (
>>
>> "mulq %2 ; shrd $32,%1,%0"
>>
>> : "=a" (product), "=d" (delta)
>>
>> : "rm" (delta), "0" ((uint64_t)scale->mul_frac) );
>>
>>
>> return product;
>>
>> }
>>
>>
>> #define _TS_MUL_FRAC_IDENTITY 0x80000000UL
>>
>>
>> static inline struct time_scale scale_reciprocal(struct time_scale scale)
>>
>> {
>>
>> struct time_scale reciprocal;
>>
>> uint32_t dividend;
>>
>>
>> dividend = _TS_MUL_FRAC_IDENTITY;
>>
>> reciprocal.shift = 1 - scale.shift;
>>
>> while ( dividend >= scale.mul_frac )
>>
>> {
>>
>> dividend >>= 1;
>>
>> reciprocal.shift++;
>>
>> }
>>
>>
>> asm (
>>
>> "divl %4"
>>
>> : "=a" (reciprocal.mul_frac), "=d" (dividend)
>>
>> : "0" (0), "1" (dividend), "r" (scale.mul_frac) );
>>
>>
>> return reciprocal;
>>
>> }
>>
>>
>>
>> int64_t get_s_time_fixed(const struct cpu_time_stamp *t, const struct
>> time_scale *ts, uint64_t at_tsc)
>>
>> {
>>
>> uint64_t tsc, delta;
>>
>>
>> if ( at_tsc )
>>
>> tsc = at_tsc;
>>
>> else
>>
>> tsc = rdtsc();
>>
>> delta = tsc - t->local_tsc;
>>
>> return t->local_stime + scale_delta(delta, ts);
>>
>> }
>>
>>
>> int main() {
>>
>>
>> struct cpu_time_stamp ct;
>>
>> struct time_scale ts;
>>
>> struct time_scale back;
>>
>>
>> uint64_t start = rdtsc();
>>
>>
>> set_time_scale(&ts, 3000000000);
>>
>> back = scale_reciprocal(ts);
>>
>>
>> ct.local_tsc = start;
>>
>> ct.local_stime = scale_delta(start, &ts);
>>
>>
>> while (1) {
>>
>> uint64_t new_tsc = rdtsc();
>>
>> ct.local_stime = get_s_time_fixed(&ct, &ts, new_tsc);
>>
>> ct.local_tsc = new_tsc;
>>
>>
>> uint64_t tmp_tsc = rdtsc();
>>
>> printf("%lu %lu\n", tmp_tsc, scale_delta(get_s_time_fixed(&ct, &ts,
>> tmp_tsc), &back));
>>
>> }
>>
>>
>> return 0;
>>
>> }
>>
>> ```
>>
>>
>> It's enough to run the loop for 10-20 seconds to see the problem.
>> scale_delta rounds the least significant bits during calculation, which
>> causes the error to accumulate, at different rates on different cores,
>> depending on the least significant bits at the time of calibration.
>>
>>
>> Now let's talk about why dwm reacts this way. When a snapshot is reversed,
>> last_guest_time in hvm_get_guest_time_fixed is set to 0, which doesn't
>> prevent time from flowing backwards. This means that cache_tsc_offset in
>> hvm_get_guest_tsc_fixed may be configured correctly on one physical core,
>> but due to shedding on a physical core with a lagging tsc, the guest may
>> occasionally see a tsc value lower than it saw before the snapshot was
>> reversed. If this happens to the dwm code, it terminates with an error.
>>
>>
>> A quick solution to this problem might be to save the last_seen_tsc
>> parameter in a snapshot for each core, followed by validation.
>>
>>
>> The correct solution is to remove the rounding of the least significant
>> bits from the sequence.
>>
>> On Wed, Jan 7, 2026 at 11:02 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>>> On 06.01.2026 21:10, Антон Марков wrote:
>>>> Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt
>>>> (xen/arch/x86/hvm/hvm.c ), it was easy to catch because
>>>> process_pending_softirqs is frequently called there, which in turn
>>>> processes softirqs from the timer (where the timestamp is updated).
>>>> After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped
>>>> reproducing under no load. However, when the number of vCPUs is 4 times
>>>> greater than the number of CPUs (under heavy load), the problem rarely
>>>> reoccurs (mostly during snapshot restores during
>>>> process_pending_softirqs calls), and this is no longer a simple case.
>>> If
>>>> get_s_time_fixed can indeed be interrupted during execution after
>>>> rdtsc_ordered, then the current fix is insufficient. It's necessary to
>>>> atomically copy "t->stamp" to the stack using local_irq_disable and
>>>> local_irq_enable (as in local_time_calibration), and then work with the
>>>> copy, confident in its lifetime and immutability. But until
>>>> get_s_time_fixed is proven to be interruptible, this is premature, so
>>>> your fix is sufficient. I think I need more information and testing to
>>>> say more.
>>>
>>> While the cpu_calibration per-CPU variable is updated from IRQ context,
>>> the cpu_time one isn't. Hence t->stamp's contents cannot change behind
>>> the back of get_s_time_fixed(). I wonder whether ...
>>>
>>>> Regarding the other scale_delta calls, if they include values
>>>> calculated from externally saved tsc values that could have become
>>>> stale during the process_pending_softirqs call, this definitely needs
>>> to
>>>> be fixed.
>>>
>>> ... another similar issue (possibly one not included in the set of
>>> remarks I have in the patch, as none of those look related to what you
>>> describe) might be causing the remaining, more rare problems you say you
>>> see. That set of remarks is actually a result of me going over all other
>>> scale_delta() calls, but of course I may have got the analysis wrong.
>>>
>>> As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy
>>> load) - while I don't think we have a support statement for such upstream
>>> (when probably we should), iirc for our (SUSE's) products we would
>>> consider that unsupported. Just fyi.
>>>
>>> Also, btw, please don't top-post.
>>>
>>> Jan
>>>
>>
>

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Anton Markov 3 weeks, 6 days ago

>
> Btw, your prior response was too hard to properly read, due to excess blank
> lines with at the same time squashed leading blanks. Together with your
> apparent inability to avoid top-posting, I think you really want to adjust
> your mail program's configuration.

I suggest we skip the discussion of formatting and the number of spaces in
messages and instead focus on the topic of the thread. I have a very
difficult time troubleshooting difficult-to-reproduce bugs, and the fact
that their descriptions are difficult to read due to the number of spaces
is probably the least of the difficulties.

That invocation of get_s_time_fixed() reduces to scale_delta() (without
> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW
> it's not quite clear to me what change you are suggesting (that would
> actually make a functional difference).

Replacing get_s_time_fixed with scale_delta will remove the calculation
dependency on the previous local_stime value, which accumulates lag between
cores. This is because: rdtsc_ordered is not called synchronously on the
cores, but by the difference offset by the ipi speed. Therefore, we get:

core0: current_rdtsc;
core1: current_rdtsc + ipi speed;
coreN: current_rdtsc + ipi speed * N;

Since ipi values are sent alternately in a loop to core0, in the version
with get_s_time_fixed, we get the following local_stime calculation format.

coreN: local_stime = local_stime + scale_delta((current_rdtsc + (ipi_speed
* N)) – local_rdtsc);

This means the time on each core will differ by ipi_speed * N. And since
we're using the values of the previous local_stime, the difference will
accumulate because the previous local_stime was also offset. In the version
with scale_delta, we get:

coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N));

This means there will still be a difference, but it won't accumulate, and
the offsets will remain within normal limits.

If it's still unclear: If your local_stime in get_s_time_fixed is offset
relative to other cores, then the fact that rdtsc_ordered and local_tsc are
not offset doesn't change anything, since you're using the delta relative
to local_stime.

core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime +
(rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc are
equal across cores.

On 96-core configurations, up to a millisecond of latency can accumulate in
local_stime over a week of operation, and this is a significant
difference. This
is due to the fact that I use cpufreq=xen:performance max_cstate=1 ,
meaning that in my configuration, local_stime is never overwritten by
master_stime.

Thanks.

On Mon, Jan 12, 2026 at 2:45 PM Jan Beulich <jbeulich@suse.com> wrote:

> On 12.01.2026 11:31, Anton Markov wrote:
> > Bit rounding isn't the main issue; the difference in ipi delivery to the
> > cores accumulates due to the ordering. Replacing get_s_time_fixed with
> > scale_delta in time_calibration_rendezvous_tail should be sufficient.
> This
> > configuration won't accumulate errors, but bit rounding can still cause a
> > significant difference from calibration to calibration, but it's not as
> > significant.
>
> That invocation of get_s_time_fixed() reduces to scale_delta() (without
> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW
> it's not quite clear to me what change you are suggesting (that would
> actually make a functional difference).
>
> Btw, your prior response was too hard to properly read, due to excess blank
> lines with at the same time squashed leading blanks. Together with your
> apparent inability to avoid top-posting, I think you really want to adjust
> your mail program's configuration.
>
> Jan
>
> > On Fri, Jan 9, 2026 at 7:11 PM Anton Markov <akmarkov45@gmail.com>
> wrote:
> >
> >> You're right. These aren't interrupts in get_s_time_fixed, but rather a
> >> cumulative error in the sequence due to integer rounding. I added
> logging
> >> of the current local_stime to local_time_calibration and noticed that
> the
> >> timestamp between cores is gradually increasing. If the server has been
> >> running for weeks, this could be a very large value.
> >>
> >>
> >> ```
> >>
> >> @@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void)
> >>
> >> if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
> >>
> >> {
> >>
> >> /* Atomically read cpu_calibration struct and write cpu_time struct. */
> >>
> >> + printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n",
> >> smp_processor_id(), t->stamp.local_stime, c->local_stime,
> >>
> >> + t->last_seen_ns, t->last_seen_tsc);
> >>
> >> local_irq_disable();
> >>
> >> t->stamp = *c;
> >>
> >> local_irq_enable();
> >>
> >> ```
> >>
> >>
> >> 2 hours of work:
> >>
> >>
> >> ```
> >>
> >> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597
> >> (8565145862216, 0)
> >>
> >> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609
> >> (8565145863957, 0)
> >>
> >> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491
> >> (8565145864800, 0)
> >>
> >> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609
> >> (8565145865372, 0)
> >>
> >>
> >> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag
> >>
> >> ```
> >>
> >>
> >> 6 hours of work:
> >>
> >> ```
> >>
> >> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993
> >> (22915730870665, 0)
> >>
> >> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889
> >> (22915730870693, 0)
> >>
> >> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841
> >> (22915730872231, 0)
> >>
> >> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696
> >> (22915730872096, 0)
> >>
> >>
> >> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag
> >>
> >> ```
> >>
> >>
> >> Clarification, according to my xen settings:
> >>
> >> ```
> >>
> >> ucode=scan dom0_mem=53923M,max:53923M dom0_max_vcpus=4-96
> dom0_vcpus_pin=0
> >> force-ept=1 ept=no-ad,no-pml hap_1gb=0 hap_2mb=0 altp2m=1
> >> hpet=legacy-replacement smt=1 spec-ctrl=no gnttab_max_frames=512
> >> cpufreq=xen:performance max_cstate=1 sched=credit sched-gran=cpu apicv=0
> >> sched_credit_tslice_ms=5 sched_ratelimit_us=500
> >>
> >> ```
> >>
> >>
> >> Processors I tested on:
> >>
> >>
> >> ```
> >>
> >> Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz
> >>
> >>
> >> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
> >> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
> >> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 cx16
> >> sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm cpuid_fault
> >> fsgsbase erms xsaveopt arch_capabilities
> >>
> >> ```
> >>
> >> ```
> >>
> >> Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz x2 (NUMA)
> >>
> >>
> >> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
> >> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
> >> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 fma
> cx16
> >> sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm
> abm
> >> 3dnowprefetch cpuid_fault ibpb fsgsbase bmi1 avx2 bmi2 erms rtm avx512f
> >> avx512dq rdseed adx avx512ifma clflushopt clwb avx512cd sha_ni avx512bw
> >> avx512vl xsaveopt xsavec xgetbv1 avx512vbmi avx512_vbmi2 gfni vaes
> >> vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm
> md_clear
> >> arch_capabilities
> >>
> >> ```
> >>
> >>
> >> Next I moved the code to r3 to speed up playback:
> >>
> >>
> >> ```
> >>
> >> #include <stdint.h>
> >>
> >> #include <stdio.h>
> >>
> >>
> >> static __inline__ unsigned long long rdtsc(void)
> >>
> >> {
> >>
> >> unsigned hi, lo;
> >>
> >> __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
> >>
> >> return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
> >>
> >> }
> >>
> >>
> >> #define MILLISECS(_ms) ((int64_t)((_ms) * 1000000ULL))
> >>
> >>
> >> struct cpu_time_stamp {
> >>
> >> uint64_t local_tsc;
> >>
> >> int64_t local_stime;
> >>
> >> int64_t master_stime;
> >>
> >> };
> >>
> >>
> >> struct time_scale {
> >>
> >> int shift;
> >>
> >> uint32_t mul_frac;
> >>
> >> };
> >>
> >>
> >>
> >> static inline uint32_t div_frac(uint32_t dividend, uint32_t divisor)
> >>
> >> {
> >>
> >> uint32_t quotient, remainder;
> >>
> >> asm (
> >>
> >> "divl %4"
> >>
> >> : "=a" (quotient), "=d" (remainder)
> >>
> >> : "0" (0), "1" (dividend), "r" (divisor) );
> >>
> >> return quotient;
> >>
> >> }
> >>
> >>
> >>
> >> void set_time_scale(struct time_scale *ts, uint64_t ticks_per_sec)
> >>
> >> {
> >>
> >> uint64_t tps64 = ticks_per_sec;
> >>
> >> uint32_t tps32;
> >>
> >> int shift = 0;
> >>
> >>
> >> while ( tps64 > (MILLISECS(1000)*2) )
> >>
> >> {
> >>
> >> tps64 >>= 1;
> >>
> >> shift--;
> >>
> >> }
> >>
> >>
> >> tps32 = (uint32_t)tps64;
> >>
> >> while ( tps32 <= (uint32_t)MILLISECS(1000) )
> >>
> >> {
> >>
> >> tps32 <<= 1;
> >>
> >> shift++;
> >>
> >> }
> >>
> >>
> >> ts->mul_frac = div_frac(MILLISECS(1000), tps32);
> >>
> >> ts->shift = shift;
> >>
> >> }
> >>
> >>
> >>
> >> uint64_t scale_delta(uint64_t delta, const struct time_scale *scale)
> >>
> >> {
> >>
> >> uint64_t product;
> >>
> >>
> >> if ( scale->shift < 0 )
> >>
> >> delta >>= -scale->shift;
> >>
> >> else
> >>
> >> delta <<= scale->shift;
> >>
> >>
> >> asm (
> >>
> >> "mulq %2 ; shrd $32,%1,%0"
> >>
> >> : "=a" (product), "=d" (delta)
> >>
> >> : "rm" (delta), "0" ((uint64_t)scale->mul_frac) );
> >>
> >>
> >> return product;
> >>
> >> }
> >>
> >>
> >> #define _TS_MUL_FRAC_IDENTITY 0x80000000UL
> >>
> >>
> >> static inline struct time_scale scale_reciprocal(struct time_scale
> scale)
> >>
> >> {
> >>
> >> struct time_scale reciprocal;
> >>
> >> uint32_t dividend;
> >>
> >>
> >> dividend = _TS_MUL_FRAC_IDENTITY;
> >>
> >> reciprocal.shift = 1 - scale.shift;
> >>
> >> while ( dividend >= scale.mul_frac )
> >>
> >> {
> >>
> >> dividend >>= 1;
> >>
> >> reciprocal.shift++;
> >>
> >> }
> >>
> >>
> >> asm (
> >>
> >> "divl %4"
> >>
> >> : "=a" (reciprocal.mul_frac), "=d" (dividend)
> >>
> >> : "0" (0), "1" (dividend), "r" (scale.mul_frac) );
> >>
> >>
> >> return reciprocal;
> >>
> >> }
> >>
> >>
> >>
> >> int64_t get_s_time_fixed(const struct cpu_time_stamp *t, const struct
> >> time_scale *ts, uint64_t at_tsc)
> >>
> >> {
> >>
> >> uint64_t tsc, delta;
> >>
> >>
> >> if ( at_tsc )
> >>
> >> tsc = at_tsc;
> >>
> >> else
> >>
> >> tsc = rdtsc();
> >>
> >> delta = tsc - t->local_tsc;
> >>
> >> return t->local_stime + scale_delta(delta, ts);
> >>
> >> }
> >>
> >>
> >> int main() {
> >>
> >>
> >> struct cpu_time_stamp ct;
> >>
> >> struct time_scale ts;
> >>
> >> struct time_scale back;
> >>
> >>
> >> uint64_t start = rdtsc();
> >>
> >>
> >> set_time_scale(&ts, 3000000000);
> >>
> >> back = scale_reciprocal(ts);
> >>
> >>
> >> ct.local_tsc = start;
> >>
> >> ct.local_stime = scale_delta(start, &ts);
> >>
> >>
> >> while (1) {
> >>
> >> uint64_t new_tsc = rdtsc();
> >>
> >> ct.local_stime = get_s_time_fixed(&ct, &ts, new_tsc);
> >>
> >> ct.local_tsc = new_tsc;
> >>
> >>
> >> uint64_t tmp_tsc = rdtsc();
> >>
> >> printf("%lu %lu\n", tmp_tsc, scale_delta(get_s_time_fixed(&ct, &ts,
> >> tmp_tsc), &back));
> >>
> >> }
> >>
> >>
> >> return 0;
> >>
> >> }
> >>
> >> ```
> >>
> >>
> >> It's enough to run the loop for 10-20 seconds to see the problem.
> >> scale_delta rounds the least significant bits during calculation, which
> >> causes the error to accumulate, at different rates on different cores,
> >> depending on the least significant bits at the time of calibration.
> >>
> >>
> >> Now let's talk about why dwm reacts this way. When a snapshot is
> reversed,
> >> last_guest_time in hvm_get_guest_time_fixed is set to 0, which doesn't
> >> prevent time from flowing backwards. This means that cache_tsc_offset in
> >> hvm_get_guest_tsc_fixed may be configured correctly on one physical
> core,
> >> but due to shedding on a physical core with a lagging tsc, the guest may
> >> occasionally see a tsc value lower than it saw before the snapshot was
> >> reversed. If this happens to the dwm code, it terminates with an error.
> >>
> >>
> >> A quick solution to this problem might be to save the last_seen_tsc
> >> parameter in a snapshot for each core, followed by validation.
> >>
> >>
> >> The correct solution is to remove the rounding of the least significant
> >> bits from the sequence.
> >>
> >> On Wed, Jan 7, 2026 at 11:02 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >>> On 06.01.2026 21:10, Антон Марков wrote:
> >>>> Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt
> >>>> (xen/arch/x86/hvm/hvm.c ), it was easy to catch because
> >>>> process_pending_softirqs is frequently called there, which in turn
> >>>> processes softirqs from the timer (where the timestamp is updated).
> >>>> After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped
> >>>> reproducing under no load. However, when the number of vCPUs is 4
> times
> >>>> greater than the number of CPUs (under heavy load), the problem rarely
> >>>> reoccurs (mostly during snapshot restores during
> >>>> process_pending_softirqs calls), and this is no longer a simple case.
> >>> If
> >>>> get_s_time_fixed can indeed be interrupted during execution after
> >>>> rdtsc_ordered, then the current fix is insufficient. It's necessary to
> >>>> atomically copy "t->stamp" to the stack using local_irq_disable and
> >>>> local_irq_enable (as in local_time_calibration), and then work with
> the
> >>>> copy, confident in its lifetime and immutability. But until
> >>>> get_s_time_fixed is proven to be interruptible, this is premature, so
> >>>> your fix is sufficient. I think I need more information and testing to
> >>>> say more.
> >>>
> >>> While the cpu_calibration per-CPU variable is updated from IRQ context,
> >>> the cpu_time one isn't. Hence t->stamp's contents cannot change behind
> >>> the back of get_s_time_fixed(). I wonder whether ...
> >>>
> >>>> Regarding the other scale_delta calls, if they include values
> >>>> calculated from externally saved tsc values that could have become
> >>>> stale during the process_pending_softirqs call, this definitely needs
> >>> to
> >>>> be fixed.
> >>>
> >>> ... another similar issue (possibly one not included in the set of
> >>> remarks I have in the patch, as none of those look related to what you
> >>> describe) might be causing the remaining, more rare problems you say
> you
> >>> see. That set of remarks is actually a result of me going over all
> other
> >>> scale_delta() calls, but of course I may have got the analysis wrong.
> >>>
> >>> As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy
> >>> load) - while I don't think we have a support statement for such
> upstream
> >>> (when probably we should), iirc for our (SUSE's) products we would
> >>> consider that unsupported. Just fyi.
> >>>
> >>> Also, btw, please don't top-post.
> >>>
> >>> Jan
> >>>
> >>
> >
>
>

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Jan Beulich 3 weeks, 6 days ago

On 12.01.2026 13:49, Anton Markov wrote:
>> Btw, your prior response was too hard to properly read, due to excess blank
>> lines with at the same time squashed leading blanks. Together with your
>> apparent inability to avoid top-posting, I think you really want to adjust
>> your mail program's configuration.
> 
> I suggest we skip the discussion of formatting and the number of spaces in
> messages and instead focus on the topic of the thread. I have a very
> difficult time troubleshooting difficult-to-reproduce bugs, and the fact
> that their descriptions are difficult to read due to the number of spaces
> is probably the least of the difficulties.

Perhaps, yet it still makes dealing with things more difficult.

> That invocation of get_s_time_fixed() reduces to scale_delta() (without
>> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW
>> it's not quite clear to me what change you are suggesting (that would
>> actually make a functional difference).
> 
> Replacing get_s_time_fixed with scale_delta will remove the calculation
> dependency on the previous local_stime value, which accumulates lag between
> cores. This is because: rdtsc_ordered is not called synchronously on the
> cores, but by the difference offset by the ipi speed. Therefore, we get:
> 
> core0: current_rdtsc;
> core1: current_rdtsc + ipi speed;
> coreN: current_rdtsc + ipi speed * N;

That's if IPIs are sent sequentially. In the most common case, they aren't,
though - we use the all-but-self shorthand.

Actually, even if IPIs are sent sequentially, I can't see where you spot
this effect: Both callers of time_calibration_rendezvous_tail() signal all
secondary CPUs to continue at the same time. Hence they'll all execute
time_calibration_rendezvous_tail() in parallel.

> Since ipi values are sent alternately in a loop to core0,

Are they? I fear I don't know which part of the code you're talking about.

> in the version
> with get_s_time_fixed, we get the following local_stime calculation format.
> 
> coreN: local_stime = local_stime + scale_delta((current_rdtsc + (ipi_speed
> * N)) – local_rdtsc);

One of the reasons we (iirc) don't do that is that since the scaling factor
is also slightly imprecise, we'd prefer to avoid scaling very big values.
IOW by changing as you suggest we'd trade one accumulating error for
another.

Jan

> This means the time on each core will differ by ipi_speed * N. And since
> we're using the values of the previous local_stime, the difference will
> accumulate because the previous local_stime was also offset. In the version
> with scale_delta, we get:
> 
> coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N));
> 
> This means there will still be a difference, but it won't accumulate, and
> the offsets will remain within normal limits.
> 
> If it's still unclear: If your local_stime in get_s_time_fixed is offset
> relative to other cores, then the fact that rdtsc_ordered and local_tsc are
> not offset doesn't change anything, since you're using the delta relative
> to local_stime.
> 
> core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime +
> (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc are
> equal across cores.
> 
> On 96-core configurations, up to a millisecond of latency can accumulate in
> local_stime over a week of operation, and this is a significant
> difference. This
> is due to the fact that I use cpufreq=xen:performance max_cstate=1 ,
> meaning that in my configuration, local_stime is never overwritten by
> master_stime.
> 
> Thanks.

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Anton Markov 3 weeks, 6 days ago

>
> Perhaps, yet it still makes dealing with things more difficult.

Sorry. I just spent too much time on this bug to stay in my mind.

That's if IPIs are sent sequentially. In the most common case, they aren't,
> though - we use the all-but-self shorthand.

Actually, even if IPIs are sent sequentially, I can't see where you spot
> this effect: Both callers of time_calibration_rendezvous_tail() signal all
> secondary CPUs to continue at the same time. Hence they'll all execute
> time_calibration_rendezvous_tail() in parallel.

In parallel, but with a slight delay.

Are they? I fear I don't know which part of the code you're talking about.

In the function "time_calibration" (xen/arch/x86/time.c) Sorry, I don't
take into account that you don't stay in context, being distracted by other
threads.

One of the reasons we (iirc) don't do that is that since the scaling factor
> is also slightly imprecise, we'd prefer to avoid scaling very big values.
> IOW by changing as you suggest we'd trade one accumulating error for
> another.

As I wrote above, there will be an error when using scale_delta, but it
won't accumulate between calls to time_calibration_rendezvous_tail. In the
current version, the old error (ipi lag + rounding error) persists due to
the use of the old local_stime in the get_s_time_fixed function, and it's
added to the new error and accumulates with each call.
If

c->local_stime = get_s_time_fixed(old_tsc ?: new_tsc);

replaced with:

c->local_stime = scale_delta(old_tsc ?: new_tsc);

Then we'll only be dealing with the error due to the current ipi lag and
rough rounding, within a single call.

If I understand you correctly, you fixed the rough rounding of scale_delta
by reducing the values using get_s_time_fixed . But the problem is, that
didn't help. The error now accumulates gradually because we're relying on
old calculations. Furthermore, while the old rounding error was limited,
now the error accumulates infinitely, albeit very slowly. If this is so,
then the solution to the problem becomes less obvious.

Roughly speaking, my servers start to go crazy after a week of continuous
operation, as the time lag between cores reaches 1 millisecond or more.

On Mon, Jan 12, 2026 at 5:13 PM Jan Beulich <jbeulich@suse.com> wrote:

> On 12.01.2026 13:49, Anton Markov wrote:
> >> Btw, your prior response was too hard to properly read, due to excess
> blank
> >> lines with at the same time squashed leading blanks. Together with your
> >> apparent inability to avoid top-posting, I think you really want to
> adjust
> >> your mail program's configuration.
> >
> > I suggest we skip the discussion of formatting and the number of spaces
> in
> > messages and instead focus on the topic of the thread. I have a very
> > difficult time troubleshooting difficult-to-reproduce bugs, and the fact
> > that their descriptions are difficult to read due to the number of spaces
> > is probably the least of the difficulties.
>
> Perhaps, yet it still makes dealing with things more difficult.
>
> > That invocation of get_s_time_fixed() reduces to scale_delta() (without
> >> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW
> >> it's not quite clear to me what change you are suggesting (that would
> >> actually make a functional difference).
> >
> > Replacing get_s_time_fixed with scale_delta will remove the calculation
> > dependency on the previous local_stime value, which accumulates lag
> between
> > cores. This is because: rdtsc_ordered is not called synchronously on the
> > cores, but by the difference offset by the ipi speed. Therefore, we get:
> >
> > core0: current_rdtsc;
> > core1: current_rdtsc + ipi speed;
> > coreN: current_rdtsc + ipi speed * N;
>
> That's if IPIs are sent sequentially. In the most common case, they aren't,
> though - we use the all-but-self shorthand.
>
> Actually, even if IPIs are sent sequentially, I can't see where you spot
> this effect: Both callers of time_calibration_rendezvous_tail() signal all
> secondary CPUs to continue at the same time. Hence they'll all execute
> time_calibration_rendezvous_tail() in parallel.
>
> > Since ipi values are sent alternately in a loop to core0,
>
> Are they? I fear I don't know which part of the code you're talking about.
>
> > in the version
> > with get_s_time_fixed, we get the following local_stime calculation
> format.
> >
> > coreN: local_stime = local_stime + scale_delta((current_rdtsc +
> (ipi_speed
> > * N)) – local_rdtsc);
>
> One of the reasons we (iirc) don't do that is that since the scaling factor
> is also slightly imprecise, we'd prefer to avoid scaling very big values.
> IOW by changing as you suggest we'd trade one accumulating error for
> another.
>
> Jan
>
> > This means the time on each core will differ by ipi_speed * N. And since
> > we're using the values of the previous local_stime, the difference will
> > accumulate because the previous local_stime was also offset. In the
> version
> > with scale_delta, we get:
> >
> > coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N));
> >
> > This means there will still be a difference, but it won't accumulate, and
> > the offsets will remain within normal limits.
> >
> > If it's still unclear: If your local_stime in get_s_time_fixed is offset
> > relative to other cores, then the fact that rdtsc_ordered and local_tsc
> are
> > not offset doesn't change anything, since you're using the delta relative
> > to local_stime.
> >
> > core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime +
> > (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc
> are
> > equal across cores.
> >
> > On 96-core configurations, up to a millisecond of latency can accumulate
> in
> > local_stime over a week of operation, and this is a significant
> > difference. This
> > is due to the fact that I use cpufreq=xen:performance max_cstate=1 ,
> > meaning that in my configuration, local_stime is never overwritten by
> > master_stime.
> >
> > Thanks.
>

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Jan Beulich 3 weeks, 6 days ago

On 12.01.2026 15:51, Anton Markov wrote:
> That's if IPIs are sent sequentially. In the most common case, they aren't,
>> though - we use the all-but-self shorthand.
> 
> Actually, even if IPIs are sent sequentially, I can't see where you spot
>> this effect: Both callers of time_calibration_rendezvous_tail() signal all
>> secondary CPUs to continue at the same time. Hence they'll all execute
>> time_calibration_rendezvous_tail() in parallel.
> 
> In parallel, but with a slight delay.
> 
> Are they? I fear I don't know which part of the code you're talking about.
> 
> In the function "time_calibration" (xen/arch/x86/time.c) Sorry, I don't
> take into account that you don't stay in context, being distracted by other
> threads.

That calls on_selected_cpus(), but send_IPI_mask() may then still resort to
all-but-self. In that case all IPIs are sent in one go.

Plus as said, how IPIs are sent doesn't matter for the invocation of
time_calibration_rendezvous_tail(). They'll all run at the same time, not
one after the other.

Since further down you build upon that "IPI lag", I fear we first need to
settle on this aspect of your theory.

Jan

> One of the reasons we (iirc) don't do that is that since the scaling factor
>> is also slightly imprecise, we'd prefer to avoid scaling very big values.
>> IOW by changing as you suggest we'd trade one accumulating error for
>> another.
> 
> As I wrote above, there will be an error when using scale_delta, but it
> won't accumulate between calls to time_calibration_rendezvous_tail. In the
> current version, the old error (ipi lag + rounding error) persists due to
> the use of the old local_stime in the get_s_time_fixed function, and it's
> added to the new error and accumulates with each call.
> If
> 
> c->local_stime = get_s_time_fixed(old_tsc ?: new_tsc);
> 
> replaced with:
> 
> c->local_stime = scale_delta(old_tsc ?: new_tsc);
> 
> Then we'll only be dealing with the error due to the current ipi lag and
> rough rounding, within a single call.
> 
> If I understand you correctly, you fixed the rough rounding of scale_delta
> by reducing the values using get_s_time_fixed . But the problem is, that
> didn't help. The error now accumulates gradually because we're relying on
> old calculations. Furthermore, while the old rounding error was limited,
> now the error accumulates infinitely, albeit very slowly. If this is so,
> then the solution to the problem becomes less obvious.
> 
> Roughly speaking, my servers start to go crazy after a week of continuous
> operation, as the time lag between cores reaches 1 millisecond or more.
> 
> On Mon, Jan 12, 2026 at 5:13 PM Jan Beulich <jbeulich@suse.com> wrote:
> 
>> On 12.01.2026 13:49, Anton Markov wrote:
>>>> Btw, your prior response was too hard to properly read, due to excess
>> blank
>>>> lines with at the same time squashed leading blanks. Together with your
>>>> apparent inability to avoid top-posting, I think you really want to
>> adjust
>>>> your mail program's configuration.
>>>
>>> I suggest we skip the discussion of formatting and the number of spaces
>> in
>>> messages and instead focus on the topic of the thread. I have a very
>>> difficult time troubleshooting difficult-to-reproduce bugs, and the fact
>>> that their descriptions are difficult to read due to the number of spaces
>>> is probably the least of the difficulties.
>>
>> Perhaps, yet it still makes dealing with things more difficult.
>>
>>> That invocation of get_s_time_fixed() reduces to scale_delta() (without
>>>> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW
>>>> it's not quite clear to me what change you are suggesting (that would
>>>> actually make a functional difference).
>>>
>>> Replacing get_s_time_fixed with scale_delta will remove the calculation
>>> dependency on the previous local_stime value, which accumulates lag
>> between
>>> cores. This is because: rdtsc_ordered is not called synchronously on the
>>> cores, but by the difference offset by the ipi speed. Therefore, we get:
>>>
>>> core0: current_rdtsc;
>>> core1: current_rdtsc + ipi speed;
>>> coreN: current_rdtsc + ipi speed * N;
>>
>> That's if IPIs are sent sequentially. In the most common case, they aren't,
>> though - we use the all-but-self shorthand.
>>
>> Actually, even if IPIs are sent sequentially, I can't see where you spot
>> this effect: Both callers of time_calibration_rendezvous_tail() signal all
>> secondary CPUs to continue at the same time. Hence they'll all execute
>> time_calibration_rendezvous_tail() in parallel.
>>
>>> Since ipi values are sent alternately in a loop to core0,
>>
>> Are they? I fear I don't know which part of the code you're talking about.
>>
>>> in the version
>>> with get_s_time_fixed, we get the following local_stime calculation
>> format.
>>>
>>> coreN: local_stime = local_stime + scale_delta((current_rdtsc +
>> (ipi_speed
>>> * N)) – local_rdtsc);
>>
>> One of the reasons we (iirc) don't do that is that since the scaling factor
>> is also slightly imprecise, we'd prefer to avoid scaling very big values.
>> IOW by changing as you suggest we'd trade one accumulating error for
>> another.
>>
>> Jan
>>
>>> This means the time on each core will differ by ipi_speed * N. And since
>>> we're using the values of the previous local_stime, the difference will
>>> accumulate because the previous local_stime was also offset. In the
>> version
>>> with scale_delta, we get:
>>>
>>> coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N));
>>>
>>> This means there will still be a difference, but it won't accumulate, and
>>> the offsets will remain within normal limits.
>>>
>>> If it's still unclear: If your local_stime in get_s_time_fixed is offset
>>> relative to other cores, then the fact that rdtsc_ordered and local_tsc
>> are
>>> not offset doesn't change anything, since you're using the delta relative
>>> to local_stime.
>>>
>>> core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime +
>>> (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc
>> are
>>> equal across cores.
>>>
>>> On 96-core configurations, up to a millisecond of latency can accumulate
>> in
>>> local_stime over a week of operation, and this is a significant
>>> difference. This
>>> is due to the fact that I use cpufreq=xen:performance max_cstate=1 ,
>>> meaning that in my configuration, local_stime is never overwritten by
>>> master_stime.
>>>
>>> Thanks.
>>
>

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Anton Markov 3 weeks, 6 days ago

>
> That calls on_selected_cpus(), but send_IPI_mask() may then still resort to
> all-but-self. In that case all IPIs are sent in one go.

Plus as said, how IPIs are sent doesn't matter for the invocation of
> time_calibration_rendezvous_tail(). They'll all run at the same time, not
> one after the other.

At the hardware level, no one can guarantee that the processors will
simultaneously respond to the signal and execute your code nanosecond after
you send the ipi. Especially when we're talking about NUMA configurations. I'm
afraid the possible and impossible in the laws of physics is also beyond
the scope of this thread.

Since further down you build upon that "IPI lag", I fear we first need to
> settle on this aspect of your theory.

 I've already provided the delay logs. It's not hard for me to repeat.

The patch:

> @@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void)
>
> if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
>
> {
>
> /* Atomically read cpu_calibration struct and write cpu_time struct. */
>
> + printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n",
> smp_processor_id(), t->stamp.local_stime, c->local_stime,
>
> + t->last_seen_ns, t->last_seen_tsc);
>
> local_irq_disable();
>
> t->stamp = *c;
>
> local_irq_enable();
>

 2 hours of work:

> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597
> (8565145862216, 0)
> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609
> (8565145863957, 0)
> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491
> (8565145864800, 0)
> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609
> (8565145865372, 0)
>
> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag
>

3 hours of work:

> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993
> (22915730870665, 0)
> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889
> (22915730870693, 0)
> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841
> (22915730872231, 0)
> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696
> (22915730872096, 0)
>
> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag
>

2-3 day of work:

> (XEN) update stime on time calibrate 0, 254477161980127 -> 254478162020920
> (254478162021549, 0)
> (XEN) update stime on time calibrate 2, 254477161977638 -> 254478162018429
> (254478162022187, 0)
> (XEN) update stime on time calibrate 1, 254477161978192 -> 254478162018972
> (254478162022776, 0)
> (XEN) update stime on time calibrate 3, 254477161976832 -> 254478162017636
> (254478162021394, 0)
>
> 254478162020920 - 254478162017636 = 3284 * 3 (3.00 GHz) = 9852 lag


 As you can see, the core lag is strictly determined by their sequence
number. I won't argue about what percentage of this delay is due to
rounding error and what percentage is due to ipi lag. To reproduce this,
simply add the patch (excluding t->last_seen_ns and t->last_seen_tsc ,
which were necessary for my own understanding). Then enable the hypervisor
with the settings cpufreq=xen:performance max_cstate=1 . Clocksource is
left at the default (i.e., hpet).

On Mon, Jan 12, 2026 at 7:08 PM Jan Beulich <jbeulich@suse.com> wrote:

> On 12.01.2026 15:51, Anton Markov wrote:
> > That's if IPIs are sent sequentially. In the most common case, they
> aren't,
> >> though - we use the all-but-self shorthand.
> >
> > Actually, even if IPIs are sent sequentially, I can't see where you spot
> >> this effect: Both callers of time_calibration_rendezvous_tail() signal
> all
> >> secondary CPUs to continue at the same time. Hence they'll all execute
> >> time_calibration_rendezvous_tail() in parallel.
> >
> > In parallel, but with a slight delay.
> >
> > Are they? I fear I don't know which part of the code you're talking
> about.
> >
> > In the function "time_calibration" (xen/arch/x86/time.c) Sorry, I don't
> > take into account that you don't stay in context, being distracted by
> other
> > threads.
>
> That calls on_selected_cpus(), but send_IPI_mask() may then still resort to
> all-but-self. In that case all IPIs are sent in one go.
>
> Plus as said, how IPIs are sent doesn't matter for the invocation of
> time_calibration_rendezvous_tail(). They'll all run at the same time, not
> one after the other.
>
> Since further down you build upon that "IPI lag", I fear we first need to
> settle on this aspect of your theory.
>
> Jan
>
> > One of the reasons we (iirc) don't do that is that since the scaling
> factor
> >> is also slightly imprecise, we'd prefer to avoid scaling very big
> values.
> >> IOW by changing as you suggest we'd trade one accumulating error for
> >> another.
> >
> > As I wrote above, there will be an error when using scale_delta, but it
> > won't accumulate between calls to time_calibration_rendezvous_tail. In
> the
> > current version, the old error (ipi lag + rounding error) persists due to
> > the use of the old local_stime in the get_s_time_fixed function, and it's
> > added to the new error and accumulates with each call.
> > If
> >
> > c->local_stime = get_s_time_fixed(old_tsc ?: new_tsc);
> >
> > replaced with:
> >
> > c->local_stime = scale_delta(old_tsc ?: new_tsc);
> >
> > Then we'll only be dealing with the error due to the current ipi lag and
> > rough rounding, within a single call.
> >
> > If I understand you correctly, you fixed the rough rounding of
> scale_delta
> > by reducing the values using get_s_time_fixed . But the problem is, that
> > didn't help. The error now accumulates gradually because we're relying on
> > old calculations. Furthermore, while the old rounding error was limited,
> > now the error accumulates infinitely, albeit very slowly. If this is so,
> > then the solution to the problem becomes less obvious.
> >
> > Roughly speaking, my servers start to go crazy after a week of continuous
> > operation, as the time lag between cores reaches 1 millisecond or more.
> >
> > On Mon, Jan 12, 2026 at 5:13 PM Jan Beulich <jbeulich@suse.com> wrote:
> >
> >> On 12.01.2026 13:49, Anton Markov wrote:
> >>>> Btw, your prior response was too hard to properly read, due to excess
> >> blank
> >>>> lines with at the same time squashed leading blanks. Together with
> your
> >>>> apparent inability to avoid top-posting, I think you really want to
> >> adjust
> >>>> your mail program's configuration.
> >>>
> >>> I suggest we skip the discussion of formatting and the number of spaces
> >> in
> >>> messages and instead focus on the topic of the thread. I have a very
> >>> difficult time troubleshooting difficult-to-reproduce bugs, and the
> fact
> >>> that their descriptions are difficult to read due to the number of
> spaces
> >>> is probably the least of the difficulties.
> >>
> >> Perhaps, yet it still makes dealing with things more difficult.
> >>
> >>> That invocation of get_s_time_fixed() reduces to scale_delta() (without
> >>>> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases.
> IOW
> >>>> it's not quite clear to me what change you are suggesting (that would
> >>>> actually make a functional difference).
> >>>
> >>> Replacing get_s_time_fixed with scale_delta will remove the calculation
> >>> dependency on the previous local_stime value, which accumulates lag
> >> between
> >>> cores. This is because: rdtsc_ordered is not called synchronously on
> the
> >>> cores, but by the difference offset by the ipi speed. Therefore, we
> get:
> >>>
> >>> core0: current_rdtsc;
> >>> core1: current_rdtsc + ipi speed;
> >>> coreN: current_rdtsc + ipi speed * N;
> >>
> >> That's if IPIs are sent sequentially. In the most common case, they
> aren't,
> >> though - we use the all-but-self shorthand.
> >>
> >> Actually, even if IPIs are sent sequentially, I can't see where you spot
> >> this effect: Both callers of time_calibration_rendezvous_tail() signal
> all
> >> secondary CPUs to continue at the same time. Hence they'll all execute
> >> time_calibration_rendezvous_tail() in parallel.
> >>
> >>> Since ipi values are sent alternately in a loop to core0,
> >>
> >> Are they? I fear I don't know which part of the code you're talking
> about.
> >>
> >>> in the version
> >>> with get_s_time_fixed, we get the following local_stime calculation
> >> format.
> >>>
> >>> coreN: local_stime = local_stime + scale_delta((current_rdtsc +
> >> (ipi_speed
> >>> * N)) – local_rdtsc);
> >>
> >> One of the reasons we (iirc) don't do that is that since the scaling
> factor
> >> is also slightly imprecise, we'd prefer to avoid scaling very big
> values.
> >> IOW by changing as you suggest we'd trade one accumulating error for
> >> another.
> >>
> >> Jan
> >>
> >>> This means the time on each core will differ by ipi_speed * N. And
> since
> >>> we're using the values of the previous local_stime, the difference will
> >>> accumulate because the previous local_stime was also offset. In the
> >> version
> >>> with scale_delta, we get:
> >>>
> >>> coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N));
> >>>
> >>> This means there will still be a difference, but it won't accumulate,
> and
> >>> the offsets will remain within normal limits.
> >>>
> >>> If it's still unclear: If your local_stime in get_s_time_fixed is
> offset
> >>> relative to other cores, then the fact that rdtsc_ordered and local_tsc
> >> are
> >>> not offset doesn't change anything, since you're using the delta
> relative
> >>> to local_stime.
> >>>
> >>> core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime
> +
> >>> (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc
> >> are
> >>> equal across cores.
> >>>
> >>> On 96-core configurations, up to a millisecond of latency can
> accumulate
> >> in
> >>> local_stime over a week of operation, and this is a significant
> >>> difference. This
> >>> is due to the fact that I use cpufreq=xen:performance max_cstate=1 ,
> >>> meaning that in my configuration, local_stime is never overwritten by
> >>> master_stime.
> >>>
> >>> Thanks.
> >>
> >
>
>

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Jan Beulich 3 weeks, 6 days ago

On 12.01.2026 17:41, Anton Markov wrote:
>>
>> That calls on_selected_cpus(), but send_IPI_mask() may then still resort to
>> all-but-self. In that case all IPIs are sent in one go.
> 
> Plus as said, how IPIs are sent doesn't matter for the invocation of
>> time_calibration_rendezvous_tail(). They'll all run at the same time, not
>> one after the other.
> 
> At the hardware level, no one can guarantee that the processors will
> simultaneously respond to the signal and execute your code nanosecond after
> you send the ipi. Especially when we're talking about NUMA configurations. I'm
> afraid the possible and impossible in the laws of physics is also beyond
> the scope of this thread.

You did read my recurring explanation beyond the IPI sending, didn't you?
Of course IPI arrival may vary across cores / threads. Yet the term
"rendezvous" is used because CPUs having received the IPI are then held
in a waiting loop, until _all_ CPUs have made it there. Then CPU0
indicates to all of them simultaneously to move to the next step. There's
going to again be some variance (especially on NUMA, where the memory
write needs to propagate to all nodes), but at least within a single node
that should be pretty low. The main source of variance I would expect
there would by hyperthreads competing with one another in a single core.

> Since further down you build upon that "IPI lag", I fear we first need to
>> settle on this aspect of your theory.
> 
>  I've already provided the delay logs. It's not hard for me to repeat.

Sure, I don't doubt you make those observations. But we're still trying to
converge on a theory on what these may be caused by.

>  2 hours of work:
> 
>> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597
>> (8565145862216, 0)
>> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609
>> (8565145863957, 0)
>> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491
>> (8565145864800, 0)
>> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609
>> (8565145865372, 0)
>>
>> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag

The log entries aren't in CPU order, and CPUs 1 and 2 actually have
identical values on the rhs. That doesn't quite fit what you have said so
far. CPU3's value is also lower than CPU0's.

> 3 hours of work:
> 
>> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993
>> (22915730870665, 0)
>> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889
>> (22915730870693, 0)
>> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841
>> (22915730872231, 0)
>> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696
>> (22915730872096, 0)
>>
>> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag

While CPU numbers happen to be in sequence here, the rhs values aren't equally
ordered.

Also really here it is

22915730869696 - 22915730869993 = -297 * 3 (3.00 GHz) = 891 ahead

> 2-3 day of work:
> 
>> (XEN) update stime on time calibrate 0, 254477161980127 -> 254478162020920
>> (254478162021549, 0)
>> (XEN) update stime on time calibrate 2, 254477161977638 -> 254478162018429
>> (254478162022187, 0)
>> (XEN) update stime on time calibrate 1, 254477161978192 -> 254478162018972
>> (254478162022776, 0)
>> (XEN) update stime on time calibrate 3, 254477161976832 -> 254478162017636
>> (254478162021394, 0)
>>
>> 254478162020920 - 254478162017636 = 3284 * 3 (3.00 GHz) = 9852 lag

Similarly here. Yes, the gap increases, yet that's not a lag of CPU3 past
CPU0, but exactly the other way around.

>  As you can see, the core lag is strictly determined by their sequence
> number.

As per above - no, I don't think I can see that. Or maybe I'm misreading the
numbers as well as what you have been saying.

Jan

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Posted by Anton Markov 3 weeks, 5 days ago

You did read my recurring explanation beyond the IPI sending, didn't you?
Of course IPI arrival may vary across cores / threads. Yet the term
"rendezvous" is used because CPUs having received the IPI are then held in
a waiting loop, until _all_ CPUs have made it there. Then CPU0 indicates to
all of them simultaneously to move to the next step. There's going to again
be some variance (especially on NUMA, where the memory write needs to
propagate to all nodes), but at least within a single node that should be
pretty low. The main source of variance I would expect there would by
hyperthreads competing with one another in a single core.

Yes, I saw it. I'm not trying to dispute your understanding of the
situation. The difference may be small, but it adds up.

Sure, I don't doubt you make those observations. But we're still trying to
> converge on a theory on what these may be caused by.
>
I can't tell you exactly what the main cause of the delay is. I can only
list possible factors:
1. Rounding error, which varies for each core;
2. Delay in IPI delivery speed (even at the hardware level, signal
delivery shouldn't
happen simultaneously, and some cores may have interrupts disabled);
3. CPU frequency boost, which allows some cores to execute code faster. On
modern CPUs, this doesn't affect the TSC frequency, but the problem is that
the counter will be read at different times.
4. The initial difference in TSC counter values, which for cores within a
single CPU should be no more than 100 ns. In the case of NUMA, no more than
1000 ns;
I can't speak about percentages; I wasn't involved in CPU development, but
there are many reasons not to allow cores to compete in sequence increment
speed.

The log entries aren't in CPU order, and CPUs 1 and 2 actually have
> identical values on the rhs. That doesn't quite fit what you have said so
> far. CPU3's value is also lower than CPU0's.
> While CPU numbers happen to be in sequence here, the rhs values aren't
> equally
> ordered.
> Also really here it is
> 22915730869696 - 22915730869993 = -297 * 3 (3.00 GHz) = 891 ahead
> Similarly here. Yes, the gap increases, yet that's not a lag of CPU3 past
> CPU0, but exactly the other way around.
> As per above - no, I don't think I can see that. Or maybe I'm misreading
> the
> numbers as well as what you have been saying.
>
During the first few hours, the situation can be blurred due to possible
race conditions. After two days, it becomes more or less clear:

254478162020920 (core 0) > 254478162018972 (core 1) > 254478162018429 (core
2) > 254478162017636 (core 3)

The lower the core number, the more it pulls ahead. It's possible that this
is indeed related to which core is most heavily loaded (which one activates
CPU boost more often). In my configuration, the first three cores are
dedicated to DOM0, and the fourth is reserved for virtual machines. The
first core ends up being the most heavily loaded due to interrupt handling,
etc.
I can also add that after replacing get_s_time_fixed with scale_delta, the
difference stops accumulating. At this point, it's clear to me that the
problem is the use of previous last_stime values. The nature of this CPU
behavior is unlikely to be understood at the software level. Of course, all
the processors I tested on have the constant_tsc, nonstop_tsc, and
tsc_known_freq flags.


On Tue, Jan 13, 2026 at 2:21 PM Jan Beulich <jbeulich@suse.com> wrote:

> On 12.01.2026 17:41, Anton Markov wrote:
> >>
> >> That calls on_selected_cpus(), but send_IPI_mask() may then still
> resort to
> >> all-but-self. In that case all IPIs are sent in one go.
> >
> > Plus as said, how IPIs are sent doesn't matter for the invocation of
> >> time_calibration_rendezvous_tail(). They'll all run at the same time,
> not
> >> one after the other.
> >
> > At the hardware level, no one can guarantee that the processors will
> > simultaneously respond to the signal and execute your code nanosecond
> after
> > you send the ipi. Especially when we're talking about NUMA
> configurations. I'm
> > afraid the possible and impossible in the laws of physics is also beyond
> > the scope of this thread.
>
> You did read my recurring explanation beyond the IPI sending, didn't you?
> Of course IPI arrival may vary across cores / threads. Yet the term
> "rendezvous" is used because CPUs having received the IPI are then held
> in a waiting loop, until _all_ CPUs have made it there. Then CPU0
> indicates to all of them simultaneously to move to the next step. There's
> going to again be some variance (especially on NUMA, where the memory
> write needs to propagate to all nodes), but at least within a single node
> that should be pretty low. The main source of variance I would expect
> there would by hyperthreads competing with one another in a single core.
>
> > Since further down you build upon that "IPI lag", I fear we first need to
> >> settle on this aspect of your theory.
> >
> >  I've already provided the delay logs. It's not hard for me to repeat.
>
> Sure, I don't doubt you make those observations. But we're still trying to
> converge on a theory on what these may be caused by.
>
> >  2 hours of work:
> >
> >> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597
> >> (8565145862216, 0)
> >> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609
> >> (8565145863957, 0)
> >> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491
> >> (8565145864800, 0)
> >> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609
> >> (8565145865372, 0)
> >>
> >> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag
>
> The log entries aren't in CPU order, and CPUs 1 and 2 actually have
> identical values on the rhs. That doesn't quite fit what you have said so
> far. CPU3's value is also lower than CPU0's.
>
> > 3 hours of work:
> >
> >> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993
> >> (22915730870665, 0)
> >> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889
> >> (22915730870693, 0)
> >> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841
> >> (22915730872231, 0)
> >> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696
> >> (22915730872096, 0)
> >>
> >> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag
>
> While CPU numbers happen to be in sequence here, the rhs values aren't
> equally
> ordered.
>
> Also really here it is
>
> 22915730869696 - 22915730869993 = -297 * 3 (3.00 GHz) = 891 ahead
>
> > 2-3 day of work:
> >
> >> (XEN) update stime on time calibrate 0, 254477161980127 ->
> 254478162020920
> >> (254478162021549, 0)
> >> (XEN) update stime on time calibrate 2, 254477161977638 ->
> 254478162018429
> >> (254478162022187, 0)
> >> (XEN) update stime on time calibrate 1, 254477161978192 ->
> 254478162018972
> >> (254478162022776, 0)
> >> (XEN) update stime on time calibrate 3, 254477161976832 ->
> 254478162017636
> >> (254478162021394, 0)
> >>
> >> 254478162020920 - 254478162017636 = 3284 * 3 (3.00 GHz) = 9852 lag
>
> Similarly here. Yes, the gap increases, yet that's not a lag of CPU3 past
> CPU0, but exactly the other way around.
>
> >  As you can see, the core lag is strictly determined by their sequence
> > number.
>
> As per above - no, I don't think I can see that. Or maybe I'm misreading
> the
> numbers as well as what you have been saying.
>
> Jan
>

[PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
[PATCH 2/5] x86/time: scale_delta() can be static
[PATCH 3/5] x86/HVM: drop at_tsc parameter from ->set_tsc_offset() hook
[PATCH 4/5] x86/time: gtsc_to_gtime() is HVM-only
[PATCH 5/5] x86/time: pv_soft_rdtsc() is PV-only