locking/csd-lock: Switch from sched_clock() to ktime_get_mono_fast_ns()

Posted by Paul E. McKenney 1 month, 2 weeks ago

Currently, the CONFIG_CSD_LOCK_WAIT_DEBUG code uses sched_clock()
to check for excessive CSD-lock wait times.  This works, but does not
guarantee monotonic timestamps.  Therefore, switch from sched_clock()
to ktime_get_mono_fast_ns(), which does guarantee monotonic timestamps,
at least in the absence of calls from NMI handlers, which are not involved
in this code path.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>

diff --git a/kernel/smp.c b/kernel/smp.c
index f25e20617b7eb..27dc31a146a35 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -246,7 +246,7 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
 		return true;
 	}

-	ts2 = sched_clock();
+	ts2 = ktime_get_mono_fast_ns();
 	/* How long since we last checked for a stuck CSD lock.*/
 	ts_delta = ts2 - *ts1;
 	if (likely(ts_delta <= csd_lock_timeout_ns * (*nmessages + 1) *
@@ -321,7 +321,7 @@ static void __csd_lock_wait(call_single_data_t *csd)
 	int bug_id = 0;
 	u64 ts0, ts1;

-	ts1 = ts0 = sched_clock();
+	ts1 = ts0 = ktime_get_mono_fast_ns();
 	for (;;) {
 		if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id, &nmessages))
 			break;

Re: locking/csd-lock: Switch from sched_clock() to ktime_get_mono_fast_ns()

Posted by Rik van Riel 1 month, 2 weeks ago

On Wed, 2024-10-09 at 10:57 -0700, Paul E. McKenney wrote:
> Currently, the CONFIG_CSD_LOCK_WAIT_DEBUG code uses sched_clock()
> to check for excessive CSD-lock wait times.  This works, but does not
> guarantee monotonic timestamps.  Therefore, switch from sched_clock()
> to ktime_get_mono_fast_ns(), which does guarantee monotonic
> timestamps,
> at least in the absence of calls from NMI handlers, which are not
> involved
> in this code path.
> 
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Leonardo Bras <leobras@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>

The functional change here is that sched_clock() uses rdtsc(),
while ktime_get_mono_fast_ns() uses rdtsc_ordered().

I don't know if sched_clock() should also be using rdtsc_ordered().

Reviewed-by: Rik van Riel <riel@surriel.com>


-- 
All Rights Reversed.

Re: locking/csd-lock: Switch from sched_clock() to ktime_get_mono_fast_ns()

Posted by Paul E. McKenney 1 month, 2 weeks ago

On Thu, Oct 10, 2024 at 11:13:35AM -0400, Rik van Riel wrote:
> On Wed, 2024-10-09 at 10:57 -0700, Paul E. McKenney wrote:
> > Currently, the CONFIG_CSD_LOCK_WAIT_DEBUG code uses sched_clock()
> > to check for excessive CSD-lock wait times.  This works, but does not
> > guarantee monotonic timestamps.  Therefore, switch from sched_clock()
> > to ktime_get_mono_fast_ns(), which does guarantee monotonic
> > timestamps,
> > at least in the absence of calls from NMI handlers, which are not
> > involved
> > in this code path.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
> > Cc: Rik van Riel <riel@surriel.com>
> > Cc: Leonardo Bras <leobras@redhat.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> 
> The functional change here is that sched_clock() uses rdtsc(),
> while ktime_get_mono_fast_ns() uses rdtsc_ordered().
> 
> I don't know if sched_clock() should also be using rdtsc_ordered().
> 
> Reviewed-by: Rik van Riel <riel@surriel.com>

Thank you!  I will apply this on my next rebase.

							Thanx, Paul

Re: locking/csd-lock: Switch from sched_clock() to ktime_get_mono_fast_ns()

Posted by Paul E. McKenney 1 month, 2 weeks ago

On Thu, Oct 10, 2024 at 11:31:08AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 10, 2024 at 11:13:35AM -0400, Rik van Riel wrote:
> > On Wed, 2024-10-09 at 10:57 -0700, Paul E. McKenney wrote:
> > > Currently, the CONFIG_CSD_LOCK_WAIT_DEBUG code uses sched_clock()
> > > to check for excessive CSD-lock wait times.  This works, but does not
> > > guarantee monotonic timestamps.  Therefore, switch from sched_clock()
> > > to ktime_get_mono_fast_ns(), which does guarantee monotonic
> > > timestamps,
> > > at least in the absence of calls from NMI handlers, which are not
> > > involved
> > > in this code path.
> > > 
> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
> > > Cc: Rik van Riel <riel@surriel.com>
> > > Cc: Leonardo Bras <leobras@redhat.com>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> > 
> > The functional change here is that sched_clock() uses rdtsc(),
> > while ktime_get_mono_fast_ns() uses rdtsc_ordered().
> > 
> > I don't know if sched_clock() should also be using rdtsc_ordered().
> > 
> > Reviewed-by: Rik van Riel <riel@surriel.com>
> 
> Thank you!  I will apply this on my next rebase.

And please see below for the updated version.

							Thanx, Paul

------------------------------------------------------------------------

locking/csd-lock: Switch from sched_clock() to ktime_get_mono_fast_ns()

Currently, the CONFIG_CSD_LOCK_WAIT_DEBUG code uses sched_clock() to check
for excessive CSD-lock wait times.  This works, but does not guarantee
monotonic timestamps on x86 due to the sched_clock() function's use of
the rdtsc instruction, which does not guarantee ordering.  This means
that, given successive calls to sched_clock(), the second might return
an earlier time than the second, that is, time might seem to go backwards.
This can (and does!) result in false-positive CSD-lock wait complaints
claiming almost 2^64 nanoseconds of delay.

Therefore, switch from sched_clock() to ktime_get_mono_fast_ns(), which
does guarantee monotonic timestamps via the rdtsc_ordered() function,
which as the name implies, does guarantee ordered timestamps, at least
in the absence of calls from NMI handlers, which are not involved in
this code path.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>

diff --git a/kernel/smp.c b/kernel/smp.c
index f25e20617b7eb..27dc31a146a35 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -246,7 +246,7 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
 		return true;
 	}

-	ts2 = sched_clock();
+	ts2 = ktime_get_mono_fast_ns();
 	/* How long since we last checked for a stuck CSD lock.*/
 	ts_delta = ts2 - *ts1;
 	if (likely(ts_delta <= csd_lock_timeout_ns * (*nmessages + 1) *
@@ -321,7 +321,7 @@ static void __csd_lock_wait(call_single_data_t *csd)
 	int bug_id = 0;
 	u64 ts0, ts1;

-	ts1 = ts0 = sched_clock();
+	ts1 = ts0 = ktime_get_mono_fast_ns();
 	for (;;) {
 		if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id, &nmessages))
 			break;

Re: locking/csd-lock: Switch from sched_clock() to ktime_get_mono_fast_ns()

Posted by Peter Zijlstra 1 month, 2 weeks ago

On Wed, Oct 09, 2024 at 10:57:24AM -0700, Paul E. McKenney wrote:
> Currently, the CONFIG_CSD_LOCK_WAIT_DEBUG code uses sched_clock()
> to check for excessive CSD-lock wait times.  This works, but does not
> guarantee monotonic timestamps. 

It does if you provide a sane TSC

> Therefore, switch from sched_clock()
> to ktime_get_mono_fast_ns(), which does guarantee monotonic timestamps,
> at least in the absence of calls from NMI handlers, which are not involved
> in this code path.

That can end up using HPET in the case of non-sane TSC.

In the good case they're equal, in the bad case you're switching from
slightly dodgy time to super expensive time. Is that what you want?

> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Leonardo Bras <leobras@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index f25e20617b7eb..27dc31a146a35 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -246,7 +246,7 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
>  		return true;
>  	}
>  
> -	ts2 = sched_clock();
> +	ts2 = ktime_get_mono_fast_ns();
>  	/* How long since we last checked for a stuck CSD lock.*/
>  	ts_delta = ts2 - *ts1;
>  	if (likely(ts_delta <= csd_lock_timeout_ns * (*nmessages + 1) *
> @@ -321,7 +321,7 @@ static void __csd_lock_wait(call_single_data_t *csd)
>  	int bug_id = 0;
>  	u64 ts0, ts1;
>  
> -	ts1 = ts0 = sched_clock();
> +	ts1 = ts0 = ktime_get_mono_fast_ns();
>  	for (;;) {
>  		if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id, &nmessages))
>  			break;

Re: locking/csd-lock: Switch from sched_clock() to ktime_get_mono_fast_ns()

Posted by Paul E. McKenney 1 month, 2 weeks ago

On Wed, Oct 09, 2024 at 08:07:08PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 09, 2024 at 10:57:24AM -0700, Paul E. McKenney wrote:
> > Currently, the CONFIG_CSD_LOCK_WAIT_DEBUG code uses sched_clock()
> > to check for excessive CSD-lock wait times.  This works, but does not
> > guarantee monotonic timestamps. 
> 
> It does if you provide a sane TSC

What is this "sane TSC" of which you speak?  ;-)

More seriously, the raw reads from the TSC that are carried out by
sched_clock() are not guaranteed to be monotonic due to potential
instruction reordering and the like.  This is *not* a theoretical
statement -- we really do see this on the fleet.  Very rarely for any
given system, to be sure, but not at all rare across the full set of them.

This results in false-positive CSD-lock complaints claiming almost 2^64
nanoseconds of delay, which are not good complaints to have.

> > Therefore, switch from sched_clock()
> > to ktime_get_mono_fast_ns(), which does guarantee monotonic timestamps,
> > at least in the absence of calls from NMI handlers, which are not involved
> > in this code path.
> 
> That can end up using HPET in the case of non-sane TSC.

That is true.  However...

> In the good case they're equal, in the bad case you're switching from
> slightly dodgy time to super expensive time. Is that what you want?

If TSC is not sane, we don't want to be using the system at all.
In fact, the super-expensive time will more often than not result in
the system being automatically taken out of service due to the excessive
application-level latencies.

But in the good case where the TSC is sane, we need successive reads
from the TSC to be ordered in order to avoid the false-positive
complaints.  Yes, this does add a bit of overhead, but the CPU isn't
doing anything useful anyway, so not a problem.  This same lack of
concern might also apply to HPET reads.

Should I upgrade the commit log?  Or am I missing your point?

							Thanx, Paul

> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
> > Cc: Rik van Riel <riel@surriel.com>
> > Cc: Leonardo Bras <leobras@redhat.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> > 
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index f25e20617b7eb..27dc31a146a35 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -246,7 +246,7 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
> >  		return true;
> >  	}
> >  
> > -	ts2 = sched_clock();
> > +	ts2 = ktime_get_mono_fast_ns();
> >  	/* How long since we last checked for a stuck CSD lock.*/
> >  	ts_delta = ts2 - *ts1;
> >  	if (likely(ts_delta <= csd_lock_timeout_ns * (*nmessages + 1) *
> > @@ -321,7 +321,7 @@ static void __csd_lock_wait(call_single_data_t *csd)
> >  	int bug_id = 0;
> >  	u64 ts0, ts1;
> >  
> > -	ts1 = ts0 = sched_clock();
> > +	ts1 = ts0 = ktime_get_mono_fast_ns();
> >  	for (;;) {
> >  		if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id, &nmessages))
> >  			break;

Re: locking/csd-lock: Switch from sched_clock() to ktime_get_mono_fast_ns()

Posted by Peter Zijlstra 1 month, 2 weeks ago

On Wed, Oct 09, 2024 at 11:18:34AM -0700, Paul E. McKenney wrote:
> On Wed, Oct 09, 2024 at 08:07:08PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 09, 2024 at 10:57:24AM -0700, Paul E. McKenney wrote:
> > > Currently, the CONFIG_CSD_LOCK_WAIT_DEBUG code uses sched_clock()
> > > to check for excessive CSD-lock wait times.  This works, but does not
> > > guarantee monotonic timestamps. 
> > 
> > It does if you provide a sane TSC
> 
> What is this "sane TSC" of which you speak?  ;-)
> 
> More seriously, the raw reads from the TSC that are carried out by
> sched_clock() are not guaranteed to be monotonic due to potential
> instruction reordering and the like.  This is *not* a theoretical
> statement -- we really do see this on the fleet.  Very rarely for any
> given system, to be sure, but not at all rare across the full set of them.
> 
> This results in false-positive CSD-lock complaints claiming almost 2^64
> nanoseconds of delay, which are not good complaints to have.

Ooh, so the real difference is that clocksource_tsc ends up using
rdtsc_ordered() while sched_clock() ends up using rdtsc(), and you're
actually seeing that reordering happen.

*urgh*.

Yes, please put that in the Changelog.

Re: locking/csd-lock: Switch from sched_clock() to ktime_get_mono_fast_ns()

Posted by Paul E. McKenney 1 month, 2 weeks ago

On Thu, Oct 10, 2024 at 01:21:32PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 09, 2024 at 11:18:34AM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 09, 2024 at 08:07:08PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 09, 2024 at 10:57:24AM -0700, Paul E. McKenney wrote:
> > > > Currently, the CONFIG_CSD_LOCK_WAIT_DEBUG code uses sched_clock()
> > > > to check for excessive CSD-lock wait times.  This works, but does not
> > > > guarantee monotonic timestamps. 
> > > 
> > > It does if you provide a sane TSC
> > 
> > What is this "sane TSC" of which you speak?  ;-)
> > 
> > More seriously, the raw reads from the TSC that are carried out by
> > sched_clock() are not guaranteed to be monotonic due to potential
> > instruction reordering and the like.  This is *not* a theoretical
> > statement -- we really do see this on the fleet.  Very rarely for any
> > given system, to be sure, but not at all rare across the full set of them.
> > 
> > This results in false-positive CSD-lock complaints claiming almost 2^64
> > nanoseconds of delay, which are not good complaints to have.
> 
> Ooh, so the real difference is that clocksource_tsc ends up using
> rdtsc_ordered() while sched_clock() ends up using rdtsc(), and you're
> actually seeing that reordering happen.

You got it!

> *urgh*.
> 
> Yes, please put that in the Changelog.

I will do so on my next rebase.  And thank you for looking this over!

							Thanx, Paul