[PATCH] clocksource: Fix the CPUs' choice in the watchdog percpu verification

Guilherme G. Piccoli posted 1 patch 10 months, 3 weeks ago
kernel/time/clocksource.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH] clocksource: Fix the CPUs' choice in the watchdog percpu verification
Posted by Guilherme G. Piccoli 10 months, 3 weeks ago
Right now, if the clocksource watchdog detects a clocksource skew, it
might perform a percpu checking, for example in the TSC case on x86.
In other words: supposing TSC is detected as unstable by the
clocksource watchdog running at CPU1, as part of marking TSC as
unstable the kernel will also run a check of TSC readings in some CPUs
other than CPU1 to be sure it is synced between them all.

But that check might happen only in some CPUs, not all of them; this
choice is based on the parameter "verify_n_cpus" and in some random
cpumask calculation. So, the watchdog runs such percpu check in
up to "verify_n_cpus" random CPUs among all online CPUs, with the risk
of repeating CPUs (that aren't double checked) in the cpumask random
calculation.

But if "verify_n_cpus" > num_online_cpus(), we could skip the random
calculation and just go ahead and check the clocksource sync between
all online CPUs, without the risk of skipping some CPUs due to
duplicity in the random cpumask calculation. That approach is exactly
what is proposed here.

Tests in a 4 CPU laptop with TSC skew detected led to some cases of
the percpu verification skipping some CPU even with verify_n_cpus=8,
due to the duplicity on random cpumask generation. With this patch,
the issue is fixed.

Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>

---


Hey folks, thanks in advance for reviews / suggestions!

Special thanks to Cascardo for the suggestion of checking
verify_n_cpus against the number of online CPUs - definitely
improved the idea!

I think this could be backported to stable if makes sense;
I can do it, please let me know your opinion.
Cheers,

Guilherme


 kernel/time/clocksource.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 2a7802ec480c..a32732dab27e 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -310,7 +310,7 @@ static void clocksource_verify_choose_cpus(void)
 {
 	int cpu, i, n = verify_n_cpus;
 
-	if (n < 0) {
+	if ((n < 0) || (n >= num_online_cpus())) {
 		/* Check all of the CPUs. */
 		cpumask_copy(&cpus_chosen, cpu_online_mask);
 		cpumask_clear_cpu(smp_processor_id(), &cpus_chosen);
-- 
2.48.1
Re: [PATCH] clocksource: Fix the CPUs' choice in the watchdog percpu verification
Posted by Paul E. McKenney 10 months, 3 weeks ago
On Sun, Mar 23, 2025 at 02:36:24PM -0300, Guilherme G. Piccoli wrote:
> Right now, if the clocksource watchdog detects a clocksource skew, it
> might perform a percpu checking, for example in the TSC case on x86.
> In other words: supposing TSC is detected as unstable by the
> clocksource watchdog running at CPU1, as part of marking TSC as
> unstable the kernel will also run a check of TSC readings in some CPUs
> other than CPU1 to be sure it is synced between them all.
> 
> But that check might happen only in some CPUs, not all of them; this
> choice is based on the parameter "verify_n_cpus" and in some random
> cpumask calculation. So, the watchdog runs such percpu check in
> up to "verify_n_cpus" random CPUs among all online CPUs, with the risk
> of repeating CPUs (that aren't double checked) in the cpumask random
> calculation.
> 
> But if "verify_n_cpus" > num_online_cpus(), we could skip the random
> calculation and just go ahead and check the clocksource sync between
> all online CPUs, without the risk of skipping some CPUs due to
> duplicity in the random cpumask calculation. That approach is exactly
> what is proposed here.
> 
> Tests in a 4 CPU laptop with TSC skew detected led to some cases of
> the percpu verification skipping some CPU even with verify_n_cpus=8,
> due to the duplicity on random cpumask generation. With this patch,
> the issue is fixed.
> 
> Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> ---
> 
> 
> Hey folks, thanks in advance for reviews / suggestions!
> 
> Special thanks to Cascardo for the suggestion of checking
> verify_n_cpus against the number of online CPUs - definitely
> improved the idea!
> 
> I think this could be backported to stable if makes sense;
> I can do it, please let me know your opinion.
> Cheers,
> 
> Guilherme
> 
> 
>  kernel/time/clocksource.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 2a7802ec480c..a32732dab27e 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -310,7 +310,7 @@ static void clocksource_verify_choose_cpus(void)
>  {
>  	int cpu, i, n = verify_n_cpus;
>  
> -	if (n < 0) {
> +	if ((n < 0) || (n >= num_online_cpus())) {
>  		/* Check all of the CPUs. */
>  		cpumask_copy(&cpus_chosen, cpu_online_mask);
>  		cpumask_clear_cpu(smp_processor_id(), &cpus_chosen);
> -- 
> 2.48.1
>
[tip: timers/core] clocksource: Fix the CPUs' choice in the watchdog per CPU verification
Posted by tip-bot2 for Guilherme G. Piccoli 9 months ago
The following commit has been merged into the timers/core branch of tip:

Commit-ID:     08d7becc1a6b8c936e25d827becabfe3bff72a36
Gitweb:        https://git.kernel.org/tip/08d7becc1a6b8c936e25d827becabfe3bff72a36
Author:        Guilherme G. Piccoli <gpiccoli@igalia.com>
AuthorDate:    Sun, 23 Mar 2025 14:36:24 -03:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 13 May 2025 15:38:55 +02:00

clocksource: Fix the CPUs' choice in the watchdog per CPU verification

Right now, if the clocksource watchdog detects a clocksource skew, it might
perform a per CPU check, for example in the TSC case on x86.  In other
words: supposing TSC is detected as unstable by the clocksource watchdog
running at CPU1, as part of marking TSC unstable the kernel will also run a
check of TSC readings on some CPUs to be sure it is synced between them
all.

But that check happens only on some CPUs, not all of them; this choice is
based on the parameter "verify_n_cpus" and in some random cpumask
calculation. So, the watchdog runs such per CPU checks on up to
"verify_n_cpus" random CPUs among all online CPUs, with the risk of
repeating CPUs (that aren't double checked) in the cpumask random
calculation.

But if "verify_n_cpus" > num_online_cpus(), it should skip the random
calculation and just go ahead and check the clocksource sync between
all online CPUs, without the risk of skipping some CPUs due to
duplicity in the random cpumask calculation.

Tests in a 4 CPU laptop with TSC skew detected led to some cases of the per
CPU verification skipping some CPU even with verify_n_cpus=8, due to the
duplicity on random cpumask generation. Skipping the randomization when the
number of online CPUs is smaller than verify_n_cpus, solves that.

Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/all/20250323173857.372390-1-gpiccoli@igalia.com
---
 kernel/time/clocksource.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index bb48498..6a8bc7d 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -310,7 +310,7 @@ static void clocksource_verify_choose_cpus(void)
 {
 	int cpu, i, n = verify_n_cpus;
 
-	if (n < 0) {
+	if (n < 0 || n >= num_online_cpus()) {
 		/* Check all of the CPUs. */
 		cpumask_copy(&cpus_chosen, cpu_online_mask);
 		cpumask_clear_cpu(smp_processor_id(), &cpus_chosen);