[v1] tracing/osnoise: Fix possible recursive locking for cpus_read_lock()

[PATCH] tracing/osnoise: Fix possible recursive locking for cpus_read_lock()

Posted by Ran Xiaokai 11 months, 2 weeks ago

From: Ran Xiaokai <ran.xiaokai@zte.com.cn>

Lockdep reports this deadlock log:
============================================
WARNING: possible recursive locking detected
--------------------------------------------
sh/31444 is trying to acquire lock:
ffffffff82c51af0 (cpu_hotplug_lock){++++}-{0:0}, at:
stop_per_cpu_kthreads+0x7/0x60

but task is already holding lock:
ffffffff82c51af0 (cpu_hotplug_lock){++++}-{0:0}, at:
start_per_cpu_kthreads+0x28/0x140

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(cpu_hotplug_lock);
  lock(cpu_hotplug_lock);

Call Trace:
 <TASK>
 __lock_acquire+0x1612/0x29b0
 lock_acquire+0xd0/0x2e0
 cpus_read_lock+0x49/0x120
 stop_per_cpu_kthreads+0x7/0x60
 start_kthread+0x105/0x120
 start_per_cpu_kthreads+0xdd/0x140
 osnoise_workload_start+0x261/0x2f0
 osnoise_tracer_start+0x18/0x4

In start_kthread(), when kthread_run_on_cpu() fails,
cpus_read_unlock() should be called before stop_per_cpu_kthreads(),
but both start_per_cpu_kthreads() and start_kthread() call the error
handling routine stop_per_cpu_kthreads(),
which is redundant. Only one call is necessary.
To fix this, move stop_per_cpu_kthreads() outside of start_kthread(),
use the return value of start_kthread() to determine kthread creation
error.
The same issue exists in osnoise_hotplug_workfn() too.

Reviewed-by: Yang Guang <yang.guang5@zte.com.cn>
Reviewed-by: Wang Yong <wang.yong12@zte.com.cn>
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
---
 kernel/trace/trace_osnoise.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 92e16f03fa4e..38fb0c655f5b 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -2029,7 +2029,6 @@ static int start_kthread(unsigned int cpu)
 
 	if (IS_ERR(kthread)) {
 		pr_err(BANNER "could not start sampling thread\n");
-		stop_per_cpu_kthreads();
 		return -ENOMEM;
 	}
 
@@ -2097,7 +2096,7 @@ static void osnoise_hotplug_workfn(struct
work_struct *dummy)
 		return;
 
 	guard(mutex)(&interface_lock);
-	guard(cpus_read_lock)();
+	cpus_read_lock();
 
 	if (!cpu_online(cpu))
 		return;
@@ -2105,7 +2104,12 @@ static void osnoise_hotplug_workfn(struct
work_struct *dummy)
 	if (!cpumask_test_cpu(cpu, &osnoise_cpumask))
 		return;
 
-	start_kthread(cpu);
+	if (start_kthread(cpu)) {
+		cpus_read_unlock();
+		stop_per_cpu_kthreads();
+		return;
+	}
+	cpus_read_unlock();
 }
 
 static DECLARE_WORK(osnoise_hotplug_work, osnoise_hotplug_workfn);
-- 
2.15.2

Re: [PATCH] tracing/osnoise: Fix possible recursive locking for cpus_read_lock()

Posted by Vishal Chourasia 11 months, 2 weeks ago

Hi,

On Tue, Feb 25, 2025 at 12:31:32PM +0000, Ran Xiaokai wrote:
> From: Ran Xiaokai <ran.xiaokai@zte.com.cn>
> 
> Lockdep reports this deadlock log:
> ============================================
> WARNING: possible recursive locking detected
> --------------------------------------------
> sh/31444 is trying to acquire lock:
> ffffffff82c51af0 (cpu_hotplug_lock){++++}-{0:0}, at:
> stop_per_cpu_kthreads+0x7/0x60
> 
> but task is already holding lock:
> ffffffff82c51af0 (cpu_hotplug_lock){++++}-{0:0}, at:
> start_per_cpu_kthreads+0x28/0x140
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>        CPU0
>        ----
>   lock(cpu_hotplug_lock);
>   lock(cpu_hotplug_lock);
> 
> Call Trace:
>  <TASK>
>  __lock_acquire+0x1612/0x29b0
>  lock_acquire+0xd0/0x2e0
>  cpus_read_lock+0x49/0x120
>  stop_per_cpu_kthreads+0x7/0x60
>  start_kthread+0x105/0x120
>  start_per_cpu_kthreads+0xdd/0x140
>  osnoise_workload_start+0x261/0x2f0
>  osnoise_tracer_start+0x18/0x4
> 
> In start_kthread(), when kthread_run_on_cpu() fails,
> cpus_read_unlock() should be called before stop_per_cpu_kthreads(),
> but both start_per_cpu_kthreads() and start_kthread() call the error
> handling routine stop_per_cpu_kthreads(),
> which is redundant. Only one call is necessary.
> To fix this, move stop_per_cpu_kthreads() outside of start_kthread(),
> use the return value of start_kthread() to determine kthread creation
> error.
> The same issue exists in osnoise_hotplug_workfn() too.
> 
> Reviewed-by: Yang Guang <yang.guang5@zte.com.cn>
> Reviewed-by: Wang Yong <wang.yong12@zte.com.cn>
> Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
> ---
>  kernel/trace/trace_osnoise.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> index 92e16f03fa4e..38fb0c655f5b 100644
> --- a/kernel/trace/trace_osnoise.c
> +++ b/kernel/trace/trace_osnoise.c
> @@ -2029,7 +2029,6 @@ static int start_kthread(unsigned int cpu)
>  
>  	if (IS_ERR(kthread)) {
>  		pr_err(BANNER "could not start sampling thread\n");
> -		stop_per_cpu_kthreads();
>  		return -ENOMEM;
>  	}
>  
> @@ -2097,7 +2096,7 @@ static void osnoise_hotplug_workfn(struct
> work_struct *dummy)
>  		return;
>  
>  	guard(mutex)(&interface_lock);
> -	guard(cpus_read_lock)();
> +	cpus_read_lock();
>  
>  	if (!cpu_online(cpu))
>  		return;
> @@ -2105,7 +2104,12 @@ static void osnoise_hotplug_workfn(struct
> work_struct *dummy)
>  	if (!cpumask_test_cpu(cpu, &osnoise_cpumask))
>  		return;
>  
> -	start_kthread(cpu);
> +	if (start_kthread(cpu)) {
> +		cpus_read_unlock();
> +		stop_per_cpu_kthreads();

Is it right to call stop_per_cpu_kthreads() which stops osnoise kthread
for every other CPUs in the system if a failure occurs during hotplug of a
CPU?

On another note, since stop_per_cpu_kthreads() invokes stop_kthread()
for every online CPU. It's better to remove stop_per_cpu_kthreads() from
start_kthread() and handle the error in `osnoise_hotplug_workfn` 

	Vishal
> +		return;
> +	}
> +	cpus_read_unlock();
>  }
>  
>  static DECLARE_WORK(osnoise_hotplug_work, osnoise_hotplug_workfn);
> -- 
> 2.15.2
>

Re: [PATCH] tracing/osnoise: Fix possible recursive locking for cpus_read_lock()

Posted by Ran Xiaokai 10 months, 4 weeks ago

>On Tue, Feb 25, 2025 at 12:31:32PM +0000, Ran Xiaokai wrote:
>> From: Ran Xiaokai <ran.xiaokai@zte.com.cn>
>> 
>> Lockdep reports this deadlock log:
>> ============================================
>> WARNING: possible recursive locking detected
>> --------------------------------------------
>> sh/31444 is trying to acquire lock:
>> ffffffff82c51af0 (cpu_hotplug_lock){++++}-{0:0}, at:
>> stop_per_cpu_kthreads+0x7/0x60
>> 
>> but task is already holding lock:
>> ffffffff82c51af0 (cpu_hotplug_lock){++++}-{0:0}, at:
>> start_per_cpu_kthreads+0x28/0x140
>> 
>> other info that might help us debug this:
>>  Possible unsafe locking scenario:
>> 
>>        CPU0
>>        ----
>>   lock(cpu_hotplug_lock);
>>   lock(cpu_hotplug_lock);
>> 
>> Call Trace:
>>  <TASK>
>>  __lock_acquire+0x1612/0x29b0
>>  lock_acquire+0xd0/0x2e0
>>  cpus_read_lock+0x49/0x120
>>  stop_per_cpu_kthreads+0x7/0x60
>>  start_kthread+0x105/0x120
>>  start_per_cpu_kthreads+0xdd/0x140
>>  osnoise_workload_start+0x261/0x2f0
>>  osnoise_tracer_start+0x18/0x4
>> 
>> In start_kthread(), when kthread_run_on_cpu() fails,
>> cpus_read_unlock() should be called before stop_per_cpu_kthreads(),
>> but both start_per_cpu_kthreads() and start_kthread() call the error
>> handling routine stop_per_cpu_kthreads(),
>> which is redundant. Only one call is necessary.
>> To fix this, move stop_per_cpu_kthreads() outside of start_kthread(),
>> use the return value of start_kthread() to determine kthread creation
>> error.
>> The same issue exists in osnoise_hotplug_workfn() too.
>> 
>> Reviewed-by: Yang Guang <yang.guang5@zte.com.cn>
>> Reviewed-by: Wang Yong <wang.yong12@zte.com.cn>
>> Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
>> ---
>>  kernel/trace/trace_osnoise.c | 10 +++++++---
>>  1 file changed, 7 insertions(+), 3 deletions(-)
>> 
>> diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
>> index 92e16f03fa4e..38fb0c655f5b 100644
>> --- a/kernel/trace/trace_osnoise.c
>> +++ b/kernel/trace/trace_osnoise.c
>> @@ -2029,7 +2029,6 @@ static int start_kthread(unsigned int cpu)
>>  
>>  	if (IS_ERR(kthread)) {
>>  		pr_err(BANNER "could not start sampling thread\n");
>> -		stop_per_cpu_kthreads();
>>  		return -ENOMEM;
>>  	}
>>  
>> @@ -2097,7 +2096,7 @@ static void osnoise_hotplug_workfn(struct
>> work_struct *dummy)
>>  		return;
>>  
>>  	guard(mutex)(&interface_lock);
>> -	guard(cpus_read_lock)();
>> +	cpus_read_lock();
>>  
>>  	if (!cpu_online(cpu))
>>  		return;
>> @@ -2105,7 +2104,12 @@ static void osnoise_hotplug_workfn(struct
>> work_struct *dummy)
>>  	if (!cpumask_test_cpu(cpu, &osnoise_cpumask))
>>  		return;
>>  
>> -	start_kthread(cpu);
>> +	if (start_kthread(cpu)) {
>> +		cpus_read_unlock();
>> +		stop_per_cpu_kthreads();
>
>Is it right to call stop_per_cpu_kthreads() which stops osnoise kthread
>for every other CPUs in the system if a failure occurs during hotplug of a
>CPU?

I also suspect that this is not a rational behavior.

>On another note, since stop_per_cpu_kthreads() invokes stop_kthread()
>for every online CPU. It's better to remove stop_per_cpu_kthreads() from
>start_kthread() and handle the error in `osnoise_hotplug_workfn` 

Hi, Vishal
I did this in my first versin, something like this:

@@ -2097,7 +2096,7 @@ static void osnoise_hotplug_workfn(struct
work_struct *dummy)
 		return;
 
 	guard(mutex)(&interface_lock);
-	guard(cpus_read_lock)();
+	cpus_read_lock();
 
 	if (!cpu_online(cpu)) {
+		cpus_read_unlock();
 		return;
	}

 	if (!cpumask_test_cpu(cpu, &osnoise_cpumask)) {
+		cpus_read_unlock();
 		return;
	}
-	start_kthread(cpu);
+	if (start_kthread(cpu)) {
+		cpus_read_unlock();
+		stop_per_cpu_kthreads();
+		return;
+	}
+	cpus_read_unlock();
 }

We have to drop the guard() and call unlock() manually,
this somewhat makes the code redundant.

>	Vishal
>> +		return;
>> +	}
>> +	cpus_read_unlock();
>>  }
>>  
>>  static DECLARE_WORK(osnoise_hotplug_work, osnoise_hotplug_workfn);
>> -- 
>> 2.15.2
>>

Re: [PATCH] tracing/osnoise: Fix possible recursive locking for cpus_read_lock()

Posted by Steven Rostedt 11 months, 2 weeks ago

On Tue, 25 Feb 2025 12:31:32 +0000
Ran Xiaokai <ranxiaokai627@163.com> wrote:

> @@ -2097,7 +2096,7 @@ static void osnoise_hotplug_workfn(struct
> work_struct *dummy)
>  		return;
>  
>  	guard(mutex)(&interface_lock);
> -	guard(cpus_read_lock)();
> +	cpus_read_lock();
>  
>  	if (!cpu_online(cpu))
>  		return;

This is buggy. You removed the guard, and right below we have an error exit
that will leave this function without unlocking the cpus_read_lock().

> @@ -2105,7 +2104,12 @@ static void osnoise_hotplug_workfn(struct
> work_struct *dummy)
>  	if (!cpumask_test_cpu(cpu, &osnoise_cpumask))
>  		return;
>  
> -	start_kthread(cpu);
> +	if (start_kthread(cpu)) {
> +		cpus_read_unlock();
> +		stop_per_cpu_kthreads();
> +		return;

If all you want to do is to unlock before calling stop_per_cpu_kthreads(),
then this should simply be:

	if (start_kthread(cpu)) {
		cpus_read_unlock();
		stop_per_cpu_kthreads();
		cpus_read_lock(); // The guard() above will unlock this
		return;
	}


But I still have to verify that this is indeed the issue here.

-- Steve


> +	}
> +	cpus_read_unlock();
>  }
>  
>  static DECLARE_WORK(osnoise_hotplug_work, osnoise_hotplug_workfn);

Re: [PATCH] tracing/osnoise: Fix possible recursive locking for cpus_read_lock()

Posted by Ran Xiaokai 11 months, 2 weeks ago

>On Tue, 25 Feb 2025 12:31:32 +0000
>Ran Xiaokai <ranxiaokai627@163.com> wrote:
>
>> @@ -2097,7 +2096,7 @@ static void osnoise_hotplug_workfn(struct
>> work_struct *dummy)
>>          return;
>>  
>>      guard(mutex)(&interface_lock);
>> -    guard(cpus_read_lock)();
>> +    cpus_read_lock();
>>  
>>      if (!cpu_online(cpu))
>>          return;
>
>This is buggy. You removed the guard, and right below we have an error exit
>that will leave this function without unlocking the cpus_read_lock().

Indeed.
I will run the LTP cpu-hotplug testcases before the next verion.

>> @@ -2105,7 +2104,12 @@ static void osnoise_hotplug_workfn(struct
>> work_struct *dummy)
>>      if (!cpumask_test_cpu(cpu, &osnoise_cpumask))
>>          return;
>>  
>> -    start_kthread(cpu);
>> +    if (start_kthread(cpu)) {
>> +        cpus_read_unlock();
>> +        stop_per_cpu_kthreads();
>> +        return;
>
>If all you want to do is to unlock before calling stop_per_cpu_kthreads(),
>then this should simply be:
>
>   if (start_kthread(cpu)) {
>       cpus_read_unlock();
>       stop_per_cpu_kthreads();
>       cpus_read_lock(); // The guard() above will unlock this
>       return;
>   }

This is the deadlock senario:
start_per_cpu_kthreads()
  cpus_read_lock();                  // first lock call
  start_kthread(cpu)
    ... kthread_run_on_cpu() fails:
    if (IS_ERR(kthread)) {
      stop_per_cpu_kthreads(); {
        cpus_read_lock();      // second lock call. Cause the AA deadlock senario
      }
    }
  stop_per_cpu_kthreads();

Besides, stop_per_cpu_kthreads() is called both in start_per_cpu_kthreads() and
start_kthread() which is unnecessary.

So the fix should be inside start_kthread()?
How about this ?

--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -2029,7 +2029,9 @@ static int start_kthread(unsigned int cpu)

        if (IS_ERR(kthread)) {
                pr_err(BANNER "could not start sampling thread\n");
+               cpus_read_unlock();
                stop_per_cpu_kthreads();
+               cpus_read_lock();
                return -ENOMEM;
        }

@@ -2076,7 +2078,6 @@ static int start_per_cpu_kthreads(void)
                retval = start_kthread(cpu);
                if (retval) {
                        cpus_read_unlock();
-                       stop_per_cpu_kthreads();
                        return retval;
                }
        }

>
>But I still have to verify that this is indeed the issue here.
>
>-- Steve
>
>
>> +    }
>> +    cpus_read_unlock();
>>  }
>>  
>>  static DECLARE_WORK(osnoise_hotplug_work, osnoise_hotplug_workfn);

Re: [PATCH] tracing/osnoise: Fix possible recursive locking for cpus_read_lock()

Posted by Steven Rostedt 11 months, 2 weeks ago

On Wed, 26 Feb 2025 03:42:53 +0000
Ran Xiaokai <ranxiaokai627@163.com> wrote:

> >> @@ -2105,7 +2104,12 @@ static void osnoise_hotplug_workfn(struct
> >> work_struct *dummy)
> >>      if (!cpumask_test_cpu(cpu, &osnoise_cpumask))
> >>          return;
> >>  
> >> -    start_kthread(cpu);
> >> +    if (start_kthread(cpu)) {
> >> +        cpus_read_unlock();
> >> +        stop_per_cpu_kthreads();
> >> +        return;  
> >
> >If all you want to do is to unlock before calling stop_per_cpu_kthreads(),
> >then this should simply be:
> >
> >   if (start_kthread(cpu)) {
> >       cpus_read_unlock();
> >       stop_per_cpu_kthreads();
> >       cpus_read_lock(); // The guard() above will unlock this
> >       return;
> >   }  
> 
> This is the deadlock senario:
> start_per_cpu_kthreads()
>   cpus_read_lock();                  // first lock call
>   start_kthread(cpu)
>     ... kthread_run_on_cpu() fails:
>     if (IS_ERR(kthread)) {
>       stop_per_cpu_kthreads(); {
>         cpus_read_lock();      // second lock call. Cause the AA deadlock senario
>       }
>     }
>   stop_per_cpu_kthreads();
> 
> Besides, stop_per_cpu_kthreads() is called both in start_per_cpu_kthreads() and
> start_kthread() which is unnecessary.
> 
> So the fix should be inside start_kthread()?
> How about this ?

No! You misunderstood what I wrote above.

Instead of removing the guard, keep it!

Do everything the same, but instead of having the error path of:

[..]
    if (start_kthread(cpu)) {
        cpus_read_unlock();
        stop_per_cpu_kthreads();
        return;
    }
    cpus_read_unlock();
 }

Which requires removing the guard. Just do:

    if (start_kthread(cpu)) {
        cpus_read_unlock();
        stop_per_cpu_kthreads();
        cpus_read_lock(); // The guard() will unlock this
    }
 }

I'm just saying to not replace the guard with open coded locking of
cpus_read_lock().

-- Steve

Re: [PATCH] tracing/osnoise: Fix possible recursive locking for cpus_read_lock()

Posted by Ran Xiaokai 10 months, 4 weeks ago

>On Wed, 26 Feb 2025 03:42:53 +0000
>Ran Xiaokai <ranxiaokai627@163.com> wrote:
>
>> >> @@ -2105,7 +2104,12 @@ static void osnoise_hotplug_workfn(struct
>> >> work_struct *dummy)
>> >>      if (!cpumask_test_cpu(cpu, &osnoise_cpumask))
>> >>          return;
>> >>  
>> >> -    start_kthread(cpu);
>> >> +    if (start_kthread(cpu)) {
>> >> +        cpus_read_unlock();
>> >> +        stop_per_cpu_kthreads();
>> >> +        return;  
>> >
>> >If all you want to do is to unlock before calling stop_per_cpu_kthreads(),
>> >then this should simply be:
>> >
>> >   if (start_kthread(cpu)) {
>> >       cpus_read_unlock();
>> >       stop_per_cpu_kthreads();
>> >       cpus_read_lock(); // The guard() above will unlock this
>> >       return;
>> >   }  
>> 
>> This is the deadlock senario:
>> start_per_cpu_kthreads()
>>   cpus_read_lock();                  // first lock call
>>   start_kthread(cpu)
>>     ... kthread_run_on_cpu() fails:
>>     if (IS_ERR(kthread)) {
>>       stop_per_cpu_kthreads(); {
>>         cpus_read_lock();      // second lock call. Cause the AA deadlock senario
>>       }
>>     }
>>   stop_per_cpu_kthreads();
>> 
>> Besides, stop_per_cpu_kthreads() is called both in start_per_cpu_kthreads() and
>> start_kthread() which is unnecessary.
>> 
>> So the fix should be inside start_kthread()?
>> How about this ?
>
>No! You misunderstood what I wrote above.
>
>Instead of removing the guard, keep it!
>
>Do everything the same, but instead of having the error path of:
>
>[..]
>    if (start_kthread(cpu)) {
>        cpus_read_unlock();
>        stop_per_cpu_kthreads();
>        return;
>    }
>    cpus_read_unlock();
> }
>
>Which requires removing the guard. Just do:
>
>    if (start_kthread(cpu)) {
>        cpus_read_unlock();
>        stop_per_cpu_kthreads();
>        cpus_read_lock(); // The guard() will unlock this
>    }
> }

Hi, Steve
Sorry for the late response.
Yes, this will fix the deadlock issue.

What i mentioned before is that there is a redundant of stop_per_cpu_kthreads()
in start_per_cpu_kthreads().

start_per_cpu_kthreads()
  for_each_cpu(cpu, current_mask) {
    retval = start_kthread(cpu);
             {
                if (IS_ERR(kthread))
                 stop_per_cpu_kthreads();  // first cleanup call of stop_per_cpu_kthreads()
                 return -ENOMEM;
             }
    if (retval) {
      cpus_read_unlock();
      stop_per_cpu_kthreads();            // the redundant call of stop_per_cpu_kthreads()

But the second call will not cause any trouble, So i will send a v2
just according to your suggestion.

>I'm just saying to not replace the guard with open coded locking of
>cpus_read_lock().
>
>-- Steve