[PATCH] cpufreq: governor: Fix race between sysfs store and dbs work handler

Zhongqiu Han posted 1 patch 2 months, 1 week ago
drivers/cpufreq/cpufreq_governor.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
[PATCH] cpufreq: governor: Fix race between sysfs store and dbs work handler
Posted by Zhongqiu Han 2 months, 1 week ago
gov_update_cpu_data() resets per-CPU prev_cpu_idle and prev_cpu_nice
for every CPU in the governed domain. It is called from sysfs store
callbacks (e.g. ignore_nice_load_store) which run under
attr_set->update_lock, held by the surrounding governor_store().

Concurrently, dbs_work_handler() calls gov->gov_dbs_update() (which
calls dbs_update()) under policy_dbs->update_mutex. dbs_update() both
reads and writes the same prev_cpu_idle / prev_cpu_nice fields. The
potential race path is:

Path A (sysfs write, holds attr_set->update_lock only):

  governor_store()
    mutex_lock(&attr_set->update_lock)
    ignore_nice_load_store()
      dbs_data->ignore_nice_load = input
      gov_update_cpu_data(dbs_data)
        list_for_each_entry(policy_dbs, ...)
          for_each_cpu(j, ...)
            j_cdbs->prev_cpu_idle = get_cpu_idle_time(...)  /* write */
            j_cdbs->prev_cpu_nice = kcpustat_field(...)     /* write */
    mutex_unlock(&attr_set->update_lock)

Path B (work queue, holds policy_dbs->update_mutex only):

  dbs_work_handler()
    mutex_lock(&policy_dbs->update_mutex)
    gov->gov_dbs_update(policy)
      dbs_update()
        for_each_cpu(j, policy->cpus)
          idle_time = cur - j_cdbs->prev_cpu_idle           /* read  */
          j_cdbs->prev_cpu_idle = cur_idle_time             /* write */
          idle_time += cur_nice - j_cdbs->prev_cpu_nice     /* read  */
          j_cdbs->prev_cpu_nice = cur_nice                  /* write */
    mutex_unlock(&policy_dbs->update_mutex)

Because attr_set->update_lock and policy_dbs->update_mutex are two
completely independent locks, the two paths are not mutually exclusive.
This results in a data race on cpu_dbs_info.prev_cpu_idle and
cpu_dbs_info.prev_cpu_nice.

Fix this by also acquiring policy_dbs->update_mutex in
gov_update_cpu_data() for each policy, so that path A participates in
the mutual exclusion already established by dbs_work_handler(). Also
update the function comment to accurately reflect the two-level locking
contract.

The root of this race dates back to the original ondemand/conservative
governors. Before commit ee88415caf73 ("[CPUFREQ] Cleanup locking in
conservative governor") and commit 5a75c82828e7 ("[CPUFREQ] Cleanup
locking in ondemand governor"), all accesses to prev_cpu_idle and
prev_cpu_nice in cpufreq_governor_dbs() (path X), store_ignore_nice_load()
(path Y), and do_dbs_timer() (path Z) were serialised by the same
dbs_mutex, so no race existed. Those two commits switched do_dbs_timer()
from dbs_mutex to a per-policy/per-cpu timer_mutex to reduce lock
contention, but left store_ignore_nice_load() still holding dbs_mutex.
As a result, path Y (store) and path Z (do_dbs_timer) no longer shared a
common lock, introducing a potential race on prev_cpu_idle/prev_cpu_nice
between store_ignore_nice_load() and dbs_check_cpu().

Commit 326c86deaed54a ("[CPUFREQ] Remove unneeded locks") then removed
dbs_mutex from store_ignore_nice_load() entirely, introducing an
additional potential race between store_ignore_nice_load() (path Y, now
lockless) and cpufreq_governor_dbs() (path X, still holding dbs_mutex),
while the race between path Y and path Z remained.

Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor")
Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor")
Fixes: 326c86deaed54a ("[CPUFREQ] Remove unneeded locks")
Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
---
 drivers/cpufreq/cpufreq_governor.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index 86f35e451914..56ef793362db 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -90,7 +90,8 @@ EXPORT_SYMBOL_GPL(sampling_rate_store);
  * (that may be a single policy or a bunch of them if governor tunables are
  * system-wide).
  *
- * Call under the @dbs_data mutex.
+ * Call under the @dbs_data->attr_set.update_lock. The per-policy
+ * update_mutex is acquired and released internally for each policy.
  */
 void gov_update_cpu_data(struct dbs_data *dbs_data)
 {
@@ -99,6 +100,7 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
 	list_for_each_entry(policy_dbs, &dbs_data->attr_set.policy_list, list) {
 		unsigned int j;
 
+		mutex_lock(&policy_dbs->update_mutex);
 		for_each_cpu(j, policy_dbs->policy->cpus) {
 			struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
 
@@ -107,6 +109,7 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
 			if (dbs_data->ignore_nice_load)
 				j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
 		}
+		mutex_unlock(&policy_dbs->update_mutex);
 	}
 }
 EXPORT_SYMBOL_GPL(gov_update_cpu_data);
-- 
2.43.0
Re: [PATCH] cpufreq: governor: Fix race between sysfs store and dbs work handler
Posted by Rafael J. Wysocki 2 months, 1 week ago
On Mon, Apr 6, 2026 at 1:01 PM Zhongqiu Han
<zhongqiu.han@oss.qualcomm.com> wrote:
>
> gov_update_cpu_data() resets per-CPU prev_cpu_idle and prev_cpu_nice
> for every CPU in the governed domain. It is called from sysfs store
> callbacks (e.g. ignore_nice_load_store) which run under
> attr_set->update_lock, held by the surrounding governor_store().
>
> Concurrently, dbs_work_handler() calls gov->gov_dbs_update() (which
> calls dbs_update()) under policy_dbs->update_mutex. dbs_update() both
> reads and writes the same prev_cpu_idle / prev_cpu_nice fields. The
> potential race path is:
>
> Path A (sysfs write, holds attr_set->update_lock only):
>
>   governor_store()
>     mutex_lock(&attr_set->update_lock)
>     ignore_nice_load_store()
>       dbs_data->ignore_nice_load = input
>       gov_update_cpu_data(dbs_data)
>         list_for_each_entry(policy_dbs, ...)
>           for_each_cpu(j, ...)
>             j_cdbs->prev_cpu_idle = get_cpu_idle_time(...)  /* write */
>             j_cdbs->prev_cpu_nice = kcpustat_field(...)     /* write */
>     mutex_unlock(&attr_set->update_lock)
>
> Path B (work queue, holds policy_dbs->update_mutex only):
>
>   dbs_work_handler()
>     mutex_lock(&policy_dbs->update_mutex)
>     gov->gov_dbs_update(policy)
>       dbs_update()
>         for_each_cpu(j, policy->cpus)
>           idle_time = cur - j_cdbs->prev_cpu_idle           /* read  */
>           j_cdbs->prev_cpu_idle = cur_idle_time             /* write */
>           idle_time += cur_nice - j_cdbs->prev_cpu_nice     /* read  */
>           j_cdbs->prev_cpu_nice = cur_nice                  /* write */
>     mutex_unlock(&policy_dbs->update_mutex)
>
> Because attr_set->update_lock and policy_dbs->update_mutex are two
> completely independent locks, the two paths are not mutually exclusive.
> This results in a data race on cpu_dbs_info.prev_cpu_idle and
> cpu_dbs_info.prev_cpu_nice.
>
> Fix this by also acquiring policy_dbs->update_mutex in
> gov_update_cpu_data() for each policy, so that path A participates in
> the mutual exclusion already established by dbs_work_handler(). Also
> update the function comment to accurately reflect the two-level locking
> contract.
>
> The root of this race dates back to the original ondemand/conservative
> governors. Before commit ee88415caf73 ("[CPUFREQ] Cleanup locking in
> conservative governor") and commit 5a75c82828e7 ("[CPUFREQ] Cleanup
> locking in ondemand governor"), all accesses to prev_cpu_idle and
> prev_cpu_nice in cpufreq_governor_dbs() (path X), store_ignore_nice_load()
> (path Y), and do_dbs_timer() (path Z) were serialised by the same
> dbs_mutex, so no race existed. Those two commits switched do_dbs_timer()
> from dbs_mutex to a per-policy/per-cpu timer_mutex to reduce lock
> contention, but left store_ignore_nice_load() still holding dbs_mutex.
> As a result, path Y (store) and path Z (do_dbs_timer) no longer shared a
> common lock, introducing a potential race on prev_cpu_idle/prev_cpu_nice
> between store_ignore_nice_load() and dbs_check_cpu().
>
> Commit 326c86deaed54a ("[CPUFREQ] Remove unneeded locks") then removed
> dbs_mutex from store_ignore_nice_load() entirely, introducing an
> additional potential race between store_ignore_nice_load() (path Y, now
> lockless) and cpufreq_governor_dbs() (path X, still holding dbs_mutex),
> while the race between path Y and path Z remained.
>
> Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor")
> Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor")
> Fixes: 326c86deaed54a ("[CPUFREQ] Remove unneeded locks")
> Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
> ---
>  drivers/cpufreq/cpufreq_governor.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
> index 86f35e451914..56ef793362db 100644
> --- a/drivers/cpufreq/cpufreq_governor.c
> +++ b/drivers/cpufreq/cpufreq_governor.c
> @@ -90,7 +90,8 @@ EXPORT_SYMBOL_GPL(sampling_rate_store);
>   * (that may be a single policy or a bunch of them if governor tunables are
>   * system-wide).
>   *
> - * Call under the @dbs_data mutex.
> + * Call under the @dbs_data->attr_set.update_lock. The per-policy
> + * update_mutex is acquired and released internally for each policy.
>   */
>  void gov_update_cpu_data(struct dbs_data *dbs_data)
>  {
> @@ -99,6 +100,7 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
>         list_for_each_entry(policy_dbs, &dbs_data->attr_set.policy_list, list) {
>                 unsigned int j;
>
> +               mutex_lock(&policy_dbs->update_mutex);
>                 for_each_cpu(j, policy_dbs->policy->cpus) {
>                         struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
>
> @@ -107,6 +109,7 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
>                         if (dbs_data->ignore_nice_load)
>                                 j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
>                 }
> +               mutex_unlock(&policy_dbs->update_mutex);
>         }
>  }
>  EXPORT_SYMBOL_GPL(gov_update_cpu_data);
> --

Please have a look at

https://sashiko.dev/#/patchset/20260406110113.3475920-1-zhongqiu.han%40oss.qualcomm.com

and let me know what you think.

Thanks!
Re: [PATCH] cpufreq: governor: Fix race between sysfs store and dbs work handler
Posted by Zhongqiu Han 2 months, 1 week ago
On 4/6/2026 11:21 PM, Rafael J. Wysocki wrote:
> On Mon, Apr 6, 2026 at 1:01 PM Zhongqiu Han
> <zhongqiu.han@oss.qualcomm.com> wrote:
>>
>> gov_update_cpu_data() resets per-CPU prev_cpu_idle and prev_cpu_nice
>> for every CPU in the governed domain. It is called from sysfs store
>> callbacks (e.g. ignore_nice_load_store) which run under
>> attr_set->update_lock, held by the surrounding governor_store().
>>
>> Concurrently, dbs_work_handler() calls gov->gov_dbs_update() (which
>> calls dbs_update()) under policy_dbs->update_mutex. dbs_update() both
>> reads and writes the same prev_cpu_idle / prev_cpu_nice fields. The
>> potential race path is:
>>
>> Path A (sysfs write, holds attr_set->update_lock only):
>>
>>    governor_store()
>>      mutex_lock(&attr_set->update_lock)
>>      ignore_nice_load_store()
>>        dbs_data->ignore_nice_load = input
>>        gov_update_cpu_data(dbs_data)
>>          list_for_each_entry(policy_dbs, ...)
>>            for_each_cpu(j, ...)
>>              j_cdbs->prev_cpu_idle = get_cpu_idle_time(...)  /* write */
>>              j_cdbs->prev_cpu_nice = kcpustat_field(...)     /* write */
>>      mutex_unlock(&attr_set->update_lock)
>>
>> Path B (work queue, holds policy_dbs->update_mutex only):
>>
>>    dbs_work_handler()
>>      mutex_lock(&policy_dbs->update_mutex)
>>      gov->gov_dbs_update(policy)
>>        dbs_update()
>>          for_each_cpu(j, policy->cpus)
>>            idle_time = cur - j_cdbs->prev_cpu_idle           /* read  */
>>            j_cdbs->prev_cpu_idle = cur_idle_time             /* write */
>>            idle_time += cur_nice - j_cdbs->prev_cpu_nice     /* read  */
>>            j_cdbs->prev_cpu_nice = cur_nice                  /* write */
>>      mutex_unlock(&policy_dbs->update_mutex)
>>
>> Because attr_set->update_lock and policy_dbs->update_mutex are two
>> completely independent locks, the two paths are not mutually exclusive.
>> This results in a data race on cpu_dbs_info.prev_cpu_idle and
>> cpu_dbs_info.prev_cpu_nice.
>>
>> Fix this by also acquiring policy_dbs->update_mutex in
>> gov_update_cpu_data() for each policy, so that path A participates in
>> the mutual exclusion already established by dbs_work_handler(). Also
>> update the function comment to accurately reflect the two-level locking
>> contract.
>>
>> The root of this race dates back to the original ondemand/conservative
>> governors. Before commit ee88415caf73 ("[CPUFREQ] Cleanup locking in
>> conservative governor") and commit 5a75c82828e7 ("[CPUFREQ] Cleanup
>> locking in ondemand governor"), all accesses to prev_cpu_idle and
>> prev_cpu_nice in cpufreq_governor_dbs() (path X), store_ignore_nice_load()
>> (path Y), and do_dbs_timer() (path Z) were serialised by the same
>> dbs_mutex, so no race existed. Those two commits switched do_dbs_timer()
>> from dbs_mutex to a per-policy/per-cpu timer_mutex to reduce lock
>> contention, but left store_ignore_nice_load() still holding dbs_mutex.
>> As a result, path Y (store) and path Z (do_dbs_timer) no longer shared a
>> common lock, introducing a potential race on prev_cpu_idle/prev_cpu_nice
>> between store_ignore_nice_load() and dbs_check_cpu().
>>
>> Commit 326c86deaed54a ("[CPUFREQ] Remove unneeded locks") then removed
>> dbs_mutex from store_ignore_nice_load() entirely, introducing an
>> additional potential race between store_ignore_nice_load() (path Y, now
>> lockless) and cpufreq_governor_dbs() (path X, still holding dbs_mutex),
>> while the race between path Y and path Z remained.
>>
>> Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor")
>> Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor")
>> Fixes: 326c86deaed54a ("[CPUFREQ] Remove unneeded locks")
>> Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
>> ---
>>   drivers/cpufreq/cpufreq_governor.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
>> index 86f35e451914..56ef793362db 100644
>> --- a/drivers/cpufreq/cpufreq_governor.c
>> +++ b/drivers/cpufreq/cpufreq_governor.c
>> @@ -90,7 +90,8 @@ EXPORT_SYMBOL_GPL(sampling_rate_store);
>>    * (that may be a single policy or a bunch of them if governor tunables are
>>    * system-wide).
>>    *
>> - * Call under the @dbs_data mutex.
>> + * Call under the @dbs_data->attr_set.update_lock. The per-policy
>> + * update_mutex is acquired and released internally for each policy.
>>    */
>>   void gov_update_cpu_data(struct dbs_data *dbs_data)
>>   {
>> @@ -99,6 +100,7 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
>>          list_for_each_entry(policy_dbs, &dbs_data->attr_set.policy_list, list) {
>>                  unsigned int j;
>>
>> +               mutex_lock(&policy_dbs->update_mutex);
>>                  for_each_cpu(j, policy_dbs->policy->cpus) {
>>                          struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
>>
>> @@ -107,6 +109,7 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
>>                          if (dbs_data->ignore_nice_load)
>>                                  j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
>>                  }
>> +               mutex_unlock(&policy_dbs->update_mutex);
>>          }
>>   }
>>   EXPORT_SYMBOL_GPL(gov_update_cpu_data);
>> --
> 
> Please have a look at
> 
> https://sashiko.dev/#/patchset/20260406110113.3475920-1-zhongqiu.han%40oss.qualcomm.com
> 
> and let me know what you think.
> 
> Thanks!

Hi Rafael,

Thanks for the review,

sashiko.dev points out two pre‑existing gaps rather than regressions
introduced by this patch.

Issue 1: There is a window where sysfs writes can race with the
unprotected initialization loop in cpufreq_dbs_governor_start(), which
may lead to races. I will address this by protecting the initialization
loop with policy_dbs->update_mutex.

Issue 2: There is a residual race where dbs_update() may observe a newly
updated tunable together with stale counters, leading to an inflated
idle delta. I plan to address this by always updating prev_cpu_nice in
dbs_update(), regardless of ignore_nice, to avoid stale state. I will
double‑check the logic and address this in patch v2.


-- 
Thx and BRs,
Zhongqiu Han