[PATCH] xen/sched: fix get_cpu_idle_time() for smt=0 suspend/resume

Juergen Gross posted 1 patch 2 years, 8 months ago
Failed in applying to current master (apply log)
xen/common/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH] xen/sched: fix get_cpu_idle_time() for smt=0 suspend/resume
Posted by Juergen Gross 2 years, 8 months ago
With smt=0 during a suspend/resume cycle of the machine the threads
which have been parked before will briefly come up again. This can
result in problems e.g. with cpufreq driver being active as this will
call into get_cpu_idle_time() for a cpu without initialized scheduler
data.

Fix that by letting get_cpu_idle_time() deal with this case.

Fixes: 132cbe8f35632fb2 ("sched: fix get_cpu_idle_time() with core scheduling")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
---
An alternative way to fix the issue would be to keep the sched_resource
of offline cpus allocated like we already do with idle vcpus and units.
This fix would be more intrusive, but it would avoid similar other bugs
like this one.
---
 xen/common/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
index 6d34764d38..9ac1b01ca8 100644
--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -337,7 +337,7 @@ uint64_t get_cpu_idle_time(unsigned int cpu)
     struct vcpu_runstate_info state = { 0 };
     const struct vcpu *v = idle_vcpu[cpu];
 
-    if ( cpu_online(cpu) && v )
+    if ( cpu_online(cpu) && v && get_sched_res(cpu) )
         vcpu_runstate_get(v, &state);
 
     return state.time[RUNSTATE_running];
-- 
2.26.2


Re: [PATCH] xen/sched: fix get_cpu_idle_time() for smt=0 suspend/resume
Posted by Dario Faggioli 2 years, 8 months ago
On Wed, 2021-08-18 at 12:21 +0200, Juergen Gross wrote:
> Fix that by letting get_cpu_idle_time() deal with this case.
> 
> Fixes: 132cbe8f35632fb2 ("sched: fix get_cpu_idle_time() with core
> scheduling")
> Reported-by: Marek Marczykowski-Górecki
> <marmarek@invisiblethingslab.com>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Tested-by: Marek Marczykowski-Górecki
> <marmarek@invisiblethingslab.com>
>
Mmm... This is an interesting one. In fact, this fix is not only
correct, it's also simple, effective and (I guess) easy enough to
backport.

Considering all these things together with the fact that we have an
open issue, I'm going to provide my:

Acked-by: Dario Faggioli <dfaggioli@suse.com>

(and this remains valid with Jan's proposed change done upon
committing.)

That said, in the long run...

> ---
> An alternative way to fix the issue would be to keep the
> sched_resource
> of offline cpus allocated like we already do with idle vcpus and
> units.
> This fix would be more intrusive, but it would avoid similar other
> bugs
> like this one.
> 
... it would be probably interesting to go this route, as it looks both
more consistent and future proof (I mean implement it proactively,
independently of issues... when/if we have time, of course!)

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)
Re: [PATCH] xen/sched: fix get_cpu_idle_time() for smt=0 suspend/resume
Posted by Jan Beulich 2 years, 8 months ago
On 18.08.2021 12:21, Juergen Gross wrote:
> With smt=0 during a suspend/resume cycle of the machine the threads
> which have been parked before will briefly come up again. This can
> result in problems e.g. with cpufreq driver being active as this will
> call into get_cpu_idle_time() for a cpu without initialized scheduler
> data.
> 
> Fix that by letting get_cpu_idle_time() deal with this case.
> 
> Fixes: 132cbe8f35632fb2 ("sched: fix get_cpu_idle_time() with core scheduling")
> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

> --- a/xen/common/sched/core.c
> +++ b/xen/common/sched/core.c
> @@ -337,7 +337,7 @@ uint64_t get_cpu_idle_time(unsigned int cpu)
>      struct vcpu_runstate_info state = { 0 };
>      const struct vcpu *v = idle_vcpu[cpu];
>  
> -    if ( cpu_online(cpu) && v )
> +    if ( cpu_online(cpu) && v && get_sched_res(cpu) )
>          vcpu_runstate_get(v, &state);

My earlier question was aiming at getting rid of the (now) middle part
of the condition; I thought this may be okay to do as a secondary change
here. But perhaps you intentionally left it there, so I'm unsure whether
to suggest to make the adjustment while committing (awaiting a
maintainer ack first anyway).

Jan


Re: [PATCH] xen/sched: fix get_cpu_idle_time() for smt=0 suspend/resume
Posted by Juergen Gross 2 years, 8 months ago
On 18.08.21 12:35, Jan Beulich wrote:
> On 18.08.2021 12:21, Juergen Gross wrote:
>> With smt=0 during a suspend/resume cycle of the machine the threads
>> which have been parked before will briefly come up again. This can
>> result in problems e.g. with cpufreq driver being active as this will
>> call into get_cpu_idle_time() for a cpu without initialized scheduler
>> data.
>>
>> Fix that by letting get_cpu_idle_time() deal with this case.
>>
>> Fixes: 132cbe8f35632fb2 ("sched: fix get_cpu_idle_time() with core scheduling")
>> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> 
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> 
>> --- a/xen/common/sched/core.c
>> +++ b/xen/common/sched/core.c
>> @@ -337,7 +337,7 @@ uint64_t get_cpu_idle_time(unsigned int cpu)
>>       struct vcpu_runstate_info state = { 0 };
>>       const struct vcpu *v = idle_vcpu[cpu];
>>   
>> -    if ( cpu_online(cpu) && v )
>> +    if ( cpu_online(cpu) && v && get_sched_res(cpu) )
>>           vcpu_runstate_get(v, &state);
> 
> My earlier question was aiming at getting rid of the (now) middle part
> of the condition; I thought this may be okay to do as a secondary change
> here. But perhaps you intentionally left it there, so I'm unsure whether
> to suggest to make the adjustment while committing (awaiting a
> maintainer ack first anyway).

Ah, okay. Yes, I think the test of v being non-NULL can be removed.


Juergen