[RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative

David Woodhouse posted 21 patches 1 year, 6 months ago
[RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative
Posted by David Woodhouse 1 year, 6 months ago
From: David Woodhouse <dwmw@amazon.co.uk>

In steal_account_process_time(), a delta is calculated between the value
returned by paravirt_steal_clock(), and this_rq()->prev_steal_time which
is assumed to be the *previous* value returned by paravirt_steal_clock().

However, instead of just assigning the newly-read value directly into
->prev_steal_time for use in the next iteration, ->prev_steal_time is
*incremented* by the calculated delta.

This used to be roughly the same, modulo conversion to jiffies and back,
until commit 807e5b80687c0 ("sched/cputime: Add steal time support to
full dynticks CPU time accounting") started clamping that delta to a
maximum of the actual time elapsed. So now, if the value returned by
paravirt_steal_clock() jumps by a large amount, instead of a *single*
period of reporting 100% steal time, the system will report 100% steal
time for as long as it takes to "catch up" with the reported value.
Which is up to 584 years.

But there is a benefit to advancing ->prev_steal_time only by the time
which was *accounted* as having been stolen. It means that any extra
time truncated by the clamping will be accounted in the next sample
period rather than lost. Given the stochastic nature of the sampling,
that is more accurate overall.

So, continue to advance ->prev_steal_time by the accounted value as
long as the delta isn't egregiously large (for which, use maxtime * 2).
If the delta is more than that, just set ->prev_steal_time directly to
the value returned by paravirt_steal_clock().

Fixes: 807e5b80687c0 ("sched/cputime: Add steal time support to full dynticks CPU time accounting")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 kernel/sched/cputime.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index af7952f12e6c..3a8a8b38966d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -254,13 +254,21 @@ static __always_inline u64 steal_account_process_time(u64 maxtime)
 {
 #ifdef CONFIG_PARAVIRT
 	if (static_key_false(&paravirt_steal_enabled)) {
-		u64 steal;
-
-		steal = paravirt_steal_clock(smp_processor_id());
-		steal -= this_rq()->prev_steal_time;
-		steal = min(steal, maxtime);
+		u64 steal, abs_steal;
+
+		abs_steal = paravirt_steal_clock(smp_processor_id());
+		steal = abs_steal - this_rq()->prev_steal_time;
+		if (unlikely(steal > maxtime)) {
+			/*
+			 * If the delta isn't egregious, it can be counted
+			 * in the next time period. Only advance by maxtime.
+			 */
+			if (steal < maxtime * 2)
+				abs_steal = this_rq()->prev_steal_time + maxtime;
+			steal = maxtime;
+		}
 		account_steal_time(steal);
-		this_rq()->prev_steal_time += steal;
+		this_rq()->prev_steal_time = abs_steal;
 
 		return steal;
 	}
-- 
2.44.0
Re: [RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative
Posted by Sean Christopherson 1 year, 3 months ago
This should be posted To: something other than kvm@, in a separate series, else
it's bound to get lost/ignored.

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> In steal_account_process_time(), a delta is calculated between the value
> returned by paravirt_steal_clock(), and this_rq()->prev_steal_time which
> is assumed to be the *previous* value returned by paravirt_steal_clock().
> 
> However, instead of just assigning the newly-read value directly into
> ->prev_steal_time for use in the next iteration, ->prev_steal_time is
> *incremented* by the calculated delta.
> 
> This used to be roughly the same, modulo conversion to jiffies and back,
> until commit 807e5b80687c0 ("sched/cputime: Add steal time support to
> full dynticks CPU time accounting") started clamping that delta to a
> maximum of the actual time elapsed. So now, if the value returned by
> paravirt_steal_clock() jumps by a large amount, instead of a *single*
> period of reporting 100% steal time, the system will report 100% steal
> time for as long as it takes to "catch up" with the reported value.
> Which is up to 584 years.
> 
> But there is a benefit to advancing ->prev_steal_time only by the time
> which was *accounted* as having been stolen. It means that any extra
> time truncated by the clamping will be accounted in the next sample
> period rather than lost. Given the stochastic nature of the sampling,
> that is more accurate overall.
> 
> So, continue to advance ->prev_steal_time by the accounted value as
> long as the delta isn't egregiously large (for which, use maxtime * 2).
> If the delta is more than that, just set ->prev_steal_time directly to
> the value returned by paravirt_steal_clock().
> 
> Fixes: 807e5b80687c0 ("sched/cputime: Add steal time support to full dynticks CPU time accounting")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  kernel/sched/cputime.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index af7952f12e6c..3a8a8b38966d 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -254,13 +254,21 @@ static __always_inline u64 steal_account_process_time(u64 maxtime)
>  {
>  #ifdef CONFIG_PARAVIRT
>  	if (static_key_false(&paravirt_steal_enabled)) {
> -		u64 steal;
> -
> -		steal = paravirt_steal_clock(smp_processor_id());
> -		steal -= this_rq()->prev_steal_time;
> -		steal = min(steal, maxtime);
> +		u64 steal, abs_steal;
> +
> +		abs_steal = paravirt_steal_clock(smp_processor_id());
> +		steal = abs_steal - this_rq()->prev_steal_time;
> +		if (unlikely(steal > maxtime)) {
> +			/*
> +			 * If the delta isn't egregious, it can be counted
> +			 * in the next time period. Only advance by maxtime.
> +			 */
> +			if (steal < maxtime * 2)
> +				abs_steal = this_rq()->prev_steal_time + maxtime;
> +			steal = maxtime;
> +		}
>  		account_steal_time(steal);
> -		this_rq()->prev_steal_time += steal;
> +		this_rq()->prev_steal_time = abs_steal;
>  
>  		return steal;
>  	}
> -- 
> 2.44.0
>
Re: [RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative
Posted by Shrikanth Hegde 1 year, 5 months ago

On 5/22/24 5:47 AM, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> In steal_account_process_time(), a delta is calculated between the value
> returned by paravirt_steal_clock(), and this_rq()->prev_steal_time which
> is assumed to be the *previous* value returned by paravirt_steal_clock().
> 
> However, instead of just assigning the newly-read value directly into
> ->prev_steal_time for use in the next iteration, ->prev_steal_time is
> *incremented* by the calculated delta.


Does this happen because ULONG_MAX and u64 aren't of same size? If so, 
would using the u64 variant of MAX would be a simpler fix?

> 
> This used to be roughly the same, modulo conversion to jiffies and back,
> until commit 807e5b80687c0 ("sched/cputime: Add steal time support to
> full dynticks CPU time accounting") started clamping that delta to a
> maximum of the actual time elapsed. So now, if the value returned by
> paravirt_steal_clock() jumps by a large amount, instead of a *single*
> period of reporting 100% steal time, the system will report 100% steal
> time for as long as it takes to "catch up" with the reported value.
> Which is up to 584 years.
> 
> But there is a benefit to advancing ->prev_steal_time only by the time
> which was *accounted* as having been stolen. It means that any extra
> time truncated by the clamping will be accounted in the next sample
> period rather than lost. Given the stochastic nature of the sampling,
> that is more accurate overall.
> 
> So, continue to advance ->prev_steal_time by the accounted value as
> long as the delta isn't egregiously large (for which, use maxtime * 2).
> If the delta is more than that, just set ->prev_steal_time directly to
> the value returned by paravirt_steal_clock().
> 
> Fixes: 807e5b80687c0 ("sched/cputime: Add steal time support to full dynticks CPU time accounting")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  kernel/sched/cputime.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index af7952f12e6c..3a8a8b38966d 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -254,13 +254,21 @@ static __always_inline u64 steal_account_process_time(u64 maxtime)
>  {
>  #ifdef CONFIG_PARAVIRT
>  	if (static_key_false(&paravirt_steal_enabled)) {
> -		u64 steal;
> -
> -		steal = paravirt_steal_clock(smp_processor_id());
> -		steal -= this_rq()->prev_steal_time;
> -		steal = min(steal, maxtime);
> +		u64 steal, abs_steal;
> +
> +		abs_steal = paravirt_steal_clock(smp_processor_id());
> +		steal = abs_steal - this_rq()->prev_steal_time;
> +		if (unlikely(steal > maxtime)) {
> +			/*
> +			 * If the delta isn't egregious, it can be counted
> +			 * in the next time period. Only advance by maxtime.
> +			 */
> +			if (steal < maxtime * 2)
> +				abs_steal = this_rq()->prev_steal_time + maxtime;
> +			steal = maxtime;
> +		}
>  		account_steal_time(steal);
> -		this_rq()->prev_steal_time += steal;
> +		this_rq()->prev_steal_time = abs_steal;
>  
>  		return steal;
>  	}
Re: [RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative
Posted by Paul Durrant 1 year, 6 months ago
On 22/05/2024 01:17, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> In steal_account_process_time(), a delta is calculated between the value
> returned by paravirt_steal_clock(), and this_rq()->prev_steal_time which
> is assumed to be the *previous* value returned by paravirt_steal_clock().
> 
> However, instead of just assigning the newly-read value directly into
> ->prev_steal_time for use in the next iteration, ->prev_steal_time is
> *incremented* by the calculated delta.
> 
> This used to be roughly the same, modulo conversion to jiffies and back,
> until commit 807e5b80687c0 ("sched/cputime: Add steal time support to
> full dynticks CPU time accounting") started clamping that delta to a
> maximum of the actual time elapsed. So now, if the value returned by
> paravirt_steal_clock() jumps by a large amount, instead of a *single*
> period of reporting 100% steal time, the system will report 100% steal
> time for as long as it takes to "catch up" with the reported value.
> Which is up to 584 years.
> 
> But there is a benefit to advancing ->prev_steal_time only by the time
> which was *accounted* as having been stolen. It means that any extra
> time truncated by the clamping will be accounted in the next sample
> period rather than lost. Given the stochastic nature of the sampling,
> that is more accurate overall.
> 
> So, continue to advance ->prev_steal_time by the accounted value as
> long as the delta isn't egregiously large (for which, use maxtime * 2).
> If the delta is more than that, just set ->prev_steal_time directly to
> the value returned by paravirt_steal_clock().
> 
> Fixes: 807e5b80687c0 ("sched/cputime: Add steal time support to full dynticks CPU time accounting")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   kernel/sched/cputime.c | 20 ++++++++++++++------
>   1 file changed, 14 insertions(+), 6 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>