[v2] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime

[PATCH v2 -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime

Posted by Zheng Zucheng 1 month, 1 week ago

In extreme test scenarios:
the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime,
utime = 18446744073709518790 ns, rtime = 135989749728000 ns

In cputime_adjust() process, stime is greater than rtime due to
mul_u64_u64_div_u64() precision problem.
before call mul_u64_u64_div_u64(),
stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
after call mul_u64_u64_div_u64(),
stime = 135989949653530

unsigned reversion occurs because rtime is less than stime.
utime = rtime - stime = 135989749728000 - 135989949653530
		      = -199925530
		      = (u64)18446744073709518790

Trigger condition:
1). User task run in kernel mode most of time
2). ARM64 architecture
3). TICK_CPU_ACCOUNTING=y
    CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set

Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime

v2:
 - Add comment
 - Update trigger condition

Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()")
Cc: <stable@vger.kernel.org>
Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com>
---
 kernel/sched/cputime.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index aa48b2ec879d..4feef0d4e449 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -582,6 +582,12 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
 	}
 
 	stime = mul_u64_u64_div_u64(stime, rtime, stime + utime);
+	/*
+	 * Because mul_u64_u64_div_u64() can approximate on some
+	 * achitectures; enforce the constraint that: a*b/(b+c) <= a.
+	 */
+	if (unlikely(stime > rtime))
+		stime = rtime;
 
 update:
 	/*
-- 
2.34.1

Re: [PATCH v2 -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime

Posted by Oleg Nesterov 1 month, 1 week ago

On 07/26, Zheng Zucheng wrote:
>
> before call mul_u64_u64_div_u64(),
> stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.

So stime + utime == 175138003500000

> after call mul_u64_u64_div_u64(),
> stime = 135989949653530

Hmm. On x86 mul_u64_u64_div_u64(175136586720000, 135989749728000, 175138003500000)
returns 135989749728000 == rtime, see below.

Nevermind...

> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -582,6 +582,12 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
>  	}
>  
>  	stime = mul_u64_u64_div_u64(stime, rtime, stime + utime);
> +	/*
> +	 * Because mul_u64_u64_div_u64() can approximate on some
> +	 * achitectures; enforce the constraint that: a*b/(b+c) <= a.
> +	 */
> +	if (unlikely(stime > rtime))
> +		stime = rtime;

Thanks,

Acked-by: Oleg Nesterov <oleg@redhat.com>

-------------------------------------------------------------------------------
But perhaps it makes sense to improve the accuracy of mul_u64_u64_div_u64() ?
See the new() function in the code below.

Say, with the numbers above I get

	$ ./test 175136586720000 135989749728000 175138003500000
	old -> 135989749728000	e=1100089950.609375
	new -> 135988649638050	e=0.609375

Oleg.

-------------------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

typedef unsigned long long u64;

static inline int fls64(u64 x)
{
	int bitpos = -1;
	/*
	 * AMD64 says BSRQ won't clobber the dest reg if x==0; Intel64 says the
	 * dest reg is undefined if x==0, but their CPU architect says its
	 * value is written to set it to the same as before.
	 */
	asm("bsrq %1,%q0"
	    : "+r" (bitpos)
	    : "rm" (x));
	return bitpos + 1;
}

static inline int ilog2(u64 n)
{
	return fls64(n) - 1;
}

#define swap(a, b) \
	do { typeof(a) __tmp = (a); (a) = (b); (b) = __tmp; } while (0)

static inline u64 div64_u64_rem(u64 dividend, u64 divisor, u64 *remainder)
{
	*remainder = dividend % divisor;
	return dividend / divisor;
}
static inline u64 div64_u64(u64 dividend, u64 divisor)
{
	return dividend / divisor;
}

//-----------------------------------------------------------------------------

// current implementation of mul_u64_u64_div_u64
u64 old(u64 a, u64 b, u64 c)
{
	u64 res = 0, div, rem;
	int shift;

	/* can a * b overflow ? */
	if (ilog2(a) + ilog2(b) > 62) {
		/*
		 * Note that the algorithm after the if block below might lose
		 * some precision and the result is more exact for b > a. So
		 * exchange a and b if a is bigger than b.
		 *
		 * For example with a = 43980465100800, b = 100000000, c = 1000000000
		 * the below calculation doesn't modify b at all because div == 0
		 * and then shift becomes 45 + 26 - 62 = 9 and so the result
		 * becomes 4398035251080. However with a and b swapped the exact
		 * result is calculated (i.e. 4398046510080).
		 */
		if (a > b)
			swap(a, b);

		/*
		 * (b * a) / c is equal to
		 *
		 *      (b / c) * a +
		 *      (b % c) * a / c
		 *
		 * if nothing overflows. Can the 1st multiplication
		 * overflow? Yes, but we do not care: this can only
		 * happen if the end result can't fit in u64 anyway.
		 *
		 * So the code below does
		 *
		 *      res = (b / c) * a;
		 *      b = b % c;
		 */
		div = div64_u64_rem(b, c, &rem);
		res = div * a;
		b = rem;

		shift = ilog2(a) + ilog2(b) - 62;
		if (shift > 0) {
			/* drop precision */
			b >>= shift;
			c >>= shift;
			if (!c)
				return res;
		}
	}

	return res + div64_u64(a * b, c);
}

u64 new(u64 a, u64 b, u64 c)
{
	u64 res = 0, div, rem;

	/* can a * b overflow ? */
	while (ilog2(a) + ilog2(b) > 62) {
		if (a > b)
			swap(b, a);

		if (b >= c) {
			/*
			 * (b * a) / c is equal to
			 *
			 *	(b / c) * a +
			 *	(b % c) * a / c
			 *
			 * if nothing overflows. Can the 1st multiplication
			 * overflow? Yes, but we do not care: this can only
			 * happen if the end result can't fit in u64 anyway.
			 *
			 * So the code below does
			 *
			 *	res += (b / c) * a;
			 *	b = b % c;
			 */
			div = div64_u64_rem(b, c, &rem);
			res += div * a;
			b = rem;
			continue;
		}

		/* drop precision */
		b >>= 1;
		c >>= 1;
		if (!c)
			return res;
	}

	return res + div64_u64(a * b, c);
}

int main(int argc, char **argv)
{
	u64 a, b, c, ro, rn;
	double rd;

	assert(argc == 4);
	a = strtoull(argv[1], NULL, 0);
	b = strtoull(argv[2], NULL, 0);
	c = strtoull(argv[3], NULL, 0);

	rd = (((double)a) * b) / c;
	ro = old(a, b, c);
	rn = new(a, b, c);

	printf("old -> %lld\te=%f\n", ro, ro - rd);
	printf("new -> %lld\te=%f\n", rn, rn - rd);

	return 0;
}

Re: [PATCH v2 -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime

Posted by Uwe Kleine-König 1 month, 1 week ago

Hello,

On Fri, Jul 26, 2024 at 12:44:29PM +0200, Oleg Nesterov wrote:
> On 07/26, Zheng Zucheng wrote:
> >
> > before call mul_u64_u64_div_u64(),
> > stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
> 
> So stime + utime == 175138003500000
> 
> > after call mul_u64_u64_div_u64(),
> > stime = 135989949653530
> 
> Hmm. On x86 mul_u64_u64_div_u64(175136586720000, 135989749728000, 175138003500000)
> returns 135989749728000 == rtime, see below.
> 
> Nevermind...
> 
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -582,6 +582,12 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
> >  	}
> >  
> >  	stime = mul_u64_u64_div_u64(stime, rtime, stime + utime);
> > +	/*
> > +	 * Because mul_u64_u64_div_u64() can approximate on some
> > +	 * achitectures; enforce the constraint that: a*b/(b+c) <= a.
> > +	 */
> > +	if (unlikely(stime > rtime))
> > +		stime = rtime;
> 
> Thanks,
> 
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> 
> -------------------------------------------------------------------------------
> But perhaps it makes sense to improve the accuracy of mul_u64_u64_div_u64() ?

Note there is a patch by Nicolas Pitre currently in mm-nonmm-unstable
that makes mul_u64_u64_div_u64() precise. It was in next for a while as
commit 3cc8bf1a81ef ("mul_u64_u64_div_u64: make it precise always")
which might explain problems to reproduce the incorrect results.

An obvious alternative to backporting this change to
kernel/sched/cputime.c for stable is to backport Nicolas's patch
instead. Andrew asked me to provide a justification to send Nicolas's
patch for inclusion in the current devel cycle. So it might make it in
before 6.11.

Best regards
Uwe

Re: [PATCH v2 -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime

Posted by Oleg Nesterov 1 month, 1 week ago

On 07/26, Oleg Nesterov wrote:
>
> On 07/26, Zheng Zucheng wrote:
> >
> > before call mul_u64_u64_div_u64(),
> > stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
>
> So stime + utime == 175138003500000
>
> > after call mul_u64_u64_div_u64(),
> > stime = 135989949653530
>
> Hmm. On x86 mul_u64_u64_div_u64(175136586720000, 135989749728000, 175138003500000)
> returns 135989749728000 == rtime, see below.

Seriously, can you re-check your numbers? it would be nice to understand why
x86_64 differs...

> But perhaps it makes sense to improve the accuracy of mul_u64_u64_div_u64() ?
> See the new() function in the code below.

Just in case, the usage of ilog2 can be improved, but this is minor.

Oleg.

Re: [PATCH v2 -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime

Posted by Peter Zijlstra 1 month, 1 week ago

On Fri, Jul 26, 2024 at 03:04:01PM +0200, Oleg Nesterov wrote:
> On 07/26, Oleg Nesterov wrote:
> >
> > On 07/26, Zheng Zucheng wrote:
> > >
> > > before call mul_u64_u64_div_u64(),
> > > stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
> >
> > So stime + utime == 175138003500000
> >
> > > after call mul_u64_u64_div_u64(),
> > > stime = 135989949653530
> >
> > Hmm. On x86 mul_u64_u64_div_u64(175136586720000, 135989749728000, 175138003500000)
> > returns 135989749728000 == rtime, see below.
> 
> Seriously, can you re-check your numbers? it would be nice to understand why
> x86_64 differs...

x86_64 has a custom mul_u64_u64_div_u64() implementation.

> > But perhaps it makes sense to improve the accuracy of mul_u64_u64_div_u64() ?
> > See the new() function in the code below.
> 
> Just in case, the usage of ilog2 can be improved, but this is minor.

I meant to go look at this, it seems to loop less than your improved
version, but I'm chasing crashes atm. Perhaps it provides inspiration.

https://codebrowser.dev/llvm/compiler-rt/lib/builtins/udivmodti4.c.html#__udivmodti4

Re: [PATCH v2 -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime

Posted by Oleg Nesterov 1 month, 1 week ago

On 07/26, Peter Zijlstra wrote:
>
> On Fri, Jul 26, 2024 at 03:04:01PM +0200, Oleg Nesterov wrote:
> > On 07/26, Oleg Nesterov wrote:
> > >
> > > On 07/26, Zheng Zucheng wrote:
> > > >
> > > > before call mul_u64_u64_div_u64(),
> > > > stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
> > >
> > > So stime + utime == 175138003500000
> > >
> > > > after call mul_u64_u64_div_u64(),
> > > > stime = 135989949653530
> > >
> > > Hmm. On x86 mul_u64_u64_div_u64(175136586720000, 135989749728000, 175138003500000)
> > > returns 135989749728000 == rtime, see below.
> >
> > Seriously, can you re-check your numbers? it would be nice to understand why
> > x86_64 differs...
>
> x86_64 has a custom mul_u64_u64_div_u64() implementation.

Yes sure, but my user-space test-case uses the "generic" version,

> > > But perhaps it makes sense to improve the accuracy of mul_u64_u64_div_u64() ?
> > > See the new() function in the code below.
> >
> > Just in case, the usage of ilog2 can be improved, but this is minor.
>
> I meant to go look at this, it seems to loop less than your improved
> version, but I'm chasing crashes atm. Perhaps it provides inspiration.
>
> https://codebrowser.dev/llvm/compiler-rt/lib/builtins/udivmodti4.c.html#__udivmodti4

Thanks... I'll try to take a look tommorrow, but at first glance I will
never understand this (clever) code ;)

Oleg.

[tip: sched/core] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime

Posted by tip-bot2 for Zheng Zucheng 1 month, 1 week ago

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     77baa5bafcbe1b2a15ef9c37232c21279c95481c
Gitweb:        https://git.kernel.org/tip/77baa5bafcbe1b2a15ef9c37232c21279c95481c
Author:        Zheng Zucheng <zhengzucheng@huawei.com>
AuthorDate:    Fri, 26 Jul 2024 02:32:35 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 29 Jul 2024 12:22:32 +02:00

sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime

In extreme test scenarios:
the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime,
utime = 18446744073709518790 ns, rtime = 135989749728000 ns

In cputime_adjust() process, stime is greater than rtime due to
mul_u64_u64_div_u64() precision problem.
before call mul_u64_u64_div_u64(),
stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
after call mul_u64_u64_div_u64(),
stime = 135989949653530

unsigned reversion occurs because rtime is less than stime.
utime = rtime - stime = 135989749728000 - 135989949653530
		      = -199925530
		      = (u64)18446744073709518790

Trigger condition:
  1). User task run in kernel mode most of time
  2). ARM64 architecture
  3). TICK_CPU_ACCOUNTING=y
      CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set

Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime

Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()")
Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20240726023235.217771-1-zhengzucheng@huawei.com
---
 kernel/sched/cputime.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a5e0029..0bed0fa 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -582,6 +582,12 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
 	}
 
 	stime = mul_u64_u64_div_u64(stime, rtime, stime + utime);
+	/*
+	 * Because mul_u64_u64_div_u64() can approximate on some
+	 * achitectures; enforce the constraint that: a*b/(b+c) <= a.
+	 */
+	if (unlikely(stime > rtime))
+		stime = rtime;
 
 update:
 	/*