[v1] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime

[PATCH -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime

Posted by Zheng Zucheng 1 year, 6 months ago

In extreme test scenarios:
the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime,
utime = 18446744073709518790 ns, rtime = 135989749728000 ns

In cputime_adjust() process, stime is greater than rtime due to
mul_u64_u64_div_u64() precision problem.
before call mul_u64_u64_div_u64(),
stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
after call mul_u64_u64_div_u64(),
stime = 135989949653530

unsigned reversion occurs because rtime is less than stime.
utime = rtime - stime = 135989749728000 - 135989949653530
		      = -199925530
		      = (u64)18446744073709518790

Trigger scenario:
1. User task run in kernel mode most of time.
2. The ARM64 architecture && CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y &&
   TICK_CPU_ACCOUNTING=y

Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime

Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()")
Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com>
---
 kernel/sched/cputime.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index aa48b2ec879d..365c74e95537 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -582,6 +582,8 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
 	}
 
 	stime = mul_u64_u64_div_u64(stime, rtime, stime + utime);
+	if (unlikely(stime > rtime))
+		stime = rtime;
 
 update:
 	/*
-- 
2.34.1

[Question] Include isolated cpu to ensure that tasks are not scheduled to isolated cpu？

Posted by zhengzucheng 1 year, 5 months ago

In a cpuset subsystem, cpuset.cpus contains isolated cpu and 
non-isolated cpu.
Is there any way to ensure that the task runs only on the non-isolated cpus?
eg：
isolcpus=1, cpusete.cpus=0-7. It is found that some tasks are scheduled 
to cpu1.

In addition, task run on isolated cpu cann't be scheduled to other cpu 
in the future.


Thanks!

Re: [Question] Include isolated cpu to ensure that tasks are not scheduled to isolated cpu？

Posted by Waiman Long 1 year, 5 months ago

On 9/1/24 21:56, zhengzucheng wrote:
> In a cpuset subsystem, cpuset.cpus contains isolated cpu and 
> non-isolated cpu.
> Is there any way to ensure that the task runs only on the non-isolated 
> cpus?
> eg：
> isolcpus=1, cpusete.cpus=0-7. It is found that some tasks are 
> scheduled to cpu1.
>
> In addition, task run on isolated cpu cann't be scheduled to other cpu 
> in the future.

The best way is to avoid mixing isolated and scheduling CPUs in the same 
cpuset especially if you are using cgroup v1.

If you are using cgroup v2, one way to avoid the use of isolated CPUs is 
to put all of them into an isolated partition. This will ensure that 
those isolated CPUs won't be used even if they are put into the 
cpuset.cpus of other cpusets accidentally

Cheers,
Longman

[Question] sched：the load is unbalanced in the VM overcommitment scenario

Posted by zhengzucheng 1 year, 4 months ago

In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8 
CPUs are overcommitted to 2 x 8u VMs,
and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs 
resources, the other VM has 6 CPUs.
The host is configured with 80 CPUs in a sched domain and other CPUs are 
in the idle state.
The root cause is that the load of the host is unbalanced, some vCPUs 
exclusively occupy CPU resources.
when the CPU that triggers load balance calculates imbalance value, 
env->imbalance = 0 is calculated because of
local->avg_load > sds->avg_load. As a result, the load balance fails.
The processing logic: 
https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70


It's normal from kernel load balance, but it's not reasonable from the 
perspective of VM users.
In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule 
domain to fix it.
Is there any other method to fix this problem? thanks.

Abstracted reproduction case：
1.environment information：

[root@localhost ~]# cat /proc/schedstat

cpu0
domain0 00000000,00000000,00010000,00000000,00000001
domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
cpu1
domain0 00000000,00000000,00020000,00000000,00000002
domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
cpu2
domain0 00000000,00000000,00040000,00000000,00000004
domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
cpu3
domain0 00000000,00000000,00080000,00000000,00000008
domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff

2.test case:

vcpu.c
#include <stdio.h>
#include <unistd.h>

int main()
{
         sleep(20);
         while (1);
         return 0;
}

gcc vcpu.c -o vcpu
-----------------------------------------------------------------
test.sh

#!/bin/bash

#vcpu1
mkdir /sys/fs/cgroup/cpuset/vcpu_1
echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems
for i in {1..8}
do
         ./vcpu &
         pid=$!
         sleep 1
         echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks
done

#vcpu2
mkdir /sys/fs/cgroup/cpuset/vcpu_2
echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems
for i in {1..8}
do
         ./vcpu &
         pid=$!
         sleep 1
         echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks
done
------------------------------------------------------------------
[root@localhost ~]# ./test.sh

[root@localhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu)

14591 root      20   0    2448   1012    928 R 100.0   0.0 13:10.73 ./vcpu
14582 root      20   0    2448   1012    928 R 100.0   0.0 13:12.71 ./vcpu
14606 root      20   0    2448    872    784 R 100.0   0.0 13:09.72 ./vcpu
14620 root      20   0    2448    916    832 R 100.0   0.0 13:07.72 ./vcpu
14622 root      20   0    2448    920    836 R 100.0   0.0 13:06.72 ./vcpu
14629 root      20   0    2448    920    832 R 100.0   0.0 13:05.72 ./vcpu
14643 root      20   0    2448    924    836 R  21.0   0.0 2:37.13 ./vcpu
14645 root      20   0    2448    868    784 R  21.0   0.0 2:36.51 ./vcpu
14589 root      20   0    2448    900    816 R  20.0   0.0 2:45.16 ./vcpu
14608 root      20   0    2448    956    872 R  20.0   0.0 2:42.24 ./vcpu
14632 root      20   0    2448    872    788 R  20.0   0.0 2:38.08 ./vcpu
14638 root      20   0    2448    924    840 R  20.0   0.0 2:37.48 ./vcpu
14652 root      20   0    2448    928    844 R  20.0   0.0 2:36.42 ./vcpu
14654 root      20   0    2448    924    840 R  20.0   0.0 2:36.14 ./vcpu
14663 root      20   0    2448    900    816 R  20.0   0.0 2:35.38 ./vcpu
14669 root      20   0    2448    868    784 R  20.0   0.0 2:35.70 ./vcpu