kernel/sched/cputime.c | 2 ++ 1 file changed, 2 insertions(+)
In extreme test scenarios:
the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime,
utime = 18446744073709518790 ns, rtime = 135989749728000 ns
In cputime_adjust() process, stime is greater than rtime due to
mul_u64_u64_div_u64() precision problem.
before call mul_u64_u64_div_u64(),
stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
after call mul_u64_u64_div_u64(),
stime = 135989949653530
unsigned reversion occurs because rtime is less than stime.
utime = rtime - stime = 135989749728000 - 135989949653530
= -199925530
= (u64)18446744073709518790
Trigger scenario:
1. User task run in kernel mode most of time.
2. The ARM64 architecture && CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y &&
TICK_CPU_ACCOUNTING=y
Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime
Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()")
Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com>
---
kernel/sched/cputime.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index aa48b2ec879d..365c74e95537 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -582,6 +582,8 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
}
stime = mul_u64_u64_div_u64(stime, rtime, stime + utime);
+ if (unlikely(stime > rtime))
+ stime = rtime;
update:
/*
--
2.34.1
In a cpuset subsystem, cpuset.cpus contains isolated cpu and non-isolated cpu. Is there any way to ensure that the task runs only on the non-isolated cpus? eg: isolcpus=1, cpusete.cpus=0-7. It is found that some tasks are scheduled to cpu1. In addition, task run on isolated cpu cann't be scheduled to other cpu in the future. Thanks!
On 9/1/24 21:56, zhengzucheng wrote: > In a cpuset subsystem, cpuset.cpus contains isolated cpu and > non-isolated cpu. > Is there any way to ensure that the task runs only on the non-isolated > cpus? > eg: > isolcpus=1, cpusete.cpus=0-7. It is found that some tasks are > scheduled to cpu1. > > In addition, task run on isolated cpu cann't be scheduled to other cpu > in the future. The best way is to avoid mixing isolated and scheduling CPUs in the same cpuset especially if you are using cgroup v1. If you are using cgroup v2, one way to avoid the use of isolated CPUs is to put all of them into an isolated partition. This will ensure that those isolated CPUs won't be used even if they are put into the cpuset.cpus of other cpusets accidentally Cheers, Longman
In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8
CPUs are overcommitted to 2 x 8u VMs,
and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs
resources, the other VM has 6 CPUs.
The host is configured with 80 CPUs in a sched domain and other CPUs are
in the idle state.
The root cause is that the load of the host is unbalanced, some vCPUs
exclusively occupy CPU resources.
when the CPU that triggers load balance calculates imbalance value,
env->imbalance = 0 is calculated because of
local->avg_load > sds->avg_load. As a result, the load balance fails.
The processing logic:
https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70
It's normal from kernel load balance, but it's not reasonable from the
perspective of VM users.
In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule
domain to fix it.
Is there any other method to fix this problem? thanks.
Abstracted reproduction case:
1.environment information:
[root@localhost ~]# cat /proc/schedstat
cpu0
domain0 00000000,00000000,00010000,00000000,00000001
domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
cpu1
domain0 00000000,00000000,00020000,00000000,00000002
domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
cpu2
domain0 00000000,00000000,00040000,00000000,00000004
domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
cpu3
domain0 00000000,00000000,00080000,00000000,00000008
domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
2.test case:
vcpu.c
#include <stdio.h>
#include <unistd.h>
int main()
{
sleep(20);
while (1);
return 0;
}
gcc vcpu.c -o vcpu
-----------------------------------------------------------------
test.sh
#!/bin/bash
#vcpu1
mkdir /sys/fs/cgroup/cpuset/vcpu_1
echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems
for i in {1..8}
do
./vcpu &
pid=$!
sleep 1
echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks
done
#vcpu2
mkdir /sys/fs/cgroup/cpuset/vcpu_2
echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems
for i in {1..8}
do
./vcpu &
pid=$!
sleep 1
echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks
done
------------------------------------------------------------------
[root@localhost ~]# ./test.sh
[root@localhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu)
14591 root 20 0 2448 1012 928 R 100.0 0.0 13:10.73 ./vcpu
14582 root 20 0 2448 1012 928 R 100.0 0.0 13:12.71 ./vcpu
14606 root 20 0 2448 872 784 R 100.0 0.0 13:09.72 ./vcpu
14620 root 20 0 2448 916 832 R 100.0 0.0 13:07.72 ./vcpu
14622 root 20 0 2448 920 836 R 100.0 0.0 13:06.72 ./vcpu
14629 root 20 0 2448 920 832 R 100.0 0.0 13:05.72 ./vcpu
14643 root 20 0 2448 924 836 R 21.0 0.0 2:37.13 ./vcpu
14645 root 20 0 2448 868 784 R 21.0 0.0 2:36.51 ./vcpu
14589 root 20 0 2448 900 816 R 20.0 0.0 2:45.16 ./vcpu
14608 root 20 0 2448 956 872 R 20.0 0.0 2:42.24 ./vcpu
14632 root 20 0 2448 872 788 R 20.0 0.0 2:38.08 ./vcpu
14638 root 20 0 2448 924 840 R 20.0 0.0 2:37.48 ./vcpu
14652 root 20 0 2448 928 844 R 20.0 0.0 2:36.42 ./vcpu
14654 root 20 0 2448 924 840 R 20.0 0.0 2:36.14 ./vcpu
14663 root 20 0 2448 900 816 R 20.0 0.0 2:35.38 ./vcpu
14669 root 20 0 2448 868 784 R 20.0 0.0 2:35.70 ./vcpu
On 9/13/24 00:03, zhengzucheng wrote:
> In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8
> CPUs are overcommitted to 2 x 8u VMs,
> and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs
> resources, the other VM has 6 CPUs.
> The host is configured with 80 CPUs in a sched domain and other CPUs
> are in the idle state.
> The root cause is that the load of the host is unbalanced, some vCPUs
> exclusively occupy CPU resources.
> when the CPU that triggers load balance calculates imbalance value,
> env->imbalance = 0 is calculated because of
> local->avg_load > sds->avg_load. As a result, the load balance fails.
> The processing logic:
> https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70
>
>
> It's normal from kernel load balance, but it's not reasonable from the
> perspective of VM users.
> In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule
> domain to fix it.
> Is there any other method to fix this problem? thanks.
>
> Abstracted reproduction case:
> 1.environment information:
>
> [root@localhost ~]# cat /proc/schedstat
>
> cpu0
> domain0 00000000,00000000,00010000,00000000,00000001
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu1
> domain0 00000000,00000000,00020000,00000000,00000002
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu2
> domain0 00000000,00000000,00040000,00000000,00000004
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu3
> domain0 00000000,00000000,00080000,00000000,00000008
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
>
> 2.test case:
>
> vcpu.c
> #include <stdio.h>
> #include <unistd.h>
>
> int main()
> {
> sleep(20);
> while (1);
> return 0;
> }
>
> gcc vcpu.c -o vcpu
> -----------------------------------------------------------------
> test.sh
>
> #!/bin/bash
>
> #vcpu1
> mkdir /sys/fs/cgroup/cpuset/vcpu_1
> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems
> for i in {1..8}
> do
> ./vcpu &
> pid=$!
> sleep 1
> echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks
> done
>
> #vcpu2
> mkdir /sys/fs/cgroup/cpuset/vcpu_2
> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems
> for i in {1..8}
> do
> ./vcpu &
> pid=$!
> sleep 1
> echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks
> done
> ------------------------------------------------------------------
> [root@localhost ~]# ./test.sh
>
> [root@localhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu)
>
> 14591 root 20 0 2448 1012 928 R 100.0 0.0 13:10.73
> ./vcpu
> 14582 root 20 0 2448 1012 928 R 100.0 0.0 13:12.71
> ./vcpu
> 14606 root 20 0 2448 872 784 R 100.0 0.0 13:09.72
> ./vcpu
> 14620 root 20 0 2448 916 832 R 100.0 0.0 13:07.72
> ./vcpu
> 14622 root 20 0 2448 920 836 R 100.0 0.0 13:06.72
> ./vcpu
> 14629 root 20 0 2448 920 832 R 100.0 0.0 13:05.72
> ./vcpu
> 14643 root 20 0 2448 924 836 R 21.0 0.0 2:37.13 ./vcpu
> 14645 root 20 0 2448 868 784 R 21.0 0.0 2:36.51 ./vcpu
> 14589 root 20 0 2448 900 816 R 20.0 0.0 2:45.16 ./vcpu
> 14608 root 20 0 2448 956 872 R 20.0 0.0 2:42.24 ./vcpu
> 14632 root 20 0 2448 872 788 R 20.0 0.0 2:38.08 ./vcpu
> 14638 root 20 0 2448 924 840 R 20.0 0.0 2:37.48 ./vcpu
> 14652 root 20 0 2448 928 844 R 20.0 0.0 2:36.42 ./vcpu
> 14654 root 20 0 2448 924 840 R 20.0 0.0 2:36.14 ./vcpu
> 14663 root 20 0 2448 900 816 R 20.0 0.0 2:35.38 ./vcpu
> 14669 root 20 0 2448 868 784 R 20.0 0.0 2:35.70 ./vcpu
>
Your script creates two cpusets with the same set of CPUs. The
scheduling aspect of the tasks, however, are not controlled by cpuset.
It is controlled by cpu cgroup. I suppose that all these tasks are in
the same cpu cgroup. It is possible that commit you mentioned might have
caused some unfairness in allocating CPU time to different processes
within the same cpu cgroup. Maybe you can try to put them into separate
cpu cgroups as well with equal weight to see if that can improve the
scheduling fairness?
BTW, you don't actually need to use 2 different cpusets if they all get
the same set of CPUs and memory nodes. Also setting
cpuset.sched_load_balance=0 may not actually get what you want unless
all the cpusets that use those CPUs have cpuset.sched_load_balance set 0
including the root cgroup. Turning off this flag may disable load
balancing, but it may not guarantee fairness depending on what CPUs are
being used by those tasks when they start unless you explicitly assign
the CPUs to them when starting these tasks.
Cheers,
Longman
在 2024/9/14 1:17, Waiman Long 写道: > you don't actually need to use 2 different cpusets if they all get the > same set of CPUs and memory nodes Yes, you're right. The purpose of setting two different cpusets is to simulate the scenario of two VMs. each cpuset is a VM. For example, the VM configuration is as follows: <domain type='kvm' id='12676'> <name>master</name> <vcpu placement='static' cpuset='0-3,80-83'>8</vcpu> <iothreads>1</iothreads> <iothreadids> <iothread id='1'/> </iothreadids> <cputune> <vcpupin vcpu='0' cpuset='0-3,80-83'/> <vcpupin vcpu='1' cpuset='0-3,80-83'/> <vcpupin vcpu='2' cpuset='0-3,80-83'/> <vcpupin vcpu='3' cpuset='0-3,80-83'/> <vcpupin vcpu='4' cpuset='0-3,80-83'/> <vcpupin vcpu='5' cpuset='0-3,80-83'/> <vcpupin vcpu='6' cpuset='0-3,80-83'/> <vcpupin vcpu='7' cpuset='0-3,80-83'/> <emulatorpin cpuset='0-79'/> </cputune> <numatune> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune>
On Fri, 13 Sept 2024 at 06:03, zhengzucheng <zhengzucheng@huawei.com> wrote:
>
> In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8
> CPUs are overcommitted to 2 x 8u VMs,
> and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs
> resources, the other VM has 6 CPUs.
> The host is configured with 80 CPUs in a sched domain and other CPUs are
> in the idle state.
> The root cause is that the load of the host is unbalanced, some vCPUs
> exclusively occupy CPU resources.
> when the CPU that triggers load balance calculates imbalance value,
> env->imbalance = 0 is calculated because of
> local->avg_load > sds->avg_load. As a result, the load balance fails.
> The processing logic:
> https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70
>
>
> It's normal from kernel load balance, but it's not reasonable from the
> perspective of VM users.
> In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule
> domain to fix it.
> Is there any other method to fix this problem? thanks.
I'm not sure how to understand your setup and why the load balance is
not balancing correctly 16 vCPU between the 8 CPUs.
From your test case description below, you have 8 always running
threads in cgroup A and 8 always running threads in cgroup B and the 2
cgroups have only 8 CPUs among 80. This should not be a problem for
load balance. I tried something similar although not exactly the same
with cgroupv2 and rt-app and I don't have noticeable imbalance
Do you have more details that you can share about your system ?
Which kernel version are you using ? Which arch ?
>
> Abstracted reproduction case:
> 1.environment information:
>
> [root@localhost ~]# cat /proc/schedstat
>
> cpu0
> domain0 00000000,00000000,00010000,00000000,00000001
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu1
> domain0 00000000,00000000,00020000,00000000,00000002
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu2
> domain0 00000000,00000000,00040000,00000000,00000004
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu3
> domain0 00000000,00000000,00080000,00000000,00000008
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
Is it correct to assume that domain0 is SMT, domain1 MC and domain2 PKG ?
and cpu80-83 are in the other group of PKG ? and LLC is at domain1 level ?
>
> 2.test case:
>
> vcpu.c
> #include <stdio.h>
> #include <unistd.h>
>
> int main()
> {
> sleep(20);
> while (1);
> return 0;
> }
>
> gcc vcpu.c -o vcpu
> -----------------------------------------------------------------
> test.sh
>
> #!/bin/bash
>
> #vcpu1
> mkdir /sys/fs/cgroup/cpuset/vcpu_1
> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems
> for i in {1..8}
> do
> ./vcpu &
> pid=$!
> sleep 1
> echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks
> done
>
> #vcpu2
> mkdir /sys/fs/cgroup/cpuset/vcpu_2
> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems
> for i in {1..8}
> do
> ./vcpu &
> pid=$!
> sleep 1
> echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks
> done
> ------------------------------------------------------------------
> [root@localhost ~]# ./test.sh
>
> [root@localhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu)
>
> 14591 root 20 0 2448 1012 928 R 100.0 0.0 13:10.73 ./vcpu
> 14582 root 20 0 2448 1012 928 R 100.0 0.0 13:12.71 ./vcpu
> 14606 root 20 0 2448 872 784 R 100.0 0.0 13:09.72 ./vcpu
> 14620 root 20 0 2448 916 832 R 100.0 0.0 13:07.72 ./vcpu
> 14622 root 20 0 2448 920 836 R 100.0 0.0 13:06.72 ./vcpu
> 14629 root 20 0 2448 920 832 R 100.0 0.0 13:05.72 ./vcpu
> 14643 root 20 0 2448 924 836 R 21.0 0.0 2:37.13 ./vcpu
> 14645 root 20 0 2448 868 784 R 21.0 0.0 2:36.51 ./vcpu
> 14589 root 20 0 2448 900 816 R 20.0 0.0 2:45.16 ./vcpu
> 14608 root 20 0 2448 956 872 R 20.0 0.0 2:42.24 ./vcpu
> 14632 root 20 0 2448 872 788 R 20.0 0.0 2:38.08 ./vcpu
> 14638 root 20 0 2448 924 840 R 20.0 0.0 2:37.48 ./vcpu
> 14652 root 20 0 2448 928 844 R 20.0 0.0 2:36.42 ./vcpu
> 14654 root 20 0 2448 924 840 R 20.0 0.0 2:36.14 ./vcpu
> 14663 root 20 0 2448 900 816 R 20.0 0.0 2:35.38 ./vcpu
> 14669 root 20 0 2448 868 784 R 20.0 0.0 2:35.70 ./vcpu
>
在 2024/9/13 23:55, Vincent Guittot 写道:
> On Fri, 13 Sept 2024 at 06:03, zhengzucheng <zhengzucheng@huawei.com> wrote:
>> In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8
>> CPUs are overcommitted to 2 x 8u VMs,
>> and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs
>> resources, the other VM has 6 CPUs.
>> The host is configured with 80 CPUs in a sched domain and other CPUs are
>> in the idle state.
>> The root cause is that the load of the host is unbalanced, some vCPUs
>> exclusively occupy CPU resources.
>> when the CPU that triggers load balance calculates imbalance value,
>> env->imbalance = 0 is calculated because of
>> local->avg_load > sds->avg_load. As a result, the load balance fails.
>> The processing logic:
>> https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70
>>
>>
>> It's normal from kernel load balance, but it's not reasonable from the
>> perspective of VM users.
>> In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule
>> domain to fix it.
>> Is there any other method to fix this problem? thanks.
> I'm not sure how to understand your setup and why the load balance is
> not balancing correctly 16 vCPU between the 8 CPUs.
>
> >From your test case description below, you have 8 always running
> threads in cgroup A and 8 always running threads in cgroup B and the 2
> cgroups have only 8 CPUs among 80. This should not be a problem for
> load balance. I tried something similar although not exactly the same
> with cgroupv2 and rt-app and I don't have noticeable imbalance
>
> Do you have more details that you can share about your system ?
>
> Which kernel version are you using ? Which arch ?
kernel version: 6.11.0-rc7
arch: X86_64 and cgroup v1
>> Abstracted reproduction case:
>> 1.environment information:
>>
>> [root@localhost ~]# cat /proc/schedstat
>>
>> cpu0
>> domain0 00000000,00000000,00010000,00000000,00000001
>> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
>> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
>> cpu1
>> domain0 00000000,00000000,00020000,00000000,00000002
>> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
>> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
>> cpu2
>> domain0 00000000,00000000,00040000,00000000,00000004
>> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
>> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
>> cpu3
>> domain0 00000000,00000000,00080000,00000000,00000008
>> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
>> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> Is it correct to assume that domain0 is SMT, domain1 MC and domain2 PKG ?
> and cpu80-83 are in the other group of PKG ? and LLC is at domain1 level ?
domain0 is SMT and domain1 is MC
thread_siblings_list:0,80. 1,81. 2,82. 3,83
LLC is at domain1 level
>> 2.test case:
>>
>> vcpu.c
>> #include <stdio.h>
>> #include <unistd.h>
>>
>> int main()
>> {
>> sleep(20);
>> while (1);
>> return 0;
>> }
>>
>> gcc vcpu.c -o vcpu
>> -----------------------------------------------------------------
>> test.sh
>>
>> #!/bin/bash
>>
>> #vcpu1
>> mkdir /sys/fs/cgroup/cpuset/vcpu_1
>> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus
>> echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems
>> for i in {1..8}
>> do
>> ./vcpu &
>> pid=$!
>> sleep 1
>> echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks
>> done
>>
>> #vcpu2
>> mkdir /sys/fs/cgroup/cpuset/vcpu_2
>> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus
>> echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems
>> for i in {1..8}
>> do
>> ./vcpu &
>> pid=$!
>> sleep 1
>> echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks
>> done
>> ------------------------------------------------------------------
>> [root@localhost ~]# ./test.sh
>>
>> [root@localhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu)
>>
>> 14591 root 20 0 2448 1012 928 R 100.0 0.0 13:10.73 ./vcpu
>> 14582 root 20 0 2448 1012 928 R 100.0 0.0 13:12.71 ./vcpu
>> 14606 root 20 0 2448 872 784 R 100.0 0.0 13:09.72 ./vcpu
>> 14620 root 20 0 2448 916 832 R 100.0 0.0 13:07.72 ./vcpu
>> 14622 root 20 0 2448 920 836 R 100.0 0.0 13:06.72 ./vcpu
>> 14629 root 20 0 2448 920 832 R 100.0 0.0 13:05.72 ./vcpu
>> 14643 root 20 0 2448 924 836 R 21.0 0.0 2:37.13 ./vcpu
>> 14645 root 20 0 2448 868 784 R 21.0 0.0 2:36.51 ./vcpu
>> 14589 root 20 0 2448 900 816 R 20.0 0.0 2:45.16 ./vcpu
>> 14608 root 20 0 2448 956 872 R 20.0 0.0 2:42.24 ./vcpu
>> 14632 root 20 0 2448 872 788 R 20.0 0.0 2:38.08 ./vcpu
>> 14638 root 20 0 2448 924 840 R 20.0 0.0 2:37.48 ./vcpu
>> 14652 root 20 0 2448 928 844 R 20.0 0.0 2:36.42 ./vcpu
>> 14654 root 20 0 2448 924 840 R 20.0 0.0 2:36.14 ./vcpu
>> 14663 root 20 0 2448 900 816 R 20.0 0.0 2:35.38 ./vcpu
>> 14669 root 20 0 2448 868 784 R 20.0 0.0 2:35.70 ./vcpu
>>
> .
On Sat, 14 Sept 2024 at 09:04, zhengzucheng <zhengzucheng@huawei.com> wrote:
>
>
> 在 2024/9/13 23:55, Vincent Guittot 写道:
> > On Fri, 13 Sept 2024 at 06:03, zhengzucheng <zhengzucheng@huawei.com> wrote:
> >> In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8
> >> CPUs are overcommitted to 2 x 8u VMs,
> >> and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs
> >> resources, the other VM has 6 CPUs.
> >> The host is configured with 80 CPUs in a sched domain and other CPUs are
> >> in the idle state.
> >> The root cause is that the load of the host is unbalanced, some vCPUs
> >> exclusively occupy CPU resources.
> >> when the CPU that triggers load balance calculates imbalance value,
> >> env->imbalance = 0 is calculated because of
> >> local->avg_load > sds->avg_load. As a result, the load balance fails.
> >> The processing logic:
> >> https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70
> >>
> >>
> >> It's normal from kernel load balance, but it's not reasonable from the
> >> perspective of VM users.
> >> In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule
> >> domain to fix it.
> >> Is there any other method to fix this problem? thanks.
> > I'm not sure how to understand your setup and why the load balance is
> > not balancing correctly 16 vCPU between the 8 CPUs.
> >
> > >From your test case description below, you have 8 always running
> > threads in cgroup A and 8 always running threads in cgroup B and the 2
> > cgroups have only 8 CPUs among 80. This should not be a problem for
> > load balance. I tried something similar although not exactly the same
> > with cgroupv2 and rt-app and I don't have noticeable imbalance
> >
> > Do you have more details that you can share about your system ?
> >
> > Which kernel version are you using ? Which arch ?
>
> kernel version: 6.11.0-rc7
> arch: X86_64 and cgroup v1
okay
>
> >> Abstracted reproduction case:
> >> 1.environment information:
> >>
> >> [root@localhost ~]# cat /proc/schedstat
> >>
> >> cpu0
> >> domain0 00000000,00000000,00010000,00000000,00000001
> >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> >> cpu1
> >> domain0 00000000,00000000,00020000,00000000,00000002
> >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> >> cpu2
> >> domain0 00000000,00000000,00040000,00000000,00000004
> >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> >> cpu3
> >> domain0 00000000,00000000,00080000,00000000,00000008
> >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> > Is it correct to assume that domain0 is SMT, domain1 MC and domain2 PKG ?
> > and cpu80-83 are in the other group of PKG ? and LLC is at domain1 level ?
>
> domain0 is SMT and domain1 is MC
> thread_siblings_list:0,80. 1,81. 2,82. 3,83
Yeah, I should have read more carefully the domain0 cpumask
> LLC is at domain1 level
>
> >> 2.test case:
> >>
> >> vcpu.c
> >> #include <stdio.h>
> >> #include <unistd.h>
> >>
> >> int main()
> >> {
> >> sleep(20);
> >> while (1);
> >> return 0;
> >> }
> >>
> >> gcc vcpu.c -o vcpu
> >> -----------------------------------------------------------------
> >> test.sh
> >>
> >> #!/bin/bash
> >>
> >> #vcpu1
> >> mkdir /sys/fs/cgroup/cpuset/vcpu_1
> >> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus
> >> echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems
> >> for i in {1..8}
> >> do
> >> ./vcpu &
> >> pid=$!
> >> sleep 1
> >> echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks
> >> done
> >>
> >> #vcpu2
> >> mkdir /sys/fs/cgroup/cpuset/vcpu_2
> >> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus
> >> echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems
> >> for i in {1..8}
> >> do
> >> ./vcpu &
> >> pid=$!
> >> sleep 1
> >> echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks
> >> done
> >> ------------------------------------------------------------------
> >> [root@localhost ~]# ./test.sh
> >>
> >> [root@localhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu)
> >>
> >> 14591 root 20 0 2448 1012 928 R 100.0 0.0 13:10.73 ./vcpu
> >> 14582 root 20 0 2448 1012 928 R 100.0 0.0 13:12.71 ./vcpu
> >> 14606 root 20 0 2448 872 784 R 100.0 0.0 13:09.72 ./vcpu
> >> 14620 root 20 0 2448 916 832 R 100.0 0.0 13:07.72 ./vcpu
> >> 14622 root 20 0 2448 920 836 R 100.0 0.0 13:06.72 ./vcpu
> >> 14629 root 20 0 2448 920 832 R 100.0 0.0 13:05.72 ./vcpu
> >> 14643 root 20 0 2448 924 836 R 21.0 0.0 2:37.13 ./vcpu
> >> 14645 root 20 0 2448 868 784 R 21.0 0.0 2:36.51 ./vcpu
> >> 14589 root 20 0 2448 900 816 R 20.0 0.0 2:45.16 ./vcpu
> >> 14608 root 20 0 2448 956 872 R 20.0 0.0 2:42.24 ./vcpu
> >> 14632 root 20 0 2448 872 788 R 20.0 0.0 2:38.08 ./vcpu
> >> 14638 root 20 0 2448 924 840 R 20.0 0.0 2:37.48 ./vcpu
> >> 14652 root 20 0 2448 928 844 R 20.0 0.0 2:36.42 ./vcpu
> >> 14654 root 20 0 2448 924 840 R 20.0 0.0 2:36.14 ./vcpu
> >> 14663 root 20 0 2448 900 816 R 20.0 0.0 2:35.38 ./vcpu
> >> 14669 root 20 0 2448 868 784 R 20.0 0.0 2:35.70 ./vcpu
> >>
So I finally understood your situation. The limited cpuset screws up
the avg load of system for domain1. The group_imbalanced state is
there to try to fix an imbalanced situation related to tasks that are
pinned to a subset of CPUs of the sched domain. But this can't cover
all cases.
> > .
On Thu, Jul 25, 2024 at 12:03:15PM +0000, Zheng Zucheng wrote:
> In extreme test scenarios:
> the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime,
> utime = 18446744073709518790 ns, rtime = 135989749728000 ns
>
> In cputime_adjust() process, stime is greater than rtime due to
> mul_u64_u64_div_u64() precision problem.
> before call mul_u64_u64_div_u64(),
> stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
> after call mul_u64_u64_div_u64(),
> stime = 135989949653530
>
> unsigned reversion occurs because rtime is less than stime.
> utime = rtime - stime = 135989749728000 - 135989949653530
> = -199925530
> = (u64)18446744073709518790
>
> Trigger scenario:
> 1. User task run in kernel mode most of time.
> 2. The ARM64 architecture && CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y &&
> TICK_CPU_ACCOUNTING=y
>
> Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime
>
> Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()")
> Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com>
> ---
> kernel/sched/cputime.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index aa48b2ec879d..365c74e95537 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -582,6 +582,8 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
> }
>
> stime = mul_u64_u64_div_u64(stime, rtime, stime + utime);
> + if (unlikely(stime > rtime))
> + stime = rtime;
But but but... for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y the code you're
patching is not compiled!
Sorry, I made a mistake here. CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set.
在 2024/7/25 22:05, Peter Zijlstra 写道:
> On Thu, Jul 25, 2024 at 12:03:15PM +0000, Zheng Zucheng wrote:
>> In extreme test scenarios:
>> the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime,
>> utime = 18446744073709518790 ns, rtime = 135989749728000 ns
>>
>> In cputime_adjust() process, stime is greater than rtime due to
>> mul_u64_u64_div_u64() precision problem.
>> before call mul_u64_u64_div_u64(),
>> stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
>> after call mul_u64_u64_div_u64(),
>> stime = 135989949653530
>>
>> unsigned reversion occurs because rtime is less than stime.
>> utime = rtime - stime = 135989749728000 - 135989949653530
>> = -199925530
>> = (u64)18446744073709518790
>>
>> Trigger scenario:
>> 1. User task run in kernel mode most of time.
>> 2. The ARM64 architecture && CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y &&
>> TICK_CPU_ACCOUNTING=y
>>
>> Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime
>>
>> Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()")
>> Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com>
>> ---
>> kernel/sched/cputime.c | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
>> index aa48b2ec879d..365c74e95537 100644
>> --- a/kernel/sched/cputime.c
>> +++ b/kernel/sched/cputime.c
>> @@ -582,6 +582,8 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
>> }
>>
>> stime = mul_u64_u64_div_u64(stime, rtime, stime + utime);
>> + if (unlikely(stime > rtime))
>> + stime = rtime;
> But but but... for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y the code you're
> patching is not compiled!
>
>
> .
On Thu, Jul 25, 2024 at 10:49:46PM +0800, zhengzucheng wrote:
> Sorry, I made a mistake here. CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set.
>
> 在 2024/7/25 22:05, Peter Zijlstra 写道:
> > On Thu, Jul 25, 2024 at 12:03:15PM +0000, Zheng Zucheng wrote:
> > > In extreme test scenarios:
> > > the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime,
> > > utime = 18446744073709518790 ns, rtime = 135989749728000 ns
> > >
> > > In cputime_adjust() process, stime is greater than rtime due to
> > > mul_u64_u64_div_u64() precision problem.
> > > before call mul_u64_u64_div_u64(),
> > > stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
> > > after call mul_u64_u64_div_u64(),
> > > stime = 135989949653530
> > >
> > > unsigned reversion occurs because rtime is less than stime.
> > > utime = rtime - stime = 135989749728000 - 135989949653530
> > > = -199925530
> > > = (u64)18446744073709518790
> > >
> > > Trigger scenario:
> > > 1. User task run in kernel mode most of time.
> > > 2. The ARM64 architecture && CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y &&
> > > TICK_CPU_ACCOUNTING=y
> > >
> > > Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime
> > >
> > > Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()")
> > > Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com>
> > > ---
> > > kernel/sched/cputime.c | 2 ++
> > > 1 file changed, 2 insertions(+)
> > >
> > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > > index aa48b2ec879d..365c74e95537 100644
> > > --- a/kernel/sched/cputime.c
> > > +++ b/kernel/sched/cputime.c
> > > @@ -582,6 +582,8 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
> > > }
> > > stime = mul_u64_u64_div_u64(stime, rtime, stime + utime);
> > > + if (unlikely(stime > rtime))
> > > + stime = rtime;
Ooh,.. I see, this is because the generic fallback for
mul_u64_u64_div_u64() is yuck :/
On x86_64 this is just two instructions and it does a native:
u64*u64->u128
u128/u64->u64
And this should never happen. But in the generic case, we appoximate and
urgh.
So yeah, but then perhaps add a comment like:
/*
* Because mul_u64_u64_div_u64() can approximate on some
* achitectures; enforce the constraint that: a*b/(b+c) <= a.
*/
if (unlikely(stime > rtime))
stime = rtime;
Also, I would look into doing a native arm64 version, I'd be surprised
if it could not do better than the generic variant.
In extreme test scenarios:
the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime,
utime = 18446744073709518790 ns, rtime = 135989749728000 ns
In cputime_adjust() process, stime is greater than rtime due to
mul_u64_u64_div_u64() precision problem.
before call mul_u64_u64_div_u64(),
stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
after call mul_u64_u64_div_u64(),
stime = 135989949653530
unsigned reversion occurs because rtime is less than stime.
utime = rtime - stime = 135989749728000 - 135989949653530
= -199925530
= (u64)18446744073709518790
Trigger condition:
1). User task run in kernel mode most of time
2). ARM64 architecture
3). TICK_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set
Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime
v2:
- Add comment
- Update trigger condition
Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()")
Cc: <stable@vger.kernel.org>
Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com>
---
kernel/sched/cputime.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index aa48b2ec879d..4feef0d4e449 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -582,6 +582,12 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
}
stime = mul_u64_u64_div_u64(stime, rtime, stime + utime);
+ /*
+ * Because mul_u64_u64_div_u64() can approximate on some
+ * achitectures; enforce the constraint that: a*b/(b+c) <= a.
+ */
+ if (unlikely(stime > rtime))
+ stime = rtime;
update:
/*
--
2.34.1
On 07/26, Zheng Zucheng wrote:
>
> before call mul_u64_u64_div_u64(),
> stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
So stime + utime == 175138003500000
> after call mul_u64_u64_div_u64(),
> stime = 135989949653530
Hmm. On x86 mul_u64_u64_div_u64(175136586720000, 135989749728000, 175138003500000)
returns 135989749728000 == rtime, see below.
Nevermind...
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -582,6 +582,12 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
> }
>
> stime = mul_u64_u64_div_u64(stime, rtime, stime + utime);
> + /*
> + * Because mul_u64_u64_div_u64() can approximate on some
> + * achitectures; enforce the constraint that: a*b/(b+c) <= a.
> + */
> + if (unlikely(stime > rtime))
> + stime = rtime;
Thanks,
Acked-by: Oleg Nesterov <oleg@redhat.com>
-------------------------------------------------------------------------------
But perhaps it makes sense to improve the accuracy of mul_u64_u64_div_u64() ?
See the new() function in the code below.
Say, with the numbers above I get
$ ./test 175136586720000 135989749728000 175138003500000
old -> 135989749728000 e=1100089950.609375
new -> 135988649638050 e=0.609375
Oleg.
-------------------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
typedef unsigned long long u64;
static inline int fls64(u64 x)
{
int bitpos = -1;
/*
* AMD64 says BSRQ won't clobber the dest reg if x==0; Intel64 says the
* dest reg is undefined if x==0, but their CPU architect says its
* value is written to set it to the same as before.
*/
asm("bsrq %1,%q0"
: "+r" (bitpos)
: "rm" (x));
return bitpos + 1;
}
static inline int ilog2(u64 n)
{
return fls64(n) - 1;
}
#define swap(a, b) \
do { typeof(a) __tmp = (a); (a) = (b); (b) = __tmp; } while (0)
static inline u64 div64_u64_rem(u64 dividend, u64 divisor, u64 *remainder)
{
*remainder = dividend % divisor;
return dividend / divisor;
}
static inline u64 div64_u64(u64 dividend, u64 divisor)
{
return dividend / divisor;
}
//-----------------------------------------------------------------------------
// current implementation of mul_u64_u64_div_u64
u64 old(u64 a, u64 b, u64 c)
{
u64 res = 0, div, rem;
int shift;
/* can a * b overflow ? */
if (ilog2(a) + ilog2(b) > 62) {
/*
* Note that the algorithm after the if block below might lose
* some precision and the result is more exact for b > a. So
* exchange a and b if a is bigger than b.
*
* For example with a = 43980465100800, b = 100000000, c = 1000000000
* the below calculation doesn't modify b at all because div == 0
* and then shift becomes 45 + 26 - 62 = 9 and so the result
* becomes 4398035251080. However with a and b swapped the exact
* result is calculated (i.e. 4398046510080).
*/
if (a > b)
swap(a, b);
/*
* (b * a) / c is equal to
*
* (b / c) * a +
* (b % c) * a / c
*
* if nothing overflows. Can the 1st multiplication
* overflow? Yes, but we do not care: this can only
* happen if the end result can't fit in u64 anyway.
*
* So the code below does
*
* res = (b / c) * a;
* b = b % c;
*/
div = div64_u64_rem(b, c, &rem);
res = div * a;
b = rem;
shift = ilog2(a) + ilog2(b) - 62;
if (shift > 0) {
/* drop precision */
b >>= shift;
c >>= shift;
if (!c)
return res;
}
}
return res + div64_u64(a * b, c);
}
u64 new(u64 a, u64 b, u64 c)
{
u64 res = 0, div, rem;
/* can a * b overflow ? */
while (ilog2(a) + ilog2(b) > 62) {
if (a > b)
swap(b, a);
if (b >= c) {
/*
* (b * a) / c is equal to
*
* (b / c) * a +
* (b % c) * a / c
*
* if nothing overflows. Can the 1st multiplication
* overflow? Yes, but we do not care: this can only
* happen if the end result can't fit in u64 anyway.
*
* So the code below does
*
* res += (b / c) * a;
* b = b % c;
*/
div = div64_u64_rem(b, c, &rem);
res += div * a;
b = rem;
continue;
}
/* drop precision */
b >>= 1;
c >>= 1;
if (!c)
return res;
}
return res + div64_u64(a * b, c);
}
int main(int argc, char **argv)
{
u64 a, b, c, ro, rn;
double rd;
assert(argc == 4);
a = strtoull(argv[1], NULL, 0);
b = strtoull(argv[2], NULL, 0);
c = strtoull(argv[3], NULL, 0);
rd = (((double)a) * b) / c;
ro = old(a, b, c);
rn = new(a, b, c);
printf("old -> %lld\te=%f\n", ro, ro - rd);
printf("new -> %lld\te=%f\n", rn, rn - rd);
return 0;
}
Hello,
On Fri, Jul 26, 2024 at 12:44:29PM +0200, Oleg Nesterov wrote:
> On 07/26, Zheng Zucheng wrote:
> >
> > before call mul_u64_u64_div_u64(),
> > stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
>
> So stime + utime == 175138003500000
>
> > after call mul_u64_u64_div_u64(),
> > stime = 135989949653530
>
> Hmm. On x86 mul_u64_u64_div_u64(175136586720000, 135989749728000, 175138003500000)
> returns 135989749728000 == rtime, see below.
>
> Nevermind...
>
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -582,6 +582,12 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
> > }
> >
> > stime = mul_u64_u64_div_u64(stime, rtime, stime + utime);
> > + /*
> > + * Because mul_u64_u64_div_u64() can approximate on some
> > + * achitectures; enforce the constraint that: a*b/(b+c) <= a.
> > + */
> > + if (unlikely(stime > rtime))
> > + stime = rtime;
>
> Thanks,
>
> Acked-by: Oleg Nesterov <oleg@redhat.com>
>
> -------------------------------------------------------------------------------
> But perhaps it makes sense to improve the accuracy of mul_u64_u64_div_u64() ?
Note there is a patch by Nicolas Pitre currently in mm-nonmm-unstable
that makes mul_u64_u64_div_u64() precise. It was in next for a while as
commit 3cc8bf1a81ef ("mul_u64_u64_div_u64: make it precise always")
which might explain problems to reproduce the incorrect results.
An obvious alternative to backporting this change to
kernel/sched/cputime.c for stable is to backport Nicolas's patch
instead. Andrew asked me to provide a justification to send Nicolas's
patch for inclusion in the current devel cycle. So it might make it in
before 6.11.
Best regards
Uwe
On 07/26, Oleg Nesterov wrote: > > On 07/26, Zheng Zucheng wrote: > > > > before call mul_u64_u64_div_u64(), > > stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. > > So stime + utime == 175138003500000 > > > after call mul_u64_u64_div_u64(), > > stime = 135989949653530 > > Hmm. On x86 mul_u64_u64_div_u64(175136586720000, 135989749728000, 175138003500000) > returns 135989749728000 == rtime, see below. Seriously, can you re-check your numbers? it would be nice to understand why x86_64 differs... > But perhaps it makes sense to improve the accuracy of mul_u64_u64_div_u64() ? > See the new() function in the code below. Just in case, the usage of ilog2 can be improved, but this is minor. Oleg.
On Fri, Jul 26, 2024 at 03:04:01PM +0200, Oleg Nesterov wrote: > On 07/26, Oleg Nesterov wrote: > > > > On 07/26, Zheng Zucheng wrote: > > > > > > before call mul_u64_u64_div_u64(), > > > stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. > > > > So stime + utime == 175138003500000 > > > > > after call mul_u64_u64_div_u64(), > > > stime = 135989949653530 > > > > Hmm. On x86 mul_u64_u64_div_u64(175136586720000, 135989749728000, 175138003500000) > > returns 135989749728000 == rtime, see below. > > Seriously, can you re-check your numbers? it would be nice to understand why > x86_64 differs... x86_64 has a custom mul_u64_u64_div_u64() implementation. > > But perhaps it makes sense to improve the accuracy of mul_u64_u64_div_u64() ? > > See the new() function in the code below. > > Just in case, the usage of ilog2 can be improved, but this is minor. I meant to go look at this, it seems to loop less than your improved version, but I'm chasing crashes atm. Perhaps it provides inspiration. https://codebrowser.dev/llvm/compiler-rt/lib/builtins/udivmodti4.c.html#__udivmodti4
On 07/26, Peter Zijlstra wrote: > > On Fri, Jul 26, 2024 at 03:04:01PM +0200, Oleg Nesterov wrote: > > On 07/26, Oleg Nesterov wrote: > > > > > > On 07/26, Zheng Zucheng wrote: > > > > > > > > before call mul_u64_u64_div_u64(), > > > > stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. > > > > > > So stime + utime == 175138003500000 > > > > > > > after call mul_u64_u64_div_u64(), > > > > stime = 135989949653530 > > > > > > Hmm. On x86 mul_u64_u64_div_u64(175136586720000, 135989749728000, 175138003500000) > > > returns 135989749728000 == rtime, see below. > > > > Seriously, can you re-check your numbers? it would be nice to understand why > > x86_64 differs... > > x86_64 has a custom mul_u64_u64_div_u64() implementation. Yes sure, but my user-space test-case uses the "generic" version, > > > But perhaps it makes sense to improve the accuracy of mul_u64_u64_div_u64() ? > > > See the new() function in the code below. > > > > Just in case, the usage of ilog2 can be improved, but this is minor. > > I meant to go look at this, it seems to loop less than your improved > version, but I'm chasing crashes atm. Perhaps it provides inspiration. > > https://codebrowser.dev/llvm/compiler-rt/lib/builtins/udivmodti4.c.html#__udivmodti4 Thanks... I'll try to take a look tommorrow, but at first glance I will never understand this (clever) code ;) Oleg.
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 77baa5bafcbe1b2a15ef9c37232c21279c95481c
Gitweb: https://git.kernel.org/tip/77baa5bafcbe1b2a15ef9c37232c21279c95481c
Author: Zheng Zucheng <zhengzucheng@huawei.com>
AuthorDate: Fri, 26 Jul 2024 02:32:35
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 29 Jul 2024 12:22:32 +02:00
sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime
In extreme test scenarios:
the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime,
utime = 18446744073709518790 ns, rtime = 135989749728000 ns
In cputime_adjust() process, stime is greater than rtime due to
mul_u64_u64_div_u64() precision problem.
before call mul_u64_u64_div_u64(),
stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
after call mul_u64_u64_div_u64(),
stime = 135989949653530
unsigned reversion occurs because rtime is less than stime.
utime = rtime - stime = 135989749728000 - 135989949653530
= -199925530
= (u64)18446744073709518790
Trigger condition:
1). User task run in kernel mode most of time
2). ARM64 architecture
3). TICK_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set
Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime
Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()")
Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20240726023235.217771-1-zhengzucheng@huawei.com
---
kernel/sched/cputime.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a5e0029..0bed0fa 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -582,6 +582,12 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
}
stime = mul_u64_u64_div_u64(stime, rtime, stime + utime);
+ /*
+ * Because mul_u64_u64_div_u64() can approximate on some
+ * achitectures; enforce the constraint that: a*b/(b+c) <= a.
+ */
+ if (unlikely(stime > rtime))
+ stime = rtime;
update:
/*
© 2016 - 2025 Red Hat, Inc.