[PATCH 0/4] Mitigate inconsistent NUMA imbalance behaviour

Mel Gorman posted 4 patches 4 years, 1 month ago
There is a newer version of this series
kernel/sched/fair.c     | 59 ++++++++++++++++++++++++++---------------
kernel/sched/topology.c | 23 ++++++++++------
2 files changed, 53 insertions(+), 29 deletions(-)
[PATCH 0/4] Mitigate inconsistent NUMA imbalance behaviour
Posted by Mel Gorman 4 years, 1 month ago
A problem was reported privately related to inconsistent performance of
NAS when parallelised with MPICH. The root of the problem is that the
initial placement is unpredictable and there can be a larger imbalance
than expected between NUMA nodes. As there is spare capacity and the faults
are local, the imbalance persists for a long time and performance suffers.

This is not 100% an "allowed imbalance" problem as setting the allowed
imbalance to 0 does not fix the issue but the allowed imbalance contributes
the the performance problem. The unpredictable behaviour was most recently
introduced by commit c6f886546cb8 ("sched/fair: Trigger the update of
blocked load on newly idle cpu").

mpirun forks hydra_pmi_proxy helpers with MPICH that go to sleep before the
execing the target workload. As the new tasks are sleeping, the potential
imbalance is not observed as idle_cpus does not reflect the tasks that
will be running in the near future. How bad the problem depends on the
timing of when fork happens and whether the new tasks are still running.
Consequently, a large initial imbalance may not be detected until the
workload is fully running. Once running, NUMA Balancing picks the preferred
node based on locality and runtime load balancing often ignores the tasks
as can_migrate_task() fails for either locality or task_hot reasons and
instead picks unrelated tasks.

This is the min, max and range of run time for mg.D parallelised with ~25%
of the CPUs parallelised by MPICH running on a 2-socket machine (80 CPUs,
16 active for mg.D due to limitations of mg.D).

v5.3                         Min  95.84 Max  96.55 Range   0.71 Mean  96.16
v5.7                         Min  95.44 Max  96.51 Range   1.07 Mean  96.14
v5.8                         Min  96.02 Max 197.08 Range 101.06 Mean 154.70
v5.12                        Min 104.45 Max 111.03 Range   6.58 Mean 105.94
v5.13                        Min 104.38 Max 170.37 Range  65.99 Mean 117.35
v5.13-revert-c6f886546cb8    Min 104.40 Max 110.70 Range   6.30 Mean 105.68 
v5.18rc4-baseline            Min 104.46 Max 169.04 Range  64.58 Mean 130.49
v5.18rc4-revert-c6f886546cb8 Min 113.98 Max 117.29 Range   3.31 Mean 114.71
v5.18rc4-this_series         Min  95.24 Max 175.33 Range  80.09 Mean 108.91
v5.18rc4-this_series+revert  Min  95.24 Max  99.87 Range   4.63 Mean  96.54

This shows that we've had unpredictable performance for a long time for
this load. Instability was introduced somewhere between v5.7 and v5.8,
fixed in v5.12 and broken again since v5.13.  The revert against 5.13
and 5.18-rc4 shows that c6f886546cb8 is the primary source of instability
although the best case is still worse than 5.7.

This series addresses the allowed imbalance problems to get the peak
performance back to 5.7 although only some of the time due to the
instability problem. The series plus the revert is both stable and has
slightly better peak performance and similar average performance. I'm
not convinced commit c6f886546cb8 is wrong but haven't isolated exactly
why it's unstable so for now, I'm just noting it has an issue.

Patch 1 initialises numa_migrate_retry. While this resolves itself
	eventually, it is unpredictable early in the lifetime of
	a task.

Patch 2 will not swap NUMA tasks in the same NUMA group or without
	a NUMA group if there is spare capacity. Swapping is just
	punishing one task to help another.

Patch 3 fixes an issue where a larger imbalance can be created at
	fork time than would be allowed at run time. This behaviour
	can help some workloads that are short lived and prefer
	to remain local but it punishes long-lived tasks that are
	memory intensive.

Patch 4 adjusts the threshold where a NUMA imbalance is allowed to
	better approximate the number of memory channels, at least
	for x86-64.

 kernel/sched/fair.c     | 59 ++++++++++++++++++++++++++---------------
 kernel/sched/topology.c | 23 ++++++++++------
 2 files changed, 53 insertions(+), 29 deletions(-)

-- 
2.34.1
Re: [PATCH 0/4] Mitigate inconsistent NUMA imbalance behaviour
Posted by K Prateek Nayak 4 years ago
Hello Mel,

We tested the patch series on a our systems.

tl;dr

Results of testing:
- Benefits short running Stream tasks in NPS2 and NPS4 mode.
- Benefits seen for tbench in NPS1 mode for 8-128 worker count.
- Regression in Hackbench with 16 groups in NPS1 mode. A rerun for
  same data point suggested run to run variation on patched kernel.
- Regression in case of tbench with 32 and 64 workers in NPS2 mode.
  Patched kernel however seems to report more stable value for 64
  worker count compared to tip.
- Slight regression in schbench in NPS2 and NPS4 mode for large
  worker count but we did spot some run to run variation with
  both tip and patched kernel.

Below are all the detailed numbers for the benchmarks.

On 5/11/2022 8:00 PM, Mel Gorman wrote:
> A problem was reported privately related to inconsistent performance of
> NAS when parallelised with MPICH. The root of the problem is that the
> initial placement is unpredictable and there can be a larger imbalance
> than expected between NUMA nodes. As there is spare capacity and the faults
> are local, the imbalance persists for a long time and performance suffers.
> 
> This is not 100% an "allowed imbalance" problem as setting the allowed
> imbalance to 0 does not fix the issue but the allowed imbalance contributes
> the the performance problem. The unpredictable behaviour was most recently
> introduced by commit c6f886546cb8 ("sched/fair: Trigger the update of
> blocked load on newly idle cpu").
> 
> mpirun forks hydra_pmi_proxy helpers with MPICH that go to sleep before the
> execing the target workload. As the new tasks are sleeping, the potential
> imbalance is not observed as idle_cpus does not reflect the tasks that
> will be running in the near future. How bad the problem depends on the
> timing of when fork happens and whether the new tasks are still running.
> Consequently, a large initial imbalance may not be detected until the
> workload is fully running. Once running, NUMA Balancing picks the preferred
> node based on locality and runtime load balancing often ignores the tasks
> as can_migrate_task() fails for either locality or task_hot reasons and
> instead picks unrelated tasks.
> 
> This is the min, max and range of run time for mg.D parallelised with ~25%
> of the CPUs parallelised by MPICH running on a 2-socket machine (80 CPUs,
> 16 active for mg.D due to limitations of mg.D).
> 
> v5.3                         Min  95.84 Max  96.55 Range   0.71 Mean  96.16
> v5.7                         Min  95.44 Max  96.51 Range   1.07 Mean  96.14
> v5.8                         Min  96.02 Max 197.08 Range 101.06 Mean 154.70
> v5.12                        Min 104.45 Max 111.03 Range   6.58 Mean 105.94
> v5.13                        Min 104.38 Max 170.37 Range  65.99 Mean 117.35
> v5.13-revert-c6f886546cb8    Min 104.40 Max 110.70 Range   6.30 Mean 105.68 
> v5.18rc4-baseline            Min 104.46 Max 169.04 Range  64.58 Mean 130.49
> v5.18rc4-revert-c6f886546cb8 Min 113.98 Max 117.29 Range   3.31 Mean 114.71
> v5.18rc4-this_series         Min  95.24 Max 175.33 Range  80.09 Mean 108.91
> v5.18rc4-this_series+revert  Min  95.24 Max  99.87 Range   4.63 Mean  96.54

Following are the results from testing on a dual socket Zen3 system
(2 x 64C/128T) in different NPS modes.

Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Kernel versions:
- tip:      	5.18-rc1 tip sched/core
- Numa Bal:    	5.18-rc1 tip sched/core + this patch

When we began testing, we recorded the tip at:

commit: a658353167bf "sched/fair: Revise comment about lb decision matrix"

Following are the results from the benchmark:

Note: Results marked with * are data points of concern. A rerun
for the data point has been provided on both the tip and the
patched kernel to check for any run to run variation.

~~~~~~~~~
hackbench
~~~~~~~~~

NPS1

Test:                   tip                     NUMA Bal
 1-groups:         4.64 (0.00 pct)         4.67 (-0.64 pct)
 2-groups:         5.38 (0.00 pct)         5.47 (-1.67 pct)
 4-groups:         6.15 (0.00 pct)         6.24 (-1.46 pct)
 8-groups:         7.42 (0.00 pct)         7.45 (-0.40 pct)
16-groups:        10.70 (0.00 pct)        12.04 (-12.52 pct)    *
16-groups:        10.81 (0.00 pct)        11.00 (-1.72 pct)     [Verification Run]

NPS2

Test:                   tip                     NUMA Bal
 1-groups:         4.70 (0.00 pct)         4.68 (0.42 pct)
 2-groups:         5.45 (0.00 pct)         5.50 (-0.91 pct)
 4-groups:         6.13 (0.00 pct)         6.13 (0.00 pct)
 8-groups:         7.30 (0.00 pct)         7.21 (1.23 pct)
16-groups:        10.30 (0.00 pct)        10.29 (0.09 pct)

NPS4

Test:                   tip                     NUMA Bal
 1-groups:         4.60 (0.00 pct)         4.55 (1.08 pct)
 2-groups:         5.41 (0.00 pct)         5.37 (0.73 pct)
 4-groups:         6.12 (0.00 pct)         6.20 (-1.30 pct)
 8-groups:         7.22 (0.00 pct)         7.29 (-0.96 pct)
16-groups:        10.24 (0.00 pct)        10.27 (-0.29 pct)

~~~~~~~~
schbench
~~~~~~~~

NPS1

#workers:      tip                   NUMA Bal
  1:      29.00 (0.00 pct)        22.50 (22.41 pct)
  2:      28.00 (0.00 pct)        27.00 (3.57 pct)
  4:      31.50 (0.00 pct)        32.00 (-1.58 pct)
  8:      42.00 (0.00 pct)        39.50 (5.95 pct)
 16:      56.50 (0.00 pct)        56.50 (0.00 pct)
 32:      94.50 (0.00 pct)        95.00 (-0.52 pct)
 64:     176.00 (0.00 pct)       176.00 (0.00 pct)
128:     404.00 (0.00 pct)       395.50 (2.10 pct)
256:     869.00 (0.00 pct)       856.00 (1.49 pct)
512:     58432.00 (0.00 pct)     58688.00 (-0.43 pct)

NPS2

#workers:      tip                   NUMA Bal
  1:      26.50 (0.00 pct)        26.00 (1.88 pct)
  2:      26.50 (0.00 pct)        24.50 (7.54 pct)
  4:      34.50 (0.00 pct)        30.50 (11.59 pct)
  8:      45.00 (0.00 pct)        42.00 (6.66 pct)
 16:      56.50 (0.00 pct)        55.50 (1.76 pct)
 32:      95.50 (0.00 pct)        95.00 (0.52 pct)
 64:     179.00 (0.00 pct)       176.00 (1.67 pct)
128:     369.00 (0.00 pct)       400.50 (-8.53 pct)   *
128:     380.00 (0.00 pct)       388.00 (-2.10 pct)   [Verification Run]
256:     898.00 (0.00 pct)       883.00 (1.67 pct)
512:     56256.00 (0.00 pct)     58752.00 (-4.43 pct)

NPS4

#workers:      tip                   NUMA Bal
  1:      25.00 (0.00 pct)        24.00 (4.00 pct)
  2:      28.00 (0.00 pct)        27.50 (1.78 pct)
  4:      29.50 (0.00 pct)        29.50 (0.00 pct)
  8:      41.00 (0.00 pct)        39.00 (4.87 pct)
 16:      65.50 (0.00 pct)        66.00 (-0.76 pct)
 32:      93.00 (0.00 pct)        94.50 (-1.61 pct)
 64:     170.50 (0.00 pct)       176.50 (-3.51 pct)
128:     377.00 (0.00 pct)       390.50 (-3.58 pct)
256:     867.00 (0.00 pct)       919.00 (-5.99 pct)     *
256:     890.00 (0.00 pct)       930.00 (-4.49 pct)     [Verification Run]
512:     58048.00 (0.00 pct)     59520.00 (-2.53 pct)

~~~~~~
tbench
~~~~~~

NPS1

Clients:      tip                    NUMA Bal
    1    443.31 (0.00 pct)       458.77 (3.48 pct)
    2    877.32 (0.00 pct)       898.76 (2.44 pct)
    4    1665.11 (0.00 pct)      1658.76 (-0.38 pct)
    8    3016.68 (0.00 pct)      3133.91 (3.88 pct)
   16    5374.30 (0.00 pct)      5816.28 (8.22 pct)
   32    8763.86 (0.00 pct)      9843.94 (12.32 pct)
   64    15786.93 (0.00 pct)     17562.26 (11.24 pct)
  128    26826.08 (0.00 pct)     28241.35 (5.27 pct)
  256    24207.35 (0.00 pct)     22242.20 (-8.11 pct)
  512    51740.58 (0.00 pct)     51678.30 (-0.12 pct)
 1024    51177.82 (0.00 pct)     50699.27 (-0.93 pct)

NPS2

Clients:       tip                    NUMA Bal
    1    449.49 (0.00 pct)       467.77 (4.06 pct)
    2    867.28 (0.00 pct)       876.20 (1.02 pct)
    4    1643.60 (0.00 pct)      1661.94 (1.11 pct)
    8    3047.35 (0.00 pct)      3040.70 (-0.21 pct)
   16    5340.77 (0.00 pct)      5168.57 (-3.22 pct)
   32    10536.85 (0.00 pct)     9603.93 (-8.85 pct)            *
   32    10424.00 (0.00 pct)     9868.67 (-5.32 pct)            [Verification Run]
   64    16543.23 (0.00 pct)     15749.69 (-4.79 pct)           *
   64    17753.50 (0.00 pct)     15599.03 (-12.13 pct)          [Verification Run]
  128    26400.40 (0.00 pct)     27745.52 (5.09 pct)
  256    23436.75 (0.00 pct)     27978.91 (19.38 pct)
  512    50902.60 (0.00 pct)     50770.42 (-0.25 pct)
 1024    50216.10 (0.00 pct)     49702.00 (-1.02 pct)

NPS4

Clients:       tip                    NUMA Bal
    1    443.82 (0.00 pct)       452.63 (1.98 pct)
    2    849.14 (0.00 pct)       857.86 (1.02 pct)
    4    1603.26 (0.00 pct)      1635.02 (1.98 pct)
    8    2972.37 (0.00 pct)      3090.09 (3.96 pct)
   16    5277.13 (0.00 pct)      5524.38 (4.68 pct)
   32    9744.73 (0.00 pct)      10152.62 (4.18 pct)
   64    15854.80 (0.00 pct)     17442.86 (10.01 pct)
  128    26116.97 (0.00 pct)     26757.21 (2.45 pct)
  256    22403.25 (0.00 pct)     21178.82 (-5.46 pct)
  512    48317.20 (0.00 pct)     47433.34 (-1.82 pct)
 1024    50445.41 (0.00 pct)     50311.83 (-0.26 pct)

Note: tbench resuts for 256 workers are known to have
a great amount of run to run variation on the test
machine. Any regression seen for the data point can
be safely ignored.

~~~~~~
Stream
~~~~~~

- 10 runs

NPS1

Test:          tip                     NUMA Bal
 Copy:   189113.11 (0.00 pct)    183548.36 (-2.94 pct)
Scale:   201190.61 (0.00 pct)    199548.74 (-0.81 pct)
  Add:   232654.21 (0.00 pct)    230058.79 (-1.11 pct)
Triad:   226583.57 (0.00 pct)    224761.89 (-0.80 pct)

NPS2

Test:          tip                     NUMA Bal
 Copy:   155347.14 (0.00 pct)    226212.24 (45.61 pct)
Scale:   191701.53 (0.00 pct)    212667.40 (10.93 pct)
  Add:   210013.97 (0.00 pct)    257112.85 (22.42 pct)
Triad:   207602.00 (0.00 pct)    250309.89 (20.57 pct)

NPS4

Test:          tip                     NUMA Bal
 Copy:   136421.15 (0.00 pct)    159681.42 (17.05 pct)
Scale:   191217.59 (0.00 pct)    193113.39 (0.99 pct)
  Add:   189229.52 (0.00 pct)    209058.15 (10.47 pct)
Triad:   188052.99 (0.00 pct)    205945.57 (9.51 pct)

- 100 runs

NPS1

Test:          tip                     NUMA Bal
 Copy:   244693.32 (0.00 pct)    233080.12 (-4.74 pct)
Scale:   221874.99 (0.00 pct)    215975.10 (-2.65 pct)
  Add:   268363.89 (0.00 pct)    263649.67 (-1.75 pct)
Triad:   260945.24 (0.00 pct)    250936.80 (-3.83 pct)

NPS2

Test:          tip                     NUMA Bal
 Copy:   211262.00 (0.00 pct)    251292.59 (18.94 pct)
Scale:   222493.34 (0.00 pct)    222258.48 (-0.10 pct)
  Add:   280277.17 (0.00 pct)    279649.40 (-0.22 pct)
Triad:   265860.49 (0.00 pct)    265383.54 (-0.17 pct)

NPS4

Test:           tip                     NUMA Bal
 Copy:   250171.40 (0.00 pct)    252465.44 (0.91 pct)
Scale:   222293.56 (0.00 pct)    228169.89 (2.64 pct)
  Add:   279222.16 (0.00 pct)    290568.29 (4.06 pct)
Triad:   262013.92 (0.00 pct)    273825.25 (4.50 pct)

~~~~~~~~~~~~
ycsb-mongodb
~~~~~~~~~~~~

NPS1

sched-tip:      303718.33 (var: 1.31)
NUMA Bal:       299859.00 (var: 1.05)    (-1.27%)

NPS2

sched-tip:      304536.33 (var: 2.46)
NUMA Bal:       302469.67 (var: 1.38)    (-0.67%)

NPS4

sched-tip:      301192.33 (var: 1.81)
NUMA Bal:       300948.00 (var: 0.85)   (-0.08%)

~~~~~
Notes
~~~~~

- Hackbench on NPS1 mode seems to show run to run variation with
  patched kernel. I'll gather some more data to check if this happens
  consistently or not.
  The number reported for hackbench is the Amean of 10 runs.
- schbench seems to show some variation on both tip and the patched
  kernel for the data points with regression. These are evident from
  the [Verification run] done for these data points.
  schbench runs are done with 1 messenger and n workers.
- tbench seems to show some regression for 32 worker and 64 workers
  in NPS2 mode. The case with 32 workers shows consistent result
  however the tip seems to see slight run to run variation for 64
  workers.

- Stream sees great benefit in NPS2 mode and NPS4 mode for short runs.
- Great improvements seen for tbench with 8-128 workers in NPS1 mode.

> 
> This shows that we've had unpredictable performance for a long time for
> this load. Instability was introduced somewhere between v5.7 and v5.8,
> fixed in v5.12 and broken again since v5.13.  The revert against 5.13
> and 5.18-rc4 shows that c6f886546cb8 is the primary source of instability
> although the best case is still worse than 5.7.
> 
> This series addresses the allowed imbalance problems to get the peak
> performance back to 5.7 although only some of the time due to the
> instability problem. The series plus the revert is both stable and has
> slightly better peak performance and similar average performance. I'm
> not convinced commit c6f886546cb8 is wrong but haven't isolated exactly
> why it's unstable so for now, I'm just noting it has an issue.
> 
> Patch 1 initialises numa_migrate_retry. While this resolves itself
> 	eventually, it is unpredictable early in the lifetime of
> 	a task.
> 
> Patch 2 will not swap NUMA tasks in the same NUMA group or without
> 	a NUMA group if there is spare capacity. Swapping is just
> 	punishing one task to help another.
> 
> Patch 3 fixes an issue where a larger imbalance can be created at
> 	fork time than would be allowed at run time. This behaviour
> 	can help some workloads that are short lived and prefer
> 	to remain local but it punishes long-lived tasks that are
> 	memory intensive.
> 
> Patch 4 adjusts the threshold where a NUMA imbalance is allowed to
> 	better approximate the number of memory channels, at least
> 	for x86-64.

The entire patch series was applied as is for testing.

> 
>  kernel/sched/fair.c     | 59 ++++++++++++++++++++++++++---------------
>  kernel/sched/topology.c | 23 ++++++++++------
>  2 files changed, 53 insertions(+), 29 deletions(-)
> 

Please let me know if you would like me to get some additional
data on the test system.
--
Thanks and Regards,
Prateek
Re: [PATCH 0/4] Mitigate inconsistent NUMA imbalance behaviour
Posted by Mel Gorman 4 years ago
On Fri, May 20, 2022 at 10:28:02AM +0530, K Prateek Nayak wrote:
> Hello Mel,
> 
> We tested the patch series on a our systems.
> 
> tl;dr
> 
> Results of testing:
> - Benefits short running Stream tasks in NPS2 and NPS4 mode.
> - Benefits seen for tbench in NPS1 mode for 8-128 worker count.
> - Regression in Hackbench with 16 groups in NPS1 mode. A rerun for
>   same data point suggested run to run variation on patched kernel.
> - Regression in case of tbench with 32 and 64 workers in NPS2 mode.
>   Patched kernel however seems to report more stable value for 64
>   worker count compared to tip.
> - Slight regression in schbench in NPS2 and NPS4 mode for large
>   worker count but we did spot some run to run variation with
>   both tip and patched kernel.
> 
> Below are all the detailed numbers for the benchmarks.
> 

Thanks!

I looked through the results but I do not see anything that is very
alarming. Some notes.

o Hackbench with 16 groups on NPS1, that would likely be 640 tasks
  communicating unless other paramters are used. I expect it to be
  variable and it's a heavily overloaded scenario. Initial placement is
  not necessarily critical as migrations are likely to be very high.
  On NPS1, there is going to be random luck given that the latency
  to individual CPUs and the physical topology is hidden.

o NPS2 with 128 workers. That's at the threshold where load is
  potentially evenly split between the two sockets but not perfectly
  split due to migrate-on-wakeup being a little unpredictable. Might
  be worth checking the variability there.

o Same observations for tbench. I looked at my own results for NPS1
  on Zen3 and what I see is that there is a small blip there but
  the mpstat heat map indicates that the nodes are being more evenly
  used than without the patch which is expected.

o STREAM is interesting in that there are large differences between
  10 runs and 100 hundred runs. In indicates that without pinning that
  STREAM can be a bit variable. The problem might be similar to NAS
  as reported in the leader mail with the variability due to commit
  c6f886546cb8 for unknown reasons.

> > 
> >  kernel/sched/fair.c     | 59 ++++++++++++++++++++++++++---------------
> >  kernel/sched/topology.c | 23 ++++++++++------
> >  2 files changed, 53 insertions(+), 29 deletions(-)
> > 
> 
> Please let me know if you would like me to get some additional
> data on the test system.

Other than checking variability, the min, max and range, I don't need
additional data. I suspect in some cases like what I observed with NAS
that there is wide variability for reasons independent of this series.

I'm of the opinion though that your results are not a barrier for
merging. Do you agree?

-- 
Mel Gorman
SUSE Labs
Re: [PATCH 0/4] Mitigate inconsistent NUMA imbalance behaviour
Posted by K Prateek Nayak 4 years ago
Hello Mel,

Thank you for looking at the results.

On 5/20/2022 3:48 PM, Mel Gorman wrote:
> On Fri, May 20, 2022 at 10:28:02AM +0530, K Prateek Nayak wrote:
>> Hello Mel,
>>
>> We tested the patch series on a our systems.
>>
>> tl;dr
>>
>> Results of testing:
>> - Benefits short running Stream tasks in NPS2 and NPS4 mode.
>> - Benefits seen for tbench in NPS1 mode for 8-128 worker count.
>> - Regression in Hackbench with 16 groups in NPS1 mode. A rerun for
>>   same data point suggested run to run variation on patched kernel.
>> - Regression in case of tbench with 32 and 64 workers in NPS2 mode.
>>   Patched kernel however seems to report more stable value for 64
>>   worker count compared to tip.
>> - Slight regression in schbench in NPS2 and NPS4 mode for large
>>   worker count but we did spot some run to run variation with
>>   both tip and patched kernel.
>>
>> Below are all the detailed numbers for the benchmarks.
>>
> Thanks!
>
> I looked through the results but I do not see anything that is very
> alarming. Some notes.
>
> o Hackbench with 16 groups on NPS1, that would likely be 640 tasks
>   communicating unless other paramters are used. I expect it to be
>   variable and it's a heavily overloaded scenario. Initial placement is
>   not necessarily critical as migrations are likely to be very high.
>   On NPS1, there is going to be random luck given that the latency
>   to individual CPUs and the physical topology is hidden.
I agree. On rerun, the numbers are quite close so I don't think it
is a concern currently.
> o NPS2 with 128 workers. That's at the threshold where load is
>   potentially evenly split between the two sockets but not perfectly
>   split due to migrate-on-wakeup being a little unpredictable. Might
>   be worth checking the variability there.

For schbench, following are the stats recorded for 128 workers:

Configuration: NPS2

- tip

Min           : 357.00
Max           : 407.00
Median        : 369.00
AMean         : 376.30
AMean Stddev  : 19.15
AMean CoefVar : 5.09 pct

- NUMA Bal

Min           : 384.00
Max           : 410.00
Median        : 400.50
AMean         : 400.40
AMean Stddev  : 8.36
AMean CoefVar : 2.09 pct


Configuration: NPS4

- tip

Min           : 361.00
Max           : 399.00
Median        : 377.00
AMean         : 377.00
AMean Stddev  : 10.31
AMean CoefVar : 2.73 pct

- NUMA Bal

Min           : 379.00
Max           : 394.00
Median        : 390.50
AMean         : 388.10
AMean Stddev  : 5.55
AMean CoefVar : 1.43 pct

In the above cases, the patched kernel seems to
be giving more stable results compared to the tip.
schbench is run 10 times for each worker count to
gather these statistics.

> o Same observations for tbench. I looked at my own results for NPS1
>   on Zen3 and what I see is that there is a small blip there but
>   the mpstat heat map indicates that the nodes are being more evenly
>   used than without the patch which is expected.
I agree. The task distribution should have improved with the patch.
Following are the stats recorded for the tbench run for 32 and 64
workers.

Configuration: NPS2

o 32 workers

- tip

Min           : 10250.10
Max           : 10721.90
Median        : 10651.00
AMean         : 10541.00
AMean Stddev  : 254.41
AMean CoefVar : 2.41 pct

- NUMA Bal

Min           : 8932.03
Max           : 10065.10
Median        : 9894.89
AMean         : 9630.67
AMean Stddev  : 611.00
AMean CoefVar : 6.34 pct

o 64 workers

- tip

Min           : 16197.20
Max           : 17175.90
Median        : 16291.20
AMean         : 16554.77
AMean Stddev  : 539.97
AMean CoefVar : 3.26 pct

- NUMA Bal

Min           : 14386.80
Max           : 16625.50
Median        : 16441.10
AMean         : 15817.80
AMean Stddev  : 1242.71
AMean CoefVar : 7.86 pct

We are observing tip to be more stable in this case.
tbench is run 3 times with for given worker count
to gather these statistics.

> o STREAM is interesting in that there are large differences between
>   10 runs and 100 hundred runs. In indicates that without pinning that
>   STREAM can be a bit variable. The problem might be similar to NAS
>   as reported in the leader mail with the variability due to commit
>   c6f886546cb8 for unknown reasons.
There are some cases of Stream where two Stream threads will be co-located
on the same LLC which results in performance drop. I suspect the
patch helps in such situation by getting a better balance much earlier.
>>>  kernel/sched/fair.c     | 59 ++++++++++++++++++++++++++---------------
>>>  kernel/sched/topology.c | 23 ++++++++++------
>>>  2 files changed, 53 insertions(+), 29 deletions(-)
>>>
>> Please let me know if you would like me to get some additional
>> data on the test system.
> Other than checking variability, the min, max and range, I don't need
> additional data. I suspect in some cases like what I observed with NAS
> that there is wide variability for reasons independent of this series.
I've inlined the data above.
> I'm of the opinion though that your results are not a barrier for
> merging. Do you agree?
The results overall look good and shouldn't be a barrier for merging.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

--
Thanks and Regards,
Prateek