[v2] sched/numa: Fix the vma scan starving issue

[PATCH v2] sched/numa: Fix the vma scan starving issue

Posted by Yujie Liu 1 year, 6 months ago

Problem statement:
Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
Numa vma scan overhead has been reduced a lot. Meanwhile, the reducing of
the vma scan might create less Numa page fault information. The
insufficient information makes it harder for the Numa balancer to make
decision. Later, commit b7a5b537c55c08 ("sched/numa: Complete scanning of
partial VMAs regardless of PID activity") and commit 84db47ca7146d7
("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found
to bring back part of the performance.

Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system,
a long duration of remote Numa node read was observed by PMU events:
A few cores having ~500MB/s remote memory access for ~20 seconds.
It causes high core-to-core variance and performance penalty. After the
investigation, it is found that many vmas are skipped due to the active
PID check. According to the trace events, in most cases, vma_is_accessed()
returns false because the history access info stored in pids_active
array has been cleared.

Proposal:
The main idea is to adjust vma_is_accessed() to let it return true easier.
Thus compare the diff between mm->numa_scan_seq and
vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold,
scan the vma.

This patch especially helps the cases where there are small number of
threads, like the process-based SPECcpu. Without this patch, if the
SPECcpu process access the vma at the beginning, then sleeps for a long
time, the pid_active array will be cleared. A a result, if this process
is woken up again, it never has a chance to set prot_none anymore.
Because only the first 2 times of access is granted for vma scan:
(current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2
to be worse, no other threads within the task can help set the prot_none.
This causes information lost.

Raghavendra helped test current patch and got the positive result
on the AMD platform:

autonumabench NUMA01
                            base                  patched
Amean     syst-NUMA01      194.05 (   0.00%)      165.11 *  14.92%*
Amean     elsp-NUMA01      324.86 (   0.00%)      315.58 *   2.86%*

Duration User      380345.36   368252.04
Duration System      1358.89     1156.23
Duration Elapsed     2277.45     2213.25

autonumabench NUMA02

Amean     syst-NUMA02        1.12 (   0.00%)        1.09 *   2.93%*
Amean     elsp-NUMA02        3.50 (   0.00%)        3.56 *  -1.84%*

Duration User        1513.23     1575.48
Duration System         8.33        8.13
Duration Elapsed       28.59       29.71

kernbench

Amean     user-256    22935.42 (   0.00%)    22535.19 *   1.75%*
Amean     syst-256     7284.16 (   0.00%)     7608.72 *  -4.46%*
Amean     elsp-256      159.01 (   0.00%)      158.17 *   0.53%*

Duration User       68816.41    67615.74
Duration System     21873.94    22848.08
Duration Elapsed      506.66      504.55

Intel 256 CPUs/2 Sockets:
autonuma benchmark also shows improvements:

                                               v6.10-rc5              v6.10-rc5
                                                                         +patch
Amean     syst-NUMA01                  245.85 (   0.00%)      230.84 *   6.11%*
Amean     syst-NUMA01_THREADLOCAL      205.27 (   0.00%)      191.86 *   6.53%*
Amean     syst-NUMA02                   18.57 (   0.00%)       18.09 *   2.58%*
Amean     syst-NUMA02_SMT                2.63 (   0.00%)        2.54 *   3.47%*
Amean     elsp-NUMA01                  517.17 (   0.00%)      526.34 *  -1.77%*
Amean     elsp-NUMA01_THREADLOCAL       99.92 (   0.00%)      100.59 *  -0.67%*
Amean     elsp-NUMA02                   15.81 (   0.00%)       15.72 *   0.59%*
Amean     elsp-NUMA02_SMT               13.23 (   0.00%)       12.89 *   2.53%*

                   v6.10-rc5   v6.10-rc5
                                  +patch
Duration User     1064010.16  1075416.23
Duration System      3307.64     3104.66
Duration Elapsed     4537.54     4604.73

The SPECcpu remote node access issue disappears with the patch applied.

Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
Reported-by: Xiaoping Zhou <xiaoping.zhou@intel.com>
Reviewed-and-tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Yujie Liu <yujie.liu@intel.com>
---
v1->v2: Refine the commit log and comments in the code.
---
 kernel/sched/fair.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9057584ec06d..9be6d6f0ab3f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3188,6 +3188,15 @@ static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
 		return true;
 	}
 
+	/*
+	 * This vma has not been accessed for a while, and if the number
+	 * the threads in the same process is low, which means no other
+	 * threads can help scan this vma, force a vma scan.
+	 */
+	if (READ_ONCE(mm->numa_scan_seq) >
+	   (vma->numab_state->prev_scan_seq + get_nr_threads(current)))
+		return true;
+
 	return false;
 }
 
-- 
2.25.1

Re: [PATCH v2] sched/numa: Fix the vma scan starving issue

Posted by Mel Gorman 1 year, 5 months ago

On Mon, Aug 05, 2024 at 04:22:28PM +0800, Yujie Liu wrote:
> Problem statement:
> Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
> Numa vma scan overhead has been reduced a lot. Meanwhile, the reducing of
> the vma scan might create less Numa page fault information. The
> insufficient information makes it harder for the Numa balancer to make
> decision. Later, commit b7a5b537c55c08 ("sched/numa: Complete scanning of
> partial VMAs regardless of PID activity") and commit 84db47ca7146d7
> ("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found
> to bring back part of the performance.
> 
> Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system,
> a long duration of remote Numa node read was observed by PMU events:
> A few cores having ~500MB/s remote memory access for ~20 seconds.
> It causes high core-to-core variance and performance penalty. After the
> investigation, it is found that many vmas are skipped due to the active
> PID check. According to the trace events, in most cases, vma_is_accessed()
> returns false because the history access info stored in pids_active
> array has been cleared.
> 
> Proposal:
> The main idea is to adjust vma_is_accessed() to let it return true easier.
> Thus compare the diff between mm->numa_scan_seq and
> vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold,
> scan the vma.
> 
> This patch especially helps the cases where there are small number of
> threads, like the process-based SPECcpu. Without this patch, if the
> SPECcpu process access the vma at the beginning, then sleeps for a long
> time, the pid_active array will be cleared. A a result, if this process
> is woken up again, it never has a chance to set prot_none anymore.
> Because only the first 2 times of access is granted for vma scan:
> (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2
> to be worse, no other threads within the task can help set the prot_none.
> This causes information lost.
> 
> Raghavendra helped test current patch and got the positive result
> on the AMD platform:
> 
> autonumabench NUMA01
>                             base                  patched
> Amean     syst-NUMA01      194.05 (   0.00%)      165.11 *  14.92%*
> Amean     elsp-NUMA01      324.86 (   0.00%)      315.58 *   2.86%*
> 
> Duration User      380345.36   368252.04
> Duration System      1358.89     1156.23
> Duration Elapsed     2277.45     2213.25
> 
> autonumabench NUMA02
> 
> Amean     syst-NUMA02        1.12 (   0.00%)        1.09 *   2.93%*
> Amean     elsp-NUMA02        3.50 (   0.00%)        3.56 *  -1.84%*
> 
> Duration User        1513.23     1575.48
> Duration System         8.33        8.13
> Duration Elapsed       28.59       29.71
> 
> kernbench
> 
> Amean     user-256    22935.42 (   0.00%)    22535.19 *   1.75%*
> Amean     syst-256     7284.16 (   0.00%)     7608.72 *  -4.46%*
> Amean     elsp-256      159.01 (   0.00%)      158.17 *   0.53%*
> 
> Duration User       68816.41    67615.74
> Duration System     21873.94    22848.08
> Duration Elapsed      506.66      504.55
> 
> Intel 256 CPUs/2 Sockets:
> autonuma benchmark also shows improvements:
> 
>                                                v6.10-rc5              v6.10-rc5
>                                                                          +patch
> Amean     syst-NUMA01                  245.85 (   0.00%)      230.84 *   6.11%*
> Amean     syst-NUMA01_THREADLOCAL      205.27 (   0.00%)      191.86 *   6.53%*
> Amean     syst-NUMA02                   18.57 (   0.00%)       18.09 *   2.58%*
> Amean     syst-NUMA02_SMT                2.63 (   0.00%)        2.54 *   3.47%*
> Amean     elsp-NUMA01                  517.17 (   0.00%)      526.34 *  -1.77%*
> Amean     elsp-NUMA01_THREADLOCAL       99.92 (   0.00%)      100.59 *  -0.67%*
> Amean     elsp-NUMA02                   15.81 (   0.00%)       15.72 *   0.59%*
> Amean     elsp-NUMA02_SMT               13.23 (   0.00%)       12.89 *   2.53%*
> 
>                    v6.10-rc5   v6.10-rc5
>                                   +patch
> Duration User     1064010.16  1075416.23
> Duration System      3307.64     3104.66
> Duration Elapsed     4537.54     4604.73
> 
> The SPECcpu remote node access issue disappears with the patch applied.
> 
> Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
> Reported-by: Xiaoping Zhou <xiaoping.zhou@intel.com>
> Reviewed-and-tested-by: Raghavendra K T <raghavendra.kt@amd.com>
> Co-developed-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Yujie Liu <yujie.liu@intel.com>

Ok, I didn't exactly replicate the autonuma test results but then again,
I'd be a little surprised it was affected by this issue. The rescan
decision is a bit arbitrary but I see no obviously better alternative
and the patch is fixing an important corner case so

Acked-by: Mel Gorman <mgorman@techsingularity.net>

Sorry for the long delay in reviewing, my backlog for upstream work is
insane :(

-- 
Mel Gorman
SUSE Labs

Re: [PATCH v2] sched/numa: Fix the vma scan starving issue

Posted by Liu, Yujie 1 year, 5 months ago

On Wed, 2024-08-21 at 11:04 +0100, Mel Gorman wrote:
> On Mon, Aug 05, 2024 at 04:22:28PM +0800, Yujie Liu wrote:
> > Problem statement:
> > Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
> > Numa vma scan overhead has been reduced a lot. Meanwhile, the reducing of
> > the vma scan might create less Numa page fault information. The
> > insufficient information makes it harder for the Numa balancer to make
> > decision. Later, commit b7a5b537c55c08 ("sched/numa: Complete scanning of
> > partial VMAs regardless of PID activity") and commit 84db47ca7146d7
> > ("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found
> > to bring back part of the performance.
> > 
> > Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system,
> > a long duration of remote Numa node read was observed by PMU events:
> > A few cores having ~500MB/s remote memory access for ~20 seconds.
> > It causes high core-to-core variance and performance penalty. After the
> > investigation, it is found that many vmas are skipped due to the active
> > PID check. According to the trace events, in most cases, vma_is_accessed()
> > returns false because the history access info stored in pids_active
> > array has been cleared.
> > 
> > Proposal:
> > The main idea is to adjust vma_is_accessed() to let it return true easier.
> > Thus compare the diff between mm->numa_scan_seq and
> > vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold,
> > scan the vma.
> > 
> > This patch especially helps the cases where there are small number of
> > threads, like the process-based SPECcpu. Without this patch, if the
> > SPECcpu process access the vma at the beginning, then sleeps for a long
> > time, the pid_active array will be cleared. A a result, if this process
> > is woken up again, it never has a chance to set prot_none anymore.
> > Because only the first 2 times of access is granted for vma scan:
> > (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2
> > to be worse, no other threads within the task can help set the prot_none.
> > This causes information lost.
> > 
> > Raghavendra helped test current patch and got the positive result
> > on the AMD platform:
> > 
> > autonumabench NUMA01
> >                             base                  patched
> > Amean     syst-NUMA01      194.05 (   0.00%)      165.11 *  14.92%*
> > Amean     elsp-NUMA01      324.86 (   0.00%)      315.58 *   2.86%*
> > 
> > Duration User      380345.36   368252.04
> > Duration System      1358.89     1156.23
> > Duration Elapsed     2277.45     2213.25
> > 
> > autonumabench NUMA02
> > 
> > Amean     syst-NUMA02        1.12 (   0.00%)        1.09 *   2.93%*
> > Amean     elsp-NUMA02        3.50 (   0.00%)        3.56 *  -1.84%*
> > 
> > Duration User        1513.23     1575.48
> > Duration System         8.33        8.13
> > Duration Elapsed       28.59       29.71
> > 
> > kernbench
> > 
> > Amean     user-256    22935.42 (   0.00%)    22535.19 *   1.75%*
> > Amean     syst-256     7284.16 (   0.00%)     7608.72 *  -4.46%*
> > Amean     elsp-256      159.01 (   0.00%)      158.17 *   0.53%*
> > 
> > Duration User       68816.41    67615.74
> > Duration System     21873.94    22848.08
> > Duration Elapsed      506.66      504.55
> > 
> > Intel 256 CPUs/2 Sockets:
> > autonuma benchmark also shows improvements:
> > 
> >                                                v6.10-rc5              v6.10-rc5
> >                                                                          +patch
> > Amean     syst-NUMA01                  245.85 (   0.00%)      230.84 *   6.11%*
> > Amean     syst-NUMA01_THREADLOCAL      205.27 (   0.00%)      191.86 *   6.53%*
> > Amean     syst-NUMA02                   18.57 (   0.00%)       18.09 *   2.58%*
> > Amean     syst-NUMA02_SMT                2.63 (   0.00%)        2.54 *   3.47%*
> > Amean     elsp-NUMA01                  517.17 (   0.00%)      526.34 *  -1.77%*
> > Amean     elsp-NUMA01_THREADLOCAL       99.92 (   0.00%)      100.59 *  -0.67%*
> > Amean     elsp-NUMA02                   15.81 (   0.00%)       15.72 *   0.59%*
> > Amean     elsp-NUMA02_SMT               13.23 (   0.00%)       12.89 *   2.53%*
> > 
> >                    v6.10-rc5   v6.10-rc5
> >                                   +patch
> > Duration User     1064010.16  1075416.23
> > Duration System      3307.64     3104.66
> > Duration Elapsed     4537.54     4604.73
> > 
> > The SPECcpu remote node access issue disappears with the patch applied.
> > 
> > Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
> > Reported-by: Xiaoping Zhou <xiaoping.zhou@intel.com>
> > Reviewed-and-tested-by: Raghavendra K T <raghavendra.kt@amd.com>
> > Co-developed-by: Chen Yu <yu.c.chen@intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > Signed-off-by: Yujie Liu <yujie.liu@intel.com>
> 
> Ok, I didn't exactly replicate the autonuma test results but then again,
> I'd be a little surprised it was affected by this issue. The rescan
> decision is a bit arbitrary but I see no obviously better alternative
> and the patch is fixing an important corner case so
> 
> Acked-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Sorry for the long delay in reviewing, my backlog for upstream work is
> insane :(

Thanks a lot for your time to review this patch.

Best Regards,
Yujie

Re: [PATCH v2] sched/numa: Fix the vma scan starving issue

Posted by Chen Yu 1 year, 5 months ago

On 2024-08-21 at 11:04:14 +0100, Mel Gorman wrote:
> On Mon, Aug 05, 2024 at 04:22:28PM +0800, Yujie Liu wrote:
> > Problem statement:
> > Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
> > Numa vma scan overhead has been reduced a lot. Meanwhile, the reducing of
> > the vma scan might create less Numa page fault information. The
> > insufficient information makes it harder for the Numa balancer to make
> > decision. Later, commit b7a5b537c55c08 ("sched/numa: Complete scanning of
> > partial VMAs regardless of PID activity") and commit 84db47ca7146d7
> > ("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found
> > to bring back part of the performance.
> > 
> > Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system,
> > a long duration of remote Numa node read was observed by PMU events:
> > A few cores having ~500MB/s remote memory access for ~20 seconds.
> > It causes high core-to-core variance and performance penalty. After the
> > investigation, it is found that many vmas are skipped due to the active
> > PID check. According to the trace events, in most cases, vma_is_accessed()
> > returns false because the history access info stored in pids_active
> > array has been cleared.
> > 
> > Proposal:
> > The main idea is to adjust vma_is_accessed() to let it return true easier.
> > Thus compare the diff between mm->numa_scan_seq and
> > vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold,
> > scan the vma.
> > 
> > This patch especially helps the cases where there are small number of
> > threads, like the process-based SPECcpu. Without this patch, if the
> > SPECcpu process access the vma at the beginning, then sleeps for a long
> > time, the pid_active array will be cleared. A a result, if this process
> > is woken up again, it never has a chance to set prot_none anymore.
> > Because only the first 2 times of access is granted for vma scan:
> > (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2
> > to be worse, no other threads within the task can help set the prot_none.
> > This causes information lost.
> > 
> > Raghavendra helped test current patch and got the positive result
> > on the AMD platform:
> > 
> > autonumabench NUMA01
> >                             base                  patched
> > Amean     syst-NUMA01      194.05 (   0.00%)      165.11 *  14.92%*
> > Amean     elsp-NUMA01      324.86 (   0.00%)      315.58 *   2.86%*
> > 
> > Duration User      380345.36   368252.04
> > Duration System      1358.89     1156.23
> > Duration Elapsed     2277.45     2213.25
> > 
> > autonumabench NUMA02
> > 
> > Amean     syst-NUMA02        1.12 (   0.00%)        1.09 *   2.93%*
> > Amean     elsp-NUMA02        3.50 (   0.00%)        3.56 *  -1.84%*
> > 
> > Duration User        1513.23     1575.48
> > Duration System         8.33        8.13
> > Duration Elapsed       28.59       29.71
> > 
> > kernbench
> > 
> > Amean     user-256    22935.42 (   0.00%)    22535.19 *   1.75%*
> > Amean     syst-256     7284.16 (   0.00%)     7608.72 *  -4.46%*
> > Amean     elsp-256      159.01 (   0.00%)      158.17 *   0.53%*
> > 
> > Duration User       68816.41    67615.74
> > Duration System     21873.94    22848.08
> > Duration Elapsed      506.66      504.55
> > 
> > Intel 256 CPUs/2 Sockets:
> > autonuma benchmark also shows improvements:
> > 
> >                                                v6.10-rc5              v6.10-rc5
> >                                                                          +patch
> > Amean     syst-NUMA01                  245.85 (   0.00%)      230.84 *   6.11%*
> > Amean     syst-NUMA01_THREADLOCAL      205.27 (   0.00%)      191.86 *   6.53%*
> > Amean     syst-NUMA02                   18.57 (   0.00%)       18.09 *   2.58%*
> > Amean     syst-NUMA02_SMT                2.63 (   0.00%)        2.54 *   3.47%*
> > Amean     elsp-NUMA01                  517.17 (   0.00%)      526.34 *  -1.77%*
> > Amean     elsp-NUMA01_THREADLOCAL       99.92 (   0.00%)      100.59 *  -0.67%*
> > Amean     elsp-NUMA02                   15.81 (   0.00%)       15.72 *   0.59%*
> > Amean     elsp-NUMA02_SMT               13.23 (   0.00%)       12.89 *   2.53%*
> > 
> >                    v6.10-rc5   v6.10-rc5
> >                                   +patch
> > Duration User     1064010.16  1075416.23
> > Duration System      3307.64     3104.66
> > Duration Elapsed     4537.54     4604.73
> > 
> > The SPECcpu remote node access issue disappears with the patch applied.
> > 
> > Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
> > Reported-by: Xiaoping Zhou <xiaoping.zhou@intel.com>
> > Reviewed-and-tested-by: Raghavendra K T <raghavendra.kt@amd.com>
> > Co-developed-by: Chen Yu <yu.c.chen@intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > Signed-off-by: Yujie Liu <yujie.liu@intel.com>
> 
> Ok, I didn't exactly replicate the autonuma test results but then again,
> I'd be a little surprised it was affected by this issue. The rescan
> decision is a bit arbitrary but I see no obviously better alternative
> and the patch is fixing an important corner case so
> 
> Acked-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Sorry for the long delay in reviewing, my backlog for upstream work is
> insane :(
>

Thank you Mel for your time to review this patch.

thanks,
Chenyu