sched/numa: Complete scanning of partial and inactive VMAs

[PATCH 6/6] sched/numa: Complete scanning of inactive VMAs when there is no alternative

Posted by Mel Gorman 2 years, 4 months ago

VMAs are skipped if there is no recent fault activity but this represents
a chicken-and-egg problem as there may be no fault activity if the PTEs
are never updated to trap NUMA hints. There is an indirect reliance on
scanning to be forced early in the lifetime of a task but this may fail
to detect changes in phase behaviour. Force inactive VMAs to be scanned
when all other eligible VMAs have been updated within the same scan
sequence.

Test results in general look good with some changes in performance, both
negative and positive, depending on whether the additional scanning and
faulting was beneficial or not to the workload. The autonuma benchmark
workload NUMA01_THREADLOCAL was picked for closer examination. The workload
creates two processes with numerous threads and thread-local storage that
is zero-filled in a loop. It exercises the corner case where unrelated
threads may skip VMAs that are thread-local to another thread and still
has some VMAs that inactive while the workload executes.

The VMA skipping activity frequency with and without the patch is as
follows;

6.6.0-rc2-sched-numabtrace-v1
    649 reason=scan_delay
   9094 reason=unsuitable
  48915 reason=shared_ro
 143919 reason=inaccessible
 193050 reason=pid_inactive

6.6.0-rc2-sched-numabselective-v1
    146 reason=seq_completed
    622 reason=ignore_pid_inactive
    624 reason=scan_delay
   6570 reason=unsuitable
  16101 reason=shared_ro
  27608 reason=inaccessible
  41939 reason=pid_inactive

Note that with the patch applied, the PID activity is ignored
(ignore_pid_inactive) to ensure a VMA with some activity is completely
scanned. In addition, a small number of VMAs are scanned when no other
eligible VMA is available during a single scan window (seq_completed).
The number of times a VMA is skipped due to no PID activity from the
scanning task (pid_inactive) drops dramatically. It is expected that
this will increase the number of PTEs updated for NUMA hinting faults
as well as hinting faults but these represent PTEs that would otherwise
have been missed. The tradeoff is scan+fault overhead versus improving
locality due to migration.

On a 2-socket Cascade Lake test machine, the time to complete the
workload is as follows;

                                               6.6.0-rc2              6.6.0-rc2
                                     sched-numabtrace-v1 sched-numabselective-v1
Min       elsp-NUMA01_THREADLOCAL      174.22 (   0.00%)      117.64 (  32.48%)
Amean     elsp-NUMA01_THREADLOCAL      175.68 (   0.00%)      123.34 *  29.79%*
Stddev    elsp-NUMA01_THREADLOCAL        1.20 (   0.00%)        4.06 (-238.20%)
CoeffVar  elsp-NUMA01_THREADLOCAL        0.68 (   0.00%)        3.29 (-381.70%)
Max       elsp-NUMA01_THREADLOCAL      177.18 (   0.00%)      128.03 (  27.74%)

The time to complete the workload is reduced by almost 30%

                   6.6.0-rc2   6.6.0-rc2
                sched-numabtrace-v1 sched-numabselective-v1 /
Duration User       91201.80    63506.64
Duration System      2015.53     1819.78
Duration Elapsed     1234.77      868.37

In this specific case, system CPU time was not increased but it's not
universally true.

From vmstat, the NUMA scanning and fault activity is as follows;

                                      6.6.0-rc2      6.6.0-rc2
                            sched-numabtrace-v1 sched-numabselective-v1
Ops NUMA base-page range updates       64272.00    26374386.00
Ops NUMA PTE updates                   36624.00       55538.00
Ops NUMA PMD updates                      54.00       51404.00
Ops NUMA hint faults                   15504.00       75786.00
Ops NUMA hint local faults %           14860.00       56763.00
Ops NUMA hint local percent               95.85          74.90
Ops NUMA pages migrated                 1629.00     6469222.00

Both the number of PTE updates and hint faults is dramatically
increased. While this is superficially unfortunate, it represents
ranges that were simply skipped without the patch. As a result
of the scanning and hinting faults, many more pages were also
migrated but as the time to completion is reduced, the overhead
is offset by the gain.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mm_types.h             |  6 +++
 include/linux/sched/numa_balancing.h |  1 +
 include/trace/events/sched.h         |  3 +-
 kernel/sched/fair.c                  | 55 ++++++++++++++++++++++++++--
 4 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8cb1dec3e358..a123c1a58617 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -578,6 +578,12 @@ struct vma_numab_state {
 						 * VMA_PID_RESET_PERIOD
 						 * jiffies.
 						 */
+	int prev_scan_seq;			/* MM scan sequence ID when
+						 * the VMA was last completely
+						 * scanned. A VMA is not
+						 * eligible for scanning if
+						 * prev_scan_seq == numa_scan_seq
+						 */
 };
 
 /*
diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index 7dcc0bdfddbb..b69afb8630db 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -22,6 +22,7 @@ enum numa_vmaskip_reason {
 	NUMAB_SKIP_SCAN_DELAY,
 	NUMAB_SKIP_PID_INACTIVE,
 	NUMAB_SKIP_IGNORE_PID,
+	NUMAB_SKIP_SEQ_COMPLETED,
 };
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 27b51c81b106..010ba1b7cb0e 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -671,7 +671,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
 	EM( NUMAB_SKIP_INACCESSIBLE,		"inaccessible" )	\
 	EM( NUMAB_SKIP_SCAN_DELAY,		"scan_delay" )	\
 	EM( NUMAB_SKIP_PID_INACTIVE,		"pid_inactive" )	\
-	EMe(NUMAB_SKIP_IGNORE_PID,		"ignore_pid_inactive" )
+	EM( NUMAB_SKIP_IGNORE_PID,		"ignore_pid_inactive" )		\
+	EMe(NUMAB_SKIP_SEQ_COMPLETED,		"seq_completed" )
 
 /* Redefine for export. */
 #undef EM
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 150f01948ec6..72ef60f394ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3175,6 +3175,8 @@ static void task_numa_work(struct callback_head *work)
 	unsigned long nr_pte_updates = 0;
 	long pages, virtpages;
 	struct vma_iterator vmi;
+	bool vma_pids_skipped;
+	bool vma_pids_forced = false;
 
 	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
 
@@ -3217,7 +3219,6 @@ static void task_numa_work(struct callback_head *work)
 	 */
 	p->node_stamp += 2 * TICK_NSEC;
 
-	start = mm->numa_scan_offset;
 	pages = sysctl_numa_balancing_scan_size;
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
 	virtpages = pages * 8;	   /* Scan up to this much virtual space */
@@ -3227,6 +3228,16 @@ static void task_numa_work(struct callback_head *work)
 
 	if (!mmap_read_trylock(mm))
 		return;
+
+	/*
+	 * VMAs are skipped if the current PID has not trapped a fault within
+	 * the VMA recently. Allow scanning to be forced if there is no
+	 * suitable VMA remaining.
+	 */
+	vma_pids_skipped = false;
+
+retry_pids:
+	start = mm->numa_scan_offset;
 	vma_iter_init(&vmi, mm, start);
 	vma = vma_next(&vmi);
 	if (!vma) {
@@ -3277,6 +3288,13 @@ static void task_numa_work(struct callback_head *work)
 			/* Reset happens after 4 times scan delay of scan start */
 			vma->numab_state->pids_active_reset =  vma->numab_state->next_scan +
 				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+
+			/*
+			 * Ensure prev_scan_seq does not match numa_scan_seq
+			 * to prevent VMAs being skipped prematurely on the
+			 * first scan.
+			 */
+			 vma->numab_state->prev_scan_seq = mm->numa_scan_seq - 1;
 		}
 
 		/*
@@ -3298,8 +3316,19 @@ static void task_numa_work(struct callback_head *work)
 			vma->numab_state->pids_active[1] = 0;
 		}
 
-		/* Do not scan the VMA if task has not accessed */
-		if (!vma_is_accessed(mm, vma)) {
+		/* Do not rescan VMAs twice within the same sequence. */
+		if (vma->numab_state->prev_scan_seq == mm->numa_scan_seq) {
+			mm->numa_scan_offset = vma->vm_end;
+			trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SEQ_COMPLETED);
+			continue;
+		}
+
+		/*
+		 * Do not scan the VMA if task has not accessed unless no other
+		 * VMA candidate exists.
+		 */
+		if (!vma_pids_forced && !vma_is_accessed(mm, vma)) {
+			vma_pids_skipped = true;
 			trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
 			continue;
 		}
@@ -3328,8 +3357,28 @@ static void task_numa_work(struct callback_head *work)
 
 			cond_resched();
 		} while (end != vma->vm_end);
+
+		/* VMA scan is complete, do not scan until next sequence. */
+		vma->numab_state->prev_scan_seq = mm->numa_scan_seq;
+
+		/*
+		 * Only force scan within one VMA at a time to limit the
+		 * cost of scanning a potentially uninteresting VMA.
+		 */
+		if (vma_pids_forced)
+			break;
 	} for_each_vma(vmi, vma);
 
+	/*
+	 * If no VMAs are remaining and VMAs were skipped due to the PID
+	 * not accessing the VMA previously then force a scan to ensure
+	 * forward progress.
+	 */
+	if (!vma && !vma_pids_forced && vma_pids_skipped) {
+		vma_pids_forced = true;
+		goto retry_pids;
+	}
+
 out:
 	/*
 	 * It is possible to reach the end of the VMA list but the last few
-- 
2.35.3

Re: [PATCH 6/6] sched/numa: Complete scanning of inactive VMAs when there is no alternative

Posted by Ingo Molnar 2 years, 4 months ago

* Mel Gorman <mgorman@techsingularity.net> wrote:

> On a 2-socket Cascade Lake test machine, the time to complete the
> workload is as follows;
> 
>                                                6.6.0-rc2              6.6.0-rc2
>                                      sched-numabtrace-v1 sched-numabselective-v1
> Min       elsp-NUMA01_THREADLOCAL      174.22 (   0.00%)      117.64 (  32.48%)
> Amean     elsp-NUMA01_THREADLOCAL      175.68 (   0.00%)      123.34 *  29.79%*
> Stddev    elsp-NUMA01_THREADLOCAL        1.20 (   0.00%)        4.06 (-238.20%)
> CoeffVar  elsp-NUMA01_THREADLOCAL        0.68 (   0.00%)        3.29 (-381.70%)
> Max       elsp-NUMA01_THREADLOCAL      177.18 (   0.00%)      128.03 (  27.74%)
> 
> The time to complete the workload is reduced by almost 30%
> 
>                    6.6.0-rc2   6.6.0-rc2
>                 sched-numabtrace-v1 sched-numabselective-v1 /
> Duration User       91201.80    63506.64
> Duration System      2015.53     1819.78
> Duration Elapsed     1234.77      868.37
> 
> In this specific case, system CPU time was not increased but it's not
> universally true.
> 
> From vmstat, the NUMA scanning and fault activity is as follows;
> 
>                                       6.6.0-rc2      6.6.0-rc2
>                             sched-numabtrace-v1 sched-numabselective-v1
> Ops NUMA base-page range updates       64272.00    26374386.00
> Ops NUMA PTE updates                   36624.00       55538.00
> Ops NUMA PMD updates                      54.00       51404.00
> Ops NUMA hint faults                   15504.00       75786.00
> Ops NUMA hint local faults %           14860.00       56763.00
> Ops NUMA hint local percent               95.85          74.90
> Ops NUMA pages migrated                 1629.00     6469222.00
> 
> Both the number of PTE updates and hint faults is dramatically
> increased. While this is superficially unfortunate, it represents
> ranges that were simply skipped without the patch. As a result
> of the scanning and hinting faults, many more pages were also
> migrated but as the time to completion is reduced, the overhead
> is offset by the gain.

Nice! I've applied your series to tip:sched/core with a few non-functional 
edits to comment/changelog formatting/clarity.

Btw., was any previous analysis done on the size of the pids_active[] hash
and the hash collision rate?

64 (BITS_PER_LONG) feels a bit small, especially on larger machines running 
threaded workloads, and the kmalloc of numab_state likely allocates a full 
cacheline anyway, so we could double the hash size from 8 bytes (2x1 longs) 
to 32 bytes (2x2 longs) with very little real cost, and still have a long 
field left to spare?

Thanks,

	Ingo

Re: [PATCH 6/6] sched/numa: Complete scanning of inactive VMAs when there is no alternative

Posted by Mel Gorman 2 years, 4 months ago

On Tue, Oct 10, 2023 at 11:23:00AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > On a 2-socket Cascade Lake test machine, the time to complete the
> > workload is as follows;
> > 
> >                                                6.6.0-rc2              6.6.0-rc2
> >                                      sched-numabtrace-v1 sched-numabselective-v1
> > Min       elsp-NUMA01_THREADLOCAL      174.22 (   0.00%)      117.64 (  32.48%)
> > Amean     elsp-NUMA01_THREADLOCAL      175.68 (   0.00%)      123.34 *  29.79%*
> > Stddev    elsp-NUMA01_THREADLOCAL        1.20 (   0.00%)        4.06 (-238.20%)
> > CoeffVar  elsp-NUMA01_THREADLOCAL        0.68 (   0.00%)        3.29 (-381.70%)
> > Max       elsp-NUMA01_THREADLOCAL      177.18 (   0.00%)      128.03 (  27.74%)
> > 
> > The time to complete the workload is reduced by almost 30%
> > 
> >                    6.6.0-rc2   6.6.0-rc2
> >                 sched-numabtrace-v1 sched-numabselective-v1 /
> > Duration User       91201.80    63506.64
> > Duration System      2015.53     1819.78
> > Duration Elapsed     1234.77      868.37
> > 
> > In this specific case, system CPU time was not increased but it's not
> > universally true.
> > 
> > From vmstat, the NUMA scanning and fault activity is as follows;
> > 
> >                                       6.6.0-rc2      6.6.0-rc2
> >                             sched-numabtrace-v1 sched-numabselective-v1
> > Ops NUMA base-page range updates       64272.00    26374386.00
> > Ops NUMA PTE updates                   36624.00       55538.00
> > Ops NUMA PMD updates                      54.00       51404.00
> > Ops NUMA hint faults                   15504.00       75786.00
> > Ops NUMA hint local faults %           14860.00       56763.00
> > Ops NUMA hint local percent               95.85          74.90
> > Ops NUMA pages migrated                 1629.00     6469222.00
> > 
> > Both the number of PTE updates and hint faults is dramatically
> > increased. While this is superficially unfortunate, it represents
> > ranges that were simply skipped without the patch. As a result
> > of the scanning and hinting faults, many more pages were also
> > migrated but as the time to completion is reduced, the overhead
> > is offset by the gain.
> 
> Nice! I've applied your series to tip:sched/core with a few non-functional 
> edits to comment/changelog formatting/clarity.
> 

Thanks.

> Btw., was any previous analysis done on the size of the pids_active[] hash
> and the hash collision rate?
> 

Not that I'm aware of but I also think it would be difficult to design
something representative in terms of a benchmark. New pids are typically
sequential so most benchmarks are not going to show many collisions
unless the hash algorithm ignores lower bits. Maybe it does, I didn't
actually check the hash algorithm and if it does, that is likely the patch
justification right there -- threads created at similar times are almost
certain to collide). As it was Peter that suggested the hash, I assumed
he considered collisions due to lower bits but that is also lazy on my part.

If lower bits are used then it would pose the question -- does it
matter? The intent of the bitmap is for threads to prefer updating PTEs
within task-active VMAs but ultimately all VMAs should be scanned anyway so
some overhead will be usless. While collisions may occur, it's still better
than scanning within VMAs that are definitely *not* of interest. It would
suggest that a sensible direction would be to scan in passes like load
balancing uses fbq_type in find_busiest_queue() to filter what types of
tasks should be considered for moving. So, maybe the passes would look like

	1. Task-active
	2. Multiple tasks active
	3. Any task active
	4. Inactive

The objective would be that PTE updates are as relevant as possible
and hopefully by the time only inactive VMAs are considered, there is a
relatively small amount of wasted work.

> 64 (BITS_PER_LONG) feels a bit small, especially on larger machines running 
> threaded workloads, and the kmalloc of numab_state likely allocates a full 
> cacheline anyway, so we could double the hash size from 8 bytes (2x1 longs) 
> to 32 bytes (2x2 longs) with very little real cost, and still have a long 
> field left to spare?
> 

You're right, we could and it's relatively cheap. I would worry that as the
storage overhead is per-VMA then workloads for large machines may also have
lots of VMAs that are not necessarily using threads. As I would struggle
to provide supporting data justifying the change, I would also be hesitant
to try merging it because if I was reviewing the patch for someone else,
the first question I would ask is "is there any performance benefit that
you can show?". I would expect the first patch would provide some
telemetry and the patch some justification.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 6/6] sched/numa: Complete scanning of inactive VMAs when there is no alternative

Posted by Ingo Molnar 2 years, 4 months ago

* Mel Gorman <mgorman@techsingularity.net> wrote:

> > 64 (BITS_PER_LONG) feels a bit small, especially on larger machines running 
> > threaded workloads, and the kmalloc of numab_state likely allocates a full 
> > cacheline anyway, so we could double the hash size from 8 bytes (2x1 longs) 
                                                            ^--- 16 bytes
> > to 32 bytes (2x2 longs) with very little real cost, and still have a long 
> > field left to spare?
> > 
> 
> You're right, we could and it's relatively cheap. I would worry that as 
> the storage overhead is per-VMA then workloads for large machines may 
> also have lots of VMAs that are not necessarily using threads.

So I think there would be *zero* extra per-vma storage overhead, because 
vma->numab_state is a pointer, with the structure kmalloc() allocated, 
which should round the allocation to cacheline granularity anyway: 64 bytes 
on NUMA systems that matter.

So with the current size of 'struct vma_numab_state' of 36 bytes, we can 
extend it by 16 bytes with zero additional storage cost.

And since there's no cost, and less hash collisions are always good, the 
change wouldn't need any other justification. :-)

[ Plus the resulting abstraction for the definition of a larger bitmask 
  would probably make future extensions easier. ]

But ... it was just a suggestion.

Thanks,

	Ingo

[tip: sched/core] sched/numa: Complete scanning of inactive VMAs when there is no alternative

Posted by tip-bot2 for Mel Gorman 2 years, 4 months ago

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     f169c62ff7cd1acf8bac8ae17bfeafa307d9e6fa
Gitweb:        https://git.kernel.org/tip/f169c62ff7cd1acf8bac8ae17bfeafa307d9e6fa
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Tue, 10 Oct 2023 09:31:43 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 10 Oct 2023 23:42:15 +02:00

sched/numa: Complete scanning of inactive VMAs when there is no alternative

VMAs are skipped if there is no recent fault activity but this represents
a chicken-and-egg problem as there may be no fault activity if the PTEs
are never updated to trap NUMA hints. There is an indirect reliance on
scanning to be forced early in the lifetime of a task but this may fail
to detect changes in phase behaviour. Force inactive VMAs to be scanned
when all other eligible VMAs have been updated within the same scan
sequence.

Test results in general look good with some changes in performance, both
negative and positive, depending on whether the additional scanning and
faulting was beneficial or not to the workload. The autonuma benchmark
workload NUMA01_THREADLOCAL was picked for closer examination. The workload
creates two processes with numerous threads and thread-local storage that
is zero-filled in a loop. It exercises the corner case where unrelated
threads may skip VMAs that are thread-local to another thread and still
has some VMAs that inactive while the workload executes.

The VMA skipping activity frequency with and without the patch:

	6.6.0-rc2-sched-numabtrace-v1
	=============================
	    649 reason=scan_delay
	  9,094 reason=unsuitable
	 48,915 reason=shared_ro
	143,919 reason=inaccessible
	193,050 reason=pid_inactive

	6.6.0-rc2-sched-numabselective-v1
	=============================
	    146 reason=seq_completed
	    622 reason=ignore_pid_inactive

	    624 reason=scan_delay
	  6,570 reason=unsuitable
	 16,101 reason=shared_ro
	 27,608 reason=inaccessible
	 41,939 reason=pid_inactive

Note that with the patch applied, the PID activity is ignored
(ignore_pid_inactive) to ensure a VMA with some activity is completely
scanned. In addition, a small number of VMAs are scanned when no other
eligible VMA is available during a single scan window (seq_completed).
The number of times a VMA is skipped due to no PID activity from the
scanning task (pid_inactive) drops dramatically. It is expected that
this will increase the number of PTEs updated for NUMA hinting faults
as well as hinting faults but these represent PTEs that would otherwise
have been missed. The tradeoff is scan+fault overhead versus improving
locality due to migration.

On a 2-socket Cascade Lake test machine, the time to complete the
workload is as follows;

                                                 6.6.0-rc2              6.6.0-rc2
                                       sched-numabtrace-v1 sched-numabselective-v1
  Min       elsp-NUMA01_THREADLOCAL      174.22 (   0.00%)      117.64 (  32.48%)
  Amean     elsp-NUMA01_THREADLOCAL      175.68 (   0.00%)      123.34 *  29.79%*
  Stddev    elsp-NUMA01_THREADLOCAL        1.20 (   0.00%)        4.06 (-238.20%)
  CoeffVar  elsp-NUMA01_THREADLOCAL        0.68 (   0.00%)        3.29 (-381.70%)
  Max       elsp-NUMA01_THREADLOCAL      177.18 (   0.00%)      128.03 (  27.74%)

The time to complete the workload is reduced by almost 30%:

                     6.6.0-rc2   6.6.0-rc2
                  sched-numabtrace-v1 sched-numabselective-v1 /
  Duration User       91201.80    63506.64
  Duration System      2015.53     1819.78
  Duration Elapsed     1234.77      868.37

In this specific case, system CPU time was not increased but it's not
universally true.

>From vmstat, the NUMA scanning and fault activity is as follows;

                                        6.6.0-rc2      6.6.0-rc2
                              sched-numabtrace-v1 sched-numabselective-v1
  Ops NUMA base-page range updates       64272.00    26374386.00
  Ops NUMA PTE updates                   36624.00       55538.00
  Ops NUMA PMD updates                      54.00       51404.00
  Ops NUMA hint faults                   15504.00       75786.00
  Ops NUMA hint local faults %           14860.00       56763.00
  Ops NUMA hint local percent               95.85          74.90
  Ops NUMA pages migrated                 1629.00     6469222.00

Both the number of PTE updates and hint faults is dramatically
increased. While this is superficially unfortunate, it represents
ranges that were simply skipped without the patch. As a result
of the scanning and hinting faults, many more pages were also
migrated but as the time to completion is reduced, the overhead
is offset by the gain.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Link: https://lore.kernel.org/r/20231010083143.19593-7-mgorman@techsingularity.net
---
 include/linux/mm_types.h             |  6 +++-
 include/linux/sched/numa_balancing.h |  1 +-
 include/trace/events/sched.h         |  3 +-
 kernel/sched/fair.c                  | 55 +++++++++++++++++++++++++--
 4 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e7571ec..589f31e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -575,6 +575,12 @@ struct vma_numab_state {
 	 * every VMA_PID_RESET_PERIOD jiffies:
 	 */
 	unsigned long pids_active[2];
+
+	/*
+	 * MM scan sequence ID when the VMA was last completely scanned.
+	 * A VMA is not eligible for scanning if prev_scan_seq == numa_scan_seq
+	 */
+	int prev_scan_seq;
 };
 
 /*
diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index 7dcc0bd..b69afb8 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -22,6 +22,7 @@ enum numa_vmaskip_reason {
 	NUMAB_SKIP_SCAN_DELAY,
 	NUMAB_SKIP_PID_INACTIVE,
 	NUMAB_SKIP_IGNORE_PID,
+	NUMAB_SKIP_SEQ_COMPLETED,
 };
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index bfc07c1..6188ad0 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -671,7 +671,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
 	EM( NUMAB_SKIP_INACCESSIBLE,		"inaccessible" )	\
 	EM( NUMAB_SKIP_SCAN_DELAY,		"scan_delay" )	\
 	EM( NUMAB_SKIP_PID_INACTIVE,		"pid_inactive" )	\
-	EMe(NUMAB_SKIP_IGNORE_PID,		"ignore_pid_inactive" )
+	EM( NUMAB_SKIP_IGNORE_PID,		"ignore_pid_inactive" )		\
+	EMe(NUMAB_SKIP_SEQ_COMPLETED,		"seq_completed" )
 
 /* Redefine for export. */
 #undef EM
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab79013..9229051 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3158,6 +3158,8 @@ static void task_numa_work(struct callback_head *work)
 	unsigned long nr_pte_updates = 0;
 	long pages, virtpages;
 	struct vma_iterator vmi;
+	bool vma_pids_skipped;
+	bool vma_pids_forced = false;
 
 	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
 
@@ -3200,7 +3202,6 @@ static void task_numa_work(struct callback_head *work)
 	 */
 	p->node_stamp += 2 * TICK_NSEC;
 
-	start = mm->numa_scan_offset;
 	pages = sysctl_numa_balancing_scan_size;
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
 	virtpages = pages * 8;	   /* Scan up to this much virtual space */
@@ -3210,6 +3211,16 @@ static void task_numa_work(struct callback_head *work)
 
 	if (!mmap_read_trylock(mm))
 		return;
+
+	/*
+	 * VMAs are skipped if the current PID has not trapped a fault within
+	 * the VMA recently. Allow scanning to be forced if there is no
+	 * suitable VMA remaining.
+	 */
+	vma_pids_skipped = false;
+
+retry_pids:
+	start = mm->numa_scan_offset;
 	vma_iter_init(&vmi, mm, start);
 	vma = vma_next(&vmi);
 	if (!vma) {
@@ -3260,6 +3271,13 @@ static void task_numa_work(struct callback_head *work)
 			/* Reset happens after 4 times scan delay of scan start */
 			vma->numab_state->pids_active_reset =  vma->numab_state->next_scan +
 				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+
+			/*
+			 * Ensure prev_scan_seq does not match numa_scan_seq,
+			 * to prevent VMAs being skipped prematurely on the
+			 * first scan:
+			 */
+			 vma->numab_state->prev_scan_seq = mm->numa_scan_seq - 1;
 		}
 
 		/*
@@ -3281,8 +3299,19 @@ static void task_numa_work(struct callback_head *work)
 			vma->numab_state->pids_active[1] = 0;
 		}
 
-		/* Do not scan the VMA if task has not accessed */
-		if (!vma_is_accessed(mm, vma)) {
+		/* Do not rescan VMAs twice within the same sequence. */
+		if (vma->numab_state->prev_scan_seq == mm->numa_scan_seq) {
+			mm->numa_scan_offset = vma->vm_end;
+			trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SEQ_COMPLETED);
+			continue;
+		}
+
+		/*
+		 * Do not scan the VMA if task has not accessed it, unless no other
+		 * VMA candidate exists.
+		 */
+		if (!vma_pids_forced && !vma_is_accessed(mm, vma)) {
+			vma_pids_skipped = true;
 			trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
 			continue;
 		}
@@ -3311,8 +3340,28 @@ static void task_numa_work(struct callback_head *work)
 
 			cond_resched();
 		} while (end != vma->vm_end);
+
+		/* VMA scan is complete, do not scan until next sequence. */
+		vma->numab_state->prev_scan_seq = mm->numa_scan_seq;
+
+		/*
+		 * Only force scan within one VMA at a time, to limit the
+		 * cost of scanning a potentially uninteresting VMA.
+		 */
+		if (vma_pids_forced)
+			break;
 	} for_each_vma(vmi, vma);
 
+	/*
+	 * If no VMAs are remaining and VMAs were skipped due to the PID
+	 * not accessing the VMA previously, then force a scan to ensure
+	 * forward progress:
+	 */
+	if (!vma && !vma_pids_forced && vma_pids_skipped) {
+		vma_pids_forced = true;
+		goto retry_pids;
+	}
+
 out:
 	/*
 	 * It is possible to reach the end of the VMA list but the last few

[PATCH 1/6] sched/numa: Document vma_numab_state fields
[PATCH 2/6] sched/numa: Rename vma_numab_state.access_pids
[PATCH 3/6] sched/numa: Trace decisions related to skipping VMAs
[PATCH 4/6] sched/numa: Move up the access pid reset logic
[PATCH 5/6] sched/numa: Complete scanning of partial VMAs regardless of PID activity
[PATCH 6/6] sched/numa: Complete scanning of inactive VMAs when there is no alternative