Improve proc RSS accuracy and OOM killer latency

[PATCH v16 3/3] mm: Reduce latency of OOM killer task selection with 2-pass algorithm

Posted by Mathieu Desnoyers 3 weeks, 4 days ago

Use the hierarchical tree counter approximation (hpcc) to implement the
OOM killer task selection with a 2-pass algorithm. The first pass
selects the process that has the highest badness points approximation,
and the second pass compares each process using the current max badness
points approximation.

The second pass uses an approximate comparison to eliminate all processes
which are below the current max badness points approximation accuracy
range.

Summing the per-CPU counters to calculate the precise badness of tasks
is only required for tasks with an approximate badness within the
accuracy range of the current max points value.

Limit to 16 the maximum number of badness sums allowed for an OOM killer
task selection before falling back to the approximated comparison. This
ensures bounded execution time for scenarios where many tasks have
badness within the accuracy of the maximum badness approximation.

Testing the execution time of select_bad_process() with a single
tail -f /dev/zero:

    AMD EPYC 9654 96-Core (2 sockets)
    Within a KVM, configured with 256 logical cpus.

                                      | precise sum |   hpcc   |
    ----------------------------------|-------------|----------|
    nr_processes=40                   |    0.5 ms   |   0.3 ms |
    nr_processes=10000                |   80.0 ms   |   7.9 ms |

Tested with the following script:

  #!/bin/sh

  for a in $(seq 1 10); do (tail /dev/zero &); done
  sleep 5
  for a in $(seq 1 10); do (tail /dev/zero &); done
  sleep 2
  for a in $(seq 1 10); do (tail /dev/zero &); done
  echo "Waiting for tasks to finish"
  wait

Results: OOM kill order on a 128GB memory system
================================================

* systemd and sd-pam are chosen first due to their oom_score_ajd:100:

Out of memory: Killed process 3502 (systemd) total-vm:20096kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:72kB oom_score_adj:100
Out of memory: Killed process 3503 ((sd-pam)) total-vm:21432kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:76kB oom_score_adj:100

* The first batch of 10 processes are gradually killed, consecutively
  picking the one that uses the most memory. The fact that we are
  freeing memory from the previous processes increases the threshold
  at which the remaining processes of that group are killed. Processes
  from the second and third batches of 10 processes have time to start
  before we complete killing the first 10 processes:

Out of memory: Killed process 3703 (tail) total-vm:6591280kB, anon-rss:6578176kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:12936kB oom_score_adj:0
Out of memory: Killed process 3705 (tail) total-vm:6731716kB, anon-rss:6709248kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:13212kB oom_score_adj:0
Out of memory: Killed process 3707 (tail) total-vm:6977216kB, anon-rss:6946816kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:13692kB oom_score_adj:0
Out of memory: Killed process 3699 (tail) total-vm:7205640kB, anon-rss:7184384kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:14136kB oom_score_adj:0
Out of memory: Killed process 3713 (tail) total-vm:7463204kB, anon-rss:7438336kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:14644kB oom_score_adj:0
Out of memory: Killed process 3701 (tail) total-vm:7739204kB, anon-rss:7716864kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:15180kB oom_score_adj:0
Out of memory: Killed process 3709 (tail) total-vm:8050176kB, anon-rss:8028160kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:15792kB oom_score_adj:0
Out of memory: Killed process 3711 (tail) total-vm:8362236kB, anon-rss:8339456kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:16404kB oom_score_adj:0
Out of memory: Killed process 3715 (tail) total-vm:8649360kB, anon-rss:8634368kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:16972kB oom_score_adj:0
Out of memory: Killed process 3697 (tail) total-vm:8951788kB, anon-rss:8929280kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:17560kB oom_score_adj:0

* Even though there is a 2 seconds delay between the 2nd and 3rd batches
  those appear to execute in mixed order. Therefore, let's consider them
  as a single batch of 20 processes. We are hitting oom at a lower
  memory threshold because at this point the 20 remaining proceses are
  running rather than the previous 10. The process with highest memory
  usage is selected for oom, thus making room for the remaining
  processes so they can use more memory before they fill the available
  memory, thus explaining why the memory use for selected processes
  gradually increases, until all system memory is used by the last one:

Out of memory: Killed process 3731 (tail) total-vm:7089868kB, anon-rss:7077888kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:13912kB oom_score_adj:0
Out of memory: Killed process 3721 (tail) total-vm:7417248kB, anon-rss:7405568kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:14556kB oom_score_adj:0
Out of memory: Killed process 3729 (tail) total-vm:7795864kB, anon-rss:7766016kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:15300kB oom_score_adj:0
Out of memory: Killed process 3723 (tail) total-vm:8259620kB, anon-rss:8224768kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:16208kB oom_score_adj:0
Out of memory: Killed process 3737 (tail) total-vm:8695984kB, anon-rss:8667136kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:17060kB oom_score_adj:0
Out of memory: Killed process 3735 (tail) total-vm:9295980kB, anon-rss:9265152kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:18240kB oom_score_adj:0
Out of memory: Killed process 3727 (tail) total-vm:9907900kB, anon-rss:9895936kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:19428kB oom_score_adj:0
Out of memory: Killed process 3719 (tail) total-vm:10631248kB, anon-rss:10600448kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:20844kB oom_score_adj:0
Out of memory: Killed process 3733 (tail) total-vm:11341720kB, anon-rss:11321344kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:22232kB oom_score_adj:0
Out of memory: Killed process 3725 (tail) total-vm:12348124kB, anon-rss:12320768kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:24204kB oom_score_adj:0
Out of memory: Killed process 3759 (tail) total-vm:12978888kB, anon-rss:12967936kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:25440kB oom_score_adj:0
Out of memory: Killed process 3751 (tail) total-vm:14386412kB, anon-rss:14352384kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:28196kB oom_score_adj:0
Out of memory: Killed process 3741 (tail) total-vm:16153168kB, anon-rss:16130048kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:31652kB oom_score_adj:0
Out of memory: Killed process 3753 (tail) total-vm:18414856kB, anon-rss:18391040kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:36076kB oom_score_adj:0
Out of memory: Killed process 3745 (tail) total-vm:21389456kB, anon-rss:21356544kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:41904kB oom_score_adj:0
Out of memory: Killed process 3747 (tail) total-vm:25659348kB, anon-rss:25632768kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:50260kB oom_score_adj:0
Out of memory: Killed process 3755 (tail) total-vm:32030820kB, anon-rss:32006144kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:62720kB oom_score_adj:0
Out of memory: Killed process 3743 (tail) total-vm:42648456kB, anon-rss:42614784kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:83504kB oom_score_adj:0
Out of memory: Killed process 3757 (tail) total-vm:63971028kB, anon-rss:63938560kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:125228kB oom_score_adj:0
Out of memory: Killed process 3749 (tail) total-vm:127799660kB, anon-rss:127778816kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:250140kB oom_score_adj:0

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Martin Liu <liumartin@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: christian.koenig@amd.com
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-mm@kvack.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: Yu Zhao <yuzhao@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>
---
Changes since v14:
- This change becomes an OOM killer latency reduction (performance
  improvement).

Changes since v12:
- The oom_task_origin case needs to set points_min rather than points
  variable now, otherwise points_min is used without being initialized.

Changes since v11:
- get_mm_counter_sum() returns a precise sum.
- Use unsigned long type rather than unsigned int for accuracy.
- Use precise sum min/max calculation to compare the chosen vs
  current points.
- The first pass finds the maximum task's min points. The
  second pass eliminates all tasks for which the max points
  are below the currently chosen min points, and uses a precise
  sum to validate the candidates which are possibly in range.
---
 fs/proc/base.c      |  2 +-
 include/linux/mm.h  | 34 +++++++++++++-----
 include/linux/oom.h | 11 +++++-
 mm/oom_kill.c       | 84 +++++++++++++++++++++++++++++++++++++--------
 4 files changed, 106 insertions(+), 25 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 4eec684baca9..d75d0ce97032 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -589,7 +589,7 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long points = 0;
 	long badness;
 
-	badness = oom_badness(task, totalpages);
+	badness = oom_badness(task, totalpages, false, NULL, NULL);
 	/*
 	 * Special case OOM_SCORE_ADJ_MIN for all others scale the
 	 * badness value into [0, 2000] range which we have been
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 10d62e8f35e7..298b8223fbfb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2855,14 +2855,28 @@ static inline struct percpu_counter_tree_level_item *get_rss_stat_items(struct m
 /*
  * per-process(per-mm_struct) statistics.
  */
+static inline unsigned long __get_mm_counter(struct mm_struct *mm, int member, bool approximate,
+					     unsigned long *accuracy_under, unsigned long *accuracy_over)
+{
+	if (approximate) {
+		if (accuracy_under && accuracy_over) {
+			percpu_counter_tree_approximate_accuracy_range(&mm->rss_stat[member],
+								       accuracy_under, accuracy_over);
+		}
+		return percpu_counter_tree_approximate_sum_positive(&mm->rss_stat[member]);
+	} else {
+		return percpu_counter_tree_precise_sum_positive(&mm->rss_stat[member]);
+	}
+}
+
 static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
 {
-	return percpu_counter_tree_approximate_sum_positive(&mm->rss_stat[member]);
+	return __get_mm_counter(mm, member, true, NULL, NULL);
 }
 
 static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
 {
-	return percpu_counter_tree_precise_sum_positive(&mm->rss_stat[member]);
+	return __get_mm_counter(mm, member, false, NULL, NULL);
 }
 
 void mm_trace_rss_stat(struct mm_struct *mm, int member);
@@ -2903,18 +2917,22 @@ static inline int mm_counter(struct folio *folio)
 	return mm_counter_file(folio);
 }
 
+static inline unsigned long __get_mm_rss(struct mm_struct *mm, bool approximate,
+					 unsigned long *accuracy_under, unsigned long *accuracy_over)
+{
+	return __get_mm_counter(mm, MM_FILEPAGES, approximate, accuracy_under, accuracy_over) +
+		__get_mm_counter(mm, MM_ANONPAGES, approximate, accuracy_under, accuracy_over) +
+		__get_mm_counter(mm, MM_SHMEMPAGES, approximate, accuracy_under, accuracy_over);
+}
+
 static inline unsigned long get_mm_rss(struct mm_struct *mm)
 {
-	return get_mm_counter(mm, MM_FILEPAGES) +
-		get_mm_counter(mm, MM_ANONPAGES) +
-		get_mm_counter(mm, MM_SHMEMPAGES);
+	return __get_mm_rss(mm, true, NULL, NULL);
 }
 
 static inline unsigned long get_mm_rss_sum(struct mm_struct *mm)
 {
-	return get_mm_counter_sum(mm, MM_FILEPAGES) +
-		get_mm_counter_sum(mm, MM_ANONPAGES) +
-		get_mm_counter_sum(mm, MM_SHMEMPAGES);
+	return __get_mm_rss(mm, false, NULL, NULL);
 }
 
 static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7b02bc1d0a7e..f8e5bfaf7b39 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -48,6 +48,12 @@ struct oom_control {
 	unsigned long totalpages;
 	struct task_struct *chosen;
 	long chosen_points;
+	bool approximate;
+	/*
+	 * Number of precise badness points sums performed by this task
+	 * selection.
+	 */
+	int nr_precise;
 
 	/* Used to print the constraint info. */
 	enum oom_constraint constraint;
@@ -97,7 +103,10 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
 }
 
 long oom_badness(struct task_struct *p,
-		unsigned long totalpages);
+		unsigned long totalpages,
+		bool approximate,
+		unsigned long *accuracy_under,
+		unsigned long *accuracy_over);
 
 extern bool out_of_memory(struct oom_control *oc);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 214cb8cb939b..93ae14c278e9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -53,6 +53,14 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
 
+/*
+ * Maximum number of badness sums allowed before using an approximated
+ * comparison. This ensures bounded execution time for scenarios where
+ * many tasks have badness within the accuracy of the maximum badness
+ * approximation.
+ */
+static int max_precise_badness_sums = 16;
+
 static int sysctl_panic_on_oom;
 static int sysctl_oom_kill_allocating_task;
 static int sysctl_oom_dump_tasks = 1;
@@ -194,12 +202,16 @@ static bool should_dump_unreclaim_slab(void)
  * oom_badness - heuristic function to determine which candidate task to kill
  * @p: task struct of which task we should calculate
  * @totalpages: total present RAM allowed for page allocation
+ * @approximate: whether the value can be approximated
+ * @accuracy_under: accuracy of the badness value approximation (under value)
+ * @accuracy_over: accuracy of the badness value approximation (over value)
  *
  * The heuristic for determining which task to kill is made to be as simple and
  * predictable as possible.  The goal is to return the highest value for the
  * task consuming the most memory to avoid subsequent oom failures.
  */
-long oom_badness(struct task_struct *p, unsigned long totalpages)
+long oom_badness(struct task_struct *p, unsigned long totalpages, bool approximate,
+		 unsigned long *accuracy_under, unsigned long *accuracy_over)
 {
 	long points;
 	long adj;
@@ -228,7 +240,8 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
 	 * The baseline for the badness score is the proportion of RAM that each
 	 * task's rss, pagetable and swap space use.
 	 */
-	points = get_mm_rss_sum(p->mm) + get_mm_counter_sum(p->mm, MM_SWAPENTS) +
+	points = __get_mm_rss(p->mm, approximate, accuracy_under, accuracy_over) +
+		__get_mm_counter(p->mm, MM_SWAPENTS, approximate, accuracy_under, accuracy_over) +
 		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
 	task_unlock(p);
 
@@ -309,7 +322,8 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 static int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
-	long points;
+	unsigned long accuracy_under = 0, accuracy_over = 0;
+	long points, points_min, points_max;
 
 	if (oom_unkillable_task(task))
 		goto next;
@@ -335,20 +349,47 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	 * killed first if it triggers an oom, then select it.
 	 */
 	if (oom_task_origin(task)) {
-		points = LONG_MAX;
+		points_min = LONG_MAX;
 		goto select;
 	}
 
-	points = oom_badness(task, oc->totalpages);
-	if (points == LONG_MIN || points < oc->chosen_points)
-		goto next;
+	points = oom_badness(task, oc->totalpages, true, &accuracy_under, &accuracy_over);
+	if (points != LONG_MIN) {
+		percpu_counter_tree_approximate_min_max_range(points,
+				accuracy_under, accuracy_over,
+				&points_min, &points_max);
+	}
+	if (oc->approximate) {
+		/*
+		 * Keep the process which has the highest minimum
+		 * possible points value based on approximation.
+		 */
+		if (points == LONG_MIN || points_min < oc->chosen_points)
+			goto next;
+	} else {
+		/*
+		 * Eliminate processes which are certainly below the
+		 * chosen points minimum possible value with an
+		 * approximation.
+		 */
+		if (points == LONG_MIN || (long)(points_max - oc->chosen_points) < 0)
+			goto next;
+
+		if (oc->nr_precise < max_precise_badness_sums) {
+			oc->nr_precise++;
+			/* Precise evaluation. */
+			points_min = points_max = points = oom_badness(task, oc->totalpages, false, NULL, NULL);
+			if (points == LONG_MIN || (long)(points - oc->chosen_points) < 0)
+				goto next;
+		}
+	}
 
 select:
 	if (oc->chosen)
 		put_task_struct(oc->chosen);
 	get_task_struct(task);
 	oc->chosen = task;
-	oc->chosen_points = points;
+	oc->chosen_points = points_min;
 next:
 	return 0;
 abort:
@@ -358,14 +399,8 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	return 1;
 }
 
-/*
- * Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
- */
-static void select_bad_process(struct oom_control *oc)
+static void select_bad_process_iter(struct oom_control *oc)
 {
-	oc->chosen_points = LONG_MIN;
-
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
 	else {
@@ -379,6 +414,25 @@ static void select_bad_process(struct oom_control *oc)
 	}
 }
 
+/*
+ * Simple selection loop. We choose the process with the highest number of
+ * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ */
+static void select_bad_process(struct oom_control *oc)
+{
+	oc->chosen_points = LONG_MIN;
+	oc->nr_precise = 0;
+
+	/* Approximate scan. */
+	oc->approximate = true;
+	select_bad_process_iter(oc);
+	if (oc->chosen == (void *)-1UL)
+		return;
+	/* Precise scan. */
+	oc->approximate = false;
+	select_bad_process_iter(oc);
+}
+
 static int dump_task(struct task_struct *p, void *arg)
 {
 	struct oom_control *oc = arg;
-- 
2.39.5

Re: [PATCH v16 3/3] mm: Reduce latency of OOM killer task selection with 2-pass algorithm

Posted by Michal Hocko 3 weeks, 4 days ago

On Wed 14-01-26 09:59:15, Mathieu Desnoyers wrote:
> Use the hierarchical tree counter approximation (hpcc) to implement the
> OOM killer task selection with a 2-pass algorithm. The first pass
> selects the process that has the highest badness points approximation,
> and the second pass compares each process using the current max badness
> points approximation.
> 
> The second pass uses an approximate comparison to eliminate all processes
> which are below the current max badness points approximation accuracy
> range.
> 
> Summing the per-CPU counters to calculate the precise badness of tasks
> is only required for tasks with an approximate badness within the
> accuracy range of the current max points value.
> 
> Limit to 16 the maximum number of badness sums allowed for an OOM killer
> task selection before falling back to the approximated comparison. This
> ensures bounded execution time for scenarios where many tasks have
> badness within the accuracy of the maximum badness approximation.
> 
> Testing the execution time of select_bad_process() with a single
> tail -f /dev/zero:
> 
>     AMD EPYC 9654 96-Core (2 sockets)
>     Within a KVM, configured with 256 logical cpus.
> 
>                                       | precise sum |   hpcc   |
>     ----------------------------------|-------------|----------|
>     nr_processes=40                   |    0.5 ms   |   0.3 ms |
>     nr_processes=10000                |   80.0 ms   |   7.9 ms |
> 
> Tested with the following script:

I am confused by these numbers. Are you saying that 2 pass over all
tasks and evaluating all of them is 10 times faster than a single pass
with exact sum of pcp counters?

> 
>   #!/bin/sh
> 
>   for a in $(seq 1 10); do (tail /dev/zero &); done
>   sleep 5
>   for a in $(seq 1 10); do (tail /dev/zero &); done
>   sleep 2
>   for a in $(seq 1 10); do (tail /dev/zero &); done
>   echo "Waiting for tasks to finish"
>   wait
> 
> Results: OOM kill order on a 128GB memory system
> ================================================

I find this section confusing as well. Is that before/after comparision.
If yes it would be great to call out explicit behavior before and after.

My overall impression is that the implementation is really involved and
at this moment I do not really see a big benefit of all the complexity.

It would help to explicitly mention what is the the overall imprecision
of the oom victim selection with the new data structure (maybe this is
good enough[*]). What if we go with exact precision with the new data
structure comparing to the original pcp counters.


[*] please keep in mind that oom victim selection is by no means an
exact science, we try to pick up a task that is likely to free up some
memory to unlock the system from memory depletion. We want that to be a
big memory consumer to reduce number of tasks to kill and we want to
roughly apply oom_score_adj.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v16 3/3] mm: Reduce latency of OOM killer task selection with 2-pass algorithm

Posted by Mathieu Desnoyers 3 weeks, 4 days ago

On 2026-01-14 12:06, Michal Hocko wrote:
> On Wed 14-01-26 09:59:15, Mathieu Desnoyers wrote:
>> Use the hierarchical tree counter approximation (hpcc) to implement the
>> OOM killer task selection with a 2-pass algorithm. The first pass
>> selects the process that has the highest badness points approximation,
>> and the second pass compares each process using the current max badness
>> points approximation.
>>
>> The second pass uses an approximate comparison to eliminate all processes
>> which are below the current max badness points approximation accuracy
>> range.
>>
>> Summing the per-CPU counters to calculate the precise badness of tasks
>> is only required for tasks with an approximate badness within the
>> accuracy range of the current max points value.
>>
>> Limit to 16 the maximum number of badness sums allowed for an OOM killer
>> task selection before falling back to the approximated comparison. This
>> ensures bounded execution time for scenarios where many tasks have
>> badness within the accuracy of the maximum badness approximation.
>>
>> Testing the execution time of select_bad_process() with a single
>> tail -f /dev/zero:
>>
>>      AMD EPYC 9654 96-Core (2 sockets)
>>      Within a KVM, configured with 256 logical cpus.
>>
>>                                        | precise sum |   hpcc   |
>>      ----------------------------------|-------------|----------|
>>      nr_processes=40                   |    0.5 ms   |   0.3 ms |
>>      nr_processes=10000                |   80.0 ms   |   7.9 ms |
>>
>> Tested with the following script:
> 
> I am confused by these numbers. Are you saying that 2 pass over all
> tasks and evaluating all of them is 10 times faster than a single pass
> with exact sum of pcp counters?

Yes, that's correct.

This is because we can use the approximate value and the
min/max possible precise sum ranges to eliminate most tasks,
so we only have to do a precise sum on very few tasks.

So we're basically going from a 1-pass doing precise sum for
all tasks to 2-passes doing approximated sum for all tasks,
and only precise sums for the very few tasks that happen to
be in the vicinity of the task chosen by the 1st pass.

We could probably do something more clever and do all this
within a single pass if we keep track of the possible candidates
in a separate list, and then only iterate on those candidates
to do a precise sum at the end of the pass. But I chose to keep
things simple, hence the 2 pass algo.

> 
>>
>>    #!/bin/sh
>>
>>    for a in $(seq 1 10); do (tail /dev/zero &); done
>>    sleep 5
>>    for a in $(seq 1 10); do (tail /dev/zero &); done
>>    sleep 2
>>    for a in $(seq 1 10); do (tail /dev/zero &); done
>>    echo "Waiting for tasks to finish"
>>    wait
>>
>> Results: OOM kill order on a 128GB memory system
>> ================================================
> 
> I find this section confusing as well. Is that before/after comparision.
> If yes it would be great to call out explicit behavior before and after.

Not really. I'm just testing the "after" to make sure we get the
expected killing order.

> 
> My overall impression is that the implementation is really involved and
> at this moment I do not really see a big benefit of all the complexity.

Note that we can get the proc ABI RSS accuracy improvements with the
previous 2 patches without this 2-pass algo. Do you see more value in
the RSS accuracy improvements than in the oom killer latency reduction ?

> 
> It would help to explicitly mention what is the the overall imprecision
> of the oom victim selection with the new data structure (maybe this is
> good enough[*]). What if we go with exact precision with the new data
> structure comparing to the original pcp counters.

Do you mean comparing using approximate sums with the new data
structure (which has a bounded accuracy of O(nr_cpus*log(nr_cpus)))
compared to the old data structure which had an inaccuracy of
O(nr_cpus^2) ? So if the inaccuracy provided by the new data structure
is good enough for OOM task selection, we could go from precise sum
back to an approximation and just use that with the new data
structure.

Thanks,

Mathieu

> 
> 
> [*] please keep in mind that oom victim selection is by no means an
> exact science, we try to pick up a task that is likely to free up some
> memory to unlock the system from memory depletion. We want that to be a
> big memory consumer to reduce number of tasks to kill and we want to
> roughly apply oom_score_adj.

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v16 3/3] mm: Reduce latency of OOM killer task selection with 2-pass algorithm

Posted by Michal Hocko 3 weeks, 2 days ago

On Wed 14-01-26 14:36:44, Mathieu Desnoyers wrote:
> On 2026-01-14 12:06, Michal Hocko wrote:
> > On Wed 14-01-26 09:59:15, Mathieu Desnoyers wrote:
[...]
Thanks to those clarifications
 
> > My overall impression is that the implementation is really involved and
> > at this moment I do not really see a big benefit of all the complexity.
> 
> Note that we can get the proc ABI RSS accuracy improvements with the
> previous 2 patches without this 2-pass algo. Do you see more value in
> the RSS accuracy improvements than in the oom killer latency reduction ?

Yes, TBH I do not see oom latency as a big problem. As already mention
this is a slow path and we are not talking about a huge latency anyway.
proc numbers are much more sensitive to latency as they are regularly
read by user space tools and accuracy for those matters as well (being
off by 100s MB or GBs is simply making those numbers completely bogus).
 
> > It would help to explicitly mention what is the the overall imprecision
> > of the oom victim selection with the new data structure (maybe this is
> > good enough[*]). What if we go with exact precision with the new data
> > structure comparing to the original pcp counters.
> 
> Do you mean comparing using approximate sums with the new data
> structure (which has a bounded accuracy of O(nr_cpus*log(nr_cpus)))
> compared to the old data structure which had an inaccuracy of
> O(nr_cpus^2) ? So if the inaccuracy provided by the new data structure
> is good enough for OOM task selection, we could go from precise sum
> back to an approximation and just use that with the new data
> structure.

Exactly!

-- 
Michal Hocko
SUSE Labs

Re: [PATCH v16 3/3] mm: Reduce latency of OOM killer task selection with 2-pass algorithm

Posted by Mathieu Desnoyers 1 week, 6 days ago

On 2026-01-16 16:55, Michal Hocko wrote:
> On Wed 14-01-26 14:36:44, Mathieu Desnoyers wrote:
>> On 2026-01-14 12:06, Michal Hocko wrote:
>>> On Wed 14-01-26 09:59:15, Mathieu Desnoyers wrote:
> [...]
> Thanks to those clarifications
>   
>>> My overall impression is that the implementation is really involved and
>>> at this moment I do not really see a big benefit of all the complexity.
>>
>> Note that we can get the proc ABI RSS accuracy improvements with the
>> previous 2 patches without this 2-pass algo. Do you see more value in
>> the RSS accuracy improvements than in the oom killer latency reduction ?
> 
> Yes, TBH I do not see oom latency as a big problem. As already mention
> this is a slow path and we are not talking about a huge latency anyway.
> proc numbers are much more sensitive to latency as they are regularly
> read by user space tools and accuracy for those matters as well (being
> off by 100s MB or GBs is simply making those numbers completely bogus).

It makes sense.

>   
>>> It would help to explicitly mention what is the the overall imprecision
>>> of the oom victim selection with the new data structure (maybe this is
>>> good enough[*]). What if we go with exact precision with the new data
>>> structure comparing to the original pcp counters.
>>
>> Do you mean comparing using approximate sums with the new data
>> structure (which has a bounded accuracy of O(nr_cpus*log(nr_cpus)))
>> compared to the old data structure which had an inaccuracy of
>> O(nr_cpus^2) ? So if the inaccuracy provided by the new data structure
>> is good enough for OOM task selection, we could go from precise sum
>> back to an approximation and just use that with the new data
>> structure.
> 
> Exactly!
OK, so based on your feedback, I plan to remove this 2-pass algo
from the series, and simply keep using the precise sum for the OOM
killer. If people complain about its latency, then we can eventually
use the approximation provided by the hierarchical counters. But let's
wait until someone asks for it rather than add this complexity when
there is no need.

The hierarchical counters are still useful as they increase the
accuracy of approximations exported through /proc.

How does that sound ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v16 3/3] mm: Reduce latency of OOM killer task selection with 2-pass algorithm

Posted by Michal Hocko 1 week, 6 days ago

On Mon 26-01-26 11:39:33, Mathieu Desnoyers wrote:
> On 2026-01-16 16:55, Michal Hocko wrote:
> > On Wed 14-01-26 14:36:44, Mathieu Desnoyers wrote:
> > > On 2026-01-14 12:06, Michal Hocko wrote:
> > > > On Wed 14-01-26 09:59:15, Mathieu Desnoyers wrote:
> > [...]
> > Thanks to those clarifications
> > > > My overall impression is that the implementation is really involved and
> > > > at this moment I do not really see a big benefit of all the complexity.
> > > 
> > > Note that we can get the proc ABI RSS accuracy improvements with the
> > > previous 2 patches without this 2-pass algo. Do you see more value in
> > > the RSS accuracy improvements than in the oom killer latency reduction ?
> > 
> > Yes, TBH I do not see oom latency as a big problem. As already mention
> > this is a slow path and we are not talking about a huge latency anyway.
> > proc numbers are much more sensitive to latency as they are regularly
> > read by user space tools and accuracy for those matters as well (being
> > off by 100s MB or GBs is simply making those numbers completely bogus).
> 
> It makes sense.
> 
> > > > It would help to explicitly mention what is the the overall imprecision
> > > > of the oom victim selection with the new data structure (maybe this is
> > > > good enough[*]). What if we go with exact precision with the new data
> > > > structure comparing to the original pcp counters.
> > > 
> > > Do you mean comparing using approximate sums with the new data
> > > structure (which has a bounded accuracy of O(nr_cpus*log(nr_cpus)))
> > > compared to the old data structure which had an inaccuracy of
> > > O(nr_cpus^2) ? So if the inaccuracy provided by the new data structure
> > > is good enough for OOM task selection, we could go from precise sum
> > > back to an approximation and just use that with the new data
> > > structure.
> > 
> > Exactly!
> OK, so based on your feedback, I plan to remove this 2-pass algo
> from the series, and simply keep using the precise sum for the OOM
> killer. If people complain about its latency, then we can eventually
> use the approximation provided by the hierarchical counters. But let's
> wait until someone asks for it rather than add this complexity when
> there is no need.
> 
> The hierarchical counters are still useful as they increase the
> accuracy of approximations exported through /proc.
> 
> How does that sound ?

Works for me.

Thanks!
-- 
Michal Hocko
SUSE Labs

[PATCH v16 1/3] lib: Introduce hierarchical per-cpu counters
[PATCH v16 2/3] mm: Improve RSS counter approximation accuracy for proc interfaces
[PATCH v16 3/3] mm: Reduce latency of OOM killer task selection with 2-pass algorithm