[PATCH 0/5 v1] mm, oom: Introduce per numa node oom for CONSTRAINT_MEMORY_POLICY

Gang Li posted 5 patches 3 years, 12 months ago
arch/s390/mm/pgtable.c        |   4 +-
fs/exec.c                     |   2 +-
fs/proc/base.c                |   6 +-
fs/proc/task_mmu.c            |  14 ++--
include/linux/mm.h            |  59 ++++++++++++-----
include/linux/mm_types_task.h |  16 +++++
include/linux/oom.h           |   2 +-
include/trace/events/kmem.h   |  27 ++++++--
kernel/events/uprobes.c       |   6 +-
kernel/fork.c                 |  70 +++++++++++++++++++-
mm/huge_memory.c              |  13 ++--
mm/khugepaged.c               |   4 +-
mm/ksm.c                      |   2 +-
mm/madvise.c                  |   2 +-
mm/memory.c                   | 116 ++++++++++++++++++++++++----------
mm/migrate.c                  |   2 +
mm/migrate_device.c           |   2 +-
mm/oom_kill.c                 |  59 ++++++++++++-----
mm/rmap.c                     |  16 ++---
mm/swapfile.c                 |   4 +-
mm/userfaultfd.c              |   2 +-
21 files changed, 317 insertions(+), 111 deletions(-)
[PATCH 0/5 v1] mm, oom: Introduce per numa node oom for CONSTRAINT_MEMORY_POLICY
Posted by Gang Li 3 years, 12 months ago
TLDR:
If a mempolicy is in effect(oc->constraint == CONSTRAINT_MEMORY_POLICY), out_of_memory() will
select victim on specific node to kill. So that kernel can avoid accidental killing on NUMA system.

Problem:
Before this patch series, oom will only kill the process with the highest memory usage.
by selecting process with the highest oom_badness on the entire system to kill.

This works fine on UMA system, but may have some accidental killing on NUMA system.

As shown below, if process c.out is bind to Node1 and keep allocating pages from Node1,
a.out will be killed first. But killing a.out did't free any mem on Node1, so c.out
will be killed then.

A lot of our AMD machines have 8 numa nodes. In these systems, there is a greater chance
of triggering this problem.

OOM before patches:
```
Per-node process memory usage (in MBs)
PID             Node 0        Node 1      Total
----------- ---------- ------------- ----------
3095 a.out     3073.34          0.11    3073.45(Killed first. Maximum memory consumption)
3199 b.out      501.35       1500.00    2001.35
3805 c.out        1.52 (grow)2248.00    2249.52(Killed then. Node1 is full)
----------- ---------- ------------- ----------
Total          3576.21       3748.11    7324.31
```

Solution:
We store per node rss in mm_rss_stat for each process.

If a page allocation with mempolicy in effect(oc->constraint == CONSTRAINT_MEMORY_POLICY)
triger oom. We will calculate oom_badness with rss counter for the corresponding node. Then
select the process with the highest oom_badness on the corresponding node to kill.

OOM after patches:
```
Per-node process memory usage (in MBs)
PID             Node 0        Node 1     Total
----------- ---------- ------------- ----------
3095 a.out     3073.34          0.11    3073.45
3199 b.out      501.35       1500.00    2001.35
3805 c.out        1.52 (grow)2248.00    2249.52(killed)
----------- ---------- ------------- ----------
Total          3576.21       3748.11    7324.31
```

Gang Li (5):
  mm: add a new parameter `node` to `get/add/inc/dec_mm_counter`
  mm: add numa_count field for rss_stat
  mm: add numa fields for tracepoint rss_stat
  mm: enable per numa node rss_stat count
  mm, oom: enable per numa node oom for CONSTRAINT_MEMORY_POLICY

 arch/s390/mm/pgtable.c        |   4 +-
 fs/exec.c                     |   2 +-
 fs/proc/base.c                |   6 +-
 fs/proc/task_mmu.c            |  14 ++--
 include/linux/mm.h            |  59 ++++++++++++-----
 include/linux/mm_types_task.h |  16 +++++
 include/linux/oom.h           |   2 +-
 include/trace/events/kmem.h   |  27 ++++++--
 kernel/events/uprobes.c       |   6 +-
 kernel/fork.c                 |  70 +++++++++++++++++++-
 mm/huge_memory.c              |  13 ++--
 mm/khugepaged.c               |   4 +-
 mm/ksm.c                      |   2 +-
 mm/madvise.c                  |   2 +-
 mm/memory.c                   | 116 ++++++++++++++++++++++++----------
 mm/migrate.c                  |   2 +
 mm/migrate_device.c           |   2 +-
 mm/oom_kill.c                 |  59 ++++++++++++-----
 mm/rmap.c                     |  16 ++---
 mm/swapfile.c                 |   4 +-
 mm/userfaultfd.c              |   2 +-
 21 files changed, 317 insertions(+), 111 deletions(-)

-- 
2.20.1
Re: [PATCH 0/5 v1] mm, oom: Introduce per numa node oom for CONSTRAINT_MEMORY_POLICY
Posted by Michal Hocko 3 years, 11 months ago
On Thu 12-05-22 12:46:29, Gang Li wrote:
> TLDR:
> If a mempolicy is in effect(oc->constraint == CONSTRAINT_MEMORY_POLICY), out_of_memory() will
> select victim on specific node to kill. So that kernel can avoid accidental killing on NUMA system.
> 
> Problem:
> Before this patch series, oom will only kill the process with the highest memory usage.
> by selecting process with the highest oom_badness on the entire system to kill.
> 
> This works fine on UMA system, but may have some accidental killing on NUMA system.
> 
> As shown below, if process c.out is bind to Node1 and keep allocating pages from Node1,
> a.out will be killed first. But killing a.out did't free any mem on Node1, so c.out
> will be killed then.
> 
> A lot of our AMD machines have 8 numa nodes. In these systems, there is a greater chance
> of triggering this problem.

Sorry, I have only now found this email thread. The limitation of the
NUMA constrained oom is well known and long standing. Basically the
whole thing is a best effort as we are lacking per numa node memory
stats. I can see that you are trying to fill up that gap but this is
not really free. Have you measured the runtime overhead? Accounting is
done in a very performance sensitive paths and it would be rather
unfortunate to make everybody pay the overhead while binding to a
specific node or sets of nodes is not the most common usecase.

Also have you tried to have a look at cpusets? Those should be easier to
make a proper selection as it should be possible to iterate over tasks
belonging to a specific cpuset much more easier - essentialy something
similar to memcg oom killer. We do not do that right now and by a very
brief look at the CONSTRAINT_CPUSET it seems that this code is not
really doing much these days. Maybe that would be a more appropriate way
to deal with more precise node aware oom killing?

[...]
>  21 files changed, 317 insertions(+), 111 deletions(-)

The code footprint is not free either. And more importantnly does this
even work much more reliably? I can see quite some NUMA_NO_NODE
accounting (e.g. copy_pte_range!).Is this somehow fixable?

Also how do those numbers add up. Let's say you increase the counter as
NUMA_NO_NODE but later on during the clean up you decrease based on the
page node?

Last but not least I am really not following MM_NO_TYPE concept. I can
only see add_mm_counter users without any decrements. What is going on
there?
-- 
Michal Hocko
SUSE Labs
Re: Re: [PATCH 0/5 v1] mm, oom: Introduce per numa node oom for CONSTRAINT_MEMORY_POLICY
Posted by Gang Li 3 years, 10 months ago
Hi, I've done some benchmarking in the last few days.

On 2022/5/17 00:44, Michal Hocko wrote:
> Sorry, I have only now found this email thread. The limitation of the
> NUMA constrained oom is well known and long standing. Basically the
> whole thing is a best effort as we are lacking per numa node memory
> stats. I can see that you are trying to fill up that gap but this is
> not really free. Have you measured the runtime overhead? Accounting is
> done in a very performance sensitive paths and it would be rather
> unfortunate to make everybody pay the overhead while binding to a
> specific node or sets of nodes is not the most common usecase.

## CPU consumption

According to the result of Unixbench. There is less than one percent 
performance loss in most cases.

On 40c512g machine.

40 parallel copies of tests:
+----------+----------+-----+----------+---------+---------+---------+
| numastat | FileCopy | ... |   Pipe   |  Fork   | syscall |  total  |
+----------+----------+-----+----------+---------+---------+---------+
| off      | 2920.24  | ... | 35926.58 | 6980.14 | 2617.18 | 8484.52 |
| on       | 2919.15  | ... | 36066.07 | 6835.01 | 2724.82 | 8461.24 |
| overhead | 0.04%    | ... | -0.39%   | 2.12%   | -3.95%  | 0.28%   |
+----------+----------+-----+----------+---------+---------+---------+

1 parallel copy of tests:
+----------+----------+-----+---------+--------+---------+---------+
| numastat | FileCopy | ... |  Pipe   |  Fork  | syscall |  total  |
+----------+----------+-----+---------+--------+---------+---------+
| off      | 1515.37  | ... | 1473.97 | 546.88 | 1152.37 | 1671.2  |
| on       | 1508.09  | ... | 1473.75 | 532.61 | 1148.83 | 1662.72 |
| overhead | 0.48%    | ... | 0.01%   | 2.68%  | 0.31%   | 0.51%   |
+----------+----------+-----+---------+--------+---------+---------+

## MEM consumption

per task_struct:
sizeof(int) * num_possible_nodes() + sizeof(int*)
typically 4 * 2 + 8 bytes

per mm_struct:
sizeof(atomic_long_t) * num_possible_nodes() + sizeof(atomic_long_t*)
typically 8 * 2 + 8 bytes

zap_pte_range:
sizeof(int) * num_possible_nodes() + sizeof(int*)
typically 4 * 2 + 8 bytes

> Also have you tried to have a look at cpusets? Those should be easier to
> make a proper selection as it should be possible to iterate over tasks
> belonging to a specific cpuset much more easier - essentialy something
> similar to memcg oom killer. We do not do that right now and by a very
> brief look at the CONSTRAINT_CPUSET it seems that this code is not
> really doing much these days. Maybe that would be a more appropriate way
> to deal with more precise node aware oom killing?

Looks like both CONSTRAINT_MEMORY_POLICY and CONSTRAINT_CPUSET can
be uesd to deal with node aware oom killing.

I think we can calculate badness in this way:
    If constraint=CONSTRAINT_MEMORY_POLICY, get badness by `nodemask`.
    If constraint=CONSTRAINT_CPUSET, get badness by `mems_allowed`.

example code:
```
long oom_badness(struct task_struct *p, struct oom_control *oc)
	long points;

	...

	if (unlikely(oc->constraint == CONSTRAINT_MEMORY_POLICY)) {
		for_each_node_mask(nid, oc->nodemask)
			points += get_mm_counter(p->mm, -1, nid)
	} else if (unlikely(oc->constraint == CONSTRAINT_CPUSET)) {
		for_each_node_mask(nid, cpuset_current_mems_allowed)
			points += get_mm_counter(p->mm, -1, nid)
	} else {
		points = get_mm_rss(p->mm);
	}
	points += get_mm_counter(p->mm, MM_SWAPENTS, NUMA_NO_NODE) \
		+ mm_pgtables_bytes(p->mm) / PAGE_SIZE;

	...

}
```

> 
> [...]
>>   21 files changed, 317 insertions(+), 111 deletions(-)
> 
> The code footprint is not free either. And more importantnly does this
> even work much more reliably? I can see quite some NUMA_NO_NODE
> accounting (e.g. copy_pte_range!).Is this somehow fixable?

> Also how do those numbers add up. Let's say you increase the counter as
> NUMA_NO_NODE but later on during the clean up you decrease based on the
> page node?
 > Last but not least I am really not following MM_NO_TYPE concept. I can
 > only see add_mm_counter users without any decrements. What is going on
 > there?

There are two usage scenarios of NUMA_NO_NODE in this patch.

1. placeholder when swap pages in and out of swapfile.
```
	// mem to swapfile
	dec_mm_counter(vma->vm_mm, MM_ANONPAGES, page_to_nid(page));
	inc_mm_counter(vma->vm_mm, MM_SWAPENTS, NUMA_NO_NODE);

	// swapfile to mem
	inc_mm_counter(vma->vm_mm, MM_ANONPAGES, page_to_nid(page));
	dec_mm_counter(vma->vm_mm, MM_SWAPENTS, NUMA_NO_NODE);
```

In *_mm_counter(vma->vm_mm, MM_SWAPENTS, NUMA_NO_NODE),
NUMA_NO_NODE is a placeholder. It means this page does not exist in any
node anymore.

2. placeholder in `add_mm_rss_vec` and `sync_mm_rss` for per process mm 
counter synchronization with SPLIT_RSS_COUNTING enabled.


MM_NO_TYPE is also a placeholder in `*_mm_counter`, `add_mm_rss_vec` and 
`sync_mm_rss`.

These placeholders are very strange. Maybe I should introduce a helper
function for mm->rss_stat.numa_count counting instead of using
placeholder.
Re: [PATCH 0/5 v1] mm, oom: Introduce per numa node oom for CONSTRAINT_MEMORY_POLICY
Posted by Suren Baghdasaryan 3 years, 12 months ago
On Wed, May 11, 2022 at 9:47 PM Gang Li <ligang.bdlg@bytedance.com> wrote:
>
> TLDR:
> If a mempolicy is in effect(oc->constraint == CONSTRAINT_MEMORY_POLICY), out_of_memory() will
> select victim on specific node to kill. So that kernel can avoid accidental killing on NUMA system.
>
> Problem:
> Before this patch series, oom will only kill the process with the highest memory usage.
> by selecting process with the highest oom_badness on the entire system to kill.
>
> This works fine on UMA system, but may have some accidental killing on NUMA system.
>
> As shown below, if process c.out is bind to Node1 and keep allocating pages from Node1,
> a.out will be killed first. But killing a.out did't free any mem on Node1, so c.out
> will be killed then.
>
> A lot of our AMD machines have 8 numa nodes. In these systems, there is a greater chance
> of triggering this problem.
>
> OOM before patches:
> ```
> Per-node process memory usage (in MBs)
> PID             Node 0        Node 1      Total
> ----------- ---------- ------------- ----------
> 3095 a.out     3073.34          0.11    3073.45(Killed first. Maximum memory consumption)
> 3199 b.out      501.35       1500.00    2001.35
> 3805 c.out        1.52 (grow)2248.00    2249.52(Killed then. Node1 is full)
> ----------- ---------- ------------- ----------
> Total          3576.21       3748.11    7324.31
> ```
>
> Solution:
> We store per node rss in mm_rss_stat for each process.
>
> If a page allocation with mempolicy in effect(oc->constraint == CONSTRAINT_MEMORY_POLICY)
> triger oom. We will calculate oom_badness with rss counter for the corresponding node. Then
> select the process with the highest oom_badness on the corresponding node to kill.
>
> OOM after patches:
> ```
> Per-node process memory usage (in MBs)
> PID             Node 0        Node 1     Total
> ----------- ---------- ------------- ----------
> 3095 a.out     3073.34          0.11    3073.45
> 3199 b.out      501.35       1500.00    2001.35
> 3805 c.out        1.52 (grow)2248.00    2249.52(killed)
> ----------- ---------- ------------- ----------
> Total          3576.21       3748.11    7324.31
> ```

You included lots of people but missed Michal Hocko. CC'ing him and
please include him in the future postings.

>
> Gang Li (5):
>   mm: add a new parameter `node` to `get/add/inc/dec_mm_counter`
>   mm: add numa_count field for rss_stat
>   mm: add numa fields for tracepoint rss_stat
>   mm: enable per numa node rss_stat count
>   mm, oom: enable per numa node oom for CONSTRAINT_MEMORY_POLICY
>
>  arch/s390/mm/pgtable.c        |   4 +-
>  fs/exec.c                     |   2 +-
>  fs/proc/base.c                |   6 +-
>  fs/proc/task_mmu.c            |  14 ++--
>  include/linux/mm.h            |  59 ++++++++++++-----
>  include/linux/mm_types_task.h |  16 +++++
>  include/linux/oom.h           |   2 +-
>  include/trace/events/kmem.h   |  27 ++++++--
>  kernel/events/uprobes.c       |   6 +-
>  kernel/fork.c                 |  70 +++++++++++++++++++-
>  mm/huge_memory.c              |  13 ++--
>  mm/khugepaged.c               |   4 +-
>  mm/ksm.c                      |   2 +-
>  mm/madvise.c                  |   2 +-
>  mm/memory.c                   | 116 ++++++++++++++++++++++++----------
>  mm/migrate.c                  |   2 +
>  mm/migrate_device.c           |   2 +-
>  mm/oom_kill.c                 |  59 ++++++++++++-----
>  mm/rmap.c                     |  16 ++---
>  mm/swapfile.c                 |   4 +-
>  mm/userfaultfd.c              |   2 +-
>  21 files changed, 317 insertions(+), 111 deletions(-)
>
> --
> 2.20.1
>