[PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}
@ 2022-07-08  8:21 Gang Li
  2022-07-08  8:21 ` [PATCH v2 1/5] mm: add a new parameter `node` to `get/add/inc/dec_mm_counter` Gang Li
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Gang Li @ 2022-07-08  8:21 UTC (permalink / raw)
  To: mhocko, akpm, surenb
  Cc: hca, gor, agordeev, borntraeger, svens, viro, ebiederm, keescook,
	rostedt, mingo, peterz, acme, mark.rutland, alexander.shishkin,
	jolsa, namhyung, david, imbrenda, adobriyan, yang.yang29, brauner,
	stephen.s.brennan, zhengqi.arch, haolee.swjtu, xu.xin16,
	Liam.Howlett, ohoono.kwon, peterx, arnd, shy828301, alex.sierra,
	xianting.tian, willy, ccross, vbabka, sujiaxun, sfr,
	vasily.averin, mgorman, vvghjk1234, tglx, luto, bigeasy,
	fenghua.yu, linux-s390, linux-kernel, linux-fsdevel, linux-mm,
	linux-perf-users, Gang Li

TLDR
----
If a mempolicy or cpuset is in effect, out_of_memory() will select victim
on specific node to kill. So that kernel can avoid accidental killing on
NUMA system.

Problem
-------
Before this patch series, oom will only kill the process with the highest 
memory usage by selecting process with the highest oom_badness on the
entire system.

This works fine on UMA system, but may have some accidental killing on NUMA
system.

As shown below, if process c.out is bind to Node1 and keep allocating pages
from Node1, a.out will be killed first. But killing a.out did't free any
mem on Node1, so c.out will be killed then.

A lot of AMD machines have 8 numa nodes. In these systems, there is a
greater chance of triggering this problem.

OOM before patches:
```
Per-node process memory usage (in MBs)
PID             Node 0        Node 1    Total
----------- ---------- ------------- --------
3095 a.out     3073.34          0.11  3073.45(Killed first. Max mem usage)
3199 b.out      501.35       1500.00  2001.35
3805 c.out        1.52 (grow)2248.00  2249.52(Killed then. Node1 is full)
----------- ---------- ------------- --------
Total          3576.21       3748.11  7324.31
```

Solution
--------
We store per node rss in mm_rss_stat for each process.

If a page allocation with mempolicy or cpuset in effect triger oom. We will
calculate oom_badness with rss counter for the corresponding node. Then
select the process with the highest oom_badness on the corresponding node
to kill.

OOM after patches:
```
Per-node process memory usage (in MBs)
PID             Node 0        Node 1     Total
----------- ---------- ------------- ----------
3095 a.out     3073.34          0.11    3073.45
3199 b.out      501.35       1500.00    2001.35
3805 c.out        1.52 (grow)2248.00    2249.52(killed)
----------- ---------- ------------- ----------
Total          3576.21       3748.11    7324.31
```

Overhead
--------
CPU:

According to the result of Unixbench. There is less than one percent
performance loss in most cases.

On 40c512g machine.

40 parallel copies of tests:
+----------+----------+-----+----------+---------+---------+---------+
| numastat | FileCopy | ... |   Pipe   |  Fork   | syscall |  total  |
+----------+----------+-----+----------+---------+---------+---------+
| off      | 2920.24  | ... | 35926.58 | 6980.14 | 2617.18 | 8484.52 |
| on       | 2919.15  | ... | 36066.07 | 6835.01 | 2724.82 | 8461.24 |
| overhead | 0.04%    | ... | -0.39%   | 2.12%   | -3.95%  | 0.28%   |
+----------+----------+-----+----------+---------+---------+---------+

1 parallel copy of tests:
+----------+----------+-----+---------+--------+---------+---------+
| numastat | FileCopy | ... |  Pipe   |  Fork  | syscall |  total  |
+----------+----------+-----+---------+--------+---------+---------+
| off      | 1515.37  | ... | 1473.97 | 546.88 | 1152.37 | 1671.2  |
| on       | 1508.09  | ... | 1473.75 | 532.61 | 1148.83 | 1662.72 |
| overhead | 0.48%    | ... | 0.01%   | 2.68%  | 0.31%   | 0.51%   |
+----------+----------+-----+---------+--------+---------+---------+

MEM:

per task_struct:
sizeof(int) * num_possible_nodes() + sizeof(int*)
typically 4 * 2 + 8 bytes

per mm_struct:
sizeof(atomic_long_t) * num_possible_nodes() + sizeof(atomic_long_t*)
typically 8 * 2 + 8 bytes

zap_pte_range:
sizeof(int) * num_possible_nodes() + sizeof(int*)
typically 4 * 2 + 8 bytes 

Changelog
----------
v2:
  - enable per numa node oom for `CONSTRAINT_CPUSET`.
  - add benchmark result in cover letter.

Gang Li (5):
  mm: add a new parameter `node` to `get/add/inc/dec_mm_counter`
  mm: add numa_count field for rss_stat
  mm: add numa fields for tracepoint rss_stat
  mm: enable per numa node rss_stat count
  mm, oom: enable per numa node oom for
    CONSTRAINT_{MEMORY_POLICY,CPUSET}

 arch/s390/mm/pgtable.c        |   4 +-
 fs/exec.c                     |   2 +-
 fs/proc/base.c                |   6 +-
 fs/proc/task_mmu.c            |  14 ++--
 include/linux/mm.h            |  59 ++++++++++++-----
 include/linux/mm_types_task.h |  16 +++++
 include/linux/oom.h           |   2 +-
 include/trace/events/kmem.h   |  27 ++++++--
 kernel/events/uprobes.c       |   6 +-
 kernel/fork.c                 |  70 +++++++++++++++++++-
 mm/huge_memory.c              |  13 ++--
 mm/khugepaged.c               |   4 +-
 mm/ksm.c                      |   2 +-
 mm/madvise.c                  |   2 +-
 mm/memory.c                   | 119 ++++++++++++++++++++++++----------
 mm/migrate.c                  |   4 ++
 mm/migrate_device.c           |   2 +-
 mm/oom_kill.c                 |  69 +++++++++++++++-----
 mm/rmap.c                     |  19 +++---
 mm/swapfile.c                 |   6 +-
 mm/userfaultfd.c              |   2 +-
 21 files changed, 335 insertions(+), 113 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/5] mm: add a new parameter `node` to `get/add/inc/dec_mm_counter`
  2022-07-08  8:21 [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET} Gang Li
@ 2022-07-08  8:21 ` Gang Li
  2022-07-08  8:21 ` [PATCH v2 2/5] mm: add numa_count field for rss_stat Gang Li
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Gang Li @ 2022-07-08  8:21 UTC (permalink / raw)
  To: mhocko, akpm, surenb, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Eric Biederman, Kees Cook, Alexander Viro, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim
  Cc: rostedt, david, imbrenda, adobriyan, yang.yang29, brauner,
	stephen.s.brennan, zhengqi.arch, haolee.swjtu, xu.xin16,
	Liam.Howlett, ohoono.kwon, peterx, arnd, shy828301, alex.sierra,
	xianting.tian, willy, ccross, vbabka, sujiaxun, sfr,
	vasily.averin, mgorman, vvghjk1234, tglx, luto, bigeasy,
	fenghua.yu, linux-s390, linux-kernel, linux-fsdevel, linux-mm,
	linux-perf-users, Gang Li

Add a new parameter `node` to mm_counter for counting per node rss. Use
page_to_nid(page) to get node id from page.
before:
    dec_mm_counter(vma->vm_mm, MM_ANONPAGES);
after:
    dec_mm_counter(vma->vm_mm, MM_ANONPAGES, page_to_nid(page));

If a page is swapped out, it no longer exists on any numa node.
(Or it is swapped from disk into a specific numa node.)
Thus when we call *_mm_counter(MM_ANONPAGES), the `node` field should be
`NUMA_NO_NODE`. For example:
```
swap_out(){
    dec_mm_counter(vma->vm_mm, MM_ANONPAGES, page_to_nid(page));
    inc_mm_counter(vma->vm_mm, MM_SWAPENTS, NUMA_NO_NODE);
}
```

Pages can be migrated between nodes. `remove_migration_pte`
must call `add_mm_counter` now.

Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
---
 arch/s390/mm/pgtable.c        |  4 +-
 fs/exec.c                     |  2 +-
 fs/proc/task_mmu.c            | 14 +++---
 include/linux/mm.h            | 14 +++---
 include/linux/mm_types_task.h | 10 ++++
 kernel/events/uprobes.c       |  6 +--
 mm/huge_memory.c              | 13 ++---
 mm/khugepaged.c               |  4 +-
 mm/ksm.c                      |  2 +-
 mm/madvise.c                  |  2 +-
 mm/memory.c                   | 94 +++++++++++++++++++++++------------
 mm/migrate.c                  |  4 ++
 mm/migrate_device.c           |  2 +-
 mm/oom_kill.c                 | 16 +++---
 mm/rmap.c                     | 19 ++++---
 mm/swapfile.c                 |  6 +--
 mm/userfaultfd.c              |  2 +-
 17 files changed, 132 insertions(+), 82 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 4909dcd762e8..b8306765cd63 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -703,11 +703,11 @@ void ptep_unshadow_pte(struct mm_struct *mm, unsigned long saddr, pte_t *ptep)
 static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry)
 {
 	if (!non_swap_entry(entry))
-		dec_mm_counter(mm, MM_SWAPENTS);
+		dec_mm_counter(mm, MM_SWAPENTS, NUMA_NO_NODE);
 	else if (is_migration_entry(entry)) {
 		struct page *page = pfn_swap_entry_to_page(entry);
 
-		dec_mm_counter(mm, mm_counter(page));
+		dec_mm_counter(mm, mm_counter(page), page_to_nid(page));
 	}
 	free_swap_and_cache(entry);
 }
diff --git a/fs/exec.c b/fs/exec.c
index 5f0656e10b5d..99825c06d0c2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -192,7 +192,7 @@ static void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
 		return;
 
 	bprm->vma_pages = pages;
-	add_mm_counter(mm, MM_ANONPAGES, diff);
+	add_mm_counter(mm, MM_ANONPAGES, diff, NUMA_NO_NODE);
 }
 
 static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 34d292cec79a..24d33d1011d9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -32,9 +32,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	unsigned long text, lib, swap, anon, file, shmem;
 	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
 
-	anon = get_mm_counter(mm, MM_ANONPAGES);
-	file = get_mm_counter(mm, MM_FILEPAGES);
-	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
+	anon = get_mm_counter(mm, MM_ANONPAGES, NUMA_NO_NODE);
+	file = get_mm_counter(mm, MM_FILEPAGES, NUMA_NO_NODE);
+	shmem = get_mm_counter(mm, MM_SHMEMPAGES, NUMA_NO_NODE);
 
 	/*
 	 * Note: to minimize their overhead, mm maintains hiwater_vm and
@@ -55,7 +55,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	text = min(text, mm->exec_vm << PAGE_SHIFT);
 	lib = (mm->exec_vm << PAGE_SHIFT) - text;
 
-	swap = get_mm_counter(mm, MM_SWAPENTS);
+	swap = get_mm_counter(mm, MM_SWAPENTS, NUMA_NO_NODE);
 	SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
 	SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
 	SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
@@ -88,12 +88,12 @@ unsigned long task_statm(struct mm_struct *mm,
 			 unsigned long *shared, unsigned long *text,
 			 unsigned long *data, unsigned long *resident)
 {
-	*shared = get_mm_counter(mm, MM_FILEPAGES) +
-			get_mm_counter(mm, MM_SHMEMPAGES);
+	*shared = get_mm_counter(mm, MM_FILEPAGES, NUMA_NO_NODE) +
+			get_mm_counter(mm, MM_SHMEMPAGES, NUMA_NO_NODE);
 	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
 								>> PAGE_SHIFT;
 	*data = mm->data_vm + mm->stack_vm;
-	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
+	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES, NUMA_NO_NODE);
 	return mm->total_vm;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 794ad19b57f8..84ce6e1b1252 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2026,7 +2026,7 @@ static inline bool get_user_page_fast_only(unsigned long addr,
 /*
  * per-process(per-mm_struct) statistics.
  */
-static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
+static inline unsigned long get_mm_counter(struct mm_struct *mm, int member, int node)
 {
 	long val = atomic_long_read(&mm->rss_stat.count[member]);
 
@@ -2043,21 +2043,21 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
 
 void mm_trace_rss_stat(struct mm_struct *mm, int member, long count);
 
-static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
+static inline void add_mm_counter(struct mm_struct *mm, int member, long value, int node)
 {
 	long count = atomic_long_add_return(value, &mm->rss_stat.count[member]);
 
 	mm_trace_rss_stat(mm, member, count);
 }
 
-static inline void inc_mm_counter(struct mm_struct *mm, int member)
+static inline void inc_mm_counter(struct mm_struct *mm, int member, int node)
 {
 	long count = atomic_long_inc_return(&mm->rss_stat.count[member]);
 
 	mm_trace_rss_stat(mm, member, count);
 }
 
-static inline void dec_mm_counter(struct mm_struct *mm, int member)
+static inline void dec_mm_counter(struct mm_struct *mm, int member, int node)
 {
 	long count = atomic_long_dec_return(&mm->rss_stat.count[member]);
 
@@ -2081,9 +2081,9 @@ static inline int mm_counter(struct page *page)
 
 static inline unsigned long get_mm_rss(struct mm_struct *mm)
 {
-	return get_mm_counter(mm, MM_FILEPAGES) +
-		get_mm_counter(mm, MM_ANONPAGES) +
-		get_mm_counter(mm, MM_SHMEMPAGES);
+	return get_mm_counter(mm, MM_FILEPAGES, NUMA_NO_NODE) +
+		get_mm_counter(mm, MM_ANONPAGES, NUMA_NO_NODE) +
+		get_mm_counter(mm, MM_SHMEMPAGES, NUMA_NO_NODE);
 }
 
 static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index 0bb4b6da9993..32512af31721 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -36,6 +36,16 @@ enum {
 	NR_MM_COUNTERS
 };
 
+/*
+ * This macro should only be used in committing local values, like sync_mm_rss,
+ * add_mm_rss_vec. It means don't count per-mm-type, only count per-node in
+ * mm_stat.
+ *
+ * `MM_NO_TYPE` must equals to `NR_MM_COUNTERS`, since we will use it in
+ * `TRACE_MM_PAGES`.
+ */
+#define MM_NO_TYPE NR_MM_COUNTERS
+
 #if USE_SPLIT_PTE_PTLOCKS && defined(CONFIG_MMU)
 #define SPLIT_RSS_COUNTING
 /* per-thread cached information, */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 401bc2d24ce0..f5b0db3494a3 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -184,11 +184,11 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		lru_cache_add_inactive_or_unevictable(new_page, vma);
 	} else
 		/* no new page, just dec_mm_counter for old_page */
-		dec_mm_counter(mm, MM_ANONPAGES);
+		dec_mm_counter(mm, MM_ANONPAGES, page_to_nid(old_page));
 
 	if (!PageAnon(old_page)) {
-		dec_mm_counter(mm, mm_counter_file(old_page));
-		inc_mm_counter(mm, MM_ANONPAGES);
+		dec_mm_counter(mm, mm_counter_file(old_page), page_to_nid(old_page));
+		inc_mm_counter(mm, MM_ANONPAGES, page_to_nid(new_page));
 	}
 
 	flush_cache_page(vma, addr, pte_pfn(*pvmw.pte));
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fd9d502aadc4..b7fd7df70e7c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -809,7 +809,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
-		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR, page_to_nid(page));
 		mm_inc_nr_ptes(vma->vm_mm);
 		spin_unlock(vmf->ptl);
 		count_vm_event(THP_FAULT_ALLOC);
@@ -1220,7 +1220,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 				pmd = pmd_swp_mkuffd_wp(pmd);
 			set_pmd_at(src_mm, addr, src_pmd, pmd);
 		}
-		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR, page_to_nid(pmd_page(*dst_pmd)));
 		mm_inc_nr_ptes(dst_mm);
 		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 		if (!userfaultfd_wp(dst_vma))
@@ -1263,7 +1263,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		__split_huge_pmd(src_vma, src_pmd, addr, false, NULL);
 		return -EAGAIN;
 	}
-	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR, page_to_nid(src_page));
 out_zero_page:
 	mm_inc_nr_ptes(dst_mm);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
@@ -1753,11 +1753,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		if (PageAnon(page)) {
 			zap_deposited_table(tlb->mm, pmd);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR, page_to_nid(page));
 		} else {
 			if (arch_needs_pgtable_deposit())
 				zap_deposited_table(tlb->mm, pmd);
-			add_mm_counter(tlb->mm, mm_counter_file(page), -HPAGE_PMD_NR);
+			add_mm_counter(tlb->mm, mm_counter_file(page), -HPAGE_PMD_NR,
+				       page_to_nid(page));
 		}
 
 		spin_unlock(ptl);
@@ -2143,7 +2144,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			page_remove_rmap(page, vma, true);
 			put_page(page);
 		}
-		add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
+		add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR, page_to_nid(page));
 		return;
 	}
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cfe231c5958f..74d4c578a91c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -687,7 +687,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			clear_user_highpage(page, address);
-			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1, page_to_nid(page));
 			if (is_zero_pfn(pte_pfn(pteval))) {
 				/*
 				 * ptl mostly unnecessary.
@@ -1469,7 +1469,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	/* step 3: set proper refcount and mm_counters. */
 	if (count) {
 		page_ref_sub(hpage, count);
-		add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
+		add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count, page_to_nid(hpage));
 	}
 
 	/* step 4: collapse pmd */
diff --git a/mm/ksm.c b/mm/ksm.c
index 63b4b9d71597..4dc4b78d6f9b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1180,7 +1180,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 		 * will get wrong values in /proc, and a BUG message in dmesg
 		 * when tearing down the mm.
 		 */
-		dec_mm_counter(mm, MM_ANONPAGES);
+		dec_mm_counter(mm, MM_ANONPAGES, page_to_nid(page));
 	}
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
diff --git a/mm/madvise.c b/mm/madvise.c
index 851fa4e134bc..46229b70cbbe 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -715,7 +715,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		if (current->mm == mm)
 			sync_mm_rss(mm);
 
-		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
+		add_mm_counter(mm, MM_SWAPENTS, nr_swap, NUMA_NO_NODE);
 	}
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
diff --git a/mm/memory.c b/mm/memory.c
index 8917bea2f0bc..bb24da767f79 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -161,6 +161,8 @@ EXPORT_SYMBOL(zero_pfn);
 
 unsigned long highest_memmap_pfn __read_mostly;
 
+static DEFINE_PER_CPU(int, percpu_numa_rss[MAX_NUMNODES]);
+
 /*
  * CONFIG_MMU architectures set up ZERO_PAGE in their paging_init()
  */
@@ -184,24 +186,24 @@ void sync_mm_rss(struct mm_struct *mm)
 
 	for (i = 0; i < NR_MM_COUNTERS; i++) {
 		if (current->rss_stat.count[i]) {
-			add_mm_counter(mm, i, current->rss_stat.count[i]);
+			add_mm_counter(mm, i, current->rss_stat.count[i], NUMA_NO_NODE);
 			current->rss_stat.count[i] = 0;
 		}
 	}
 	current->rss_stat.events = 0;
 }
 
-static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
+static void add_mm_counter_fast(struct mm_struct *mm, int member, int val, int node)
 {
 	struct task_struct *task = current;
 
 	if (likely(task->mm == mm))
 		task->rss_stat.count[member] += val;
 	else
-		add_mm_counter(mm, member, val);
+		add_mm_counter(mm, member, val, node);
 }
-#define inc_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, 1)
-#define dec_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, -1)
+#define inc_mm_counter_fast(mm, member, node) add_mm_counter_fast(mm, member, 1, node)
+#define dec_mm_counter_fast(mm, member, node) add_mm_counter_fast(mm, member, -1, node)
 
 /* sync counter once per 64 page faults */
 #define TASK_RSS_EVENTS_THRESH	(64)
@@ -214,8 +216,8 @@ static void check_sync_rss_stat(struct task_struct *task)
 }
 #else /* SPLIT_RSS_COUNTING */
 
-#define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member)
-#define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member)
+#define inc_mm_counter_fast(mm, member, node) inc_mm_counter(mm, member, node)
+#define dec_mm_counter_fast(mm, member, node) dec_mm_counter(mm, member, node)
 
 static void check_sync_rss_stat(struct task_struct *task)
 {
@@ -502,12 +504,13 @@ int __pte_alloc_kernel(pmd_t *pmd)
 	return 0;
 }
 
-static inline void init_rss_vec(int *rss)
+static inline void init_rss_vec(int *rss, int *numa_rss)
 {
 	memset(rss, 0, sizeof(int) * NR_MM_COUNTERS);
+	memset(numa_rss, 0, sizeof(int) * num_possible_nodes());
 }
 
-static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
+static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss, int *numa_rss)
 {
 	int i;
 
@@ -515,7 +518,7 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
 		sync_mm_rss(mm);
 	for (i = 0; i < NR_MM_COUNTERS; i++)
 		if (rss[i])
-			add_mm_counter(mm, i, rss[i]);
+			add_mm_counter(mm, i, rss[i], NUMA_NO_NODE);
 }
 
 /*
@@ -792,7 +795,8 @@ try_restore_exclusive_pte(pte_t *src_pte, struct vm_area_struct *vma,
 static unsigned long
 copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *dst_vma,
-		struct vm_area_struct *src_vma, unsigned long addr, int *rss)
+		struct vm_area_struct *src_vma, unsigned long addr, int *rss,
+		int *numa_rss)
 {
 	unsigned long vm_flags = dst_vma->vm_flags;
 	pte_t pte = *src_pte;
@@ -817,10 +821,12 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			set_pte_at(src_mm, addr, src_pte, pte);
 		}
 		rss[MM_SWAPENTS]++;
+		numa_rss[page_to_nid(pte_page(*dst_pte))]++;
 	} else if (is_migration_entry(entry)) {
 		page = pfn_swap_entry_to_page(entry);
 
 		rss[mm_counter(page)]++;
+		numa_rss[page_to_nid(page)]++;
 
 		if (!is_readable_migration_entry(entry) &&
 				is_cow_mapping(vm_flags)) {
@@ -852,6 +858,8 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		 */
 		get_page(page);
 		rss[mm_counter(page)]++;
+		numa_rss[page_to_nid(page)]++;
+
 		/* Cannot fail as these pages cannot get pinned. */
 		BUG_ON(page_try_dup_anon_rmap(page, false, src_vma));
 
@@ -912,7 +920,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 static inline int
 copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		  pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
-		  struct page **prealloc, struct page *page)
+		  struct page **prealloc, struct page *page, int *numa_rss)
 {
 	struct page *new_page;
 	pte_t pte;
@@ -931,6 +939,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	page_add_new_anon_rmap(new_page, dst_vma, addr);
 	lru_cache_add_inactive_or_unevictable(new_page, dst_vma);
 	rss[mm_counter(new_page)]++;
+	rss[page_to_nid(new_page)]++;
 
 	/* All done, just insert the new page copy in the child */
 	pte = mk_pte(new_page, dst_vma->vm_page_prot);
@@ -949,7 +958,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 static inline int
 copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		 pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
-		 struct page **prealloc)
+		 struct page **prealloc, int *numa_rss)
 {
 	struct mm_struct *src_mm = src_vma->vm_mm;
 	unsigned long vm_flags = src_vma->vm_flags;
@@ -969,13 +978,15 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 			/* Page maybe pinned, we have to copy. */
 			put_page(page);
 			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
-						 addr, rss, prealloc, page);
+						 addr, rss, prealloc, page, numa_rss);
 		}
 		rss[mm_counter(page)]++;
+		numa_rss[page_to_nid(page)]++;
 	} else if (page) {
 		get_page(page);
 		page_dup_file_rmap(page, false);
 		rss[mm_counter(page)]++;
+		numa_rss[page_to_nid(page)]++;
 	}
 
 	/*
@@ -1034,12 +1045,16 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress, ret = 0;
 	int rss[NR_MM_COUNTERS];
+	int *numa_rss;
 	swp_entry_t entry = (swp_entry_t){0};
 	struct page *prealloc = NULL;
+	numa_rss = kcalloc(num_possible_nodes(), sizeof(int), GFP_KERNEL);
+	if (unlikely(!numa_rss))
+		numa_rss = (int *)get_cpu_ptr(&percpu_numa_rss);
 
 again:
 	progress = 0;
-	init_rss_vec(rss);
+	init_rss_vec(rss, numa_rss);
 
 	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
 	if (!dst_pte) {
@@ -1072,7 +1087,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 			ret = copy_nonpresent_pte(dst_mm, src_mm,
 						  dst_pte, src_pte,
 						  dst_vma, src_vma,
-						  addr, rss);
+						  addr, rss, numa_rss);
 			if (ret == -EIO) {
 				entry = pte_to_swp_entry(*src_pte);
 				break;
@@ -1091,7 +1106,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		}
 		/* copy_present_pte() will clear `*prealloc' if consumed */
 		ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
-				       addr, rss, &prealloc);
+				       addr, rss, &prealloc, numa_rss);
 		/*
 		 * If we need a pre-allocated page for this pte, drop the
 		 * locks, allocate, and try again.
@@ -1114,7 +1129,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	arch_leave_lazy_mmu_mode();
 	spin_unlock(src_ptl);
 	pte_unmap(orig_src_pte);
-	add_mm_rss_vec(dst_mm, rss);
+	add_mm_rss_vec(dst_mm, rss, numa_rss);
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
 	cond_resched();
 
@@ -1143,6 +1158,10 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 out:
 	if (unlikely(prealloc))
 		put_page(prealloc);
+	if (unlikely(numa_rss == (int *)raw_cpu_ptr(&percpu_numa_rss)))
+		put_cpu_ptr(numa_rss);
+	else
+		kfree(numa_rss);
 	return ret;
 }
 
@@ -1415,14 +1434,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	struct mm_struct *mm = tlb->mm;
 	int force_flush = 0;
 	int rss[NR_MM_COUNTERS];
+	int *numa_rss;
 	spinlock_t *ptl;
 	pte_t *start_pte;
 	pte_t *pte;
 	swp_entry_t entry;
+	numa_rss = kcalloc(num_possible_nodes(), sizeof(int), GFP_KERNEL);
+	if (unlikely(!numa_rss))
+		numa_rss = (int *)get_cpu_ptr(&percpu_numa_rss);
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 again:
-	init_rss_vec(rss);
+	init_rss_vec(rss, numa_rss);
 	start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	pte = start_pte;
 	flush_tlb_batched_pending(mm);
@@ -1459,6 +1482,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
+			numa_rss[page_to_nid(page)]--;
 			page_remove_rmap(page, vma, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
@@ -1484,6 +1508,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			 */
 			WARN_ON_ONCE(!vma_is_anonymous(vma));
 			rss[mm_counter(page)]--;
+			numa_rss[page_to_nid(page)]--;
 			if (is_device_private_entry(entry))
 				page_remove_rmap(page, vma, false);
 			put_page(page);
@@ -1499,13 +1524,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (!should_zap_page(details, page))
 				continue;
 			rss[mm_counter(page)]--;
+			numa_rss[page_to_nid(page)]--;
 		} else if (pte_marker_entry_uffd_wp(entry)) {
 			/* Only drop the uffd-wp marker if explicitly requested */
 			if (!zap_drop_file_uffd_wp(details))
 				continue;
 		} else if (is_hwpoison_entry(entry) ||
 			   is_swapin_error_entry(entry)) {
-			if (!should_zap_cows(details))
 				continue;
 		} else {
 			/* We should have covered all the swap entry types */
@@ -1515,7 +1540,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
-	add_mm_rss_vec(mm, rss);
+	add_mm_rss_vec(mm, rss, numa_rss);
 	arch_leave_lazy_mmu_mode();
 
 	/* Do the actual TLB flush before dropping ptl */
@@ -1539,6 +1564,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		goto again;
 	}
 
+	if (unlikely(numa_rss == (int *)raw_cpu_ptr(&percpu_numa_rss)))
+		put_cpu_ptr(numa_rss);
+	else
+		kfree(numa_rss);
 	return addr;
 }
 
@@ -1868,7 +1897,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
 		return -EBUSY;
 	/* Ok, finally just insert the thing.. */
 	get_page(page);
-	inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
+	inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page), page_to_nid(page));
 	page_add_file_rmap(page, vma, false);
 	set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot));
 	return 0;
@@ -3164,11 +3193,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		if (old_page) {
 			if (!PageAnon(old_page)) {
 				dec_mm_counter_fast(mm,
-						mm_counter_file(old_page));
-				inc_mm_counter_fast(mm, MM_ANONPAGES);
+						mm_counter_file(old_page), page_to_nid(old_page));
+				inc_mm_counter_fast(mm, MM_ANONPAGES, page_to_nid(new_page));
+			} else {
+				dec_mm_counter_fast(mm, MM_ANONPAGES, page_to_nid(old_page));
+				inc_mm_counter_fast(mm, MM_ANONPAGES, page_to_nid(new_page));
 			}
 		} else {
-			inc_mm_counter_fast(mm, MM_ANONPAGES);
+			inc_mm_counter_fast(mm, MM_ANONPAGES, page_to_nid(new_page));
 		}
 		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
@@ -3955,8 +3987,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (should_try_to_free_swap(page, vma, vmf->flags))
 		try_to_free_swap(page);
 
-	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
+	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES, page_to_nid(page));
+	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS, NUMA_NO_NODE);
 	pte = mk_pte(page, vma->vm_page_prot);
 
 	/*
@@ -4134,7 +4166,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES, page_to_nid(page));
 	page_add_new_anon_rmap(page, vma, vmf->address);
 	lru_cache_add_inactive_or_unevictable(page, vma);
 setpte:
@@ -4275,7 +4307,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 	if (write)
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 
-	add_mm_counter(vma->vm_mm, mm_counter_file(page), HPAGE_PMD_NR);
+	add_mm_counter(vma->vm_mm, mm_counter_file(page), HPAGE_PMD_NR, page_to_nid(page));
 	page_add_file_rmap(page, vma, true);
 
 	/*
@@ -4324,11 +4356,11 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 		entry = pte_mkuffd_wp(pte_wrprotect(entry));
 	/* copy-on-write page */
 	if (write && !(vma->vm_flags & VM_SHARED)) {
-		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES, page_to_nid(page));
 		page_add_new_anon_rmap(page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
-		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
+		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page), page_to_nid(page));
 		page_add_file_rmap(page, vma, false);
 	}
 	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
diff --git a/mm/migrate.c b/mm/migrate.c
index e01624fcab5b..1d7aac928e7e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -258,6 +258,10 @@ static bool remove_migration_pte(struct folio *folio,
 
 		/* No need to invalidate - it was non-present before */
 		update_mmu_cache(vma, pvmw.address, pvmw.pte);
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES,
+				-compound_nr(old), page_to_nid(old));
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES,
+				compound_nr(&folio->page), page_to_nid(&folio->page));
 	}
 
 	return true;
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index ad593d5754cf..e17c5fbc3d2a 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -631,7 +631,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	if (userfaultfd_missing(vma))
 		goto unlock_abort;
 
-	inc_mm_counter(mm, MM_ANONPAGES);
+	inc_mm_counter(mm, MM_ANONPAGES, page_to_nid(page));
 	page_add_new_anon_rmap(page, vma, addr);
 	if (!is_zone_device_page(page))
 		lru_cache_add_inactive_or_unevictable(page, vma);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 35ec75cdfee2..e25c37e2e90d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -227,7 +227,7 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
 	 * The baseline for the badness score is the proportion of RAM that each
 	 * task's rss, pagetable and swap space use.
 	 */
-	points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
+	points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS, NUMA_NO_NODE) +
 		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
 	task_unlock(p);
 
@@ -403,7 +403,7 @@ static int dump_task(struct task_struct *p, void *arg)
 		task->pid, from_kuid(&init_user_ns, task_uid(task)),
 		task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
 		mm_pgtables_bytes(task->mm),
-		get_mm_counter(task->mm, MM_SWAPENTS),
+		get_mm_counter(task->mm, MM_SWAPENTS, NUMA_NO_NODE),
 		task->signal->oom_score_adj, task->comm);
 	task_unlock(task);
 
@@ -594,9 +594,9 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
-			K(get_mm_counter(mm, MM_ANONPAGES)),
-			K(get_mm_counter(mm, MM_FILEPAGES)),
-			K(get_mm_counter(mm, MM_SHMEMPAGES)));
+			K(get_mm_counter(mm, MM_ANONPAGES, NUMA_NO_NODE)),
+			K(get_mm_counter(mm, MM_FILEPAGES, NUMA_NO_NODE)),
+			K(get_mm_counter(mm, MM_SHMEMPAGES, NUMA_NO_NODE)));
 out_finish:
 	trace_finish_task_reaping(tsk->pid);
 out_unlock:
@@ -948,9 +948,9 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
 	mark_oom_victim(victim);
 	pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
 		message, task_pid_nr(victim), victim->comm, K(mm->total_vm),
-		K(get_mm_counter(mm, MM_ANONPAGES)),
-		K(get_mm_counter(mm, MM_FILEPAGES)),
-		K(get_mm_counter(mm, MM_SHMEMPAGES)),
+		K(get_mm_counter(mm, MM_ANONPAGES, NUMA_NO_NODE)),
+		K(get_mm_counter(mm, MM_FILEPAGES, NUMA_NO_NODE)),
+		K(get_mm_counter(mm, MM_SHMEMPAGES, NUMA_NO_NODE)),
 		from_kuid(&init_user_ns, task_uid(victim)),
 		mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj);
 	task_unlock(victim);
diff --git a/mm/rmap.c b/mm/rmap.c
index edc06c52bc82..a6e8bb3d40cc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1620,7 +1620,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				hugetlb_count_sub(folio_nr_pages(folio), mm);
 				set_huge_pte_at(mm, address, pvmw.pte, pteval);
 			} else {
-				dec_mm_counter(mm, mm_counter(&folio->page));
+				dec_mm_counter(mm, mm_counter(&folio->page),
+						page_to_nid(&folio->page));
 				set_pte_at(mm, address, pvmw.pte, pteval);
 			}
 
@@ -1635,7 +1636,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * migration) will not expect userfaults on already
 			 * copied pages.
 			 */
-			dec_mm_counter(mm, mm_counter(&folio->page));
+			dec_mm_counter(mm, mm_counter(&folio->page), page_to_nid(&folio->page));
 			/* We have to invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
 						      address + PAGE_SIZE);
@@ -1686,7 +1687,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					/* Invalidate as we cleared the pte */
 					mmu_notifier_invalidate_range(mm,
 						address, address + PAGE_SIZE);
-					dec_mm_counter(mm, MM_ANONPAGES);
+					dec_mm_counter(mm, MM_ANONPAGES, page_to_nid(&folio->page));
 					goto discard;
 				}
 
@@ -1739,8 +1740,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					list_add(&mm->mmlist, &init_mm.mmlist);
 				spin_unlock(&mmlist_lock);
 			}
-			dec_mm_counter(mm, MM_ANONPAGES);
-			inc_mm_counter(mm, MM_SWAPENTS);
+			dec_mm_counter(mm, MM_ANONPAGES, page_to_nid(&folio->page));
+			inc_mm_counter(mm, MM_SWAPENTS, NUMA_NO_NODE);
 			swp_pte = swp_entry_to_pte(entry);
 			if (anon_exclusive)
 				swp_pte = pte_swp_mkexclusive(swp_pte);
@@ -1764,7 +1765,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 *
 			 * See Documentation/mm/mmu_notifier.rst
 			 */
-			dec_mm_counter(mm, mm_counter_file(&folio->page));
+			dec_mm_counter(mm, mm_counter_file(&folio->page),
+					page_to_nid(&folio->page));
 		}
 discard:
 		/*
@@ -2011,7 +2013,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				hugetlb_count_sub(folio_nr_pages(folio), mm);
 				set_huge_pte_at(mm, address, pvmw.pte, pteval);
 			} else {
-				dec_mm_counter(mm, mm_counter(&folio->page));
+				dec_mm_counter(mm, mm_counter(&folio->page),
+						page_to_nid(&folio->page));
 				set_pte_at(mm, address, pvmw.pte, pteval);
 			}
 
@@ -2026,7 +2029,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			 * migration) will not expect userfaults on already
 			 * copied pages.
 			 */
-			dec_mm_counter(mm, mm_counter(&folio->page));
+			dec_mm_counter(mm, mm_counter(&folio->page), page_to_nid(&folio->page));
 			/* We have to invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
 						      address + PAGE_SIZE);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5c8681a3f1d9..c0485bb54954 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1791,7 +1791,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	if (unlikely(!PageUptodate(page))) {
 		pte_t pteval;
 
-		dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
+		dec_mm_counter(vma->vm_mm, MM_SWAPENTS, NUMA_NO_NODE);
 		pteval = swp_entry_to_pte(make_swapin_error_entry(page));
 		set_pte_at(vma->vm_mm, addr, pte, pteval);
 		swap_free(entry);
@@ -1803,8 +1803,8 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
 	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
 
-	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
+	dec_mm_counter(vma->vm_mm, MM_SWAPENTS, NUMA_NO_NODE);
+	inc_mm_counter(vma->vm_mm, MM_ANONPAGES, page_to_nid(page));
 	get_page(page);
 	if (page == swapcache) {
 		rmap_t rmap_flags = RMAP_NONE;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 07d3befc80e4..b6581867aad6 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -127,7 +127,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	 * Must happen after rmap, as mm_counter() checks mapping (via
 	 * PageAnon()), which is set by __page_set_anon_rmap().
 	 */
-	inc_mm_counter(dst_mm, mm_counter(page));
+	inc_mm_counter(dst_mm, mm_counter(page), page_to_nid(page));
 
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/5] mm: add numa_count field for rss_stat
  2022-07-08  8:21 [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET} Gang Li
  2022-07-08  8:21 ` [PATCH v2 1/5] mm: add a new parameter `node` to `get/add/inc/dec_mm_counter` Gang Li
@ 2022-07-08  8:21 ` Gang Li
  2022-07-08  8:21 ` [PATCH v2 3/5] mm: add numa fields for tracepoint rss_stat Gang Li
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Gang Li @ 2022-07-08  8:21 UTC (permalink / raw)
  To: mhocko, akpm, surenb
  Cc: hca, gor, agordeev, borntraeger, svens, viro, ebiederm, keescook,
	rostedt, mingo, peterz, acme, mark.rutland, alexander.shishkin,
	jolsa, namhyung, david, imbrenda, adobriyan, yang.yang29, brauner,
	stephen.s.brennan, zhengqi.arch, haolee.swjtu, xu.xin16,
	Liam.Howlett, ohoono.kwon, peterx, arnd, shy828301, alex.sierra,
	xianting.tian, willy, ccross, vbabka, sujiaxun, sfr,
	vasily.averin, mgorman, vvghjk1234, tglx, luto, bigeasy,
	fenghua.yu, linux-s390, linux-kernel, linux-fsdevel, linux-mm,
	linux-perf-users, Gang Li

This patch add new fields `numa_count` for mm_rss_stat and
task_rss_stat.

`numa_count` are in the size of `sizeof(long) * num_possible_numa()`.
To reduce mem consumption, they only contain the sum of rss which is
needed by `oom_badness` instead of recording different kinds of rss
sepratly.

Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
---
 include/linux/mm_types_task.h |  6 +++
 kernel/fork.c                 | 70 +++++++++++++++++++++++++++++++++--
 2 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index 32512af31721..9fd34ab484f4 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -52,11 +52,17 @@ enum {
 struct task_rss_stat {
 	int events;	/* for synchronization threshold */
 	int count[NR_MM_COUNTERS];
+#ifdef CONFIG_NUMA
+	int *numa_count;
+#endif
 };
 #endif /* USE_SPLIT_PTE_PTLOCKS */
 
 struct mm_rss_stat {
 	atomic_long_t count[NR_MM_COUNTERS];
+#ifdef CONFIG_NUMA
+	atomic_long_t *numa_count;
+#endif
 };
 
 struct page_frag {
diff --git a/kernel/fork.c b/kernel/fork.c
index 23f0ba3affe5..f4f93d6fecd5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -140,6 +140,10 @@ DEFINE_PER_CPU(unsigned long, process_counts) = 0;
 
 __cacheline_aligned DEFINE_RWLOCK(tasklist_lock);  /* outer */
 
+#if (defined SPLIT_RSS_COUNTING) && (defined CONFIG_NUMA)
+#define SPLIT_RSS_NUMA_COUNTING
+#endif
+
 #ifdef CONFIG_PROVE_RCU
 int lockdep_tasklist_lock_is_held(void)
 {
@@ -757,6 +761,16 @@ static void check_mm(struct mm_struct *mm)
 				 mm, resident_page_types[i], x);
 	}
 
+#ifdef CONFIG_NUMA
+	for (i = 0; i < num_possible_nodes(); i++) {
+		long x = atomic_long_read(&mm->rss_stat.numa_count[i]);
+
+		if (unlikely(x))
+			pr_alert("BUG: Bad rss-counter state mm:%p node:%d val:%ld\n",
+				 mm, i, x);
+	}
+#endif
+
 	if (mm_pgtables_bytes(mm))
 		pr_alert("BUG: non-zero pgtables_bytes on freeing mm: %ld\n",
 				mm_pgtables_bytes(mm));
@@ -769,6 +783,29 @@ static void check_mm(struct mm_struct *mm)
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
 #define free_mm(mm)	(kmem_cache_free(mm_cachep, (mm)))
 
+#ifdef CONFIG_NUMA
+static inline void mm_free_rss_stat(struct mm_struct *mm)
+{
+	kfree(mm->rss_stat.numa_count);
+}
+
+static inline int mm_init_rss_stat(struct mm_struct *mm)
+{
+	memset(&mm->rss_stat.count, 0, sizeof(mm->rss_stat.count));
+	mm->rss_stat.numa_count = kcalloc(num_possible_nodes(), sizeof(atomic_long_t), GFP_KERNEL);
+	if (unlikely(!mm->rss_stat.numa_count))
+		return -ENOMEM;
+	return 0;
+}
+#else
+static inline void mm_free_rss_stat(struct mm_struct *mm) {}
+static inline int mm_init_rss_stat(struct mm_struct *mm)
+{
+	memset(&mm->rss_stat.count, 0, sizeof(mm->rss_stat.count));
+	return 0;
+}
+#endif
+
 /*
  * Called when the last reference to the mm
  * is dropped: either by a lazy thread or by
@@ -783,6 +820,7 @@ void __mmdrop(struct mm_struct *mm)
 	destroy_context(mm);
 	mmu_notifier_subscriptions_destroy(mm);
 	check_mm(mm);
+	mm_free_rss_stat(mm);
 	put_user_ns(mm->user_ns);
 	mm_pasid_drop(mm);
 	free_mm(mm);
@@ -824,12 +862,22 @@ static inline void put_signal_struct(struct signal_struct *sig)
 		free_signal_struct(sig);
 }
 
+#ifdef SPLIT_RSS_NUMA_COUNTING
+void rss_stat_free(struct task_struct *p)
+{
+	kfree(p->rss_stat.numa_count);
+}
+#else
+void rss_stat_free(struct task_struct *p) {}
+#endif
+
 void __put_task_struct(struct task_struct *tsk)
 {
 	WARN_ON(!tsk->exit_state);
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	rss_stat_free(tsk);
 	io_uring_free(tsk);
 	cgroup_free(tsk);
 	task_numa_free(tsk, true);
@@ -956,6 +1004,7 @@ void set_task_stack_end_magic(struct task_struct *tsk)
 static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 {
 	struct task_struct *tsk;
+	int *numa_count __maybe_unused;
 	int err;
 
 	if (node == NUMA_NO_NODE)
@@ -977,9 +1026,16 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 #endif
 	account_kernel_stack(tsk, 1);
 
+#ifdef SPLIT_RSS_NUMA_COUNTING
+	numa_count = kcalloc(num_possible_nodes(), sizeof(int), GFP_KERNEL);
+	if (!numa_count)
+		goto free_stack;
+	tsk->rss_stat.numa_count = numa_count;
+#endif
+
 	err = scs_prepare(tsk, node);
 	if (err)
-		goto free_stack;
+		goto free_rss_stat;
 
 #ifdef CONFIG_SECCOMP
 	/*
@@ -1045,6 +1101,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 
 	return tsk;
 
+free_rss_stat:
+#ifdef SPLIT_RSS_NUMA_COUNTING
+	kfree(numa_count);
+#endif
 free_stack:
 	exit_task_stack_account(tsk);
 	free_thread_stack(tsk);
@@ -1114,7 +1174,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->map_count = 0;
 	mm->locked_vm = 0;
 	atomic64_set(&mm->pinned_vm, 0);
-	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
 	spin_lock_init(&mm->arg_lock);
 	mm_init_cpumask(mm);
@@ -1141,6 +1200,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (mm_alloc_pgd(mm))
 		goto fail_nopgd;
 
+	if (mm_init_rss_stat(mm))
+		goto fail_nocontext;
+
 	if (init_new_context(p, mm))
 		goto fail_nocontext;
 
@@ -2142,7 +2204,9 @@ static __latent_entropy struct task_struct *copy_process(
 	p->io_uring = NULL;
 #endif
 
-#if defined(SPLIT_RSS_COUNTING)
+#ifdef SPLIT_RSS_NUMA_COUNTING
+	memset(&p->rss_stat, 0, sizeof(p->rss_stat) - sizeof(p->rss_stat.numa_count));
+#else
 	memset(&p->rss_stat, 0, sizeof(p->rss_stat));
 #endif
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/5] mm: add numa fields for tracepoint rss_stat
  2022-07-08  8:21 [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET} Gang Li
  2022-07-08  8:21 ` [PATCH v2 1/5] mm: add a new parameter `node` to `get/add/inc/dec_mm_counter` Gang Li
  2022-07-08  8:21 ` [PATCH v2 2/5] mm: add numa_count field for rss_stat Gang Li
@ 2022-07-08  8:21 ` Gang Li
  2022-07-08 17:31   ` Steven Rostedt
  2022-07-08  8:21 ` [PATCH v2 4/5] mm: enable per numa node rss_stat count Gang Li
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Gang Li @ 2022-07-08  8:21 UTC (permalink / raw)
  To: mhocko, akpm, surenb, Steven Rostedt, Ingo Molnar
  Cc: hca, gor, agordeev, borntraeger, svens, viro, ebiederm, keescook,
	peterz, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	david, imbrenda, adobriyan, yang.yang29, brauner,
	stephen.s.brennan, zhengqi.arch, haolee.swjtu, xu.xin16,
	Liam.Howlett, ohoono.kwon, peterx, arnd, shy828301, alex.sierra,
	xianting.tian, willy, ccross, vbabka, sujiaxun, sfr,
	vasily.averin, mgorman, vvghjk1234, tglx, luto, bigeasy,
	fenghua.yu, linux-s390, linux-kernel, linux-fsdevel, linux-mm,
	linux-perf-users, Gang Li

Since we add numa_count for mm->rss_stat, the tracepoint should
also be modified. Now the output looks like this:

```
sleep-660   [002]   918.524333: rss_stat:             mm_id=1539334265 curr=0 type=MM_NO_TYPE type_size=0B node=2 node_size=32768B diff_size=-8192B
sleep-660   [002]   918.524333: rss_stat:             mm_id=1539334265 curr=0 type=MM_FILEPAGES type_size=4096B node=-1 node_size=0B diff_size=-4096B
sleep-660   [002]   918.524333: rss_stat:             mm_id=1539334265 curr=0 type=MM_NO_TYPE type_size=0B node=1 node_size=0B diff_size=-4096B
```

Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
---
 include/linux/mm.h          |  9 +++++----
 include/trace/events/kmem.h | 27 ++++++++++++++++++++-------
 mm/memory.c                 |  5 +++--
 3 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 84ce6e1b1252..a7150ee7439c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2041,27 +2041,28 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member, int
 	return (unsigned long)val;
 }
 
-void mm_trace_rss_stat(struct mm_struct *mm, int member, long count);
+void mm_trace_rss_stat(struct mm_struct *mm, int member, long member_count, int node,
+		       long numa_count, long diff_count);
 
 static inline void add_mm_counter(struct mm_struct *mm, int member, long value, int node)
 {
 	long count = atomic_long_add_return(value, &mm->rss_stat.count[member]);
 
-	mm_trace_rss_stat(mm, member, count);
+	mm_trace_rss_stat(mm, member, count, NUMA_NO_NODE, 0, value);
 }
 
 static inline void inc_mm_counter(struct mm_struct *mm, int member, int node)
 {
 	long count = atomic_long_inc_return(&mm->rss_stat.count[member]);
 
-	mm_trace_rss_stat(mm, member, count);
+	mm_trace_rss_stat(mm, member, count, NUMA_NO_NODE, 0, 1);
 }
 
 static inline void dec_mm_counter(struct mm_struct *mm, int member, int node)
 {
 	long count = atomic_long_dec_return(&mm->rss_stat.count[member]);
 
-	mm_trace_rss_stat(mm, member, count);
+	mm_trace_rss_stat(mm, member, count, NUMA_NO_NODE, 0, -1);
 }
 
 /* Optimized variant when page is already known not to be PageAnon */
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 4cb51ace600d..7c8ad4aeb7b1 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -363,7 +363,8 @@ static unsigned int __maybe_unused mm_ptr_to_hash(const void *ptr)
 	EM(MM_FILEPAGES)	\
 	EM(MM_ANONPAGES)	\
 	EM(MM_SWAPENTS)		\
-	EMe(MM_SHMEMPAGES)
+	EM(MM_SHMEMPAGES)	\
+	EMe(MM_NO_TYPE)
 
 #undef EM
 #undef EMe
@@ -383,29 +384,41 @@ TRACE_EVENT(rss_stat,
 
 	TP_PROTO(struct mm_struct *mm,
 		int member,
-		long count),
+		long member_count,
+		int node,
+		long node_count,
+		long diff_count),
 
-	TP_ARGS(mm, member, count),
+	TP_ARGS(mm, member, member_count, node, node_count, diff_count),
 
 	TP_STRUCT__entry(
 		__field(unsigned int, mm_id)
 		__field(unsigned int, curr)
 		__field(int, member)
-		__field(long, size)
+		__field(long, member_size)
+		__field(int, node)
+		__field(long, node_size)
+		__field(long, diff_size)
 	),
 
 	TP_fast_assign(
 		__entry->mm_id = mm_ptr_to_hash(mm);
 		__entry->curr = !!(current->mm == mm);
 		__entry->member = member;
-		__entry->size = (count << PAGE_SHIFT);
+		__entry->member_size = (member_count << PAGE_SHIFT);
+		__entry->node = node;
+		__entry->node_size = (node_count << PAGE_SHIFT);
+		__entry->diff_size = (diff_count << PAGE_SHIFT);
 	),
 
-	TP_printk("mm_id=%u curr=%d type=%s size=%ldB",
+	TP_printk("mm_id=%u curr=%d type=%s type_size=%ldB node=%d node_size=%ldB diff_size=%ldB",
 		__entry->mm_id,
 		__entry->curr,
 		__print_symbolic(__entry->member, TRACE_MM_PAGES),
-		__entry->size)
+		__entry->member_size,
+		__entry->node,
+		__entry->node_size,
+		__entry->diff_size)
 	);
 #endif /* _TRACE_KMEM_H */
 
diff --git a/mm/memory.c b/mm/memory.c
index bb24da767f79..b085f368ae11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -173,9 +173,10 @@ static int __init init_zero_pfn(void)
 }
 early_initcall(init_zero_pfn);
 
-void mm_trace_rss_stat(struct mm_struct *mm, int member, long count)
+void mm_trace_rss_stat(struct mm_struct *mm, int member, long member_count, int node,
+		       long numa_count, long diff_count)
 {
-	trace_rss_stat(mm, member, count);
+	trace_rss_stat(mm, member, member_count, node, numa_count, diff_count);
 }
 
 #if defined(SPLIT_RSS_COUNTING)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/5] mm: add numa fields for tracepoint rss_stat
  2022-07-08  8:21 ` [PATCH v2 3/5] mm: add numa fields for tracepoint rss_stat Gang Li
@ 2022-07-08 17:31   ` Steven Rostedt
  0 siblings, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2022-07-08 17:31 UTC (permalink / raw)
  To: Gang Li
  Cc: mhocko, akpm, surenb, Ingo Molnar, hca, gor, agordeev,
	borntraeger, svens, viro, ebiederm, keescook, peterz, acme,
	mark.rutland, alexander.shishkin, jolsa, namhyung, david,
	imbrenda, adobriyan, yang.yang29, brauner, stephen.s.brennan,
	zhengqi.arch, haolee.swjtu, xu.xin16, Liam.Howlett, ohoono.kwon,
	peterx, arnd, shy828301, alex.sierra, xianting.tian, willy,
	ccross, vbabka, sujiaxun, sfr, vasily.averin, mgorman, vvghjk1234,
	tglx, luto, bigeasy, fenghua.yu, linux-s390, linux-kernel,
	linux-fsdevel, linux-mm, linux-perf-users

On Fri,  8 Jul 2022 16:21:27 +0800
Gang Li <ligang.bdlg@bytedance.com> wrote:

> --- a/include/trace/events/kmem.h
> +++ b/include/trace/events/kmem.h
> @@ -363,7 +363,8 @@ static unsigned int __maybe_unused mm_ptr_to_hash(const void *ptr)
>  	EM(MM_FILEPAGES)	\
>  	EM(MM_ANONPAGES)	\
>  	EM(MM_SWAPENTS)		\
> -	EMe(MM_SHMEMPAGES)
> +	EM(MM_SHMEMPAGES)	\
> +	EMe(MM_NO_TYPE)
>  
>  #undef EM
>  #undef EMe
> @@ -383,29 +384,41 @@ TRACE_EVENT(rss_stat,
>  
>  	TP_PROTO(struct mm_struct *mm,
>  		int member,
> -		long count),
> +		long member_count,
> +		int node,
> +		long node_count,
> +		long diff_count),
>  
> -	TP_ARGS(mm, member, count),
> +	TP_ARGS(mm, member, member_count, node, node_count, diff_count),
>  
>  	TP_STRUCT__entry(
>  		__field(unsigned int, mm_id)
>  		__field(unsigned int, curr)
>  		__field(int, member)
> -		__field(long, size)
> +		__field(long, member_size)
> +		__field(int, node)

Please swap the node and member_size fields. I just noticed that size
should have been before member in the original as well. Because it will
leave a 4 byte "hole" in the event due to alignment on 64 bit machines.

-- Steve

> +		__field(long, node_size)
> +		__field(long, diff_size)
>  	),
>  
>  	TP_fast_assign(
>  		__entry->mm_id = mm_ptr_to_hash(mm);
>  		__entry->curr = !!(current->mm == mm);
>  		__entry->member = member;
> -		__entry->size = (count << PAGE_SHIFT);
> +		__entry->member_size = (member_count << PAGE_SHIFT);
> +		__entry->node = node;
> +		__entry->node_size = (node_count << PAGE_SHIFT);
> +		__entry->diff_size = (diff_count << PAGE_SHIFT);
>  	),
>  
> -	TP_printk("mm_id=%u curr=%d type=%s size=%ldB",
> +	TP_printk("mm_id=%u curr=%d type=%s type_size=%ldB node=%d node_size=%ldB diff_size=%ldB",
>  		__entry->mm_id,
>  		__entry->curr,
>  		__print_symbolic(__entry->member, TRACE_MM_PAGES),
> -		__entry->size)
> +		__entry->member_size,
> +		__entry->node,
> +		__entry->node_size,
> +		__entry->diff_size)
>  	);
>  #endif /* _TRACE_KMEM_H */
>  

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 4/5] mm: enable per numa node rss_stat count
  2022-07-08  8:21 [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET} Gang Li
                   ` (2 preceding siblings ...)
  2022-07-08  8:21 ` [PATCH v2 3/5] mm: add numa fields for tracepoint rss_stat Gang Li
@ 2022-07-08  8:21 ` Gang Li
  2022-07-08  8:21 ` [PATCH v2 5/5] mm, oom: enable per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET} Gang Li
  2022-07-08  8:54 ` [PATCH v2 0/5] mm, oom: Introduce " Michal Hocko
  5 siblings, 0 replies; 14+ messages in thread
From: Gang Li @ 2022-07-08  8:21 UTC (permalink / raw)
  To: mhocko, akpm, surenb
  Cc: hca, gor, agordeev, borntraeger, svens, viro, ebiederm, keescook,
	rostedt, mingo, peterz, acme, mark.rutland, alexander.shishkin,
	jolsa, namhyung, david, imbrenda, adobriyan, yang.yang29, brauner,
	stephen.s.brennan, zhengqi.arch, haolee.swjtu, xu.xin16,
	Liam.Howlett, ohoono.kwon, peterx, arnd, shy828301, alex.sierra,
	xianting.tian, willy, ccross, vbabka, sujiaxun, sfr,
	vasily.averin, mgorman, vvghjk1234, tglx, luto, bigeasy,
	fenghua.yu, linux-s390, linux-kernel, linux-fsdevel, linux-mm,
	linux-perf-users, Gang Li

Now we have all the infrastructure ready. Modify `*_mm_counter`,
`sync_mm_rss`, `add_mm_counter_fast` and `add_mm_rss_vec` to enable per
numa node rss_stat count.

Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
---
 include/linux/mm.h | 42 +++++++++++++++++++++++++++++++++++-------
 mm/memory.c        | 20 ++++++++++++++++++--
 2 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7150ee7439c..4a8e10ebc729 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2028,8 +2028,18 @@ static inline bool get_user_page_fast_only(unsigned long addr,
  */
 static inline unsigned long get_mm_counter(struct mm_struct *mm, int member, int node)
 {
-	long val = atomic_long_read(&mm->rss_stat.count[member]);
+	long val;
 
+	WARN_ON(node == NUMA_NO_NODE && member == MM_NO_TYPE);
+
+	if (node == NUMA_NO_NODE)
+		val = atomic_long_read(&mm->rss_stat.count[member]);
+	else
+#ifdef CONFIG_NUMA
+		val = atomic_long_read(&mm->rss_stat.numa_count[node]);
+#else
+		val = 0;
+#endif
 #ifdef SPLIT_RSS_COUNTING
 	/*
 	 * counter is updated in asynchronous manner and may go to minus.
@@ -2046,23 +2056,41 @@ void mm_trace_rss_stat(struct mm_struct *mm, int member, long member_count, int
 
 static inline void add_mm_counter(struct mm_struct *mm, int member, long value, int node)
 {
-	long count = atomic_long_add_return(value, &mm->rss_stat.count[member]);
+	long member_count = 0, numa_count = 0;
 
-	mm_trace_rss_stat(mm, member, count, NUMA_NO_NODE, 0, value);
+	if (member != MM_NO_TYPE)
+		member_count = atomic_long_add_return(value, &mm->rss_stat.count[member]);
+#ifdef CONFIG_NUMA
+	if (node != NUMA_NO_NODE)
+		numa_count = atomic_long_add_return(value, &mm->rss_stat.numa_count[node]);
+#endif
+	mm_trace_rss_stat(mm, member, member_count, node, numa_count, value);
 }
 
 static inline void inc_mm_counter(struct mm_struct *mm, int member, int node)
 {
-	long count = atomic_long_inc_return(&mm->rss_stat.count[member]);
+	long member_count = 0, numa_count = 0;
 
-	mm_trace_rss_stat(mm, member, count, NUMA_NO_NODE, 0, 1);
+	if (member != MM_NO_TYPE)
+		member_count = atomic_long_inc_return(&mm->rss_stat.count[member]);
+#ifdef CONFIG_NUMA
+	if (node != NUMA_NO_NODE)
+		numa_count = atomic_long_inc_return(&mm->rss_stat.numa_count[node]);
+#endif
+	mm_trace_rss_stat(mm, member, member_count, node, numa_count, 1);
 }
 
 static inline void dec_mm_counter(struct mm_struct *mm, int member, int node)
 {
-	long count = atomic_long_dec_return(&mm->rss_stat.count[member]);
+	long member_count = 0, numa_count = 0;
 
-	mm_trace_rss_stat(mm, member, count, NUMA_NO_NODE, 0, -1);
+	if (member != MM_NO_TYPE)
+		member_count = atomic_long_dec_return(&mm->rss_stat.count[member]);
+#ifdef CONFIG_NUMA
+	if (node != NUMA_NO_NODE)
+		numa_count = atomic_long_dec_return(&mm->rss_stat.numa_count[node]);
+#endif
+	mm_trace_rss_stat(mm, member, member_count, node, numa_count, -1);
 }
 
 /* Optimized variant when page is already known not to be PageAnon */
diff --git a/mm/memory.c b/mm/memory.c
index b085f368ae11..66c8d10d36cc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -191,6 +191,14 @@ void sync_mm_rss(struct mm_struct *mm)
 			current->rss_stat.count[i] = 0;
 		}
 	}
+#ifdef CONFIG_NUMA
+	for_each_node(i) {
+		if (current->rss_stat.numa_count[i]) {
+			add_mm_counter(mm, MM_NO_TYPE, current->rss_stat.numa_count[i], i);
+			current->rss_stat.numa_count[i] = 0;
+		}
+	}
+#endif
 	current->rss_stat.events = 0;
 }
 
@@ -198,9 +206,12 @@ static void add_mm_counter_fast(struct mm_struct *mm, int member, int val, int n
 {
 	struct task_struct *task = current;
 
-	if (likely(task->mm == mm))
+	if (likely(task->mm == mm)) {
 		task->rss_stat.count[member] += val;
-	else
+#ifdef CONFIG_NUMA
+		task->rss_stat.numa_count[node] += val;
+#endif
+	} else
 		add_mm_counter(mm, member, val, node);
 }
 #define inc_mm_counter_fast(mm, member, node) add_mm_counter_fast(mm, member, 1, node)
@@ -520,6 +531,11 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss, int *numa_rss)
 	for (i = 0; i < NR_MM_COUNTERS; i++)
 		if (rss[i])
 			add_mm_counter(mm, i, rss[i], NUMA_NO_NODE);
+#ifdef CONFIG_NUMA
+	for_each_node(i)
+		if (numa_rss[i] != 0)
+			add_mm_counter(mm, MM_NO_TYPE, numa_rss[i], i);
+#endif
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 5/5] mm, oom: enable per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}
  2022-07-08  8:21 [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET} Gang Li
                   ` (3 preceding siblings ...)
  2022-07-08  8:21 ` [PATCH v2 4/5] mm: enable per numa node rss_stat count Gang Li
@ 2022-07-08  8:21 ` Gang Li
  2022-07-08  8:54 ` [PATCH v2 0/5] mm, oom: Introduce " Michal Hocko
  5 siblings, 0 replies; 14+ messages in thread
From: Gang Li @ 2022-07-08  8:21 UTC (permalink / raw)
  To: mhocko, akpm, surenb
  Cc: hca, gor, agordeev, borntraeger, svens, viro, ebiederm, keescook,
	rostedt, mingo, peterz, acme, mark.rutland, alexander.shishkin,
	jolsa, namhyung, david, imbrenda, adobriyan, yang.yang29, brauner,
	stephen.s.brennan, zhengqi.arch, haolee.swjtu, xu.xin16,
	Liam.Howlett, ohoono.kwon, peterx, arnd, shy828301, alex.sierra,
	xianting.tian, willy, ccross, vbabka, sujiaxun, sfr,
	vasily.averin, mgorman, vvghjk1234, tglx, luto, bigeasy,
	fenghua.yu, linux-s390, linux-kernel, linux-fsdevel, linux-mm,
	linux-perf-users, Gang Li

Page allocator will only alloc pages on node indicated by mempolicy or
cpuset. But oom will still select bad process by total rss usage.

This patch let oom only calculate rss on the given node when
oc->constraint equals to CONSTRAINT_{MEMORY_POLICY,CPUSET}.

With those constraint, the process with the highest memory consumption on
the specific node will be killed. oom_kill dmesg now have a new column
`(%d)nrss`.

It looks like this:
```
[ 1471.436027] Tasks state (memory values in pages):
[ 1471.438518] [  pid  ]   uid  tgid total_vm      rss (01)nrss  pgtables_bytes swapents oom_score_adj name
[ 1471.554703] [   1011]     0  1011   220005     8589     1872   823296        0             0 node
[ 1471.707912] [  12399]     0 12399  1311306  1311056   262170 10534912        0             0 a.out
[ 1471.712429] [  13135]     0 13135   787018   674666   674300  5439488        0             0 a.out
[ 1471.721506] [  13295]     0 13295      597      188        0    24576        0             0 sh
[ 1471.734600] oom-kill:constraint=CONSTRAINT_MEMORY_POLICY,nodemask=1,cpuset=/,mems_allowed=0-2,global_oom,task_memcg=/user.slice/user-0.slice/session-3.scope,task=a.out,pid=13135,uid=0
[ 1471.742583] Out of memory: Killed process 13135 (a.out) total-vm:3148072kB, anon-rss:2697304kB, file-rss:1360kB, shmem-rss:0kB, UID:0 pgtables:5312kB oom_score_adj:0
[ 1471.849615] oom_reaper: reaped process 13135 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
```

Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
---
 fs/proc/base.c      |  6 ++++-
 include/linux/oom.h |  2 +-
 mm/oom_kill.c       | 55 ++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 53 insertions(+), 10 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 617816168748..92075e9dca06 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -552,8 +552,12 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long totalpages = totalram_pages() + total_swap_pages;
 	unsigned long points = 0;
 	long badness;
+	struct oom_control oc = {
+		.totalpages =  totalpages,
+		.gfp_mask = 0,
+	};
 
-	badness = oom_badness(task, totalpages);
+	badness = oom_badness(task, &oc);
 	/*
 	 * Special case OOM_SCORE_ADJ_MIN for all others scale the
 	 * badness value into [0, 2000] range which we have been
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7d0c9c48a0c5..19eaa447ac57 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -98,7 +98,7 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
 }
 
 long oom_badness(struct task_struct *p,
-		unsigned long totalpages);
+		struct oom_control *oc);
 
 extern bool out_of_memory(struct oom_control *oc);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e25c37e2e90d..921539e29ae9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -189,6 +189,18 @@ static bool should_dump_unreclaim_slab(void)
 	return (global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B) > nr_lru);
 }
 
+static inline int get_nid_from_oom_control(struct oom_control *oc)
+{
+	nodemask_t *nodemask;
+	struct zoneref *zoneref;
+
+	nodemask = oc->constraint == CONSTRAINT_MEMORY_POLICY
+			? oc->nodemask : &cpuset_current_mems_allowed;
+
+	zoneref = first_zones_zonelist(oc->zonelist, gfp_zone(oc->gfp_mask), nodemask);
+	return zone_to_nid(zoneref->zone);
+}
+
 /**
  * oom_badness - heuristic function to determine which candidate task to kill
  * @p: task struct of which task we should calculate
@@ -198,7 +210,7 @@ static bool should_dump_unreclaim_slab(void)
  * predictable as possible.  The goal is to return the highest value for the
  * task consuming the most memory to avoid subsequent oom failures.
  */
-long oom_badness(struct task_struct *p, unsigned long totalpages)
+long oom_badness(struct task_struct *p, struct oom_control *oc)
 {
 	long points;
 	long adj;
@@ -227,12 +239,21 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
 	 * The baseline for the badness score is the proportion of RAM that each
 	 * task's rss, pagetable and swap space use.
 	 */
-	points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS, NUMA_NO_NODE) +
-		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+	if (unlikely(oc->constraint == CONSTRAINT_MEMORY_POLICY ||
+		     oc->constraint == CONSTRAINT_CPUSET)) {
+		int nid_to_find_victim = get_nid_from_oom_control(oc);
+
+		points = get_mm_counter(p->mm, -1, nid_to_find_victim) +
+			get_mm_counter(p->mm, MM_SWAPENTS, NUMA_NO_NODE) +
+			mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+	} else {
+		points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS, NUMA_NO_NODE) +
+			mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+	}
 	task_unlock(p);
 
 	/* Normalize to oom_score_adj units */
-	adj *= totalpages / 1000;
+	adj *= oc->totalpages / 1000;
 	points += adj;
 
 	return points;
@@ -338,7 +359,7 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 		goto select;
 	}
 
-	points = oom_badness(task, oc->totalpages);
+	points = oom_badness(task, oc);
 	if (points == LONG_MIN || points < oc->chosen_points)
 		goto next;
 
@@ -382,6 +403,7 @@ static int dump_task(struct task_struct *p, void *arg)
 {
 	struct oom_control *oc = arg;
 	struct task_struct *task;
+	unsigned long node_mm_rss;
 
 	if (oom_unkillable_task(p))
 		return 0;
@@ -399,9 +421,17 @@ static int dump_task(struct task_struct *p, void *arg)
 		return 0;
 	}
 
-	pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu         %5hd %s\n",
+	if (unlikely(oc->constraint == CONSTRAINT_MEMORY_POLICY ||
+		     oc->constraint == CONSTRAINT_CPUSET)) {
+		int nid_to_find_victim = get_nid_from_oom_control(oc);
+
+		node_mm_rss = get_mm_counter(p->mm, -1, nid_to_find_victim);
+	} else {
+		node_mm_rss = 0;
+	}
+	pr_info("[%7d] %5d %5d %8lu %8lu %8lu %8ld %8lu         %5hd %s\n",
 		task->pid, from_kuid(&init_user_ns, task_uid(task)),
-		task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
+		task->tgid, task->mm->total_vm, get_mm_rss(task->mm), node_mm_rss,
 		mm_pgtables_bytes(task->mm),
 		get_mm_counter(task->mm, MM_SWAPENTS, NUMA_NO_NODE),
 		task->signal->oom_score_adj, task->comm);
@@ -422,8 +452,17 @@ static int dump_task(struct task_struct *p, void *arg)
  */
 static void dump_tasks(struct oom_control *oc)
 {
+	int nid_to_find_victim;
+
+	if (unlikely(oc->constraint == CONSTRAINT_MEMORY_POLICY ||
+		     oc->constraint == CONSTRAINT_CPUSET)) {
+		nid_to_find_victim = get_nid_from_oom_control(oc);
+	} else {
+		nid_to_find_victim = -1;
+	}
 	pr_info("Tasks state (memory values in pages):\n");
-	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
+	pr_info("[  pid  ]   uid  tgid total_vm      rss (%02d)nrss  pgtables_bytes swapents"
+		" oom_score_adj name\n", nid_to_find_victim);
 
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}
  2022-07-08  8:21 [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET} Gang Li
                   ` (4 preceding siblings ...)
  2022-07-08  8:21 ` [PATCH v2 5/5] mm, oom: enable per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET} Gang Li
@ 2022-07-08  8:54 ` Michal Hocko
  2022-07-08  9:25   ` Gang Li
  2022-07-12 11:12   ` Abel Wu
  5 siblings, 2 replies; 14+ messages in thread
From: Michal Hocko @ 2022-07-08  8:54 UTC (permalink / raw)
  To: Gang Li
  Cc: akpm, surenb, hca, gor, agordeev, borntraeger, svens, viro,
	ebiederm, keescook, rostedt, mingo, peterz, acme, mark.rutland,
	alexander.shishkin, jolsa, namhyung, david, imbrenda, adobriyan,
	yang.yang29, brauner, stephen.s.brennan, zhengqi.arch,
	haolee.swjtu, xu.xin16, Liam.Howlett, ohoono.kwon, peterx, arnd,
	shy828301, alex.sierra, xianting.tian, willy, ccross, vbabka,
	sujiaxun, sfr, vasily.averin, mgorman, vvghjk1234, tglx, luto,
	bigeasy, fenghua.yu, linux-s390, linux-kernel, linux-fsdevel,
	linux-mm, linux-perf-users

On Fri 08-07-22 16:21:24, Gang Li wrote:
> TLDR
> ----
> If a mempolicy or cpuset is in effect, out_of_memory() will select victim
> on specific node to kill. So that kernel can avoid accidental killing on
> NUMA system.

We have discussed this in your previous posting and an alternative
proposal was to use cpusets to partition NUMA aware workloads and
enhance the oom killer to be cpuset aware instead which should be a much
easier solution.

> Problem
> -------
> Before this patch series, oom will only kill the process with the highest 
> memory usage by selecting process with the highest oom_badness on the
> entire system.
> 
> This works fine on UMA system, but may have some accidental killing on NUMA
> system.
> 
> As shown below, if process c.out is bind to Node1 and keep allocating pages
> from Node1, a.out will be killed first. But killing a.out did't free any
> mem on Node1, so c.out will be killed then.
> 
> A lot of AMD machines have 8 numa nodes. In these systems, there is a
> greater chance of triggering this problem.

Please be more specific about existing usecases which suffer from the
current OOM handling limitations. 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}
  2022-07-08  8:54 ` [PATCH v2 0/5] mm, oom: Introduce " Michal Hocko
@ 2022-07-08  9:25   ` Gang Li
  2022-07-08  9:37     ` Michal Hocko
  2022-07-12 11:12   ` Abel Wu
  1 sibling, 1 reply; 14+ messages in thread
From: Gang Li @ 2022-07-08  9:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, surenb, hca, gor, agordeev, borntraeger, svens, viro,
	ebiederm, keescook, rostedt, mingo, peterz, acme, mark.rutland,
	alexander.shishkin, jolsa, namhyung, david, imbrenda, adobriyan,
	yang.yang29, brauner, stephen.s.brennan, zhengqi.arch,
	haolee.swjtu, xu.xin16, Liam.Howlett, ohoono.kwon, peterx, arnd,
	shy828301, alex.sierra, xianting.tian, willy, ccross, vbabka,
	sujiaxun, sfr, vasily.averin, mgorman, vvghjk1234, tglx, luto,
	bigeasy, fenghua.yu, linux-s390, linux-kernel, linux-fsdevel,
	linux-mm, linux-perf-users

Oh apologize. I just realized what you mean.

I should try a "cpuset cgroup oom killer" selecting victim from a
specific cpuset cgroup.

On 2022/7/8 16:54, Michal Hocko wrote:
> On Fri 08-07-22 16:21:24, Gang Li wrote:
> 
> We have discussed this in your previous posting and an alternative
> proposal was to use cpusets to partition NUMA aware workloads and
> enhance the oom killer to be cpuset aware instead which should be a much
> easier solution.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}
  2022-07-08  9:25   ` Gang Li
@ 2022-07-08  9:37     ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2022-07-08  9:37 UTC (permalink / raw)
  To: Gang Li
  Cc: akpm, surenb, hca, gor, agordeev, borntraeger, svens, viro,
	ebiederm, keescook, rostedt, mingo, peterz, acme, mark.rutland,
	alexander.shishkin, jolsa, namhyung, david, imbrenda, adobriyan,
	yang.yang29, brauner, stephen.s.brennan, zhengqi.arch,
	haolee.swjtu, xu.xin16, Liam.Howlett, ohoono.kwon, peterx, arnd,
	shy828301, alex.sierra, xianting.tian, willy, ccross, vbabka,
	sujiaxun, sfr, vasily.averin, mgorman, vvghjk1234, tglx, luto,
	bigeasy, fenghua.yu, linux-s390, linux-kernel, linux-fsdevel,
	linux-mm, linux-perf-users

On Fri 08-07-22 17:25:31, Gang Li wrote:
> Oh apologize. I just realized what you mean.
> 
> I should try a "cpuset cgroup oom killer" selecting victim from a
> specific cpuset cgroup.

yes, that was the idea. Many workloads which really do care about
particioning the NUMA system tend to use cpusets. In those cases you
have reasonably defined boundaries and the current OOM killer
imeplementation is not really aware of that. The oom selection process
could be enhanced/fixed to select victims from those cpusets similar to
how memcg oom killer victim selection is done.

There is no additional accounting required for this approach because the
workload is partitioned on the cgroup level already. Maybe this is not
really the best fit for all workloads but it should be reasonably simple
to implement without intrusive or runtime visible changes.

I am not saying per-numa accounting is wrong or a bad idea. I would just
like to see a stronger justification for that and also some arguments
why a simpler approach via cpusets is not viable.

Does this make sense to you?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}
  2022-07-08  8:54 ` [PATCH v2 0/5] mm, oom: Introduce " Michal Hocko
  2022-07-08  9:25   ` Gang Li
@ 2022-07-12 11:12   ` Abel Wu
  2022-07-12 13:35     ` Michal Hocko
  1 sibling, 1 reply; 14+ messages in thread
From: Abel Wu @ 2022-07-12 11:12 UTC (permalink / raw)
  To: Michal Hocko, Gang Li
  Cc: akpm, surenb, hca, gor, agordeev, borntraeger, svens, viro,
	ebiederm, keescook, rostedt, mingo, peterz, acme, mark.rutland,
	alexander.shishkin, jolsa, namhyung, david, imbrenda, adobriyan,
	yang.yang29, brauner, stephen.s.brennan, zhengqi.arch,
	haolee.swjtu, xu.xin16, Liam.Howlett, ohoono.kwon, peterx, arnd,
	shy828301, alex.sierra, xianting.tian, willy, ccross, vbabka,
	sujiaxun, sfr, vasily.averin, mgorman, vvghjk1234, tglx, luto,
	bigeasy, fenghua.yu, linux-s390, linux-kernel, linux-fsdevel,
	linux-mm, linux-perf-users, hezhongkun.hzk

Hi Michal,

On 7/8/22 4:54 PM, Michal Hocko Wrote:
> On Fri 08-07-22 16:21:24, Gang Li wrote:
>> TLDR
>> ----
>> If a mempolicy or cpuset is in effect, out_of_memory() will select victim
>> on specific node to kill. So that kernel can avoid accidental killing on
>> NUMA system.
> 
> We have discussed this in your previous posting and an alternative
> proposal was to use cpusets to partition NUMA aware workloads and
> enhance the oom killer to be cpuset aware instead which should be a much
> easier solution.
> 
>> Problem
>> -------
>> Before this patch series, oom will only kill the process with the highest
>> memory usage by selecting process with the highest oom_badness on the
>> entire system.
>>
>> This works fine on UMA system, but may have some accidental killing on NUMA
>> system.
>>
>> As shown below, if process c.out is bind to Node1 and keep allocating pages
>> from Node1, a.out will be killed first. But killing a.out did't free any
>> mem on Node1, so c.out will be killed then.
>>
>> A lot of AMD machines have 8 numa nodes. In these systems, there is a
>> greater chance of triggering this problem.
> 
> Please be more specific about existing usecases which suffer from the
> current OOM handling limitations.

I was just going through the mail list and happen to see this. There
is another usecase for us about per-numa memory usage.

Say we have several important latency-critical services sitting inside
different NUMA nodes without intersection. The need for memory of these
LC services varies, so the free memory of each node is also different.
Then we launch several background containers without cpuset constrains
to eat the left resources. Now the problem is that there doesn't seem
like a proper memory policy available to balance the usage between the
nodes, which could lead to memory-heavy LC services suffer from high
memory pressure and fails to meet the SLOs.

It's quite appreciated if you can shed some light on this!

Thanks & BR,
Abel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}
  2022-07-12 11:12   ` Abel Wu
@ 2022-07-12 13:35     ` Michal Hocko
  2022-07-12 15:00       ` Abel Wu
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2022-07-12 13:35 UTC (permalink / raw)
  To: Abel Wu
  Cc: Gang Li, akpm, surenb, hca, gor, agordeev, borntraeger, svens,
	viro, ebiederm, keescook, rostedt, mingo, peterz, acme,
	mark.rutland, alexander.shishkin, jolsa, namhyung, david,
	imbrenda, adobriyan, yang.yang29, brauner, stephen.s.brennan,
	zhengqi.arch, haolee.swjtu, xu.xin16, Liam.Howlett, ohoono.kwon,
	peterx, arnd, shy828301, alex.sierra, xianting.tian, willy,
	ccross, vbabka, sujiaxun, sfr, vasily.averin, mgorman, vvghjk1234,
	tglx, luto, bigeasy, fenghua.yu, linux-s390, linux-kernel,
	linux-fsdevel, linux-mm, linux-perf-users, hezhongkun.hzk

On Tue 12-07-22 19:12:18, Abel Wu wrote:
[...]
> I was just going through the mail list and happen to see this. There
> is another usecase for us about per-numa memory usage.
> 
> Say we have several important latency-critical services sitting inside
> different NUMA nodes without intersection. The need for memory of these
> LC services varies, so the free memory of each node is also different.
> Then we launch several background containers without cpuset constrains
> to eat the left resources. Now the problem is that there doesn't seem
> like a proper memory policy available to balance the usage between the
> nodes, which could lead to memory-heavy LC services suffer from high
> memory pressure and fails to meet the SLOs.

I do agree that cpusets would be rather clumsy if usable at all in a
scenario when you are trying to mix NUMA bound workloads with those
that do not have any NUMA proferences. Could you be more specific about
requirements here though?

Let's say you run those latency critical services with "simple" memory
policies and mix them with the other workload without any policies in
place so they compete over memory. It is not really clear to me how can
you achieve any reasonable QoS in such an environment. Your latency
critical servises will be more constrained than the non-critical ones
yet they are more demanding AFAIU.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}
  2022-07-12 13:35     ` Michal Hocko
@ 2022-07-12 15:00       ` Abel Wu
  2022-07-18 12:11         ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Abel Wu @ 2022-07-12 15:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Gang Li, akpm, surenb, hca, gor, agordeev, borntraeger, svens,
	viro, ebiederm, keescook, rostedt, mingo, peterz, acme,
	mark.rutland, alexander.shishkin, jolsa, namhyung, david,
	imbrenda, adobriyan, yang.yang29, brauner, stephen.s.brennan,
	zhengqi.arch, haolee.swjtu, xu.xin16, Liam.Howlett, ohoono.kwon,
	peterx, arnd, shy828301, alex.sierra, xianting.tian, willy,
	ccross, vbabka, sujiaxun, sfr, vasily.averin, mgorman, vvghjk1234,
	tglx, luto, bigeasy, fenghua.yu, linux-s390, linux-kernel,
	linux-fsdevel, linux-mm, linux-perf-users, hezhongkun.hzk


On 7/12/22 9:35 PM, Michal Hocko Wrote:
> On Tue 12-07-22 19:12:18, Abel Wu wrote:
> [...]
>> I was just going through the mail list and happen to see this. There
>> is another usecase for us about per-numa memory usage.
>>
>> Say we have several important latency-critical services sitting inside
>> different NUMA nodes without intersection. The need for memory of these
>> LC services varies, so the free memory of each node is also different.
>> Then we launch several background containers without cpuset constrains
>> to eat the left resources. Now the problem is that there doesn't seem
>> like a proper memory policy available to balance the usage between the
>> nodes, which could lead to memory-heavy LC services suffer from high
>> memory pressure and fails to meet the SLOs.
> 
> I do agree that cpusets would be rather clumsy if usable at all in a
> scenario when you are trying to mix NUMA bound workloads with those
> that do not have any NUMA proferences. Could you be more specific about
> requirements here though?

Yes, these LC services are highly sensitive to memory access latency
and bandwidth, so they are provisioned by NUMA node granule to meet
their performance requirements. While on the other hand, they usually
do not make full use of cpu/mem resources which increases the TCO of
our IDCs, so we have to co-locate them with background tasks.

Some of these LC services are memory-bound but leave much of cpu's
capacity unused. In this case we hope the co-located background tasks
to consume some leftover without introducing obvious mm overhead to
the LC services.

> 
> Let's say you run those latency critical services with "simple" memory
> policies and mix them with the other workload without any policies in
> place so they compete over memory. It is not really clear to me how can
> you achieve any reasonable QoS in such an environment. Your latency
> critical servises will be more constrained than the non-critical ones
> yet they are more demanding AFAIU.

Yes, the QoS over memory is the biggest block in the way (the other
resources are relatively easier). For now, we hacked a new mpol to
achieve weighted-interleave behavior to balance the memory usage across
NUMA nodes, and only set memcg protections to the LC services. If the
memory pressure is still high, the background tasks will be killed.
Ideas? Thanks!

Abel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}
  2022-07-12 15:00       ` Abel Wu
@ 2022-07-18 12:11         ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2022-07-18 12:11 UTC (permalink / raw)
  To: Abel Wu
  Cc: Gang Li, akpm, surenb, hca, gor, agordeev, borntraeger, svens,
	viro, ebiederm, keescook, rostedt, mingo, peterz, acme,
	mark.rutland, alexander.shishkin, jolsa, namhyung, david,
	imbrenda, adobriyan, yang.yang29, brauner, stephen.s.brennan,
	zhengqi.arch, haolee.swjtu, xu.xin16, Liam.Howlett, ohoono.kwon,
	peterx, arnd, shy828301, alex.sierra, xianting.tian, willy,
	ccross, vbabka, sujiaxun, sfr, vasily.averin, mgorman, vvghjk1234,
	tglx, luto, bigeasy, fenghua.yu, linux-s390, linux-kernel,
	linux-fsdevel, linux-mm, linux-perf-users, hezhongkun.hzk

On Tue 12-07-22 23:00:55, Abel Wu wrote:
> 
> On 7/12/22 9:35 PM, Michal Hocko Wrote:
> > On Tue 12-07-22 19:12:18, Abel Wu wrote:
> > [...]
> > > I was just going through the mail list and happen to see this. There
> > > is another usecase for us about per-numa memory usage.
> > > 
> > > Say we have several important latency-critical services sitting inside
> > > different NUMA nodes without intersection. The need for memory of these
> > > LC services varies, so the free memory of each node is also different.
> > > Then we launch several background containers without cpuset constrains
> > > to eat the left resources. Now the problem is that there doesn't seem
> > > like a proper memory policy available to balance the usage between the
> > > nodes, which could lead to memory-heavy LC services suffer from high
> > > memory pressure and fails to meet the SLOs.
> > 
> > I do agree that cpusets would be rather clumsy if usable at all in a
> > scenario when you are trying to mix NUMA bound workloads with those
> > that do not have any NUMA proferences. Could you be more specific about
> > requirements here though?
> 
> Yes, these LC services are highly sensitive to memory access latency
> and bandwidth, so they are provisioned by NUMA node granule to meet
> their performance requirements. While on the other hand, they usually
> do not make full use of cpu/mem resources which increases the TCO of
> our IDCs, so we have to co-locate them with background tasks.
> 
> Some of these LC services are memory-bound but leave much of cpu's
> capacity unused. In this case we hope the co-located background tasks
> to consume some leftover without introducing obvious mm overhead to
> the LC services.

This are some tough requirements and I am afraid far from any typical
usage. So I believe that you need a careful tunning much more than a
policy which I really have hard time to imagine wrt semantic TBH.
 
> > Let's say you run those latency critical services with "simple" memory
> > policies and mix them with the other workload without any policies in
> > place so they compete over memory. It is not really clear to me how can
> > you achieve any reasonable QoS in such an environment. Your latency
> > critical servises will be more constrained than the non-critical ones
> > yet they are more demanding AFAIU.
> 
> Yes, the QoS over memory is the biggest block in the way (the other
> resources are relatively easier). For now, we hacked a new mpol to
> achieve weighted-interleave behavior to balance the memory usage across
> NUMA nodes, and only set memcg protections to the LC services. If the
> memory pressure is still high, the background tasks will be killed.
> Ideas? Thanks!

It is not really clear what the new memory policy does and what is the
semantic of it from your description. Memory protection (via memcg) of
your sensitive workload makes sense but it would require proper setting
of background jobs as well. As soon as you hit the global direct reclaim
then the memory protection won't safe your sensitve workload.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-07-18 12:11 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-07-08  8:21 [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET} Gang Li
2022-07-08  8:21 ` [PATCH v2 1/5] mm: add a new parameter `node` to `get/add/inc/dec_mm_counter` Gang Li
2022-07-08  8:21 ` [PATCH v2 2/5] mm: add numa_count field for rss_stat Gang Li
2022-07-08  8:21 ` [PATCH v2 3/5] mm: add numa fields for tracepoint rss_stat Gang Li
2022-07-08 17:31   ` Steven Rostedt
2022-07-08  8:21 ` [PATCH v2 4/5] mm: enable per numa node rss_stat count Gang Li
2022-07-08  8:21 ` [PATCH v2 5/5] mm, oom: enable per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET} Gang Li
2022-07-08  8:54 ` [PATCH v2 0/5] mm, oom: Introduce " Michal Hocko
2022-07-08  9:25   ` Gang Li
2022-07-08  9:37     ` Michal Hocko
2022-07-12 11:12   ` Abel Wu
2022-07-12 13:35     ` Michal Hocko
2022-07-12 15:00       ` Abel Wu
2022-07-18 12:11         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).