Re: Re: [PATCH 0/5 v1] mm, oom: Introduce per numa node oom for CONSTRAINT_MEMORY_POLICY

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Gang Li <ligang.bdlg@bytedance.com>
To: Michal Hocko <mhocko@suse.com>
Cc: akpm@linux-foundation.org, songmuchun@bytedance.com,
	hca@linux.ibm.com, gor@linux.ibm.com, agordeev@linux.ibm.com,
	borntraeger@linux.ibm.com, svens@linux.ibm.com,
	ebiederm@xmission.com, keescook@chromium.org,
	viro@zeniv.linux.org.uk, rostedt@goodmis.org, mingo@redhat.com,
	peterz@infradead.org, acme@kernel.org, mark.rutland@arm.com,
	alexander.shishkin@linux.intel.com, jolsa@kernel.org,
	namhyung@kernel.org, david@redhat.com, imbrenda@linux.ibm.com,
	apopple@nvidia.com, adobriyan@gmail.com,
	stephen.s.brennan@oracle.com, ohoono.kwon@samsung.com,
	haolee.swjtu@gmail.com, kaleshsingh@google.com,
	zhengqi.arch@bytedance.com, peterx@redhat.com,
	shy828301@gmail.com, surenb@google.com, ccross@google.com,
	vincent.whitchurch@axis.com, tglx@linutronix.de,
	bigeasy@linutronix.de, fenghua.yu@intel.com,
	linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-perf-users@vger.kernel.org
Subject: Re: Re: [PATCH 0/5 v1] mm, oom: Introduce per numa node oom for CONSTRAINT_MEMORY_POLICY
Date: Wed, 15 Jun 2022 18:13:12 +0800	[thread overview]
Message-ID: <0e27425e-1fb6-bc7c-9845-71dc805897c3@bytedance.com> (raw)
In-Reply-To: <YoJ/ioXwGTdCywUE@dhcp22.suse.cz>

Hi, I've done some benchmarking in the last few days.

On 2022/5/17 00:44, Michal Hocko wrote:
> Sorry, I have only now found this email thread. The limitation of the
> NUMA constrained oom is well known and long standing. Basically the
> whole thing is a best effort as we are lacking per numa node memory
> stats. I can see that you are trying to fill up that gap but this is
> not really free. Have you measured the runtime overhead? Accounting is
> done in a very performance sensitive paths and it would be rather
> unfortunate to make everybody pay the overhead while binding to a
> specific node or sets of nodes is not the most common usecase.

## CPU consumption

According to the result of Unixbench. There is less than one percent 
performance loss in most cases.

On 40c512g machine.

40 parallel copies of tests:
+----------+----------+-----+----------+---------+---------+---------+
| numastat | FileCopy | ... |   Pipe   |  Fork   | syscall |  total  |
+----------+----------+-----+----------+---------+---------+---------+
| off      | 2920.24  | ... | 35926.58 | 6980.14 | 2617.18 | 8484.52 |
| on       | 2919.15  | ... | 36066.07 | 6835.01 | 2724.82 | 8461.24 |
| overhead | 0.04%    | ... | -0.39%   | 2.12%   | -3.95%  | 0.28%   |
+----------+----------+-----+----------+---------+---------+---------+

1 parallel copy of tests:
+----------+----------+-----+---------+--------+---------+---------+
| numastat | FileCopy | ... |  Pipe   |  Fork  | syscall |  total  |
+----------+----------+-----+---------+--------+---------+---------+
| off      | 1515.37  | ... | 1473.97 | 546.88 | 1152.37 | 1671.2  |
| on       | 1508.09  | ... | 1473.75 | 532.61 | 1148.83 | 1662.72 |
| overhead | 0.48%    | ... | 0.01%   | 2.68%  | 0.31%   | 0.51%   |
+----------+----------+-----+---------+--------+---------+---------+

## MEM consumption

per task_struct:
sizeof(int) * num_possible_nodes() + sizeof(int*)
typically 4 * 2 + 8 bytes

per mm_struct:
sizeof(atomic_long_t) * num_possible_nodes() + sizeof(atomic_long_t*)
typically 8 * 2 + 8 bytes

zap_pte_range:
sizeof(int) * num_possible_nodes() + sizeof(int*)
typically 4 * 2 + 8 bytes

> Also have you tried to have a look at cpusets? Those should be easier to
> make a proper selection as it should be possible to iterate over tasks
> belonging to a specific cpuset much more easier - essentialy something
> similar to memcg oom killer. We do not do that right now and by a very
> brief look at the CONSTRAINT_CPUSET it seems that this code is not
> really doing much these days. Maybe that would be a more appropriate way
> to deal with more precise node aware oom killing?

Looks like both CONSTRAINT_MEMORY_POLICY and CONSTRAINT_CPUSET can
be uesd to deal with node aware oom killing.

I think we can calculate badness in this way:
    If constraint=CONSTRAINT_MEMORY_POLICY, get badness by `nodemask`.
    If constraint=CONSTRAINT_CPUSET, get badness by `mems_allowed`.

example code:
```
long oom_badness(struct task_struct *p, struct oom_control *oc)
	long points;

	...

	if (unlikely(oc->constraint == CONSTRAINT_MEMORY_POLICY)) {
		for_each_node_mask(nid, oc->nodemask)
			points += get_mm_counter(p->mm, -1, nid)
	} else if (unlikely(oc->constraint == CONSTRAINT_CPUSET)) {
		for_each_node_mask(nid, cpuset_current_mems_allowed)
			points += get_mm_counter(p->mm, -1, nid)
	} else {
		points = get_mm_rss(p->mm);
	}
	points += get_mm_counter(p->mm, MM_SWAPENTS, NUMA_NO_NODE) \
		+ mm_pgtables_bytes(p->mm) / PAGE_SIZE;

	...

}
```

> 
> [...]
>>   21 files changed, 317 insertions(+), 111 deletions(-)
> 
> The code footprint is not free either. And more importantnly does this
> even work much more reliably? I can see quite some NUMA_NO_NODE
> accounting (e.g. copy_pte_range!).Is this somehow fixable?

> Also how do those numbers add up. Let's say you increase the counter as
> NUMA_NO_NODE but later on during the clean up you decrease based on the
> page node?
 > Last but not least I am really not following MM_NO_TYPE concept. I can
 > only see add_mm_counter users without any decrements. What is going on
 > there?

There are two usage scenarios of NUMA_NO_NODE in this patch.

1. placeholder when swap pages in and out of swapfile.
```
	// mem to swapfile
	dec_mm_counter(vma->vm_mm, MM_ANONPAGES, page_to_nid(page));
	inc_mm_counter(vma->vm_mm, MM_SWAPENTS, NUMA_NO_NODE);

	// swapfile to mem
	inc_mm_counter(vma->vm_mm, MM_ANONPAGES, page_to_nid(page));
	dec_mm_counter(vma->vm_mm, MM_SWAPENTS, NUMA_NO_NODE);
```

In *_mm_counter(vma->vm_mm, MM_SWAPENTS, NUMA_NO_NODE),
NUMA_NO_NODE is a placeholder. It means this page does not exist in any
node anymore.

2. placeholder in `add_mm_rss_vec` and `sync_mm_rss` for per process mm 
counter synchronization with SPLIT_RSS_COUNTING enabled.


MM_NO_TYPE is also a placeholder in `*_mm_counter`, `add_mm_rss_vec` and 
`sync_mm_rss`.

These placeholders are very strange. Maybe I should introduce a helper
function for mm->rss_stat.numa_count counting instead of using
placeholder.

     prev parent reply	other threads:[~2022-06-15 10:15 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-12  4:46 [PATCH 0/5 v1] mm, oom: Introduce per numa node oom for CONSTRAINT_MEMORY_POLICY Gang Li
2022-05-12  4:46 ` [PATCH 1/5 v1] mm: add a new parameter `node` to `get/add/inc/dec_mm_counter` Gang Li
2022-05-12  4:46 ` [PATCH 2/5 v1] mm: add numa_count field for rss_stat Gang Li
2022-05-12  4:46 ` [PATCH 3/5 v1] mm: add numa fields for tracepoint rss_stat Gang Li
2022-05-12  4:46 ` [PATCH 4/5 v1] mm: enable per numa node rss_stat count Gang Li
2022-05-12  4:46 ` [PATCH 5/5 v1] mm, oom: enable per numa node oom for CONSTRAINT_MEMORY_POLICY Gang Li
2022-05-12 22:31 ` [PATCH 0/5 v1] mm, oom: Introduce " Suren Baghdasaryan
2022-05-16 16:44 ` Michal Hocko
2022-06-15 10:13   ` Gang Li [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0e27425e-1fb6-bc7c-9845-71dc805897c3@bytedance.com \
    --to=ligang.bdlg@bytedance.com \
    --cc=acme@kernel.org \
    --cc=adobriyan@gmail.com \
    --cc=agordeev@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=apopple@nvidia.com \
    --cc=bigeasy@linutronix.de \
    --cc=borntraeger@linux.ibm.com \
    --cc=ccross@google.com \
    --cc=david@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=fenghua.yu@intel.com \
    --cc=gor@linux.ibm.com \
    --cc=haolee.swjtu@gmail.com \
    --cc=hca@linux.ibm.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=jolsa@kernel.org \
    --cc=kaleshsingh@google.com \
    --cc=keescook@chromium.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=ohoono.kwon@samsung.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    --cc=stephen.s.brennan@oracle.com \
    --cc=surenb@google.com \
    --cc=svens@linux.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=vincent.whitchurch@axis.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).