From: Oleksandr Natalenko <oleksandr@natalenko.name>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: cgel.zte@gmail.com, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, corbet@lwn.net,
xu xin <xu.xin16@zte.com.cn>, Yang Yang <yang.yang29@zte.com.cn>,
Ran Xiaokai <ran.xiaokai@zte.com.cn>,
wangyong <wang.yong12@zte.com.cn>,
Yunkai Zhang <zhang.yunkai@zte.com.cn>,
Matthew Wilcox <willy@infradead.org>
Subject: Re: [PATCH v6] mm/ksm: introduce ksm_force for each process
Date: Fri, 13 May 2022 22:53:16 +0200 [thread overview]
Message-ID: <1817008.tdWV9SEqCh@natalenko.name> (raw)
In-Reply-To: <20220513133210.9dd0a4216bd8baaa1047562c@linux-foundation.org>
Hello.
On pátek 13. května 2022 22:32:10 CEST Andrew Morton wrote:
> On Fri, 13 May 2022 11:51:53 +0200 Oleksandr Natalenko <oleksandr@natalenko.name> wrote:
> > On pátek 13. května 2022 0:37:53 CEST Andrew Morton wrote:
> > > On Tue, 10 May 2022 15:30:36 +0200 Oleksandr Natalenko <oleksandr@natalenko.name> wrote:
> > >
> > > > > If ksm_force is set to 1, force all anonymous and 'qualified' VMAs
> > > > > of this mm to be involved in KSM scanning without explicitly calling
> > > > > madvise to mark VMA as MADV_MERGEABLE. But It is effective only when
> > > > > the klob of /sys/kernel/mm/ksm/run is set as 1.
> > > > >
> > > > > If ksm_force is set to 0, cancel the feature of ksm_force of this
> > > > > process (fallback to the default state) and unmerge those merged pages
> > > > > belonging to VMAs which is not madvised as MADV_MERGEABLE of this process,
> > > > > but still leave MADV_MERGEABLE areas merged.
> > > >
> > > > To my best knowledge, last time a forcible KSM was discussed (see threads [1], [2], [3] and probably others) it was concluded that a) procfs was a horrible interface for things like this one; and b) process_madvise() syscall was among the best suggested places to implement this (which would require a more tricky handling from userspace, but still).
> > > >
> > > > So, what changed since that discussion?
> > > >
> > > > P.S. For now I do it via dedicated syscall, but I'm not trying to upstream this approach.
> > >
> > > Why are you patching the kernel with a new syscall rather than using
> > > process_madvise()?
> >
> > Because I'm not sure how to use `process_madvise()` to achieve $subj properly.
> >
> > The objective is to mark all the eligible VMAs of the target task for KSM to consider them for merging.
> >
> > For that, all the eligible VMAs have to be traversed.
> >
> > Given `process_madvise()` has got an iovec API, this means the process that will call `process_madvise()` has to know the list of VMAs of the target process. In order to traverse them in a race-free manner the target task has to be SIGSTOP'ped or frozen, then the list of VMAs has to be obtained, then `process_madvise()` has to be called, and the the target task can continue. This is:
> >
> > a) superfluous (the kernel already knows the list of VMAs of the target tasks, why proxy it through the userspace then?); and
> > b) may induce more latency than needed because the target task has to be stopped to avoid races.
>
> OK. And what happens to new vmas that the target process creates after
> the process_madvise()?
Call `process_madvise()` on them again. And do that again. Regularly, with some intervals. Use some daemon for that (like [1]).
[1] https://gitlab.com/post-factum/uksmd
> > OTOH, IIUC, even if `MADV_MERGEABLE` is allowed for `process_madvise()`,
>
> Is it not?
It is not:
```
1158 static bool
1159 process_madvise_behavior_valid(int behavior)
1160 {
1161 switch (behavior) {
1162 case MADV_COLD:
1163 case MADV_PAGEOUT:
1164 case MADV_WILLNEED:
1165 return true;
1166 default:
1167 return false;
1168 }
1169 }
```
Initially, when `process_madvise()` stuff was being prepared for merging, I tried to enabled it but failed [2], and it was decided [3] to move forward without it.
[2] https://lore.kernel.org/linux-api/34f812b8-df54-eaad-5cf0-335f07da55c6@suse.cz/
[3] https://lore.kernel.org/lkml/20200623085944.cvob63vrv54fo7cs@butterfly.localdomain/
> > I cannot just call it like this:
> >
> > ```
> > iovec.iov_base = 0;
> > iovec.iov_len = ~0ULL;
> > process_madvise(pidfd, &iovec, 1, MADV_MERGEABLE, 0);
> > ```
> >
> > to cover the whole address space because iovec expects total size to be under ssize_t.
> >
> > Or maybe there's no need to cover the whole address space, only the lower half of it?
>
> Call process_madvise() twice, once for each half?
And still get `-ENOMEM`?
```
1191 /*
1192 * If the interval [start,end) covers some unmapped address
1193 * ranges, just ignore them, but return -ENOMEM at the end.
1194 * - different from the way of handling in mlock etc.
1195 */
```
I mean, it probably will work, and probably having the error returned is fine, but generally speaking an error value should hint that something is not being done right.
> > Or maybe there's another way of doing things, and I just look stupid and do not understand how this is supposed to work?..
> >
> > I'm more than happy to read your comments on this.
> >
>
> I see the problem. I do like the simplicity of the ksm_force concept.
> Are there alternative ideas?
I do like it too. I wonder what to do with older concerns [4] [5] regarding presenting such an API.
[4] https://lore.kernel.org/lkml/20190516172452.GA2106@avx2/
[5] https://lore.kernel.org/lkml/20190515145151.GG16651@dhcp22.suse.cz/
--
Oleksandr Natalenko (post-factum)
next prev parent reply other threads:[~2022-05-13 20:53 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-10 12:22 [PATCH v6] mm/ksm: introduce ksm_force for each process cgel.zte
2022-05-10 13:30 ` Oleksandr Natalenko
2022-05-11 3:12 ` CGEL
2022-05-12 22:37 ` Andrew Morton
2022-05-13 9:51 ` Oleksandr Natalenko
2022-05-13 20:32 ` Andrew Morton
2022-05-13 20:53 ` Oleksandr Natalenko [this message]
2022-05-12 21:07 ` Matthew Wilcox
2022-05-13 6:50 ` CGEL
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1817008.tdWV9SEqCh@natalenko.name \
--to=oleksandr@natalenko.name \
--cc=akpm@linux-foundation.org \
--cc=cgel.zte@gmail.com \
--cc=corbet@lwn.net \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ran.xiaokai@zte.com.cn \
--cc=wang.yong12@zte.com.cn \
--cc=willy@infradead.org \
--cc=xu.xin16@zte.com.cn \
--cc=yang.yang29@zte.com.cn \
--cc=zhang.yunkai@zte.com.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.