linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Suren Baghdasaryan <surenb@google.com>
Cc: linux-api@vger.kernel.org, linux-mm <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Rientjes <rientjes@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <guro@fb.com>, Rik van Riel <riel@surriel.com>,
	Minchan Kim <minchan@kernel.org>,
	Christian Brauner <christian@brauner.io>,
	Oleg Nesterov <oleg@redhat.com>,
	Tim Murray <timmurray@google.com>,
	kernel-team <kernel-team@android.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC]: userspace memory reaping
Date: Wed, 14 Oct 2020 14:09:37 +0200	[thread overview]
Message-ID: <20201014120937.GC4440@dhcp22.suse.cz> (raw)
In-Reply-To: <CAJuCfpGjuUz5FPpR5iQ7oURJAhnP1ffBAnERuTUp9uPxQCRhDg@mail.gmail.com>

[Sorry for a late reply]

On Mon 14-09-20 17:45:44, Suren Baghdasaryan wrote:
> + linux-kernel@vger.kernel.org
> 
> On Mon, Sep 14, 2020 at 5:43 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > Last year I sent an RFC about using oom-reaper while killing a
> > process: https://patchwork.kernel.org/cover/10894999. During LSFMM2019
> > discussion https://lwn.net/Articles/787217 a couple of alternative
> > options were discussed with the most promising one (outlined in the
> > last paragraph of https://lwn.net/Articles/787217) suggesting to use a
> > remote version of madvise(MADV_DONTNEED) operation to force memory
> > reclaim of a killed process. With process_madvise() making its way
> > through reviews (https://patchwork.kernel.org/patch/11747133/), I
> > would like to revive this discussion and get feedback on several
> > possible options, their pros and cons.

Thanks for reviving this!

> > The need is similar to why oom-reaper was introduced - when a process
> > is being killed to free memory we want to make sure memory is freed
> > even if the victim is in uninterruptible sleep or is busy and reaction
> > to SIGKILL is delayed by an unpredictable amount of time. I
> > experimented with enabling process_madvise(MADV_DONTNEED) operation
> > and using it to force memory reclaim of the target process after
> > sending SIGKILL. Unfortunately this approach requires the caller to
> > read proc/pid/maps to extract the list of VMAs to pass as an input to
> > process_madvise().

Well I would argue that this is not really necessary. You can simply
call process_madvise with the full address range and let the kernel
operated only on ranges which are safe to tear down asynchronously.
Sure that would require some changes to the existing code to not fail
on those ranges if they contain incompatible vmas but that should be
possible. If we are worried about backward compatibility then a
dedicated flag could override.

[...]

> > While the objective is to guarantee forward progress even when the
> > victim cannot terminate, we still want this mechanism to be efficient
> > because we perform these operations to relieve memory pressure before
> > it affects user experience.
> >
> > Alternative options I would like your feedback are:
> > 1. Introduce a dedicated process_madvise(MADV_DONTNEED_MM)
> > specifically for this case to indicate that the whole mm can be freed.

This shouldn't be any different from madvise on the full address range,
right?

> > 2. A new syscall to efficiently obtain a vector of VMAs (start,
> > length, flags) of the process instead of reading /proc/pid/maps. The
> > size of the vector is still limited by UIO_MAXIOV (1024), so several
> > calls might be needed to query larger number of VMAs, however it will
> > still be an order of magnitude more efficient than reading
> > /proc/pid/maps file in 4K or smaller chunks.

While this might be interesting for other usecases - userspace memory
management in general - I do not think it is directly related to this
particular feature.

> > 3. Use process_madvise() flags parameter to indicate a bulk operation
> > which ignores input vectors. Sample usage: process_madvise(pidfd,
> > MADV_DONTNEED, vector=NULL, vlen=0, flags=PMADV_FLAG_FILE |
> > PMADV_FLAG_ANON);

Similar to above.

> > 4. madvise()/process_madvise() handle gaps between VMAs, so we could
> > provide one vector element spanning the entire address space. There
> > are technical issues with this approach (process_madvise return value
> > can't handle such a large number of bytes and there is MAX_RW_COUNT
> > limit on max number of bytes one process_madvise call can handle) but
> > I would still like to hear opinions about it. If this option is
> > preferable maybe we can deal with these limitations.

To be really honest, the more I am thinking about remove MADV_DONTNEED
the less I like it. Sure we can limit this functionality to killed tasks
but there is still a need to MMF_UNSTABLE that the current oom reaper
sets to prevent from memory corruption while the kernel is still in
kernel. Userspace memory reaper would need something similar.

I do have a vague recollection that we have discussed a kill(2) based
approach as well in the past. Essentially SIG_KILL_SYNC which would
not only send the signal but it would start a teardown of resources
owned by the task - at least those we can remove safely. The interface
would be much more simple and less tricky to use. You just make your
userspace oom killer or potentially other users call SIG_KILL_SYNC which
will be more expensive but you would at least know that as many
resources have been freed as the kernel can afford at the moment.
-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2020-10-14 12:09 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-15  0:43 [RFC]: userspace memory reaping Suren Baghdasaryan
2020-09-15  0:45 ` Suren Baghdasaryan
2020-10-14 12:09   ` Michal Hocko [this message]
2020-10-14 16:57     ` Suren Baghdasaryan
2020-10-14 18:39       ` minchan
2020-10-15  9:20       ` Michal Hocko
2020-10-15 18:43         ` Minchan Kim
2020-10-15 19:32           ` Suren Baghdasaryan
2020-10-15 19:25         ` Suren Baghdasaryan
2020-11-02 20:29           ` Suren Baghdasaryan
2020-11-03  9:35             ` Michal Hocko
2020-11-03 21:28               ` Suren Baghdasaryan
2020-11-03 21:32               ` Minchan Kim
2020-11-03 21:40                 ` Suren Baghdasaryan
2020-11-03 21:46                   ` Minchan Kim
2020-11-04  6:58                 ` Michal Hocko
2020-11-04 20:40                   ` Minchan Kim
2020-11-05 12:20                     ` Michal Hocko
2020-11-05 16:50                       ` Suren Baghdasaryan
2020-11-05 17:07                         ` Minchan Kim
2020-11-05 17:16                         ` Michal Hocko
2020-11-05 17:21                           ` Suren Baghdasaryan
2020-11-05 17:41                             ` Minchan Kim
2020-11-05 17:43                             ` Michal Hocko
2020-11-05 18:02                               ` Suren Baghdasaryan
2020-11-13 17:37                                 ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201014120937.GC4440@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=christian@brauner.io \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@android.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=oleg@redhat.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=surenb@google.com \
    --cc=timmurray@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).