From: Michal Hocko <mhocko@kernel.org>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
David Rientjes <rientjes@google.com>,
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
Oleg Nesterov <oleg@redhat.com>, Hugh Dickins <hughd@google.com>,
linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
Date: Wed, 26 Jul 2017 07:45:33 +0200 [thread overview]
Message-ID: <20170726054533.GA960@dhcp22.suse.cz> (raw)
In-Reply-To: <20170725182619.GQ29716@redhat.com>
On Tue 25-07-17 20:26:19, Andrea Arcangeli wrote:
> On Tue, Jul 25, 2017 at 05:45:14PM +0200, Michal Hocko wrote:
> > That problem is real though as reported by David.
>
> I'm not against fixing it, I just think it's not a major concern, and
> the solution doesn't seem optimal as measured by Kirill.
>
> I'm just skeptical it's the best to solve that tiny race, 99.9% of the
> time such down_write is unnecessary.
>
> > it is not only about exit_mmap. __mmput calls into exit_aio and that can
> > wait for completion and there is no way to guarantee this will finish in
> > finite time.
>
> exit_aio blocking is actually the only good point for wanting this
> concurrency where exit_mmap->unmap_vmas and
> oom_reap_task->unmap_page_range have to run concurrently on the same
> mm.
Yes, exit_aio is the only blocking call I know of currently. But I would
like this to be as robust as possible and so I do not want to rely on
the current implementation. This can change in future and I can
guarantee that nobody will think about the oom path when adding
something to the final __mmput path.
> exit_mmap would have no issue, if there was enough time in the
> lifetime CPU to allocate the memory, sure the memory will also be
> freed in finite amount of time by exit_mmap.
I am not sure I understand. Say that any call prior to unmap_vmas blocks
on a lock which is held by another call path which cannot proceed with
the allocation...
> In fact you mentioned multiple OOM in the NUMA case, exit_mmap may not
> solve that, so depending on the runtime it may have been better not to
> wait all memory of the process to be freed before moving to the next
> task, but only a couple of seconds before the OOM reaper moves to a
> new candidate. Again this is only a tradeoff between solving the OOM
> faster vs risk of false positives OOM.
I really do not want to rely on any timing. This just too fragile. Once
we have killed a task then we shouldn't pick another victim until it
passed exit_mmap or the oom_reaper did its job. Otherwise we just risk
false positives while we have already disrupted the workload.
> If it wasn't because of exit_aio (which may have to wait I/O
> completion), changing the OOM reaper to return "false" if
> mmget_not_zero returns zero and MMF_OOM_SKIP is not set yet, would
> have been enough (and depending on the runtime it may have solved OOM
> faster in NUMA) and there would be absolutely no need to run OOM
> reaper and exit_mmap concurrently on the same mm. However there's such
> exit_aio..
>
> Raw I/O mempools never require memory allocations, although aio if it
> involves a filesystem to complete may run into filesystem or buffering
> locks which are known to loop forever or depend on other tasks stuck
> in kernel allocations, so I didn't go down that chain too long.
Exactly. We simply cannot assume anything here because veryfying this
basically impossible.
[...]
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f19efcf75418..615133762b99 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2993,6 +2993,11 @@ void exit_mmap(struct mm_struct *mm)
> /* Use -1 here to ensure all VMAs in the mm are unmapped */
> unmap_vmas(&tlb, vma, 0, -1);
>
> + if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> + /* wait the OOM reaper to stop working on this mm */
> + down_write(&mm->mmap_sem);
> + up_write(&mm->mmap_sem);
> + }
> free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> tlb_finish_mmu(&tlb, 0, -1);
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e8b4f030c1c..2a7000995784 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -471,6 +471,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> struct mmu_gather tlb;
> struct vm_area_struct *vma;
> bool ret = true;
> + bool mmgot = true;
>
> /*
> * We have to make sure to not race with the victim exit path
> @@ -500,9 +501,16 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> * and delayed __mmput doesn't matter that much
> */
> if (!mmget_not_zero(mm)) {
> - up_read(&mm->mmap_sem);
> trace_skip_task_reaping(tsk->pid);
> - goto unlock_oom;
> + /*
> + * MMF_OOM_SKIP is set by exit_mmap when the OOM
> + * reaper can't work on the mm anymore.
> + */
> + if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> + up_read(&mm->mmap_sem);
> + goto unlock_oom;
> + }
> + mmgot = false;
> }
This will work more or less the same to what we have currently.
[victim] [oom reaper] [oom killer]
do_exit __oom_reap_task_mm
mmput
__mmput
mmget_not_zero
test_and_set_bit(MMF_OOM_SKIP)
oom_evaluate_task
# select next victim
# reap the mm
unmap_vmas
so we can select a next victim while the current one is still not
completely torn down.
> > > 4) how is it safe to overwrite a VM_FAULT_RETRY that returns without
> > > mmap_sem and then the arch code will release the mmap_sem despite
> > > it was already released by handle_mm_fault? Anonymous memory faults
> > > aren't common to return VM_FAULT_RETRY but an userfault
> > > can. Shouldn't there be a block that prevents overwriting if
> > > VM_FAULT_RETRY is set below? (not only VM_FAULT_ERROR)
> > >
> > > if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > > && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > > ret = VM_FAULT_SIGBUS;
> >
> > I am not sure I understand what you mean and how this is related to the
> > patch?
>
> It's not related to the patch but it involves the OOM reaper as it
> only happens when MMF_UNSTABLE is set which is set only by the OOM
> reaper. I was simply reading the OOM reaper code and following up what
> MMF_UNSTABLE does and it ringed a bell.
I hope 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom
reaped memory") will clarify this code. If not please start a new thread
so that we do not conflate different things together.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-07-26 5:45 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-07-24 7:23 [PATCH] mm, oom: allow oom reaper to race with exit_mmap Michal Hocko
2017-07-24 14:00 ` Kirill A. Shutemov
2017-07-24 14:15 ` Michal Hocko
2017-07-24 14:51 ` Kirill A. Shutemov
2017-07-24 16:11 ` Michal Hocko
2017-07-25 14:17 ` Kirill A. Shutemov
2017-07-25 14:26 ` Michal Hocko
2017-07-25 15:07 ` Kirill A. Shutemov
2017-07-25 15:15 ` Michal Hocko
2017-07-25 14:26 ` Michal Hocko
2017-07-25 15:17 ` Kirill A. Shutemov
2017-07-25 15:23 ` Michal Hocko
2017-07-25 15:31 ` Kirill A. Shutemov
2017-07-25 16:04 ` Michal Hocko
2017-07-25 19:19 ` Andrea Arcangeli
2017-07-26 5:45 ` Michal Hocko
2017-07-26 16:29 ` Andrea Arcangeli
2017-07-26 16:43 ` Andrea Arcangeli
2017-07-27 6:50 ` Michal Hocko
2017-07-27 14:55 ` Andrea Arcangeli
2017-07-28 6:23 ` Michal Hocko
2017-07-28 1:58 ` [PATCH 1/1] mm: oom: let oom_reap_task and exit_mmap to run kbuild test robot
2017-08-15 0:20 ` [PATCH] mm, oom: allow oom reaper to race with exit_mmap David Rientjes
2017-07-24 15:27 ` Michal Hocko
2017-07-24 16:42 ` kbuild test robot
2017-07-24 18:12 ` Michal Hocko
2017-07-25 15:26 ` Andrea Arcangeli
2017-07-25 15:45 ` Michal Hocko
2017-07-25 18:26 ` Andrea Arcangeli
2017-07-26 5:45 ` Michal Hocko [this message]
2017-07-26 16:39 ` Andrea Arcangeli
2017-07-27 6:32 ` Michal Hocko
-- strict thread matches above, loose matches on Subject: below --
2017-08-10 8:16 Michal Hocko
2017-08-10 18:05 ` Andrea Arcangeli
2017-08-10 18:51 ` Michal Hocko
2017-08-10 20:36 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170726054533.GA960@dhcp22.suse.cz \
--to=mhocko@kernel.org \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=oleg@redhat.com \
--cc=penguin-kernel@I-love.SAKURA.ne.jp \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).