From: Andrea Arcangeli <aarcange@redhat.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
Andrew Morton <akpm@linux-foundation.org>,
David Rientjes <rientjes@google.com>,
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
Oleg Nesterov <oleg@redhat.com>, Hugh Dickins <hughd@google.com>,
linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
Date: Wed, 26 Jul 2017 18:29:12 +0200 [thread overview]
Message-ID: <20170726162912.GA29716@redhat.com> (raw)
In-Reply-To: <20170726054557.GB960@dhcp22.suse.cz>
On Wed, Jul 26, 2017 at 07:45:57AM +0200, Michal Hocko wrote:
> On Tue 25-07-17 21:19:52, Andrea Arcangeli wrote:
> > On Tue, Jul 25, 2017 at 06:04:00PM +0200, Michal Hocko wrote:
> > > - down_write(&mm->mmap_sem);
> > > + if (tsk_is_oom_victim(current))
> > > + down_write(&mm->mmap_sem);
> > > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > > tlb_finish_mmu(&tlb, 0, -1);
> > >
> > > @@ -3012,7 +3014,8 @@ void exit_mmap(struct mm_struct *mm)
> > > }
> > > mm->mmap = NULL;
> > > vm_unacct_memory(nr_accounted);
> > > - up_write(&mm->mmap_sem);
> > > + if (tsk_is_oom_victim(current))
> > > + up_write(&mm->mmap_sem);
> >
> > How is this possibly safe? mark_oom_victim can run while exit_mmap is
> > running.
>
> I believe it cannot. We always call mark_oom_victim (on !current) with
> task_lock held and check task->mm != NULL and we call do_exit->mmput after
> mm is set to NULL under the same lock.
Holding the mmap_sem for writing and setting mm->mmap to NULL to
filter which tasks already released the mmap_sem for writing post
free_pgtables still look unnecessary to solve this.
Using MMF_OOM_SKIP as flag had side effects of oom_badness() skipping
it, but we can use the same tsk_is_oom_victim instead and relay on the
locking in mark_oom_victim you pointed out above instead of the
test_and_set_bit of my patch, because current->mm is already NULL at
that point.
A race at the light of the above now is, because current->mm is NULL by the
time mmput is called, how can you start the oom_reap_task on a process
with current->mm NULL that called the last mmput and is blocked
in exit_aio? It looks like no false positive can get fixed until this
is solved first because
Isn't this enough? If this is enough it avoids other modification to
the exit_mmap runtime that looks unnecessary: mm->mmap = NULL replaced
by MMF_OOM_SKIP that has to be set anyway by __mmput later and one
unnecessary branch to call the up_write.
WARNING: multiple messages have this Message-ID (diff)
From: Andrea Arcangeli <aarcange@redhat.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
Andrew Morton <akpm@linux-foundation.org>,
David Rientjes <rientjes@google.com>,
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
Oleg Nesterov <oleg@redhat.com>, Hugh Dickins <hughd@google.com>,
linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
Date: Wed, 26 Jul 2017 18:29:12 +0200 [thread overview]
Message-ID: <20170726162912.GA29716@redhat.com> (raw)
In-Reply-To: <20170726054557.GB960@dhcp22.suse.cz>
On Wed, Jul 26, 2017 at 07:45:57AM +0200, Michal Hocko wrote:
> On Tue 25-07-17 21:19:52, Andrea Arcangeli wrote:
> > On Tue, Jul 25, 2017 at 06:04:00PM +0200, Michal Hocko wrote:
> > > - down_write(&mm->mmap_sem);
> > > + if (tsk_is_oom_victim(current))
> > > + down_write(&mm->mmap_sem);
> > > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > > tlb_finish_mmu(&tlb, 0, -1);
> > >
> > > @@ -3012,7 +3014,8 @@ void exit_mmap(struct mm_struct *mm)
> > > }
> > > mm->mmap = NULL;
> > > vm_unacct_memory(nr_accounted);
> > > - up_write(&mm->mmap_sem);
> > > + if (tsk_is_oom_victim(current))
> > > + up_write(&mm->mmap_sem);
> >
> > How is this possibly safe? mark_oom_victim can run while exit_mmap is
> > running.
>
> I believe it cannot. We always call mark_oom_victim (on !current) with
> task_lock held and check task->mm != NULL and we call do_exit->mmput after
> mm is set to NULL under the same lock.
Holding the mmap_sem for writing and setting mm->mmap to NULL to
filter which tasks already released the mmap_sem for writing post
free_pgtables still look unnecessary to solve this.
Using MMF_OOM_SKIP as flag had side effects of oom_badness() skipping
it, but we can use the same tsk_is_oom_victim instead and relay on the
locking in mark_oom_victim you pointed out above instead of the
test_and_set_bit of my patch, because current->mm is already NULL at
that point.
A race at the light of the above now is, because current->mm is NULL by the
time mmput is called, how can you start the oom_reap_task on a process
with current->mm NULL that called the last mmput and is blocked
in exit_aio? It looks like no false positive can get fixed until this
is solved first because
Isn't this enough? If this is enough it avoids other modification to
the exit_mmap runtime that looks unnecessary: mm->mmap = NULL replaced
by MMF_OOM_SKIP that has to be set anyway by __mmput later and one
unnecessary branch to call the up_write.
>From 3d9001490ee1a71f39c7bfaf19e96821f9d3ff16 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Tue, 25 Jul 2017 20:02:27 +0200
Subject: [PATCH 1/1] mm: oom: let oom_reap_task and exit_mmap to run
concurrently
This is purely required because exit_aio() may block and exit_mmap() may
never start, if the oom_reap_task cannot start running on a mm with
mm_users == 0.
At the same time if the OOM reaper doesn't wait at all for the memory
of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it
would generate a spurious OOM kill.
If it wasn't because of the exit_aio or similar blocking functions in
the last mmput, it would be enough to change the oom_reap_task() in
the case it finds mm_users == 0, to wait for a timeout or to wait for
__mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the
problem here so the concurrency of exit_mmap and oom_reap_task is
apparently warranted.
It's a non standard runtime, exit_mmap() runs without mmap_sem, and
oom_reap_task runs with the mmap_sem for reading as usual (kind of
MADV_DONTNEED).
The race between the two is solved with a combination of
tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP
(serialized by a dummy down_write/up_write cycle on the same lines of
the ksm_exit method).
If the oom_reap_task() may be running concurrently during exit_mmap,
exit_mmap will wait it to finish in down_write (before taking down mm
structures that would make the oom_reap_task fail with use after
free).
If exit_mmap comes first, oom_reap_task() will skip the mm if
MMF_OOM_SKIP is already set and in turn all memory is already freed
and furthermore the mm data structures may already have been taken
down by free_pgtables.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
kernel/fork.c | 1 -
mm/mmap.c | 17 +++++++++++++++++
mm/oom_kill.c | 15 +++++----------
3 files changed, 22 insertions(+), 11 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 9ec98b0c4675..ed412d85a596 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -910,7 +910,6 @@ static inline void __mmput(struct mm_struct *mm)
}
if (mm->binfmt)
module_put(mm->binfmt->module);
- set_bit(MMF_OOM_SKIP, &mm->flags);
mmdrop(mm);
}
diff --git a/mm/mmap.c b/mm/mmap.c
index f19efcf75418..bdab595ce25c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2993,6 +2993,23 @@ void exit_mmap(struct mm_struct *mm)
/* Use -1 here to ensure all VMAs in the mm are unmapped */
unmap_vmas(&tlb, vma, 0, -1);
+ set_bit(MMF_OOM_SKIP, &mm->flags);
+ if (tsk_is_oom_victim(current)) {
+ /*
+ * Wait for oom_reap_task() to stop working on this
+ * mm. Because MMF_OOM_SKIP is already set before
+ * calling down_read(), oom_reap_task() will not run
+ * on this "mm" post up_write().
+ *
+ * tsk_is_oom_victim() cannot be set from under us
+ * either because current->mm is already set to NULL
+ * under task_lock before calling mmput and oom_mm is
+ * set not NULL by the OOM killer only if current->mm
+ * is found not NULL while holding the task_lock.
+ */
+ down_write(&mm->mmap_sem);
+ up_write(&mm->mmap_sem);
+ }
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb, 0, -1);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f030c1c..242a1f0d579b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -495,11 +495,12 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
}
/*
- * increase mm_users only after we know we will reap something so
- * that the mmput_async is called only when we have reaped something
- * and delayed __mmput doesn't matter that much
+ * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
+ * work on the mm anymore. The check for MMF_OOM_SKIP must run
+ * under mmap_sem for reading because it serializes against the
+ * down_write();up_write() cycle in exit_mmap().
*/
- if (!mmget_not_zero(mm)) {
+ if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
up_read(&mm->mmap_sem);
trace_skip_task_reaping(tsk->pid);
goto unlock_oom;
@@ -542,12 +543,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
K(get_mm_counter(mm, MM_SHMEMPAGES)));
up_read(&mm->mmap_sem);
- /*
- * Drop our reference but make sure the mmput slow path is called from a
- * different context because we shouldn't risk we get stuck there and
- * put the oom_reaper out of the way.
- */
- mmput_async(mm);
trace_finish_task_reaping(tsk->pid);
unlock_oom:
mutex_unlock(&oom_lock);
next prev parent reply other threads:[~2017-07-26 16:29 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-07-24 7:23 [PATCH] mm, oom: allow oom reaper to race with exit_mmap Michal Hocko
2017-07-24 7:23 ` Michal Hocko
2017-07-24 14:00 ` Kirill A. Shutemov
2017-07-24 14:00 ` Kirill A. Shutemov
2017-07-24 14:15 ` Michal Hocko
2017-07-24 14:15 ` Michal Hocko
2017-07-24 14:51 ` Kirill A. Shutemov
2017-07-24 14:51 ` Kirill A. Shutemov
2017-07-24 16:11 ` Michal Hocko
2017-07-24 16:11 ` Michal Hocko
2017-07-25 14:17 ` Kirill A. Shutemov
2017-07-25 14:17 ` Kirill A. Shutemov
2017-07-25 14:26 ` Michal Hocko
2017-07-25 14:26 ` Michal Hocko
2017-07-25 15:07 ` Kirill A. Shutemov
2017-07-25 15:07 ` Kirill A. Shutemov
2017-07-25 15:15 ` Michal Hocko
2017-07-25 15:15 ` Michal Hocko
2017-07-25 14:26 ` Michal Hocko
2017-07-25 15:17 ` Kirill A. Shutemov
2017-07-25 15:17 ` Kirill A. Shutemov
2017-07-25 15:23 ` Michal Hocko
2017-07-25 15:23 ` Michal Hocko
2017-07-25 15:31 ` Kirill A. Shutemov
2017-07-25 15:31 ` Kirill A. Shutemov
2017-07-25 16:04 ` Michal Hocko
2017-07-25 16:04 ` Michal Hocko
2017-07-25 19:19 ` Andrea Arcangeli
2017-07-25 19:19 ` Andrea Arcangeli
2017-07-26 5:45 ` Michal Hocko
2017-07-26 5:45 ` Michal Hocko
2017-07-26 16:29 ` Andrea Arcangeli [this message]
2017-07-26 16:29 ` Andrea Arcangeli
2017-07-26 16:43 ` Andrea Arcangeli
2017-07-26 16:43 ` Andrea Arcangeli
2017-07-27 6:50 ` Michal Hocko
2017-07-27 6:50 ` Michal Hocko
2017-07-27 14:55 ` Andrea Arcangeli
2017-07-27 14:55 ` Andrea Arcangeli
2017-07-28 6:23 ` Michal Hocko
2017-07-28 6:23 ` Michal Hocko
2017-07-28 1:58 ` [PATCH 1/1] mm: oom: let oom_reap_task and exit_mmap to run kbuild test robot
2017-08-15 0:20 ` [PATCH] mm, oom: allow oom reaper to race with exit_mmap David Rientjes
2017-08-15 0:20 ` David Rientjes
2017-07-24 15:27 ` Michal Hocko
2017-07-24 15:27 ` Michal Hocko
2017-07-24 16:42 ` kbuild test robot
2017-07-24 18:12 ` Michal Hocko
2017-07-24 18:12 ` Michal Hocko
2017-07-25 15:26 ` Andrea Arcangeli
2017-07-25 15:26 ` Andrea Arcangeli
2017-07-25 15:45 ` Michal Hocko
2017-07-25 15:45 ` Michal Hocko
2017-07-25 18:26 ` Andrea Arcangeli
2017-07-25 18:26 ` Andrea Arcangeli
2017-07-26 5:45 ` Michal Hocko
2017-07-26 5:45 ` Michal Hocko
2017-07-26 16:39 ` Andrea Arcangeli
2017-07-26 16:39 ` Andrea Arcangeli
2017-07-27 6:32 ` Michal Hocko
2017-07-27 6:32 ` Michal Hocko
-- strict thread matches above, loose matches on Subject: below --
2017-07-27 8:29 Manish Jaggi
2017-07-27 9:24 ` Michal Hocko
2017-08-10 8:16 Michal Hocko
2017-08-10 8:16 ` Michal Hocko
2017-08-10 18:05 ` Andrea Arcangeli
2017-08-10 18:05 ` Andrea Arcangeli
2017-08-10 18:51 ` Michal Hocko
2017-08-10 18:51 ` Michal Hocko
2017-08-10 20:36 ` Michal Hocko
2017-08-10 20:36 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170726162912.GA29716@redhat.com \
--to=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=hughd@google.com \
--cc=kirill@shutemov.name \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=oleg@redhat.com \
--cc=penguin-kernel@I-love.SAKURA.ne.jp \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.