linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Rientjes <rientjes@google.com>,
	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	Oleg Nesterov <oleg@redhat.com>, Hugh Dickins <hughd@google.com>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
Date: Wed, 26 Jul 2017 07:45:33 +0200	[thread overview]
Message-ID: <20170726054533.GA960@dhcp22.suse.cz> (raw)
In-Reply-To: <20170725182619.GQ29716@redhat.com>

On Tue 25-07-17 20:26:19, Andrea Arcangeli wrote:
> On Tue, Jul 25, 2017 at 05:45:14PM +0200, Michal Hocko wrote:
> > That problem is real though as reported by David.
> 
> I'm not against fixing it, I just think it's not a major concern, and
> the solution doesn't seem optimal as measured by Kirill.
> 
> I'm just skeptical it's the best to solve that tiny race, 99.9% of the
> time such down_write is unnecessary.
> 
> > it is not only about exit_mmap. __mmput calls into exit_aio and that can
> > wait for completion and there is no way to guarantee this will finish in
> > finite time.
> 
> exit_aio blocking is actually the only good point for wanting this
> concurrency where exit_mmap->unmap_vmas and
> oom_reap_task->unmap_page_range have to run concurrently on the same
> mm.

Yes, exit_aio is the only blocking call I know of currently. But I would
like this to be as robust as possible and so I do not want to rely on
the current implementation. This can change in future and I can
guarantee that nobody will think about the oom path when adding
something to the final __mmput path.

> exit_mmap would have no issue, if there was enough time in the
> lifetime CPU to allocate the memory, sure the memory will also be
> freed in finite amount of time by exit_mmap.

I am not sure I understand. Say that any call prior to unmap_vmas blocks
on a lock which is held by another call path which cannot proceed with
the allocation...
 
> In fact you mentioned multiple OOM in the NUMA case, exit_mmap may not
> solve that, so depending on the runtime it may have been better not to
> wait all memory of the process to be freed before moving to the next
> task, but only a couple of seconds before the OOM reaper moves to a
> new candidate. Again this is only a tradeoff between solving the OOM
> faster vs risk of false positives OOM.

I really do not want to rely on any timing. This just too fragile. Once
we have killed a task then we shouldn't pick another victim until it
passed exit_mmap or the oom_reaper did its job. Otherwise we just risk
false positives while we have already disrupted the workload.
 
> If it wasn't because of exit_aio (which may have to wait I/O
> completion), changing the OOM reaper to return "false" if
> mmget_not_zero returns zero and MMF_OOM_SKIP is not set yet, would
> have been enough (and depending on the runtime it may have solved OOM
> faster in NUMA) and there would be absolutely no need to run OOM
> reaper and exit_mmap concurrently on the same mm. However there's such
> exit_aio..
> 
> Raw I/O mempools never require memory allocations, although aio if it
> involves a filesystem to complete may run into filesystem or buffering
> locks which are known to loop forever or depend on other tasks stuck
> in kernel allocations, so I didn't go down that chain too long.

Exactly. We simply cannot assume anything here because veryfying this
basically impossible.
 
[...]
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f19efcf75418..615133762b99 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2993,6 +2993,11 @@ void exit_mmap(struct mm_struct *mm)
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>  	unmap_vmas(&tlb, vma, 0, -1);
>  
> +	if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> +		/* wait the OOM reaper to stop working on this mm */
> +		down_write(&mm->mmap_sem);
> +		up_write(&mm->mmap_sem);
> +	}
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb, 0, -1);
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e8b4f030c1c..2a7000995784 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -471,6 +471,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	struct mmu_gather tlb;
>  	struct vm_area_struct *vma;
>  	bool ret = true;
> +	bool mmgot = true;
>  
>  	/*
>  	 * We have to make sure to not race with the victim exit path
> @@ -500,9 +501,16 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	 * and delayed __mmput doesn't matter that much
>  	 */
>  	if (!mmget_not_zero(mm)) {
> -		up_read(&mm->mmap_sem);
>  		trace_skip_task_reaping(tsk->pid);
> -		goto unlock_oom;
> +		/*
> +		 * MMF_OOM_SKIP is set by exit_mmap when the OOM
> +		 * reaper can't work on the mm anymore.
> +		 */
> +		if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> +			up_read(&mm->mmap_sem);
> +			goto unlock_oom;
> +		}
> +		mmgot = false;
>  	}

This will work more or less the same to what we have currently.

[victim]		[oom reaper]				[oom killer]
do_exit			__oom_reap_task_mm
  mmput
    __mmput
			  mmget_not_zero
			    test_and_set_bit(MMF_OOM_SKIP)
			    					oom_evaluate_task
								   # select next victim 
			  # reap the mm
      unmap_vmas

so we can select a next victim while the current one is still not
completely torn down.

> > > 4) how is it safe to overwrite a VM_FAULT_RETRY that returns without
> > >    mmap_sem and then the arch code will release the mmap_sem despite
> > >    it was already released by handle_mm_fault? Anonymous memory faults
> > >    aren't common to return VM_FAULT_RETRY but an userfault
> > >    can. Shouldn't there be a block that prevents overwriting if
> > >    VM_FAULT_RETRY is set below? (not only VM_FAULT_ERROR)
> > > 
> > > 	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > > 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > > 		ret = VM_FAULT_SIGBUS;
> > 
> > I am not sure I understand what you mean and how this is related to the
> > patch?
> 
> It's not related to the patch but it involves the OOM reaper as it
> only happens when MMF_UNSTABLE is set which is set only by the OOM
> reaper. I was simply reading the OOM reaper code and following up what
> MMF_UNSTABLE does and it ringed a bell.

I hope 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom
reaped memory") will clarify this code. If not please start a new thread
so that we do not conflate different things together.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-07-26  5:45 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-24  7:23 [PATCH] mm, oom: allow oom reaper to race with exit_mmap Michal Hocko
2017-07-24 14:00 ` Kirill A. Shutemov
2017-07-24 14:15   ` Michal Hocko
2017-07-24 14:51     ` Kirill A. Shutemov
2017-07-24 16:11       ` Michal Hocko
2017-07-25 14:17         ` Kirill A. Shutemov
2017-07-25 14:26           ` Michal Hocko
2017-07-25 15:07             ` Kirill A. Shutemov
2017-07-25 15:15               ` Michal Hocko
2017-07-25 14:26         ` Michal Hocko
2017-07-25 15:17           ` Kirill A. Shutemov
2017-07-25 15:23             ` Michal Hocko
2017-07-25 15:31               ` Kirill A. Shutemov
2017-07-25 16:04                 ` Michal Hocko
2017-07-25 19:19                   ` Andrea Arcangeli
2017-07-26  5:45                     ` Michal Hocko
2017-07-26 16:29                       ` Andrea Arcangeli
2017-07-26 16:43                         ` Andrea Arcangeli
2017-07-27  6:50                         ` Michal Hocko
2017-07-27 14:55                           ` Andrea Arcangeli
2017-07-28  6:23                             ` Michal Hocko
2017-07-28  1:58                         ` [PATCH 1/1] mm: oom: let oom_reap_task and exit_mmap to run kbuild test robot
2017-08-15  0:20                         ` [PATCH] mm, oom: allow oom reaper to race with exit_mmap David Rientjes
2017-07-24 15:27 ` Michal Hocko
2017-07-24 16:42 ` kbuild test robot
2017-07-24 18:12   ` Michal Hocko
2017-07-25 15:26 ` Andrea Arcangeli
2017-07-25 15:45   ` Michal Hocko
2017-07-25 18:26     ` Andrea Arcangeli
2017-07-26  5:45       ` Michal Hocko [this message]
2017-07-26 16:39         ` Andrea Arcangeli
2017-07-27  6:32           ` Michal Hocko
  -- strict thread matches above, loose matches on Subject: below --
2017-08-10  8:16 Michal Hocko
2017-08-10 18:05 ` Andrea Arcangeli
2017-08-10 18:51   ` Michal Hocko
2017-08-10 20:36     ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170726054533.GA960@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=oleg@redhat.com \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).