why do we do ALLOC_WMARK_HIGH before going out_of

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* why do we do ALLOC_WMARK_HIGH before going out_of_memory
@ 2016-01-28 16:38 Michal Hocko
  2016-01-28 19:02 ` Andrea Arcangeli
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2016-01-28 16:38 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Argangeli, Mel Gorman, Johannes Weiner, David Rientjes,
	Andrew Morton

Hi,
__alloc_pages_may_oom just after it manages to get oom_lock we try
to allocate once more with ALLOC_WMARK_HIGH target. I was always
wondering why are we will to actually kill something even though
we are above min wmark. This doesn't make much sense to me. I understand
that this is racy because __alloc_pages_may_oom is called after we have
failed to fulfill the WMARK_MIN target but this means WMARK_HIGH
is highly unlikely as well. So either we should use ALLOC_WMARK_MIN
or get rid of this altogether.

The code has been added before git era by
https://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc2/2.6.11-rc2-mm2/broken-out/mm-fix-several-oom-killer-bugs.patch

and it doesn't explain this particular decision. It seems to me that
what ever was the reason back then it doesn't hold anymore.

What do you think?
-- 
Michal Hocko
SUSE Labs 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we do ALLOC_WMARK_HIGH before going out_of_memory
  2016-01-28 16:38 why do we do ALLOC_WMARK_HIGH before going out_of_memory Michal Hocko
@ 2016-01-28 19:02 ` Andrea Arcangeli
  2016-01-28 20:11   ` Michal Hocko
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Arcangeli @ 2016-01-28 19:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Mel Gorman, Johannes Weiner, David Rientjes,
	Andrew Morton

Hello Michal,

On Thu, Jan 28, 2016 at 05:38:03PM +0100, Michal Hocko wrote:
> Hi,
> __alloc_pages_may_oom just after it manages to get oom_lock we try
> to allocate once more with ALLOC_WMARK_HIGH target. I was always
> wondering why are we will to actually kill something even though
> we are above min wmark. This doesn't make much sense to me. I understand
> that this is racy because __alloc_pages_may_oom is called after we have
> failed to fulfill the WMARK_MIN target but this means WMARK_HIGH
> is highly unlikely as well. So either we should use ALLOC_WMARK_MIN
> or get rid of this altogether.
> 
> The code has been added before git era by
> https://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc2/2.6.11-rc2-mm2/broken-out/mm-fix-several-oom-killer-bugs.patch

I assume you refer to this:

+		/*
+		 * Go through the zonelist yet one more time, keep
+		 * very high watermark here, this is only to catch
+		 * a parallel oom killing, we must fail if we're still
+		 * under heavy pressure.
+		 */
+		for (i = 0; (z = zones[i]) != NULL; i++) {
+			if (!zone_watermark_ok(z, order, z->pages_high,
			   			  	 ^^^^^^^^^^^^^

> and it doesn't explain this particular decision. It seems to me that

Not explained explicitly in the commit header but see the above
comment added just before the z->pages_high, it at least tries to
explain it..

Although the implementation changed and now it's ALLOC_WMARK_HIGH
instead of z->pages_high, the old comment is still in the current
upstream:

	/*
	 * Go through the zonelist yet one more time, keep very high watermark
	 * here, this is only to catch a parallel oom killing, we must fail if
	 * we're still under heavy pressure.
	 */

> what ever was the reason back then it doesn't hold anymore.
> 
> What do you think?

Elaborating the comment: the reason for the high wmark is to reduce
the likelihood of livelocks and be sure to invoke the OOM killer, if
we're still under pressure and reclaim just failed. The high wmark is
used to be sure the failure of reclaim isn't going to be ignored. If
using the min wmark like you propose there's risk of livelock or
anyway of delayed OOM killer invocation.

The reason for doing one last wmark check (regardless of the wmark
used) before invoking the oom killer, was just to be sure another OOM
killer invocation hasn't already freed a ton of memory while we were
stuck in reclaim. A lot of free memory generated by the OOM killer,
won't make a parallel reclaim more likely to succeed, it just creates
free memory, but reclaim only succeeds when it finds "freeable" memory
and it makes progress in converting it to free memory. So for the
purpose of this last check, the high wmark would work fine as lots of
free memory would have been generated in such case.

It's not immediately apparent if there is a new OOM killer upstream
logic that would prevent the risk of a second OOM killer invocation
despite another OOM killing already happened while we were stuck in
reclaim. In absence of that, the high wmark check would be still
needed.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we do ALLOC_WMARK_HIGH before going out_of_memory
  2016-01-28 19:02 ` Andrea Arcangeli
@ 2016-01-28 20:11   ` Michal Hocko
  2016-01-28 21:12     ` Johannes Weiner
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2016-01-28 20:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Johannes Weiner, David Rientjes,
	Andrew Morton

On Thu 28-01-16 20:02:04, Andrea Arcangeli wrote:
> Hello Michal,
> 
> On Thu, Jan 28, 2016 at 05:38:03PM +0100, Michal Hocko wrote:
> > Hi,
> > __alloc_pages_may_oom just after it manages to get oom_lock we try
> > to allocate once more with ALLOC_WMARK_HIGH target. I was always
> > wondering why are we will to actually kill something even though
> > we are above min wmark. This doesn't make much sense to me. I understand
> > that this is racy because __alloc_pages_may_oom is called after we have
> > failed to fulfill the WMARK_MIN target but this means WMARK_HIGH
> > is highly unlikely as well. So either we should use ALLOC_WMARK_MIN
> > or get rid of this altogether.
> > 
> > The code has been added before git era by
> > https://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc2/2.6.11-rc2-mm2/broken-out/mm-fix-several-oom-killer-bugs.patch
> 
> I assume you refer to this:
> 
> +		/*
> +		 * Go through the zonelist yet one more time, keep
> +		 * very high watermark here, this is only to catch
> +		 * a parallel oom killing, we must fail if we're still
> +		 * under heavy pressure.
> +		 */
> +		for (i = 0; (z = zones[i]) != NULL; i++) {
> +			if (!zone_watermark_ok(z, order, z->pages_high,
> 			   			  	 ^^^^^^^^^^^^^

yes

> > and it doesn't explain this particular decision. It seems to me that
> 
> Not explained explicitly in the commit header but see the above
> comment added just before the z->pages_high, it at least tries to
> explain it..
> 
> Although the implementation changed and now it's ALLOC_WMARK_HIGH
> instead of z->pages_high, the old comment is still in the current
> upstream:
> 
> 	/*
> 	 * Go through the zonelist yet one more time, keep very high watermark
> 	 * here, this is only to catch a parallel oom killing, we must fail if
> 	 * we're still under heavy pressure.
> 	 */

Yes I have read the comment but it doesn't make any sense to me, to be
honest.

> > what ever was the reason back then it doesn't hold anymore.
> > 
> > What do you think?
> 
> Elaborating the comment: the reason for the high wmark is to reduce
> the likelihood of livelocks and be sure to invoke the OOM killer, if
> we're still under pressure and reclaim just failed. The high wmark is
> used to be sure the failure of reclaim isn't going to be ignored. If
> using the min wmark like you propose there's risk of livelock or
> anyway of delayed OOM killer invocation.

By livelock you mean trashing when last few pages are recycled very
quickly and the OOM killer should be invoked instead?

> The reason for doing one last wmark check (regardless of the wmark
> used) before invoking the oom killer, was just to be sure another OOM
> killer invocation hasn't already freed a ton of memory while we were
> stuck in reclaim. A lot of free memory generated by the OOM killer,
> won't make a parallel reclaim more likely to succeed, it just creates
> free memory, but reclaim only succeeds when it finds "freeable" memory
> and it makes progress in converting it to free memory. So for the
> purpose of this last check, the high wmark would work fine as lots of
> free memory would have been generated in such case.

OK, I see. It is true that we try to allocate only if the direct reclaim
made some progress which is not aware of the oom killer reclaimed memory.

>
> It's not immediately apparent if there is a new OOM killer upstream
> logic that would prevent the risk of a second OOM killer invocation
> despite another OOM killing already happened while we were stuck in
> reclaim. In absence of that, the high wmark check would be still
> needed.

Well, my oom detection rework [1] strives to make the OOM detection more
robust and the retry logic performs the watermark check. So I think the
last attempt is no longer needed after that patch. I will then remove
it.

Thanks for the clarification
---
[1] http://lkml.kernel.org/r/1450203586-10959-1-git-send-email-mhocko@kernel.org
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we do ALLOC_WMARK_HIGH before going out_of_memory
  2016-01-28 20:11   ` Michal Hocko
@ 2016-01-28 21:12     ` Johannes Weiner
  2016-01-28 21:55       ` Michal Hocko
  0 siblings, 1 reply; 10+ messages in thread
From: Johannes Weiner @ 2016-01-28 21:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Arcangeli, linux-mm, Mel Gorman, David Rientjes,
	Andrew Morton

On Thu, Jan 28, 2016 at 09:11:23PM +0100, Michal Hocko wrote:
> On Thu 28-01-16 20:02:04, Andrea Arcangeli wrote:
> > It's not immediately apparent if there is a new OOM killer upstream
> > logic that would prevent the risk of a second OOM killer invocation
> > despite another OOM killing already happened while we were stuck in
> > reclaim. In absence of that, the high wmark check would be still
> > needed.
> 
> Well, my oom detection rework [1] strives to make the OOM detection more
> robust and the retry logic performs the watermark check. So I think the
> last attempt is no longer needed after that patch. I will then remove
> it.

Hm? I don't have the same conclusion from what Andrea said.

When you have many allocations racing at the same time, they can all
enter __alloc_pages_may_oom() in quick succession. We don't want a
cavalcade of OOM kills when one could be enough, so we have to make
sure that in between should_alloc_retry() giving up and acquiring the
OOM lock nobody else already issued a kill and released enough memory.

It's a race window that gets yanked wide open when hundreds of threads
race in __alloc_pages_may_oom(). Your patches don't fix that, AFAICS.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we do ALLOC_WMARK_HIGH before going out_of_memory
  2016-01-28 21:12     ` Johannes Weiner
@ 2016-01-28 21:55       ` Michal Hocko
  2016-01-28 23:40         ` Johannes Weiner
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2016-01-28 21:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrea Arcangeli, linux-mm, Mel Gorman, David Rientjes,
	Andrew Morton

On Thu 28-01-16 16:12:40, Johannes Weiner wrote:
> On Thu, Jan 28, 2016 at 09:11:23PM +0100, Michal Hocko wrote:
> > On Thu 28-01-16 20:02:04, Andrea Arcangeli wrote:
> > > It's not immediately apparent if there is a new OOM killer upstream
> > > logic that would prevent the risk of a second OOM killer invocation
> > > despite another OOM killing already happened while we were stuck in
> > > reclaim. In absence of that, the high wmark check would be still
> > > needed.
> > 
> > Well, my oom detection rework [1] strives to make the OOM detection more
> > robust and the retry logic performs the watermark check. So I think the
> > last attempt is no longer needed after that patch. I will then remove
> > it.
> 
> Hm? I don't have the same conclusion from what Andrea said.
> 
> When you have many allocations racing at the same time, they can all
> enter __alloc_pages_may_oom() in quick succession. We don't want a
> cavalcade of OOM kills when one could be enough, so we have to make
> sure that in between should_alloc_retry() giving up and acquiring the
> OOM lock nobody else already issued a kill and released enough memory.
> 
> It's a race window that gets yanked wide open when hundreds of threads
> race in __alloc_pages_may_oom(). Your patches don't fix that, AFAICS.

Only one task would be allowed to go out_of_memory and all the rest will
simply fail on oom_lock trylock and return with NULL. Or am I missing
your point?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we do ALLOC_WMARK_HIGH before going out_of_memory
  2016-01-28 21:55       ` Michal Hocko
@ 2016-01-28 23:40         ` Johannes Weiner
  2016-01-29 14:38           ` Michal Hocko
  0 siblings, 1 reply; 10+ messages in thread
From: Johannes Weiner @ 2016-01-28 23:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Arcangeli, linux-mm, Mel Gorman, David Rientjes,
	Andrew Morton

On Thu, Jan 28, 2016 at 10:55:15PM +0100, Michal Hocko wrote:
> On Thu 28-01-16 16:12:40, Johannes Weiner wrote:
> > On Thu, Jan 28, 2016 at 09:11:23PM +0100, Michal Hocko wrote:
> > > On Thu 28-01-16 20:02:04, Andrea Arcangeli wrote:
> > > > It's not immediately apparent if there is a new OOM killer upstream
> > > > logic that would prevent the risk of a second OOM killer invocation
> > > > despite another OOM killing already happened while we were stuck in
> > > > reclaim. In absence of that, the high wmark check would be still
> > > > needed.
> > > 
> > > Well, my oom detection rework [1] strives to make the OOM detection more
> > > robust and the retry logic performs the watermark check. So I think the
> > > last attempt is no longer needed after that patch. I will then remove
> > > it.
> > 
> > Hm? I don't have the same conclusion from what Andrea said.
> > 
> > When you have many allocations racing at the same time, they can all
> > enter __alloc_pages_may_oom() in quick succession. We don't want a
> > cavalcade of OOM kills when one could be enough, so we have to make
> > sure that in between should_alloc_retry() giving up and acquiring the
> > OOM lock nobody else already issued a kill and released enough memory.
> > 
> > It's a race window that gets yanked wide open when hundreds of threads
> > race in __alloc_pages_may_oom(). Your patches don't fix that, AFAICS.
> 
> Only one task would be allowed to go out_of_memory and all the rest will
> simply fail on oom_lock trylock and return with NULL. Or am I missing
> your point?

Just picture it with mutex_lock() instead of mutex_trylock() and it
becomes obvious why you have to do a locked check before the kill.

The race window is much smaller with the trylock of course, but given
enough threads it's possible that one of the other contenders would
acquire the trylock right after the first task drops it:

first task:                     204th task:
!reclaim                        !reclaim
!should_alloc_retry             !should_alloc_retry
oom_trylock
out_of_memory
oom_unlock
                                oom_trylock
                                out_of_memory // likely unnecessary

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we do ALLOC_WMARK_HIGH before going out_of_memory
  2016-01-28 23:40         ` Johannes Weiner
@ 2016-01-29 14:38           ` Michal Hocko
  2016-01-29 15:56             ` Andrea Arcangeli
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2016-01-29 14:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrea Arcangeli, linux-mm, Mel Gorman, David Rientjes,
	Andrew Morton

On Thu 28-01-16 18:40:18, Johannes Weiner wrote:
> On Thu, Jan 28, 2016 at 10:55:15PM +0100, Michal Hocko wrote:
> > On Thu 28-01-16 16:12:40, Johannes Weiner wrote:
> > > On Thu, Jan 28, 2016 at 09:11:23PM +0100, Michal Hocko wrote:
> > > > On Thu 28-01-16 20:02:04, Andrea Arcangeli wrote:
> > > > > It's not immediately apparent if there is a new OOM killer upstream
> > > > > logic that would prevent the risk of a second OOM killer invocation
> > > > > despite another OOM killing already happened while we were stuck in
> > > > > reclaim. In absence of that, the high wmark check would be still
> > > > > needed.
> > > > 
> > > > Well, my oom detection rework [1] strives to make the OOM detection more
> > > > robust and the retry logic performs the watermark check. So I think the
> > > > last attempt is no longer needed after that patch. I will then remove
> > > > it.
> > > 
> > > Hm? I don't have the same conclusion from what Andrea said.
> > > 
> > > When you have many allocations racing at the same time, they can all
> > > enter __alloc_pages_may_oom() in quick succession. We don't want a
> > > cavalcade of OOM kills when one could be enough, so we have to make
> > > sure that in between should_alloc_retry() giving up and acquiring the
> > > OOM lock nobody else already issued a kill and released enough memory.
> > > 
> > > It's a race window that gets yanked wide open when hundreds of threads
> > > race in __alloc_pages_may_oom(). Your patches don't fix that, AFAICS.
> > 
> > Only one task would be allowed to go out_of_memory and all the rest will
> > simply fail on oom_lock trylock and return with NULL. Or am I missing
> > your point?
> 
> Just picture it with mutex_lock() instead of mutex_trylock() and it
> becomes obvious why you have to do a locked check before the kill.
> 
> The race window is much smaller with the trylock of course, but given
> enough threads it's possible that one of the other contenders would
> acquire the trylock right after the first task drops it:
> 
> first task:                     204th task:
> !reclaim                        !reclaim
> !should_alloc_retry             !should_alloc_retry
> oom_trylock
> out_of_memory
> oom_unlock
>                                 oom_trylock
>                                 out_of_memory // likely unnecessary

That would require the oom victim to release the memory and drop
TIF_MEMDIE before we go out_of_memory again. And that might happen
anytime whether we are holding oom_trylock or not because it doesn't
synchronize the exit path. So we are basically talking about:

should_alloc_retry
[1]
get_page_from_freelist(ALLOC_WMARK_HIGH)
[2]
out_of_memory

and the race window for 1 is much smaller than 2 because [2] is quite
costly operation. I wonder if this last moment request ever succeeds. I
have run my usual oom flood tests and it hasn't shown up a single time.

That being said I do not care that much. I just find this confusing and
basically pointless because the whole thing is racy by definition and we
are trying to cover a smaller window. I would understand if we did such
a last attempt right before we are going to kill a selected victim. This
would cover much larger race window.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we do ALLOC_WMARK_HIGH before going out_of_memory
  2016-01-29 14:38           ` Michal Hocko
@ 2016-01-29 15:56             ` Andrea Arcangeli
  2016-01-29 16:12               ` Michal Hocko
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Arcangeli @ 2016-01-29 15:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, Mel Gorman, David Rientjes,
	Andrew Morton

On Fri, Jan 29, 2016 at 03:38:06PM +0100, Michal Hocko wrote:
> That would require the oom victim to release the memory and drop
> TIF_MEMDIE before we go out_of_memory again. And that might happen
> anytime whether we are holding oom_trylock or not because it doesn't
> synchronize the exit path. So we are basically talking about:
> 
> should_alloc_retry
> [1]
> get_page_from_freelist(ALLOC_WMARK_HIGH)
> [2]
> out_of_memory
> 
> and the race window for 1 is much smaller than 2 because [2] is quite

[1] is before should_alloc_retry is set. It covers the entire reclaim.

> costly operation. I wonder if this last moment request ever succeeds. I
> have run my usual oom flood tests and it hasn't shown up a single time.

For this check to make a difference, you need a lot of small programs
all hitting OOM at the same time. Perhaps the trylock on the oom_lock
tends to hide the race like you suggested earlier but it doesn't sound
accurate if we proceed to oom kill without checking the high wmark at all
before killing another task after a random reclaim failure.

Also note there's no CPU to save here, this is a very slow path,
anything that can increase accuracy and avoid OOM kill false
positives (at practical zero CPU cost like here) sounds worth it.

> That being said I do not care that much. I just find this confusing and
> basically pointless because the whole thing is racy by definition and we
> are trying to cover a smaller window. I would understand if we did such
> a last attempt right before we are going to kill a selected victim. This
> would cover much larger race window.

The high wmark itself is still an arbitrary value so yes, it isn't
perfect, but the whole OOM killing is an heuristic, so tiny race
window to me sounds better than huge race window.

Moving this check down inside out_of_memory to reduce the window even
further is quite a different proposition than removing the check.

Currently we're doing this check after holding the oom_lock, back in
2.6.x it was more more racy, now thanks to the oom_lock it's way more
reliable. If you want to increase reliability further I sure agree,
but removing the check would drop reliability instead so I don't see
how it could be preferable.

We can increase reliability further if we'd move this high wmark check
after select_bad_process() returned a task (and not -1UL) to be sure
all TIF_MEMDIE tasks already were flushed out, before checking the
high wmark. Just it would complicate the code and that's probably why
it wasn't done.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we do ALLOC_WMARK_HIGH before going out_of_memory
  2016-01-29 15:56             ` Andrea Arcangeli
@ 2016-01-29 16:12               ` Michal Hocko
  2016-01-29 16:29                 ` Michal Hocko
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2016-01-29 16:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Johannes Weiner, linux-mm, Mel Gorman, David Rientjes,
	Andrew Morton

On Fri 29-01-16 16:56:45, Andrea Arcangeli wrote:
> On Fri, Jan 29, 2016 at 03:38:06PM +0100, Michal Hocko wrote:
> > That would require the oom victim to release the memory and drop
> > TIF_MEMDIE before we go out_of_memory again. And that might happen
> > anytime whether we are holding oom_trylock or not because it doesn't
> > synchronize the exit path. So we are basically talking about:
> > 
> > should_alloc_retry
> > [1]
> > get_page_from_freelist(ALLOC_WMARK_HIGH)
> > [2]
> > out_of_memory
> > 
> > and the race window for 1 is much smaller than 2 because [2] is quite
> 
> [1] is before should_alloc_retry is set. It covers the entire reclaim.
> 
> > costly operation. I wonder if this last moment request ever succeeds. I
> > have run my usual oom flood tests and it hasn't shown up a single time.
> 
> For this check to make a difference, you need a lot of small programs
> all hitting OOM at the same time.

That is essentially my oom flood testing program doing. Spawning
hundreds of paralell anon mem eaters.

> Perhaps the trylock on the oom_lock
> tends to hide the race like you suggested earlier but it doesn't sound
> accurate if we proceed to oom kill without checking the high wmark at all
> before killing another task after a random reclaim failure.

The thing is that the reclaim would have to reclaim consistently after
the rework.

> Also note there's no CPU to save here, this is a very slow path,
> anything that can increase accuracy and avoid OOM kill false
> positives (at practical zero CPU cost like here) sounds worth it.

Sure my idea wasn't to save the CPU. The code was just a head scratcher
and an attempt for a clean up. We have more places where we keep some
heuristics just because we have them since ever and it is hard to judge
what effect they have exactly.

> > That being said I do not care that much. I just find this confusing and
> > basically pointless because the whole thing is racy by definition and we
> > are trying to cover a smaller window. I would understand if we did such
> > a last attempt right before we are going to kill a selected victim. This
> > would cover much larger race window.
> 
> The high wmark itself is still an arbitrary value so yes, it isn't
> perfect, but the whole OOM killing is an heuristic, so tiny race
> window to me sounds better than huge race window.
> 
> Moving this check down inside out_of_memory to reduce the window even
> further is quite a different proposition than removing the check.
> 
> Currently we're doing this check after holding the oom_lock, back in
> 2.6.x it was more more racy, now thanks to the oom_lock it's way more
> reliable. If you want to increase reliability further I sure agree,
> but removing the check would drop reliability instead so I don't see
> how it could be preferable.
> 
> We can increase reliability further if we'd move this high wmark check
> after select_bad_process() returned a task (and not -1UL) to be sure
> all TIF_MEMDIE tasks already were flushed out, before checking the
> high wmark. Just it would complicate the code and that's probably why
> it wasn't done.

This would be mixing two layers which is not nice. Johannes was
proposing a different approach [1] which sounds much better to me.

[1] http://lkml.kernel.org/r/20160128235110.GA5805@cmpxchg.org
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we do ALLOC_WMARK_HIGH before going out_of_memory
  2016-01-29 16:12               ` Michal Hocko
@ 2016-01-29 16:29                 ` Michal Hocko
  0 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2016-01-29 16:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Johannes Weiner, linux-mm, Mel Gorman, David Rientjes,
	Andrew Morton

On Fri 29-01-16 17:12:57, Michal Hocko wrote:
> On Fri 29-01-16 16:56:45, Andrea Arcangeli wrote:
> > On Fri, Jan 29, 2016 at 03:38:06PM +0100, Michal Hocko wrote:
> > > That would require the oom victim to release the memory and drop
> > > TIF_MEMDIE before we go out_of_memory again. And that might happen
> > > anytime whether we are holding oom_trylock or not because it doesn't
> > > synchronize the exit path. So we are basically talking about:
> > > 
> > > should_alloc_retry
> > > [1]
> > > get_page_from_freelist(ALLOC_WMARK_HIGH)
> > > [2]
> > > out_of_memory
> > > 
> > > and the race window for 1 is much smaller than 2 because [2] is quite
> > 
> > [1] is before should_alloc_retry is set. It covers the entire reclaim.
> > 
> > > costly operation. I wonder if this last moment request ever succeeds. I
> > > have run my usual oom flood tests and it hasn't shown up a single time.
> > 
> > For this check to make a difference, you need a lot of small programs
> > all hitting OOM at the same time.
> 
> That is essentially my oom flood testing program doing. Spawning
> hundreds of paralell anon mem eaters.
> 
> > Perhaps the trylock on the oom_lock
> > tends to hide the race like you suggested earlier but it doesn't sound
> > accurate if we proceed to oom kill without checking the high wmark at all
> > before killing another task after a random reclaim failure.
> 
> The thing is that the reclaim would have to reclaim consistently after
> the rework.

Ble. It should read: would have to fail to reclaim consistently or
have only very small chance to reclaim enough to fulfill to allocation
request (even if we reclaimed all the reclaimable memory combined with
the free memory). So the chances to succeed after should_alloc_retry are
quite small. The race is there of course and something might have just
freed a bulk of memory but my primary point is that [1] is way too small
to make a difference. It would if we slept on the lock of course but
that is not happening.

Anyway I have refrained from pursuing the patch to remove this last
minute check. It is definitely not harmful.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-01-29 16:29 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-28 16:38 why do we do ALLOC_WMARK_HIGH before going out_of_memory Michal Hocko
2016-01-28 19:02 ` Andrea Arcangeli
2016-01-28 20:11   ` Michal Hocko
2016-01-28 21:12     ` Johannes Weiner
2016-01-28 21:55       ` Michal Hocko
2016-01-28 23:40         ` Johannes Weiner
2016-01-29 14:38           ` Michal Hocko
2016-01-29 15:56             ` Andrea Arcangeli
2016-01-29 16:12               ` Michal Hocko
2016-01-29 16:29                 ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).