oom killer rewrite

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* oom killer rewrite
@ 2010-05-19 22:14 David Rientjes
  2010-05-20  0:27 ` KAMEZAWA Hiroyuki
  2010-05-24  1:09 ` KOSAKI Motohiro
  0 siblings, 2 replies; 18+ messages in thread
From: David Rientjes @ 2010-05-19 22:14 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm

KOSAKI,

I've been notified that my entire oom killer rewrite has been dropped from 
-mm based solely on your feedback.  The problem is that I have absolutely 
no idea what issues you have with the changes that haven't already been 
addressed (nobody else does, either, it seems).

The last work I've done on the patches are to ask those involved in the 
review (including you) and linux-mm whether there were any outstanding 
issues that anyone has, and I've asked that twice.  I've received no 
response either time.

Please respond with a list of your objections to the rewrite (which is 
available at 
http://www.kernel.org/pub/linux/kernel/people/rientjes/oom-killer-rewrite
so we can move forward.

Thank you.

			David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-19 22:14 oom killer rewrite David Rientjes
@ 2010-05-20  0:27 ` KAMEZAWA Hiroyuki
  2010-05-25  9:42   ` David Rientjes
  2010-05-24  1:09 ` KOSAKI Motohiro
  1 sibling, 1 reply; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-20  0:27 UTC (permalink / raw)
  To: David Rientjes; +Cc: KOSAKI Motohiro, linux-mm

On Wed, 19 May 2010 15:14:42 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> KOSAKI,
> 
> I've been notified that my entire oom killer rewrite has been dropped from 
> -mm based solely on your feedback.  The problem is that I have absolutely 
> no idea what issues you have with the changes that haven't already been 
> addressed (nobody else does, either, it seems).
> 

I've pointed out that "normalized" parameter doesn't seem to work well in some
situaion (in cluster). I hope you'll have an extra interface as

	echo 3G > /proc/<pid>/oom_indemification

to allow users have "absolute value" setting.
(If the admin know usual memory usage of an application, we can only
 add badness to extra memory usage.)

To be honest, I can't fully understand why we need _normalized_ parameter. Why
oom_adj _which is now used_ is not enough for setting "relative importance" ?

Does google guys controls importance of processes in very small step ?

And, IIRC, Nick pointed out that "don't remove _used_ interfaces just because
you hate it or it seems not clean". So, I recommend you to drop sysctl changes.

I think the whole concept of your patch series is good and I like it.
But changes in interfaces seem not very sensible. 

Don't take my word very serious but I don't like changes in interface.

Cheers,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-19 22:14 oom killer rewrite David Rientjes
  2010-05-20  0:27 ` KAMEZAWA Hiroyuki
@ 2010-05-24  1:09 ` KOSAKI Motohiro
  2010-05-24  7:07   ` Nick Piggin
  2010-05-25  9:55   ` David Rientjes
  1 sibling, 2 replies; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-05-24  1:09 UTC (permalink / raw)
  To: David Rientjes; +Cc: kosaki.motohiro, linux-mm, KAMEZAWA Hiroyuki

Hi

> KOSAKI,
> 
> I've been notified that my entire oom killer rewrite has been dropped from 
> -mm based solely on your feedback.  The problem is that I have absolutely 
> no idea what issues you have with the changes that haven't already been 
> addressed (nobody else does, either, it seems).

That's simple. A regression and an incompatibility are absolutely
unacceptable. They should be removed. Your patches have some funny parts,
but, afaik, nobody said funny requirement itself is wrong. They only said
your requirement don't have to cause any pain to other users.

Zero risk patches are always acceptable.

> 
> The last work I've done on the patches are to ask those involved in the 
> review (including you) and linux-mm whether there were any outstanding 
> issues that anyone has, and I've asked that twice.  I've received no 
> response either time.
> 
> Please respond with a list of your objections to the rewrite (which is 
> available at 
> http://www.kernel.org/pub/linux/kernel/people/rientjes/oom-killer-rewrite
> so we can move forward.

I've reviewed all of your patches. the result is here.

> oom-filter-tasks-not-sharing-the-same-cpuset.patch
	ok, no objection.
	I'm still afraid this patch reinstanciate old bug. but at that time,
	we can drop it solely. this patch is enough bisectable.

> oom-sacrifice-child-with-highest-badness-score-for-parent.patch
	ok, no objection.
	It's good patch.

> oom-select-task-from-tasklist-for-mempolicy-ooms.patch
	ok, no objection.

> oom-remove-special-handling-for-pagefault-ooms.patch
	ok, no objection.

> oom-badness-heuristic-rewrite.patch
	No. All of rewrite is bad idea. Please make separate some
	individual patches.
	All rewrite thing break bisectability. Perhaps it can steal
	a lot of time from MM developers.
	This patch have following parts.
	1) Add oom_score_adj
	2) OOM score normalization
	3) forkbomb detector
	4) oom_forkbomb_thres new knob
 	5) Root user get 3% bonus instead 400%

	all except (2) seems ok. but I'll review them again after separation.
	but you can't insert your copyright. 

> oom-deprecate-oom_adj-tunable.patch
	NAK. you can't change userland use-case at all. This patch
	only makes bug report flood and streal our time.

> oom-replace-sysctls-with-quick-mode.patch
	NAK. To change sysctl makes confusion to userland.
	You have to prove such deprecated sysctl was alread unused.
	But the fact is, there is users. I have hear some times such
	use case and recent bug reporter said that's used.

	https://bugzilla.kernel.org/show_bug.cgi?id=15058

> oom-avoid-oom-killer-for-lowmem-allocations.patch
	I don't like this one. 64bit arch have big (e.g. 2/4G)
	DMA_ZONE/DMA32_ZONE. So, if we create small guest kernel
	on KVM (or Xen), Killing processes may help. IOW, this
	one is conceptually good. but this check way is brutal.

	but even though it's ok. Let's go merge it. this patch is
	enough small.
	If any problem is occur, we can revert this one easily.


> oom-remove-unnecessary-code-and-cleanup.patch
	ok, no objection.

> oom-default-to-killing-current-for-pagefault-ooms.patch
	NAK.
	1) this patch break panic_on_oom
	2) At this merge window, Nick change almost all architecture's
	   page hault handler. now almost all arch use
	   pagefault_out_of_memory. your description has been a bit obsoleted.

> oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch
	no objection. but afaik Oleg already pointed out "if (!p->mm)" is bad.
	So, Don't we need push his patch instead?

> oom-hold-tasklist_lock-when-dumping-tasks.patch
	ok, no objection.

> oom-give-current-access-to-memory-reserves-if-it-has-been-killed.patch
	ok, no objection.

> oom-avoid-sending-exiting-tasks-a-sigkill.patch
	ok, no objection

> oom-cleanup-oom_kill_task.patch
	ok, no objection

> oom-cleanup-oom_badness.patch
	ok, no objection

The above "no objection" mean you can feel free to use my reviewed-by tag.


Thanks.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-24  1:09 ` KOSAKI Motohiro
@ 2010-05-24  7:07   ` Nick Piggin
  2010-05-25  9:46     ` David Rientjes
  2010-05-25  9:55   ` David Rientjes
  1 sibling, 1 reply; 18+ messages in thread
From: Nick Piggin @ 2010-05-24  7:07 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: David Rientjes, linux-mm, KAMEZAWA Hiroyuki

On Mon, May 24, 2010 at 10:09:34AM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> > KOSAKI,
> > 
> > I've been notified that my entire oom killer rewrite has been dropped from 
> > -mm based solely on your feedback.  The problem is that I have absolutely 
> > no idea what issues you have with the changes that haven't already been 
> > addressed (nobody else does, either, it seems).

I had exactly the same issues with the userland kernel API changes and
the pagefault OOM regression it introduced, which I told you months ago.
You ignored me, it seems.

> 
> That's simple. A regression and an incompatibility are absolutely
> unacceptable. They should be removed. Your patches have some funny parts,
> but, afaik, nobody said funny requirement itself is wrong. They only said
> your requirement don't have to cause any pain to other users.
> 
> Zero risk patches are always acceptable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-20  0:27 ` KAMEZAWA Hiroyuki
@ 2010-05-25  9:42   ` David Rientjes
  2010-05-26  0:17     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2010-05-25  9:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: KOSAKI Motohiro, linux-mm

On Thu, 20 May 2010, KAMEZAWA Hiroyuki wrote:

> I've pointed out that "normalized" parameter doesn't seem to work well in some
> situaion (in cluster). I hope you'll have an extra interface as
> 
> 	echo 3G > /proc/<pid>/oom_indemification
> 
> to allow users have "absolute value" setting.
> (If the admin know usual memory usage of an application, we can only
>  add badness to extra memory usage.)
> 
> To be honest, I can't fully understand why we need _normalized_ parameter. Why
> oom_adj _which is now used_ is not enough for setting "relative importance" ?
> 

The only sane badness heuristic will be one that effectively compares all 
eligible tasks for oom kill in a way that are relative to one another; I'm 
concerned that a tunable that is based on a pure memory quantity requires 
specific knowledge of the system (or memcg, cpuset, etc) capacity before 
it is meaningful.  In other words, I opted to use a relative proportion so 
that when tasks are constrained to cpusets or memcgs or mempolicies they 
become part of a "virtualized system" where the proportion is then used in 
calculation of the total amount of system RAM, memcg limit, cpuset mems 
capacities, etc, without knowledge of what that value actually is.  So 
"echo 3G" may be valid in your example when not constrained to any cgroup 
or mempolicy but becomes invalid if I attach it to a cpuset with a single 
node of 1G capacity.  When oom_score_adj, we can specify the proportion 
"of the resources that the application has access to" in comparison to 
other applications that share those resources to determine oom killing 
priority.  I think that's a very powerful interface and your suggestion 
could easily be implemented in userspace with a simple divide, thus we 
don't need kernel support for it.

> And, IIRC, Nick pointed out that "don't remove _used_ interfaces just because
> you hate it or it seems not clean". So, I recommend you to drop sysctl changes.
> 

I addressed this in 
oom-reintroduce-and-deprecate-oom_kill_allocating_task.patch so that the 
end result was that only the oom_dump_tasks sysctl was removed because it 
was now enabled by default since it provides useful information about the 
state of the VM at the time of allocation failure.  So there's no longer 
any removal of "used" interfaces (the only previous use of oom_dump_tasks 
was to enable it, which is now the default).  Are you disagreeing with the 
deprecation of the sysctl since it was then folded into the new 
oom_dump_quick sysctl?

Regardless, I dropped all of these cleanups from my latest patch series 
which I'll post shortly, but keep in mind that what existed in -mm before 
it was dropped did not break any userspace API, it simply deprecated them 
for removal in two years.

> I think the whole concept of your patch series is good and I like it.

Thanks, Kame, for your continued interest in this work, it's encouraging.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-24  7:07   ` Nick Piggin
@ 2010-05-25  9:46     ` David Rientjes
  2010-05-25 10:05       ` Nick Piggin
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2010-05-25  9:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: KOSAKI Motohiro, linux-mm, KAMEZAWA Hiroyuki

On Mon, 24 May 2010, Nick Piggin wrote:

> > > I've been notified that my entire oom killer rewrite has been dropped from 
> > > -mm based solely on your feedback.  The problem is that I have absolutely 
> > > no idea what issues you have with the changes that haven't already been 
> > > addressed (nobody else does, either, it seems).
> 
> I had exactly the same issues with the userland kernel API changes and
> the pagefault OOM regression it introduced, which I told you months ago.
> You ignored me, it seems.
> 

No, I didn't ignore you, your comments were specifically addressed with 
oom-reintroduce-and-deprecate-oom_kill_allocating_task.patch which only 
deprecated the API change and wasn't even scheduled for removal until of 
the end of 2011.  So there were no kernel API changes that went 
unaddressed, perhaps you just didn't see that patch (I cc'd it to you on 
April 27, though).

The pagefault oom behavior can now be changed back since you've converted 
all existing architectures to call into the oom killer and not simply kill 
current (thanks for that work!).  Previously, there was an inconsistency 
amongst architectures in panic_on_oom behavior that we can now unify into 
semantics that work across the board.

I've made that change in my latest patch series which I'll be posting 
shortly.

Thanks for the feedback!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-24  1:09 ` KOSAKI Motohiro
  2010-05-24  7:07   ` Nick Piggin
@ 2010-05-25  9:55   ` David Rientjes
  2010-05-26  0:02     ` David Rientjes
  2010-05-28  5:25     ` KOSAKI Motohiro
  1 sibling, 2 replies; 18+ messages in thread
From: David Rientjes @ 2010-05-25  9:55 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, KAMEZAWA Hiroyuki

On Mon, 24 May 2010, KOSAKI Motohiro wrote:

> > I've been notified that my entire oom killer rewrite has been dropped from 
> > -mm based solely on your feedback.  The problem is that I have absolutely 
> > no idea what issues you have with the changes that haven't already been 
> > addressed (nobody else does, either, it seems).
> 
> That's simple. A regression and an incompatibility are absolutely
> unacceptable. They should be removed. Your patches have some funny parts,
> but, afaik, nobody said funny requirement itself is wrong. They only said
> your requirement don't have to cause any pain to other users.
> 
> Zero risk patches are always acceptable.
> 

When you see these "funny parts," please let me know what they are.  The 
were was no incompatibility issue after 
oom-reintroduce-and-deprecate-oom_kill_allocating_task.patch was merged, 
the interface was simply deprecated.  Arguing against the deprecation is 
understandable and quite frankly something I'd like to avoid since it's 
apparently hanging up the larger importance of the work, so I've dropped 
the consolidation (and subsequent deprecation of oom_kill_allocating_task) 
of the sysctls from my latest patch series.

> I've reviewed all of your patches. the result is here.
> 
> > oom-filter-tasks-not-sharing-the-same-cpuset.patch
> 	ok, no objection.
> 	I'm still afraid this patch reinstanciate old bug. but at that time,
> 	we can drop it solely. this patch is enough bisectable.
> 
> > oom-sacrifice-child-with-highest-badness-score-for-parent.patch
> 	ok, no objection.
> 	It's good patch.
> 
> > oom-select-task-from-tasklist-for-mempolicy-ooms.patch
> 	ok, no objection.
> 
> > oom-remove-special-handling-for-pagefault-ooms.patch
> 	ok, no objection.
> 
> > oom-badness-heuristic-rewrite.patch
> 	No. All of rewrite is bad idea. Please make separate some
> 	individual patches.
> 	All rewrite thing break bisectability. Perhaps it can steal
> 	a lot of time from MM developers.

We've talked about that before, and I remember specifically addressing why 
it couldn't be broken apart with any coherent understanding of what was 
happening.  I think the patchset itself was fairly well divided, but this 
specific patch touches many different areas and function signatures but 
are mainly localized to the oom killer.

> 	This patch have following parts.
> 	1) Add oom_score_adj

A patch that only adds oom_score_adj but doesn't do anything else?  It 
can't be used with the current badness function, it requires the rewrite 
of oom_badness().

> 	2) OOM score normalization

I prefer to do that with the addition of oom_score_adj since that tunable 
is meaninless until the score uses it.

> 	3) forkbomb detector

Ok, I can seperate that out but that's only a small part of the overall 
code.  Are there specific issues you'd like to address with that now 
instead of later?

> 	4) oom_forkbomb_thres new knob

I'd prefer to keep the introduction of the sysctl, again, with the 
addition of the functional code that uses it.

>  	5) Root user get 3% bonus instead 400%
> 

I don't understand this.

> 	all except (2) seems ok. but I'll review them again after separation.
> 	but you can't insert your copyright. 
> 

I can't add a copyright under the GPL for the new heuristic?  Why?

> > oom-deprecate-oom_adj-tunable.patch
> 	NAK. you can't change userland use-case at all. This patch
> 	only makes bug report flood and streal our time.
> 

It was Andrew's idea to deprecate this since the tunable works on a much 
higher granularity than oom_score_adj.  Andrew?

> > oom-replace-sysctls-with-quick-mode.patch
> 	NAK. To change sysctl makes confusion to userland.
> 	You have to prove such deprecated sysctl was alread unused.
> 	But the fact is, there is users. I have hear some times such
> 	use case and recent bug reporter said that's used.
> 
> 	https://bugzilla.kernel.org/show_bug.cgi?id=15058
> 

Already dropped.

> > oom-avoid-oom-killer-for-lowmem-allocations.patch
> 	I don't like this one. 64bit arch have big (e.g. 2/4G)
> 	DMA_ZONE/DMA32_ZONE. So, if we create small guest kernel
> 	on KVM (or Xen), Killing processes may help. IOW, this
> 	one is conceptually good. but this check way is brutal.
> 

It "may" help but has a significant probability of unnecessarily killing a 
task that won't free any lowmem, so how would you suggest we modify the 
oom killer to handle the allocation failure for GFP_DMA without negatively 
impacting the system?

> 	but even though it's ok. Let's go merge it. this patch is
> 	enough small.
> 	If any problem is occur, we can revert this one easily.
> 

Ok.

> > oom-remove-unnecessary-code-and-cleanup.patch
> 	ok, no objection.
> 
> > oom-default-to-killing-current-for-pagefault-ooms.patch
> 	NAK.
> 	1) this patch break panic_on_oom
> 	2) At this merge window, Nick change almost all architecture's
> 	   page hault handler. now almost all arch use
> 	   pagefault_out_of_memory. your description has been a bit obsoleted.
> 

Already changed, as previously mentioned in earlier posts.

> > oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch
> 	no objection. but afaik Oleg already pointed out "if (!p->mm)" is bad.
> 	So, Don't we need push his patch instead?
> 

I think it all depends on the order in which this work is merged.

> > oom-hold-tasklist_lock-when-dumping-tasks.patch
> 	ok, no objection.
> 
> > oom-give-current-access-to-memory-reserves-if-it-has-been-killed.patch
> 	ok, no objection.
> 
> > oom-avoid-sending-exiting-tasks-a-sigkill.patch
> 	ok, no objection
> 
> > oom-cleanup-oom_kill_task.patch
> 	ok, no objection
> 
> > oom-cleanup-oom_badness.patch
> 	ok, no objection
> 
> The above "no objection" mean you can feel free to use my reviewed-by tag.
> 

Thanks for the detailed review, I look forward to your feedback when I 
post the updated series.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-25  9:46     ` David Rientjes
@ 2010-05-25 10:05       ` Nick Piggin
  2010-05-25 10:23         ` David Rientjes
  0 siblings, 1 reply; 18+ messages in thread
From: Nick Piggin @ 2010-05-25 10:05 UTC (permalink / raw)
  To: David Rientjes; +Cc: KOSAKI Motohiro, linux-mm, KAMEZAWA Hiroyuki

On Tue, May 25, 2010 at 02:46:06AM -0700, David Rientjes wrote:
> On Mon, 24 May 2010, Nick Piggin wrote:
> 
> > > > I've been notified that my entire oom killer rewrite has been dropped from 
> > > > -mm based solely on your feedback.  The problem is that I have absolutely 
> > > > no idea what issues you have with the changes that haven't already been 
> > > > addressed (nobody else does, either, it seems).
> > 
> > I had exactly the same issues with the userland kernel API changes and
> > the pagefault OOM regression it introduced, which I told you months ago.
> > You ignored me, it seems.
> > 
> 
> No, I didn't ignore you, your comments were specifically addressed with 
> oom-reintroduce-and-deprecate-oom_kill_allocating_task.patch which only 
> deprecated the API change and wasn't even scheduled for removal until of 
> the end of 2011.  So there were no kernel API changes that went 

OK, you still never justified why that change is needed, or why it
is even a cleanup at all. You need actually a *good* reason to change
the user kernel API. A slight difference in opinion of what the sysctls
should be, or a slight change in implementation in the kernel, is not
a good reason in the slightest.

Look at something like /proc/sys/fs/inode-state and dentry-state or
the old syscalls we accumulate.

The point about not many people using the parameters I don't think is
a good one. 2.6.32 is being used in the next enterprise kernels so they
are going to be in production for 5 or more years. How many people will
have written scripts by the time they upgrade?

> unaddressed, perhaps you just didn't see that patch (I cc'd it to you on 
> April 27, though).
> 
> The pagefault oom behavior can now be changed back since you've converted 
> all existing architectures to call into the oom killer and not simply kill 
> current (thanks for that work!).  Previously, there was an inconsistency 
> amongst architectures in panic_on_oom behavior that we can now unify into 
> semantics that work across the board.

Thanks that would be good. I'll do another pass shortly to make sure
all archs are converted in this window if possible.

> I've made that change in my latest patch series which I'll be posting 
> shortly.

The other thing is that it makes perfect sense to put controversial
changes on hold even if you still think you can make a case for them.
We could have already gotten *most* (and the most useful to you) of
it merged by now. Then if there is a single controversial patch rather
than a big series, Andrew or Linus say is much more likely to take a
look and weigh in.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-25 10:05       ` Nick Piggin
@ 2010-05-25 10:23         ` David Rientjes
  2010-05-25 10:31           ` Nick Piggin
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2010-05-25 10:23 UTC (permalink / raw)
  To: Nick Piggin; +Cc: KOSAKI Motohiro, linux-mm, KAMEZAWA Hiroyuki

On Tue, 25 May 2010, Nick Piggin wrote:

> OK, you still never justified why that change is needed, or why it
> is even a cleanup at all. You need actually a *good* reason to change
> the user kernel API. A slight difference in opinion of what the sysctls
> should be, or a slight change in implementation in the kernel, is not
> a good reason in the slightest.
> 

My personal opinion (and it's no longer what I'm advocating since I'm 
hoping that the much larger importance of this patchset can now be 
realized without this sideshow) is that if we have the opportunity to 
consolidate two virtually unused sysctls into one to cleanup procfs and 
make extending the oom killer easier in the future without the need to add 
additional sysctls for systems that cannot afford a tasklist scan that we 
should do it.  I didn't think anyone was using them (nobody can be cited 
as using them) and I felt the two year deprecation period was enough to be 
able to cleanup the ever-growing list of VM sysctls as well as making it 
easier to extend in the future by adding a sysctl that had semantics 
specifically directed to its target audience.

> The point about not many people using the parameters I don't think is
> a good one. 2.6.32 is being used in the next enterprise kernels so they
> are going to be in production for 5 or more years. How many people will
> have written scripts by the time they upgrade?
> 

If nothing else, this solidifies in my mind the notion that once a VM 
sysctl is added, it can never be removed.  I regret adding 
oom_kill_allocating_task at the mere suggestion of SGI when I made cpuset 
ooms do a tasklist scan without developing a more extendable interface for 
those large systems that we could piggyback on later for the same reasons 
(we'd now need to disable oom_dump_tasks once it's the default for those 
systems).  I felt the deprecation and removal would eventually be 
sufficient, but I'm no longer going to push it because it's sidetracking a 
much larger and seperate effort.

> > I've made that change in my latest patch series which I'll be posting 
> > shortly.
> 
> The other thing is that it makes perfect sense to put controversial
> changes on hold even if you still think you can make a case for them.

I personally like consistency even amongst architectures when it comes to 
the semantics of sysctls, so the pagefault oom handling changes were 
merely for compatibility until all architectures were converted to using 
the oom killer.  You've done that work, so I no longer have a problem with 
panicking when the pagefault handler is called.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-25 10:23         ` David Rientjes
@ 2010-05-25 10:31           ` Nick Piggin
  0 siblings, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2010-05-25 10:31 UTC (permalink / raw)
  To: David Rientjes; +Cc: KOSAKI Motohiro, linux-mm, KAMEZAWA Hiroyuki

On Tue, May 25, 2010 at 03:23:01AM -0700, David Rientjes wrote:
> On Tue, 25 May 2010, Nick Piggin wrote:
> 
> > OK, you still never justified why that change is needed, or why it
> > is even a cleanup at all. You need actually a *good* reason to change
> > the user kernel API. A slight difference in opinion of what the sysctls
> > should be, or a slight change in implementation in the kernel, is not
> > a good reason in the slightest.
> > 
> 
> My personal opinion (and it's no longer what I'm advocating since I'm 
> hoping that the much larger importance of this patchset can now be 
> realized without this sideshow) is that if we have the opportunity to 
> consolidate two virtually unused sysctls into one to cleanup procfs and 
> make extending the oom killer easier in the future without the need to add 
> additional sysctls for systems that cannot afford a tasklist scan that we 
> should do it.  I didn't think anyone was using them (nobody can be cited 
> as using them) and I felt the two year deprecation period was enough to be 
> able to cleanup the ever-growing list of VM sysctls as well as making it 
> easier to extend in the future by adding a sysctl that had semantics 
> specifically directed to its target audience.
> 
> > The point about not many people using the parameters I don't think is
> > a good one. 2.6.32 is being used in the next enterprise kernels so they
> > are going to be in production for 5 or more years. How many people will
> > have written scripts by the time they upgrade?
> > 
> 
> If nothing else, this solidifies in my mind the notion that once a VM 
> sysctl is added, it can never be removed.  I regret adding 
> oom_kill_allocating_task at the mere suggestion of SGI when I made cpuset 
> ooms do a tasklist scan without developing a more extendable interface for 
> those large systems that we could piggyback on later for the same reasons 
> (we'd now need to disable oom_dump_tasks once it's the default for those 
> systems).  I felt the deprecation and removal would eventually be 
> sufficient, but I'm no longer going to push it because it's sidetracking a 
> much larger and seperate effort.

Thanks :) It's unfortunate, but if it is no real burden to carry it
then we really should. It's easy to get APIs wrong, we can't blame
the person proposing the API because everyone makes wrong or short
sighted choices.

What is needed is more careful flagging and review of API additions
and changes.

 
> > > I've made that change in my latest patch series which I'll be posting 
> > > shortly.
> > 
> > The other thing is that it makes perfect sense to put controversial
> > changes on hold even if you still think you can make a case for them.
> 
> I personally like consistency even amongst architectures when it comes to 
> the semantics of sysctls, so the pagefault oom handling changes were 
> merely for compatibility until all architectures were converted to using 
> the oom killer.  You've done that work, so I no longer have a problem with 
> panicking when the pagefault handler is called.

Fair enough. I should have just shut up and fixed my half done changes
rather than arguing :) So apologies for that!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-25  9:55   ` David Rientjes
@ 2010-05-26  0:02     ` David Rientjes
  2010-05-28  5:27       ` KOSAKI Motohiro
  2010-05-28  5:25     ` KOSAKI Motohiro
  1 sibling, 1 reply; 18+ messages in thread
From: David Rientjes @ 2010-05-26  0:02 UTC (permalink / raw)
  To: KOSAKI Motohiro, Oleg Nesterov; +Cc: linux-mm, KAMEZAWA Hiroyuki

On Tue, 25 May 2010, David Rientjes wrote:

> > > oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch
> > 	no objection. but afaik Oleg already pointed out "if (!p->mm)" is bad.
> > 	So, Don't we need push his patch instead?
> > 
> 
> I think it all depends on the order in which this work is merged.
> 

I just noticed that Oleg's patches were dropped as well from -mm so I'll 
merge them into my set and repost them as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-25  9:42   ` David Rientjes
@ 2010-05-26  0:17     ` KAMEZAWA Hiroyuki
  2010-05-26  1:40       ` David Rientjes
  0 siblings, 1 reply; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-26  0:17 UTC (permalink / raw)
  To: David Rientjes; +Cc: KOSAKI Motohiro, linux-mm

On Tue, 25 May 2010 02:42:14 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Thu, 20 May 2010, KAMEZAWA Hiroyuki wrote:
> 
> > I've pointed out that "normalized" parameter doesn't seem to work well in some
> > situaion (in cluster). I hope you'll have an extra interface as
> > 
> > 	echo 3G > /proc/<pid>/oom_indemification
> > 
> > to allow users have "absolute value" setting.
> > (If the admin know usual memory usage of an application, we can only
> >  add badness to extra memory usage.)
> > 
> > To be honest, I can't fully understand why we need _normalized_ parameter. Why
> > oom_adj _which is now used_ is not enough for setting "relative importance" ?
> > 
> 
> The only sane badness heuristic will be one that effectively compares all 
> eligible tasks for oom kill in a way that are relative to one another; I'm 
> concerned that a tunable that is based on a pure memory quantity requires 
> specific knowledge of the system (or memcg, cpuset, etc) capacity before 
> it is meaningful.  In other words, I opted to use a relative proportion so 
> that when tasks are constrained to cpusets or memcgs or mempolicies they 
> become part of a "virtualized system" where the proportion is then used in 
> calculation of the total amount of system RAM, memcg limit, cpuset mems 
> capacities, etc, without knowledge of what that value actually is.  So 
> "echo 3G" may be valid in your example when not constrained to any cgroup 
> or mempolicy but becomes invalid if I attach it to a cpuset with a single 
> node of 1G capacity.  When oom_score_adj, we can specify the proportion 
> "of the resources that the application has access to" in comparison to 
> other applications that share those resources to determine oom killing 
> priority.  I think that's a very powerful interface and your suggestion 
> could easily be implemented in userspace with a simple divide, thus we 
> don't need kernel support for it.
> 
I know admins will be able to write a script. But, my point is
"please don't force admins to write such a hacky scripts."

For example, an admin uses an application which always use 3G bytes adn it's
valid and sane use for the application. When he run it on a server with
4G system and 8G system, he has to change the value for oom_score_adj.


One good point of old oom_adj is that it's not influenced by environment.
Then, X-window applications set it's oom_adj to be fixed value. 
IIUC, they're hardcoded with fixed value, now. 

Even if my customer may use only OOM_DISABLE, I think using oom_score_adj
is too difficult for usual users.


Thanks,
-Kame






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-26  0:17     ` KAMEZAWA Hiroyuki
@ 2010-05-26  1:40       ` David Rientjes
  2010-05-26  2:00         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2010-05-26  1:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: KOSAKI Motohiro, linux-mm

On Wed, 26 May 2010, KAMEZAWA Hiroyuki wrote:

> > The only sane badness heuristic will be one that effectively compares all 
> > eligible tasks for oom kill in a way that are relative to one another; I'm 
> > concerned that a tunable that is based on a pure memory quantity requires 
> > specific knowledge of the system (or memcg, cpuset, etc) capacity before 
> > it is meaningful.  In other words, I opted to use a relative proportion so 
> > that when tasks are constrained to cpusets or memcgs or mempolicies they 
> > become part of a "virtualized system" where the proportion is then used in 
> > calculation of the total amount of system RAM, memcg limit, cpuset mems 
> > capacities, etc, without knowledge of what that value actually is.  So 
> > "echo 3G" may be valid in your example when not constrained to any cgroup 
> > or mempolicy but becomes invalid if I attach it to a cpuset with a single 
> > node of 1G capacity.  When oom_score_adj, we can specify the proportion 
> > "of the resources that the application has access to" in comparison to 
> > other applications that share those resources to determine oom killing 
> > priority.  I think that's a very powerful interface and your suggestion 
> > could easily be implemented in userspace with a simple divide, thus we 
> > don't need kernel support for it.
> > 
> I know admins will be able to write a script. But, my point is
> "please don't force admins to write such a hacky scripts."
> 

It's not necessarily the memory quantity that is interesting in this case 
(or proportion of available memory), it's how the badness() score is 
altered relative to other eligible tasks that end up changing the oom kill 
priority list.  If we were to implement a tunable that only took a memory 
quantity, it would require specific knowledge of the system's capacity to 
make any sense compared to other tasks.  An oom_score_adj of 125MB means 
vastly different things on a 4GB system compared to 64GB system and admins 
do not want to update their script anytime they add (or hotadd) memory or 
run on a variety of systems that don't have the same capacities.  What 
they want to do is specify the priority of an application either to prefer 
it or protect it from oom kill by saying "this application can use 25% 
more memory available to it than everything else" or "this application 
should be killed if it's not leaving at least 25% to everything else."

> For example, an admin uses an application which always use 3G bytes adn it's
> valid and sane use for the application. When he run it on a server with
> 4G system and 8G system, he has to change the value for oom_score_adj.
> 

That's the same if you were to implement a memory quantity instead of a 
proportion for oom_score_adj and depends on how you want to protect or 
prefer that application.  For a 3G application on a 4G machine, an 
oom_score_adj of 250 is legitimate if you want to ensure it never uses 
more than 3G and is always killed first when it does.  For the 8G machine, 
you can't make the same killing choice if another instance of the same 
application is using 5G instead of 3G.  See the difference?  In that case, 
it may not be the correct choice for oom kill and we should kill something 
else: the 5G memory leaker.  That requires userspace intervention to 
identify, but unless we mandate the expected memory use is spelled out for 
every single application (which we can't), there's no way to use a fixed 
memory quantity to determine relative priority.

If you really did always want to kill that 3G task, an oom_score_adj value 
of +1000 would always work just like a value of +15 does for oom_adj.

> One good point of old oom_adj is that it's not influenced by environment.
> Then, X-window applications set it's oom_adj to be fixed value. 
> IIUC, they're hardcoded with fixed value, now. 
> 

It _is_ influenced by environment, just indirectly.  It's a bitshift on 
the badness() score so for any other usecase other than a complete 
polarization of the score to either always prefer or completely disable 
oom killing for a task, it's practically useless.  The bitshift will 
increase or decrease the score but that score will be ranked according to 
the scores of other tasks on the system.  So if a task consuming 400K of 
memory has a badness score of 100 with an oom_adj value of +10, the end 
result is a score of 102400 which would represent about 10% of system 
memory on a 4G system but about 1.5% of system memory on a 64GB system.  
So the actual preference of a task, minus the usecase of polarizing the 
task with oom_adj, is completely dependent on the size of system RAM.

oom_adj must also be altered anytime a task is attached to a cpuset or 
memcg (or even mempolicy now) since its effect on badness will skew how 
the score is compared relative to all other tasks in that cpuset, memcg, 
or attached to the mempolicy nodes.

> Even if my customer may use only OOM_DISABLE, I think using oom_score_adj
> is too difficult for usual users.
> 

How is oom_score_adj, which has a well-defined unit of "proportion of 
available memory" more difficult to use than oom_adj, which exponentially 
increases a tasks badness score with no units other than "badness() 
quantum"?  The typical use case for these users is simply to polarize the 
score, anyway, which is still possible with oom_score_adj: oom_score_adj 
of -1000 is equivalent oom_adj of -17 and oom_score_adj of +1000 is 
equivalent to oom_adj of +15.  That conversion is done automatically for 
existing users of oom_adj, so I find oom_score_adj to be much more 
predicatable and scalable than oom_adj.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-26  1:40       ` David Rientjes
@ 2010-05-26  2:00         ` KAMEZAWA Hiroyuki
  2010-05-26  3:26           ` David Rientjes
  0 siblings, 1 reply; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-26  2:00 UTC (permalink / raw)
  To: David Rientjes; +Cc: KOSAKI Motohiro, linux-mm

On Tue, 25 May 2010 18:40:36 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 26 May 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > The only sane badness heuristic will be one that effectively compares all 
> > > eligible tasks for oom kill in a way that are relative to one another; I'm 
> > > concerned that a tunable that is based on a pure memory quantity requires 
> > > specific knowledge of the system (or memcg, cpuset, etc) capacity before 
> > > it is meaningful.  In other words, I opted to use a relative proportion so 
> > > that when tasks are constrained to cpusets or memcgs or mempolicies they 
> > > become part of a "virtualized system" where the proportion is then used in 
> > > calculation of the total amount of system RAM, memcg limit, cpuset mems 
> > > capacities, etc, without knowledge of what that value actually is.  So 
> > > "echo 3G" may be valid in your example when not constrained to any cgroup 
> > > or mempolicy but becomes invalid if I attach it to a cpuset with a single 
> > > node of 1G capacity.  When oom_score_adj, we can specify the proportion 
> > > "of the resources that the application has access to" in comparison to 
> > > other applications that share those resources to determine oom killing 
> > > priority.  I think that's a very powerful interface and your suggestion 
> > > could easily be implemented in userspace with a simple divide, thus we 
> > > don't need kernel support for it.
> > > 
> > I know admins will be able to write a script. But, my point is
> > "please don't force admins to write such a hacky scripts."
> > 
> 
> It's not necessarily the memory quantity that is interesting in this case 
> (or proportion of available memory), it's how the badness() score is 
> altered relative to other eligible tasks that end up changing the oom kill 
> priority list.  If we were to implement a tunable that only took a memory 
> quantity, it would require specific knowledge of the system's capacity to 
> make any sense compared to other tasks.  An oom_score_adj of 125MB means 
> vastly different things on a 4GB system compared to 64GB system and admins 
> do not want to update their script anytime they add (or hotadd) memory or 
> run on a variety of systems that don't have the same capacities. 

IMHO, importance of application is consistent under all hosts in the system.
(the system here means a system maintained by a team of admins to do a service.)

It's not be influenced by the amount of memory, other applications, etc..
If influenced, it's a chaos for admins.
It seems that's fundamental difference in ideas among you and me.


> > For example, an admin uses an application which always use 3G bytes adn it's
> > valid and sane use for the application. When he run it on a server with
> > 4G system and 8G system, he has to change the value for oom_score_adj.
> > 
> 
> That's the same if you were to implement a memory quantity instead of a 
> proportion for oom_score_adj and depends on how you want to protect or 
> prefer that application.  For a 3G application on a 4G machine, an 
> oom_score_adj of 250 is legitimate if you want to ensure it never uses 
> more than 3G and is always killed first when it does.  For the 8G machine, 
> you can't make the same killing choice if another instance of the same 
> application is using 5G instead of 3G.  See the difference?  In that case, 
> it may not be the correct choice for oom kill and we should kill something 
> else: the 5G memory leaker.  That requires userspace intervention to 
> identify, but unless we mandate the expected memory use is spelled out for 
> every single application (which we can't), there's no way to use a fixed 
> memory quantity to determine relative priority.
> 

I just don't believe relative priority ;)
Then, my customer will just disable oom or will use panic_on_oom.
That's why I wrote don't take my words serious. 
I wonder if people wants precise control of oom_score_adj, they should
use memcg and put apps into containers. In that case, static priority
and will be useful.

> If you really did always want to kill that 3G task, an oom_score_adj value 
> of +1000 would always work just like a value of +15 does for oom_adj.
> 
> > One good point of old oom_adj is that it's not influenced by environment.
> > Then, X-window applications set it's oom_adj to be fixed value. 
> > IIUC, they're hardcoded with fixed value, now. 
> > 
> 
> It _is_ influenced by environment, just indirectly.  It's a bitshift on 
> the badness() score so for any other usecase other than a complete 
> polarization of the score to either always prefer or completely disable 
> oom killing for a task, it's practically useless.  The bitshift will 
> increase or decrease the score but that score will be ranked according to 
> the scores of other tasks on the system.  So if a task consuming 400K of 
> memory has a badness score of 100 with an oom_adj value of +10, the end 
> result is a score of 102400 which would represent about 10% of system 
> memory on a 4G system but about 1.5% of system memory on a 64GB system.  
> So the actual preference of a task, minus the usecase of polarizing the 
> task with oom_adj, is completely dependent on the size of system RAM.
> 
> oom_adj must also be altered anytime a task is attached to a cpuset or 
> memcg (or even mempolicy now) since its effect on badness will skew how 
> the score is compared relative to all other tasks in that cpuset, memcg, 
> or attached to the mempolicy nodes.
> 

I agree that oom_score_adj is better than current oom_adj.

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-26  2:00         ` KAMEZAWA Hiroyuki
@ 2010-05-26  3:26           ` David Rientjes
  0 siblings, 0 replies; 18+ messages in thread
From: David Rientjes @ 2010-05-26  3:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: KOSAKI Motohiro, linux-mm

On Wed, 26 May 2010, KAMEZAWA Hiroyuki wrote:

> > It's not necessarily the memory quantity that is interesting in this case 
> > (or proportion of available memory), it's how the badness() score is 
> > altered relative to other eligible tasks that end up changing the oom kill 
> > priority list.  If we were to implement a tunable that only took a memory 
> > quantity, it would require specific knowledge of the system's capacity to 
> > make any sense compared to other tasks.  An oom_score_adj of 125MB means 
> > vastly different things on a 4GB system compared to 64GB system and admins 
> > do not want to update their script anytime they add (or hotadd) memory or 
> > run on a variety of systems that don't have the same capacities. 
> 
> IMHO, importance of application is consistent under all hosts in the system.
> (the system here means a system maintained by a team of admins to do a service.)
> 

And it's consistent whether it's unbound to any cgroup, it's attached to a 
cpuset, a memcg, or has a mempolicy restriction.  Its importance in those 
"virtualzied environments" is the same when shared with other tasks, 
that's why proportions work well: I can decide that my application should 
be targeted first by the oom killer when it is using more than 25% of 
system memory, for example.  When I attach that task to a cpuset, the 
priority is the same, I've just adjusted the amount of available memory 
out from under a set of tasks.  The point is that the script or admin that 
sets oom_score_adj need not know what resources the application has, but 
rather its memory expectations relative to other applications that share 
the same resources.

> It's not be influenced by the amount of memory, other applications, etc..
> If influenced, it's a chaos for admins.
> It seems that's fundamental difference in ideas among you and me.
> 

Other than polarizing tasks with oom_score_adj of -1000 or +1000, you must 
consider the relative importance of an application to other applications, 
that's the point of adjusting badness scoring: so we can influence which 
task is killed by the kernel instead of simply the task that is most 
memory hogging.  When an application is using more memory than desired (or 
expected), the cost of business would mandate that the task is killed and 
that threshold is definable in clear units from userspace via 
oom_score_adj and not oom_adj.

> > That's the same if you were to implement a memory quantity instead of a 
> > proportion for oom_score_adj and depends on how you want to protect or 
> > prefer that application.  For a 3G application on a 4G machine, an 
> > oom_score_adj of 250 is legitimate if you want to ensure it never uses 
> > more than 3G and is always killed first when it does.  For the 8G machine, 
> > you can't make the same killing choice if another instance of the same 
> > application is using 5G instead of 3G.  See the difference?  In that case, 
> > it may not be the correct choice for oom kill and we should kill something 
> > else: the 5G memory leaker.  That requires userspace intervention to 
> > identify, but unless we mandate the expected memory use is spelled out for 
> > every single application (which we can't), there's no way to use a fixed 
> > memory quantity to determine relative priority.
> > 
> 
> I just don't believe relative priority ;)

If application A typically consumes 6G on an 8G system and that's an 
important task, it's possible to protect it from being oom killed as the 
result of the memory usage of the remaining applications B and C with 
oom_score_adj; oom_adj does not have a clear interface for being able to 
define when to kill A when it uses more than 7G but not to kill it when it 
uses 6G.  That's the motivation behind developing a relative priority like 
oom_score_adj.

> That's why I wrote don't take my words serious. 
> I wonder if people wants precise control of oom_score_adj, they should
> use memcg and put apps into containers. In that case, static priority
> and will be useful.
> 

Indeed, using a hierarchy of memcgs is another way to do the same thing 
but seems like overkill if I'm running only a webserver and a few 
monitoring applications.  oom_score_adj also applies to cpusets and 
mempolicies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-25  9:55   ` David Rientjes
  2010-05-26  0:02     ` David Rientjes
@ 2010-05-28  5:25     ` KOSAKI Motohiro
  2010-06-01  7:30       ` David Rientjes
  1 sibling, 1 reply; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-05-28  5:25 UTC (permalink / raw)
  To: David Rientjes; +Cc: kosaki.motohiro, linux-mm, KAMEZAWA Hiroyuki

Hi

> On Mon, 24 May 2010, KOSAKI Motohiro wrote:
> 
> > > I've been notified that my entire oom killer rewrite has been dropped from 
> > > -mm based solely on your feedback.  The problem is that I have absolutely 
> > > no idea what issues you have with the changes that haven't already been 
> > > addressed (nobody else does, either, it seems).
> > 
> > That's simple. A regression and an incompatibility are absolutely
> > unacceptable. They should be removed. Your patches have some funny parts,
> > but, afaik, nobody said funny requirement itself is wrong. They only said
> > your requirement don't have to cause any pain to other users.
> > 
> > Zero risk patches are always acceptable.
> > 
> 
> When you see these "funny parts," please let me know what they are.  The 
> were was no incompatibility issue after 
> oom-reintroduce-and-deprecate-oom_kill_allocating_task.patch was merged, 
> the interface was simply deprecated.  Arguing against the deprecation is 
> understandable and quite frankly something I'd like to avoid since it's 
> apparently hanging up the larger importance of the work, so I've dropped 
> the consolidation (and subsequent deprecation of oom_kill_allocating_task) 
> of the sysctls from my latest patch series.

That's said, don't deprecated current interface. Other MM developers makes
effort to reduce a number of oom bug report. I don't hope you run just opposite
direction.


> > I've reviewed all of your patches. the result is here.
> > 
> > > oom-filter-tasks-not-sharing-the-same-cpuset.patch
> > 	ok, no objection.
> > 	I'm still afraid this patch reinstanciate old bug. but at that time,
> > 	we can drop it solely. this patch is enough bisectable.
> > 
> > > oom-sacrifice-child-with-highest-badness-score-for-parent.patch
> > 	ok, no objection.
> > 	It's good patch.
> > 
> > > oom-select-task-from-tasklist-for-mempolicy-ooms.patch
> > 	ok, no objection.
> > 
> > > oom-remove-special-handling-for-pagefault-ooms.patch
> > 	ok, no objection.
> > 
> > > oom-badness-heuristic-rewrite.patch
> > 	No. All of rewrite is bad idea. Please make separate some
> > 	individual patches.
> > 	All rewrite thing break bisectability. Perhaps it can steal
> > 	a lot of time from MM developers.
> 
> We've talked about that before, and I remember specifically addressing why 
> it couldn't be broken apart with any coherent understanding of what was 
> happening.  I think the patchset itself was fairly well divided, but this 
> specific patch touches many different areas and function signatures but 
> are mainly localized to the oom killer.

Heh, that's ok.
I'll merge apart of this one If you can't. The rule is simple, rewrite 
all patches will never merge. but ok too. you can choice no merge.


> > 	This patch have following parts.
> > 	1) Add oom_score_adj
> 
> A patch that only adds oom_score_adj but doesn't do anything else?  It 
> can't be used with the current badness function, it requires the rewrite 
> of oom_badness().

ok. you can drop oom_score_adj too.

> > 	2) OOM score normalization
> 
> I prefer to do that with the addition of oom_score_adj since that tunable 
> is meaninless until the score uses it.

No. This one have no justification. BAD IDEA.
Any core heuristic change need to prove to improve desktop use case.

That's said, now lkml have one or two oom bug report per month. We have
to make effort to reduce it. Please don't append new confusion source.



> > 	3) forkbomb detector
> 
> Ok, I can seperate that out but that's only a small part of the overall 
> code.  Are there specific issues you'd like to address with that now 
> instead of later?

reviewability and bisectability are one of most important issue. that's all.


> > 	4) oom_forkbomb_thres new knob
> 
> I'd prefer to keep the introduction of the sysctl, again, with the 
> addition of the functional code that uses it.

I'm not against new knob. I only request separation.


> >  	5) Root user get 3% bonus instead 400%
> 
> I don't understand this.

Now, our oom have "if (root-user) points /= 4" logic, I wrote it as 400%.


> > 	all except (2) seems ok. but I'll review them again after separation.
> > 	but you can't insert your copyright. 
> > 
> 
> I can't add a copyright under the GPL for the new heuristic?  Why?

1) too small work
2) In this area, almost work had been lead to kamezawa-san. you don't have
   proper right.

(1) mean other people of joining this improvement can't append it too.


> > > oom-deprecate-oom_adj-tunable.patch
> > 	NAK. you can't change userland use-case at all. This patch
> > 	only makes bug report flood and streal our time.
> 
> It was Andrew's idea to deprecate this since the tunable works on a much 
> higher granularity than oom_score_adj.  Andrew?

If we can create really better new knob, the end users naturally migrate
to use new one. thus, we can remove older one without pain.

again, we still have a bit high bug report rate in oom area. Please don't
increase it.


> > > oom-replace-sysctls-with-quick-mode.patch
> > 	NAK. To change sysctl makes confusion to userland.
> > 	You have to prove such deprecated sysctl was alread unused.
> > 	But the fact is, there is users. I have hear some times such
> > 	use case and recent bug reporter said that's used.
> > 
> > 	https://bugzilla.kernel.org/show_bug.cgi?id=15058
> > 
> 
> Already dropped.

thanks.


> > > oom-avoid-oom-killer-for-lowmem-allocations.patch
> > 	I don't like this one. 64bit arch have big (e.g. 2/4G)
> > 	DMA_ZONE/DMA32_ZONE. So, if we create small guest kernel
> > 	on KVM (or Xen), Killing processes may help. IOW, this
> > 	one is conceptually good. but this check way is brutal.
> 
> It "may" help but has a significant probability of unnecessarily killing a 
> task that won't free any lowmem, so how would you suggest we modify the 
> oom killer to handle the allocation failure for GFP_DMA without negatively 
> impacting the system?

I prefer to check numbers of anon pages in the zone. I mean we want
"skip oom if the zone have no freeable memory", thus just do it.

But again, I'm ok yours too.

> 
> > 	but even though it's ok. Let's go merge it. this patch is
> > 	enough small.
> > 	If any problem is occur, we can revert this one easily.
> > 
> 
> Ok.
> 
> > > oom-remove-unnecessary-code-and-cleanup.patch
> > 	ok, no objection.
> > 
> > > oom-default-to-killing-current-for-pagefault-ooms.patch
> > 	NAK.
> > 	1) this patch break panic_on_oom
> > 	2) At this merge window, Nick change almost all architecture's
> > 	   page hault handler. now almost all arch use
> > 	   pagefault_out_of_memory. your description has been a bit obsoleted.
> > 
> 
> Already changed, as previously mentioned in earlier posts.

ok, thanks.

> 
> > > oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch
> > 	no objection. but afaik Oleg already pointed out "if (!p->mm)" is bad.
> > 	So, Don't we need push his patch instead?
> > 
> 
> I think it all depends on the order in which this work is merged.
> 
> > > oom-hold-tasklist_lock-when-dumping-tasks.patch
> > 	ok, no objection.
> > 
> > > oom-give-current-access-to-memory-reserves-if-it-has-been-killed.patch
> > 	ok, no objection.
> > 
> > > oom-avoid-sending-exiting-tasks-a-sigkill.patch
> > 	ok, no objection
> > 
> > > oom-cleanup-oom_kill_task.patch
> > 	ok, no objection
> > 
> > > oom-cleanup-oom_badness.patch
> > 	ok, no objection
> > 
> > The above "no objection" mean you can feel free to use my reviewed-by tag.
> > 
> 
> Thanks for the detailed review, I look forward to your feedback when I 
> post the updated series.

I'm thanks too.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-26  0:02     ` David Rientjes
@ 2010-05-28  5:27       ` KOSAKI Motohiro
  0 siblings, 0 replies; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-05-28  5:27 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Oleg Nesterov, linux-mm, KAMEZAWA Hiroyuki

> On Tue, 25 May 2010, David Rientjes wrote:
> 
> > > > oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch
> > > 	no objection. but afaik Oleg already pointed out "if (!p->mm)" is bad.
> > > 	So, Don't we need push his patch instead?
> > > 
> > 
> > I think it all depends on the order in which this work is merged.
> > 
> 
> I just noticed that Oleg's patches were dropped as well from -mm so I'll 
> merge them into my set and repost them as well.

Oops, no. It shouldn't. 
His patch is important and should be merged ASAP. I believe standalone merge is best.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: oom killer rewrite
  2010-05-28  5:25     ` KOSAKI Motohiro
@ 2010-06-01  7:30       ` David Rientjes
  0 siblings, 0 replies; 18+ messages in thread
From: David Rientjes @ 2010-06-01  7:30 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, KAMEZAWA Hiroyuki

On Fri, 28 May 2010, KOSAKI Motohiro wrote:

> > When you see these "funny parts," please let me know what they are.  The 
> > were was no incompatibility issue after 
> > oom-reintroduce-and-deprecate-oom_kill_allocating_task.patch was merged, 
> > the interface was simply deprecated.  Arguing against the deprecation is 
> > understandable and quite frankly something I'd like to avoid since it's 
> > apparently hanging up the larger importance of the work, so I've dropped 
> > the consolidation (and subsequent deprecation of oom_kill_allocating_task) 
> > of the sysctls from my latest patch series.
> 
> That's said, don't deprecated current interface. Other MM developers makes
> effort to reduce a number of oom bug report. I don't hope you run just opposite
> direction.
> 

It's dropped, as I said above.

> > > > oom-badness-heuristic-rewrite.patch
> > > 	No. All of rewrite is bad idea. Please make separate some
> > > 	individual patches.
> > > 	All rewrite thing break bisectability. Perhaps it can steal
> > > 	a lot of time from MM developers.
> > 
> > We've talked about that before, and I remember specifically addressing why 
> > it couldn't be broken apart with any coherent understanding of what was 
> > happening.  I think the patchset itself was fairly well divided, but this 
> > specific patch touches many different areas and function signatures but 
> > are mainly localized to the oom killer.
> 
> Heh, that's ok.
> I'll merge apart of this one If you can't. The rule is simple, rewrite 
> all patches will never merge. but ok too. you can choice no merge.
> 

You can't break this patch apart functionally into anything that's 
meaninful, it doesn't help to go around changing function signatures in 
one patch when the arguments are left unused, or adding sysctls in one 
patch when its unused, etc.  I gave a tip for reviewing of this particular 
patch since all of the changes are isolated in oom_badness(): merge the 
patch into your own tree and review the heuristic.

> > > 	This patch have following parts.
> > > 	1) Add oom_score_adj
> > 
> > A patch that only adds oom_score_adj but doesn't do anything else?  It 
> > can't be used with the current badness function, it requires the rewrite 
> > of oom_badness().
> 
> ok. you can drop oom_score_adj too.
> 

I don't know what this means.  oom_score_adj is the new tunable that is in 
the units that the new heuristic uses, they obviously need to be merged 
together.

> > > 	2) OOM score normalization
> > 
> > I prefer to do that with the addition of oom_score_adj since that tunable 
> > is meaninless until the score uses it.
> 
> No. This one have no justification. BAD IDEA.
> Any core heuristic change need to prove to improve desktop use case.
> 

This entire rewrite was largely based on improving the desktop use case, 
the new heuristic doesn't kill KDE when you fork a large memory-hogging 
process on the desktop when it currently does.

> That's said, now lkml have one or two oom bug report per month. We have
> to make effort to reduce it. Please don't append new confusion source.
> 

My heuristic is far superior to the current heuristic and this introduces 
a new tunable that allows userspace to tune oom killer prioritization in 
units that admins understand and in a linear way, not exponentially on the 
total VM size that the current heuristic does.

> > > 	3) forkbomb detector
> > 
> > Ok, I can seperate that out but that's only a small part of the overall 
> > code.  Are there specific issues you'd like to address with that now 
> > instead of later?
> 
> reviewability and bisectability are one of most important issue. that's all.
> 

This is done and has been proposed as patch 09/18 in my latest posting, 
thanks.

> > >  	5) Root user get 3% bonus instead 400%
> > 
> > I don't understand this.
> 
> Now, our oom have "if (root-user) points /= 4" logic, I wrote it as 400%.
> 

This bias should be equivalent to that given in __vm_enough_memory() as 
specified in the patch description.

> > I can't add a copyright under the GPL for the new heuristic?  Why?
> 
> 1) too small work

 14 files changed, 680 insertions(+), 332 deletions(-)

That's too small?

> 2) In this area, almost work had been lead to kamezawa-san. you don't have
>    proper right.
> 

Absolutely ridiculous, the only thing that this entire patchset builds 
upon is the use of MM_SWAPENTS which was introduced to be exported via 
/proc/pid/status.

> (1) mean other people of joining this improvement can't append it too.
> 

I don't know what this means.

If you object to adding Google's copyright to this work, please merge that 
as an additional patch and submit it to Andrew along with my nacked-by.  
He can remove Google's copyright to the rewrite if he'd like.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2010-06-01  7:30 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-19 22:14 oom killer rewrite David Rientjes
2010-05-20  0:27 ` KAMEZAWA Hiroyuki
2010-05-25  9:42   ` David Rientjes
2010-05-26  0:17     ` KAMEZAWA Hiroyuki
2010-05-26  1:40       ` David Rientjes
2010-05-26  2:00         ` KAMEZAWA Hiroyuki
2010-05-26  3:26           ` David Rientjes
2010-05-24  1:09 ` KOSAKI Motohiro
2010-05-24  7:07   ` Nick Piggin
2010-05-25  9:46     ` David Rientjes
2010-05-25 10:05       ` Nick Piggin
2010-05-25 10:23         ` David Rientjes
2010-05-25 10:31           ` Nick Piggin
2010-05-25  9:55   ` David Rientjes
2010-05-26  0:02     ` David Rientjes
2010-05-28  5:27       ` KOSAKI Motohiro
2010-05-28  5:25     ` KOSAKI Motohiro
2010-06-01  7:30       ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).