* oom killer rewrite
@ 2010-05-19 22:14 David Rientjes
2010-05-20 0:27 ` KAMEZAWA Hiroyuki
2010-05-24 1:09 ` KOSAKI Motohiro
0 siblings, 2 replies; 18+ messages in thread
From: David Rientjes @ 2010-05-19 22:14 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: linux-mm
KOSAKI,
I've been notified that my entire oom killer rewrite has been dropped from
-mm based solely on your feedback. The problem is that I have absolutely
no idea what issues you have with the changes that haven't already been
addressed (nobody else does, either, it seems).
The last work I've done on the patches are to ask those involved in the
review (including you) and linux-mm whether there were any outstanding
issues that anyone has, and I've asked that twice. I've received no
response either time.
Please respond with a list of your objections to the rewrite (which is
available at
http://www.kernel.org/pub/linux/kernel/people/rientjes/oom-killer-rewrite
so we can move forward.
Thank you.
David
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-19 22:14 oom killer rewrite David Rientjes
@ 2010-05-20 0:27 ` KAMEZAWA Hiroyuki
2010-05-25 9:42 ` David Rientjes
2010-05-24 1:09 ` KOSAKI Motohiro
1 sibling, 1 reply; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-20 0:27 UTC (permalink / raw)
To: David Rientjes; +Cc: KOSAKI Motohiro, linux-mm
On Wed, 19 May 2010 15:14:42 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> KOSAKI,
>
> I've been notified that my entire oom killer rewrite has been dropped from
> -mm based solely on your feedback. The problem is that I have absolutely
> no idea what issues you have with the changes that haven't already been
> addressed (nobody else does, either, it seems).
>
I've pointed out that "normalized" parameter doesn't seem to work well in some
situaion (in cluster). I hope you'll have an extra interface as
echo 3G > /proc/<pid>/oom_indemification
to allow users have "absolute value" setting.
(If the admin know usual memory usage of an application, we can only
add badness to extra memory usage.)
To be honest, I can't fully understand why we need _normalized_ parameter. Why
oom_adj _which is now used_ is not enough for setting "relative importance" ?
Does google guys controls importance of processes in very small step ?
And, IIRC, Nick pointed out that "don't remove _used_ interfaces just because
you hate it or it seems not clean". So, I recommend you to drop sysctl changes.
I think the whole concept of your patch series is good and I like it.
But changes in interfaces seem not very sensible.
Don't take my word very serious but I don't like changes in interface.
Cheers,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-19 22:14 oom killer rewrite David Rientjes
2010-05-20 0:27 ` KAMEZAWA Hiroyuki
@ 2010-05-24 1:09 ` KOSAKI Motohiro
2010-05-24 7:07 ` Nick Piggin
2010-05-25 9:55 ` David Rientjes
1 sibling, 2 replies; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-05-24 1:09 UTC (permalink / raw)
To: David Rientjes; +Cc: kosaki.motohiro, linux-mm, KAMEZAWA Hiroyuki
Hi
> KOSAKI,
>
> I've been notified that my entire oom killer rewrite has been dropped from
> -mm based solely on your feedback. The problem is that I have absolutely
> no idea what issues you have with the changes that haven't already been
> addressed (nobody else does, either, it seems).
That's simple. A regression and an incompatibility are absolutely
unacceptable. They should be removed. Your patches have some funny parts,
but, afaik, nobody said funny requirement itself is wrong. They only said
your requirement don't have to cause any pain to other users.
Zero risk patches are always acceptable.
>
> The last work I've done on the patches are to ask those involved in the
> review (including you) and linux-mm whether there were any outstanding
> issues that anyone has, and I've asked that twice. I've received no
> response either time.
>
> Please respond with a list of your objections to the rewrite (which is
> available at
> http://www.kernel.org/pub/linux/kernel/people/rientjes/oom-killer-rewrite
> so we can move forward.
I've reviewed all of your patches. the result is here.
> oom-filter-tasks-not-sharing-the-same-cpuset.patch
ok, no objection.
I'm still afraid this patch reinstanciate old bug. but at that time,
we can drop it solely. this patch is enough bisectable.
> oom-sacrifice-child-with-highest-badness-score-for-parent.patch
ok, no objection.
It's good patch.
> oom-select-task-from-tasklist-for-mempolicy-ooms.patch
ok, no objection.
> oom-remove-special-handling-for-pagefault-ooms.patch
ok, no objection.
> oom-badness-heuristic-rewrite.patch
No. All of rewrite is bad idea. Please make separate some
individual patches.
All rewrite thing break bisectability. Perhaps it can steal
a lot of time from MM developers.
This patch have following parts.
1) Add oom_score_adj
2) OOM score normalization
3) forkbomb detector
4) oom_forkbomb_thres new knob
5) Root user get 3% bonus instead 400%
all except (2) seems ok. but I'll review them again after separation.
but you can't insert your copyright.
> oom-deprecate-oom_adj-tunable.patch
NAK. you can't change userland use-case at all. This patch
only makes bug report flood and streal our time.
> oom-replace-sysctls-with-quick-mode.patch
NAK. To change sysctl makes confusion to userland.
You have to prove such deprecated sysctl was alread unused.
But the fact is, there is users. I have hear some times such
use case and recent bug reporter said that's used.
https://bugzilla.kernel.org/show_bug.cgi?id=15058
> oom-avoid-oom-killer-for-lowmem-allocations.patch
I don't like this one. 64bit arch have big (e.g. 2/4G)
DMA_ZONE/DMA32_ZONE. So, if we create small guest kernel
on KVM (or Xen), Killing processes may help. IOW, this
one is conceptually good. but this check way is brutal.
but even though it's ok. Let's go merge it. this patch is
enough small.
If any problem is occur, we can revert this one easily.
> oom-remove-unnecessary-code-and-cleanup.patch
ok, no objection.
> oom-default-to-killing-current-for-pagefault-ooms.patch
NAK.
1) this patch break panic_on_oom
2) At this merge window, Nick change almost all architecture's
page hault handler. now almost all arch use
pagefault_out_of_memory. your description has been a bit obsoleted.
> oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch
no objection. but afaik Oleg already pointed out "if (!p->mm)" is bad.
So, Don't we need push his patch instead?
> oom-hold-tasklist_lock-when-dumping-tasks.patch
ok, no objection.
> oom-give-current-access-to-memory-reserves-if-it-has-been-killed.patch
ok, no objection.
> oom-avoid-sending-exiting-tasks-a-sigkill.patch
ok, no objection
> oom-cleanup-oom_kill_task.patch
ok, no objection
> oom-cleanup-oom_badness.patch
ok, no objection
The above "no objection" mean you can feel free to use my reviewed-by tag.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-24 1:09 ` KOSAKI Motohiro
@ 2010-05-24 7:07 ` Nick Piggin
2010-05-25 9:46 ` David Rientjes
2010-05-25 9:55 ` David Rientjes
1 sibling, 1 reply; 18+ messages in thread
From: Nick Piggin @ 2010-05-24 7:07 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: David Rientjes, linux-mm, KAMEZAWA Hiroyuki
On Mon, May 24, 2010 at 10:09:34AM +0900, KOSAKI Motohiro wrote:
> Hi
>
> > KOSAKI,
> >
> > I've been notified that my entire oom killer rewrite has been dropped from
> > -mm based solely on your feedback. The problem is that I have absolutely
> > no idea what issues you have with the changes that haven't already been
> > addressed (nobody else does, either, it seems).
I had exactly the same issues with the userland kernel API changes and
the pagefault OOM regression it introduced, which I told you months ago.
You ignored me, it seems.
>
> That's simple. A regression and an incompatibility are absolutely
> unacceptable. They should be removed. Your patches have some funny parts,
> but, afaik, nobody said funny requirement itself is wrong. They only said
> your requirement don't have to cause any pain to other users.
>
> Zero risk patches are always acceptable.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-20 0:27 ` KAMEZAWA Hiroyuki
@ 2010-05-25 9:42 ` David Rientjes
2010-05-26 0:17 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2010-05-25 9:42 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: KOSAKI Motohiro, linux-mm
On Thu, 20 May 2010, KAMEZAWA Hiroyuki wrote:
> I've pointed out that "normalized" parameter doesn't seem to work well in some
> situaion (in cluster). I hope you'll have an extra interface as
>
> echo 3G > /proc/<pid>/oom_indemification
>
> to allow users have "absolute value" setting.
> (If the admin know usual memory usage of an application, we can only
> add badness to extra memory usage.)
>
> To be honest, I can't fully understand why we need _normalized_ parameter. Why
> oom_adj _which is now used_ is not enough for setting "relative importance" ?
>
The only sane badness heuristic will be one that effectively compares all
eligible tasks for oom kill in a way that are relative to one another; I'm
concerned that a tunable that is based on a pure memory quantity requires
specific knowledge of the system (or memcg, cpuset, etc) capacity before
it is meaningful. In other words, I opted to use a relative proportion so
that when tasks are constrained to cpusets or memcgs or mempolicies they
become part of a "virtualized system" where the proportion is then used in
calculation of the total amount of system RAM, memcg limit, cpuset mems
capacities, etc, without knowledge of what that value actually is. So
"echo 3G" may be valid in your example when not constrained to any cgroup
or mempolicy but becomes invalid if I attach it to a cpuset with a single
node of 1G capacity. When oom_score_adj, we can specify the proportion
"of the resources that the application has access to" in comparison to
other applications that share those resources to determine oom killing
priority. I think that's a very powerful interface and your suggestion
could easily be implemented in userspace with a simple divide, thus we
don't need kernel support for it.
> And, IIRC, Nick pointed out that "don't remove _used_ interfaces just because
> you hate it or it seems not clean". So, I recommend you to drop sysctl changes.
>
I addressed this in
oom-reintroduce-and-deprecate-oom_kill_allocating_task.patch so that the
end result was that only the oom_dump_tasks sysctl was removed because it
was now enabled by default since it provides useful information about the
state of the VM at the time of allocation failure. So there's no longer
any removal of "used" interfaces (the only previous use of oom_dump_tasks
was to enable it, which is now the default). Are you disagreeing with the
deprecation of the sysctl since it was then folded into the new
oom_dump_quick sysctl?
Regardless, I dropped all of these cleanups from my latest patch series
which I'll post shortly, but keep in mind that what existed in -mm before
it was dropped did not break any userspace API, it simply deprecated them
for removal in two years.
> I think the whole concept of your patch series is good and I like it.
Thanks, Kame, for your continued interest in this work, it's encouraging.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-24 7:07 ` Nick Piggin
@ 2010-05-25 9:46 ` David Rientjes
2010-05-25 10:05 ` Nick Piggin
0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2010-05-25 9:46 UTC (permalink / raw)
To: Nick Piggin; +Cc: KOSAKI Motohiro, linux-mm, KAMEZAWA Hiroyuki
On Mon, 24 May 2010, Nick Piggin wrote:
> > > I've been notified that my entire oom killer rewrite has been dropped from
> > > -mm based solely on your feedback. The problem is that I have absolutely
> > > no idea what issues you have with the changes that haven't already been
> > > addressed (nobody else does, either, it seems).
>
> I had exactly the same issues with the userland kernel API changes and
> the pagefault OOM regression it introduced, which I told you months ago.
> You ignored me, it seems.
>
No, I didn't ignore you, your comments were specifically addressed with
oom-reintroduce-and-deprecate-oom_kill_allocating_task.patch which only
deprecated the API change and wasn't even scheduled for removal until of
the end of 2011. So there were no kernel API changes that went
unaddressed, perhaps you just didn't see that patch (I cc'd it to you on
April 27, though).
The pagefault oom behavior can now be changed back since you've converted
all existing architectures to call into the oom killer and not simply kill
current (thanks for that work!). Previously, there was an inconsistency
amongst architectures in panic_on_oom behavior that we can now unify into
semantics that work across the board.
I've made that change in my latest patch series which I'll be posting
shortly.
Thanks for the feedback!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-24 1:09 ` KOSAKI Motohiro
2010-05-24 7:07 ` Nick Piggin
@ 2010-05-25 9:55 ` David Rientjes
2010-05-26 0:02 ` David Rientjes
2010-05-28 5:25 ` KOSAKI Motohiro
1 sibling, 2 replies; 18+ messages in thread
From: David Rientjes @ 2010-05-25 9:55 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: linux-mm, KAMEZAWA Hiroyuki
On Mon, 24 May 2010, KOSAKI Motohiro wrote:
> > I've been notified that my entire oom killer rewrite has been dropped from
> > -mm based solely on your feedback. The problem is that I have absolutely
> > no idea what issues you have with the changes that haven't already been
> > addressed (nobody else does, either, it seems).
>
> That's simple. A regression and an incompatibility are absolutely
> unacceptable. They should be removed. Your patches have some funny parts,
> but, afaik, nobody said funny requirement itself is wrong. They only said
> your requirement don't have to cause any pain to other users.
>
> Zero risk patches are always acceptable.
>
When you see these "funny parts," please let me know what they are. The
were was no incompatibility issue after
oom-reintroduce-and-deprecate-oom_kill_allocating_task.patch was merged,
the interface was simply deprecated. Arguing against the deprecation is
understandable and quite frankly something I'd like to avoid since it's
apparently hanging up the larger importance of the work, so I've dropped
the consolidation (and subsequent deprecation of oom_kill_allocating_task)
of the sysctls from my latest patch series.
> I've reviewed all of your patches. the result is here.
>
> > oom-filter-tasks-not-sharing-the-same-cpuset.patch
> ok, no objection.
> I'm still afraid this patch reinstanciate old bug. but at that time,
> we can drop it solely. this patch is enough bisectable.
>
> > oom-sacrifice-child-with-highest-badness-score-for-parent.patch
> ok, no objection.
> It's good patch.
>
> > oom-select-task-from-tasklist-for-mempolicy-ooms.patch
> ok, no objection.
>
> > oom-remove-special-handling-for-pagefault-ooms.patch
> ok, no objection.
>
> > oom-badness-heuristic-rewrite.patch
> No. All of rewrite is bad idea. Please make separate some
> individual patches.
> All rewrite thing break bisectability. Perhaps it can steal
> a lot of time from MM developers.
We've talked about that before, and I remember specifically addressing why
it couldn't be broken apart with any coherent understanding of what was
happening. I think the patchset itself was fairly well divided, but this
specific patch touches many different areas and function signatures but
are mainly localized to the oom killer.
> This patch have following parts.
> 1) Add oom_score_adj
A patch that only adds oom_score_adj but doesn't do anything else? It
can't be used with the current badness function, it requires the rewrite
of oom_badness().
> 2) OOM score normalization
I prefer to do that with the addition of oom_score_adj since that tunable
is meaninless until the score uses it.
> 3) forkbomb detector
Ok, I can seperate that out but that's only a small part of the overall
code. Are there specific issues you'd like to address with that now
instead of later?
> 4) oom_forkbomb_thres new knob
I'd prefer to keep the introduction of the sysctl, again, with the
addition of the functional code that uses it.
> 5) Root user get 3% bonus instead 400%
>
I don't understand this.
> all except (2) seems ok. but I'll review them again after separation.
> but you can't insert your copyright.
>
I can't add a copyright under the GPL for the new heuristic? Why?
> > oom-deprecate-oom_adj-tunable.patch
> NAK. you can't change userland use-case at all. This patch
> only makes bug report flood and streal our time.
>
It was Andrew's idea to deprecate this since the tunable works on a much
higher granularity than oom_score_adj. Andrew?
> > oom-replace-sysctls-with-quick-mode.patch
> NAK. To change sysctl makes confusion to userland.
> You have to prove such deprecated sysctl was alread unused.
> But the fact is, there is users. I have hear some times such
> use case and recent bug reporter said that's used.
>
> https://bugzilla.kernel.org/show_bug.cgi?id=15058
>
Already dropped.
> > oom-avoid-oom-killer-for-lowmem-allocations.patch
> I don't like this one. 64bit arch have big (e.g. 2/4G)
> DMA_ZONE/DMA32_ZONE. So, if we create small guest kernel
> on KVM (or Xen), Killing processes may help. IOW, this
> one is conceptually good. but this check way is brutal.
>
It "may" help but has a significant probability of unnecessarily killing a
task that won't free any lowmem, so how would you suggest we modify the
oom killer to handle the allocation failure for GFP_DMA without negatively
impacting the system?
> but even though it's ok. Let's go merge it. this patch is
> enough small.
> If any problem is occur, we can revert this one easily.
>
Ok.
> > oom-remove-unnecessary-code-and-cleanup.patch
> ok, no objection.
>
> > oom-default-to-killing-current-for-pagefault-ooms.patch
> NAK.
> 1) this patch break panic_on_oom
> 2) At this merge window, Nick change almost all architecture's
> page hault handler. now almost all arch use
> pagefault_out_of_memory. your description has been a bit obsoleted.
>
Already changed, as previously mentioned in earlier posts.
> > oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch
> no objection. but afaik Oleg already pointed out "if (!p->mm)" is bad.
> So, Don't we need push his patch instead?
>
I think it all depends on the order in which this work is merged.
> > oom-hold-tasklist_lock-when-dumping-tasks.patch
> ok, no objection.
>
> > oom-give-current-access-to-memory-reserves-if-it-has-been-killed.patch
> ok, no objection.
>
> > oom-avoid-sending-exiting-tasks-a-sigkill.patch
> ok, no objection
>
> > oom-cleanup-oom_kill_task.patch
> ok, no objection
>
> > oom-cleanup-oom_badness.patch
> ok, no objection
>
> The above "no objection" mean you can feel free to use my reviewed-by tag.
>
Thanks for the detailed review, I look forward to your feedback when I
post the updated series.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-25 9:46 ` David Rientjes
@ 2010-05-25 10:05 ` Nick Piggin
2010-05-25 10:23 ` David Rientjes
0 siblings, 1 reply; 18+ messages in thread
From: Nick Piggin @ 2010-05-25 10:05 UTC (permalink / raw)
To: David Rientjes; +Cc: KOSAKI Motohiro, linux-mm, KAMEZAWA Hiroyuki
On Tue, May 25, 2010 at 02:46:06AM -0700, David Rientjes wrote:
> On Mon, 24 May 2010, Nick Piggin wrote:
>
> > > > I've been notified that my entire oom killer rewrite has been dropped from
> > > > -mm based solely on your feedback. The problem is that I have absolutely
> > > > no idea what issues you have with the changes that haven't already been
> > > > addressed (nobody else does, either, it seems).
> >
> > I had exactly the same issues with the userland kernel API changes and
> > the pagefault OOM regression it introduced, which I told you months ago.
> > You ignored me, it seems.
> >
>
> No, I didn't ignore you, your comments were specifically addressed with
> oom-reintroduce-and-deprecate-oom_kill_allocating_task.patch which only
> deprecated the API change and wasn't even scheduled for removal until of
> the end of 2011. So there were no kernel API changes that went
OK, you still never justified why that change is needed, or why it
is even a cleanup at all. You need actually a *good* reason to change
the user kernel API. A slight difference in opinion of what the sysctls
should be, or a slight change in implementation in the kernel, is not
a good reason in the slightest.
Look at something like /proc/sys/fs/inode-state and dentry-state or
the old syscalls we accumulate.
The point about not many people using the parameters I don't think is
a good one. 2.6.32 is being used in the next enterprise kernels so they
are going to be in production for 5 or more years. How many people will
have written scripts by the time they upgrade?
> unaddressed, perhaps you just didn't see that patch (I cc'd it to you on
> April 27, though).
>
> The pagefault oom behavior can now be changed back since you've converted
> all existing architectures to call into the oom killer and not simply kill
> current (thanks for that work!). Previously, there was an inconsistency
> amongst architectures in panic_on_oom behavior that we can now unify into
> semantics that work across the board.
Thanks that would be good. I'll do another pass shortly to make sure
all archs are converted in this window if possible.
> I've made that change in my latest patch series which I'll be posting
> shortly.
The other thing is that it makes perfect sense to put controversial
changes on hold even if you still think you can make a case for them.
We could have already gotten *most* (and the most useful to you) of
it merged by now. Then if there is a single controversial patch rather
than a big series, Andrew or Linus say is much more likely to take a
look and weigh in.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-25 10:05 ` Nick Piggin
@ 2010-05-25 10:23 ` David Rientjes
2010-05-25 10:31 ` Nick Piggin
0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2010-05-25 10:23 UTC (permalink / raw)
To: Nick Piggin; +Cc: KOSAKI Motohiro, linux-mm, KAMEZAWA Hiroyuki
On Tue, 25 May 2010, Nick Piggin wrote:
> OK, you still never justified why that change is needed, or why it
> is even a cleanup at all. You need actually a *good* reason to change
> the user kernel API. A slight difference in opinion of what the sysctls
> should be, or a slight change in implementation in the kernel, is not
> a good reason in the slightest.
>
My personal opinion (and it's no longer what I'm advocating since I'm
hoping that the much larger importance of this patchset can now be
realized without this sideshow) is that if we have the opportunity to
consolidate two virtually unused sysctls into one to cleanup procfs and
make extending the oom killer easier in the future without the need to add
additional sysctls for systems that cannot afford a tasklist scan that we
should do it. I didn't think anyone was using them (nobody can be cited
as using them) and I felt the two year deprecation period was enough to be
able to cleanup the ever-growing list of VM sysctls as well as making it
easier to extend in the future by adding a sysctl that had semantics
specifically directed to its target audience.
> The point about not many people using the parameters I don't think is
> a good one. 2.6.32 is being used in the next enterprise kernels so they
> are going to be in production for 5 or more years. How many people will
> have written scripts by the time they upgrade?
>
If nothing else, this solidifies in my mind the notion that once a VM
sysctl is added, it can never be removed. I regret adding
oom_kill_allocating_task at the mere suggestion of SGI when I made cpuset
ooms do a tasklist scan without developing a more extendable interface for
those large systems that we could piggyback on later for the same reasons
(we'd now need to disable oom_dump_tasks once it's the default for those
systems). I felt the deprecation and removal would eventually be
sufficient, but I'm no longer going to push it because it's sidetracking a
much larger and seperate effort.
> > I've made that change in my latest patch series which I'll be posting
> > shortly.
>
> The other thing is that it makes perfect sense to put controversial
> changes on hold even if you still think you can make a case for them.
I personally like consistency even amongst architectures when it comes to
the semantics of sysctls, so the pagefault oom handling changes were
merely for compatibility until all architectures were converted to using
the oom killer. You've done that work, so I no longer have a problem with
panicking when the pagefault handler is called.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-25 10:23 ` David Rientjes
@ 2010-05-25 10:31 ` Nick Piggin
0 siblings, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2010-05-25 10:31 UTC (permalink / raw)
To: David Rientjes; +Cc: KOSAKI Motohiro, linux-mm, KAMEZAWA Hiroyuki
On Tue, May 25, 2010 at 03:23:01AM -0700, David Rientjes wrote:
> On Tue, 25 May 2010, Nick Piggin wrote:
>
> > OK, you still never justified why that change is needed, or why it
> > is even a cleanup at all. You need actually a *good* reason to change
> > the user kernel API. A slight difference in opinion of what the sysctls
> > should be, or a slight change in implementation in the kernel, is not
> > a good reason in the slightest.
> >
>
> My personal opinion (and it's no longer what I'm advocating since I'm
> hoping that the much larger importance of this patchset can now be
> realized without this sideshow) is that if we have the opportunity to
> consolidate two virtually unused sysctls into one to cleanup procfs and
> make extending the oom killer easier in the future without the need to add
> additional sysctls for systems that cannot afford a tasklist scan that we
> should do it. I didn't think anyone was using them (nobody can be cited
> as using them) and I felt the two year deprecation period was enough to be
> able to cleanup the ever-growing list of VM sysctls as well as making it
> easier to extend in the future by adding a sysctl that had semantics
> specifically directed to its target audience.
>
> > The point about not many people using the parameters I don't think is
> > a good one. 2.6.32 is being used in the next enterprise kernels so they
> > are going to be in production for 5 or more years. How many people will
> > have written scripts by the time they upgrade?
> >
>
> If nothing else, this solidifies in my mind the notion that once a VM
> sysctl is added, it can never be removed. I regret adding
> oom_kill_allocating_task at the mere suggestion of SGI when I made cpuset
> ooms do a tasklist scan without developing a more extendable interface for
> those large systems that we could piggyback on later for the same reasons
> (we'd now need to disable oom_dump_tasks once it's the default for those
> systems). I felt the deprecation and removal would eventually be
> sufficient, but I'm no longer going to push it because it's sidetracking a
> much larger and seperate effort.
Thanks :) It's unfortunate, but if it is no real burden to carry it
then we really should. It's easy to get APIs wrong, we can't blame
the person proposing the API because everyone makes wrong or short
sighted choices.
What is needed is more careful flagging and review of API additions
and changes.
> > > I've made that change in my latest patch series which I'll be posting
> > > shortly.
> >
> > The other thing is that it makes perfect sense to put controversial
> > changes on hold even if you still think you can make a case for them.
>
> I personally like consistency even amongst architectures when it comes to
> the semantics of sysctls, so the pagefault oom handling changes were
> merely for compatibility until all architectures were converted to using
> the oom killer. You've done that work, so I no longer have a problem with
> panicking when the pagefault handler is called.
Fair enough. I should have just shut up and fixed my half done changes
rather than arguing :) So apologies for that!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-25 9:55 ` David Rientjes
@ 2010-05-26 0:02 ` David Rientjes
2010-05-28 5:27 ` KOSAKI Motohiro
2010-05-28 5:25 ` KOSAKI Motohiro
1 sibling, 1 reply; 18+ messages in thread
From: David Rientjes @ 2010-05-26 0:02 UTC (permalink / raw)
To: KOSAKI Motohiro, Oleg Nesterov; +Cc: linux-mm, KAMEZAWA Hiroyuki
On Tue, 25 May 2010, David Rientjes wrote:
> > > oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch
> > no objection. but afaik Oleg already pointed out "if (!p->mm)" is bad.
> > So, Don't we need push his patch instead?
> >
>
> I think it all depends on the order in which this work is merged.
>
I just noticed that Oleg's patches were dropped as well from -mm so I'll
merge them into my set and repost them as well.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-25 9:42 ` David Rientjes
@ 2010-05-26 0:17 ` KAMEZAWA Hiroyuki
2010-05-26 1:40 ` David Rientjes
0 siblings, 1 reply; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-26 0:17 UTC (permalink / raw)
To: David Rientjes; +Cc: KOSAKI Motohiro, linux-mm
On Tue, 25 May 2010 02:42:14 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> On Thu, 20 May 2010, KAMEZAWA Hiroyuki wrote:
>
> > I've pointed out that "normalized" parameter doesn't seem to work well in some
> > situaion (in cluster). I hope you'll have an extra interface as
> >
> > echo 3G > /proc/<pid>/oom_indemification
> >
> > to allow users have "absolute value" setting.
> > (If the admin know usual memory usage of an application, we can only
> > add badness to extra memory usage.)
> >
> > To be honest, I can't fully understand why we need _normalized_ parameter. Why
> > oom_adj _which is now used_ is not enough for setting "relative importance" ?
> >
>
> The only sane badness heuristic will be one that effectively compares all
> eligible tasks for oom kill in a way that are relative to one another; I'm
> concerned that a tunable that is based on a pure memory quantity requires
> specific knowledge of the system (or memcg, cpuset, etc) capacity before
> it is meaningful. In other words, I opted to use a relative proportion so
> that when tasks are constrained to cpusets or memcgs or mempolicies they
> become part of a "virtualized system" where the proportion is then used in
> calculation of the total amount of system RAM, memcg limit, cpuset mems
> capacities, etc, without knowledge of what that value actually is. So
> "echo 3G" may be valid in your example when not constrained to any cgroup
> or mempolicy but becomes invalid if I attach it to a cpuset with a single
> node of 1G capacity. When oom_score_adj, we can specify the proportion
> "of the resources that the application has access to" in comparison to
> other applications that share those resources to determine oom killing
> priority. I think that's a very powerful interface and your suggestion
> could easily be implemented in userspace with a simple divide, thus we
> don't need kernel support for it.
>
I know admins will be able to write a script. But, my point is
"please don't force admins to write such a hacky scripts."
For example, an admin uses an application which always use 3G bytes adn it's
valid and sane use for the application. When he run it on a server with
4G system and 8G system, he has to change the value for oom_score_adj.
One good point of old oom_adj is that it's not influenced by environment.
Then, X-window applications set it's oom_adj to be fixed value.
IIUC, they're hardcoded with fixed value, now.
Even if my customer may use only OOM_DISABLE, I think using oom_score_adj
is too difficult for usual users.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-26 0:17 ` KAMEZAWA Hiroyuki
@ 2010-05-26 1:40 ` David Rientjes
2010-05-26 2:00 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2010-05-26 1:40 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: KOSAKI Motohiro, linux-mm
On Wed, 26 May 2010, KAMEZAWA Hiroyuki wrote:
> > The only sane badness heuristic will be one that effectively compares all
> > eligible tasks for oom kill in a way that are relative to one another; I'm
> > concerned that a tunable that is based on a pure memory quantity requires
> > specific knowledge of the system (or memcg, cpuset, etc) capacity before
> > it is meaningful. In other words, I opted to use a relative proportion so
> > that when tasks are constrained to cpusets or memcgs or mempolicies they
> > become part of a "virtualized system" where the proportion is then used in
> > calculation of the total amount of system RAM, memcg limit, cpuset mems
> > capacities, etc, without knowledge of what that value actually is. So
> > "echo 3G" may be valid in your example when not constrained to any cgroup
> > or mempolicy but becomes invalid if I attach it to a cpuset with a single
> > node of 1G capacity. When oom_score_adj, we can specify the proportion
> > "of the resources that the application has access to" in comparison to
> > other applications that share those resources to determine oom killing
> > priority. I think that's a very powerful interface and your suggestion
> > could easily be implemented in userspace with a simple divide, thus we
> > don't need kernel support for it.
> >
> I know admins will be able to write a script. But, my point is
> "please don't force admins to write such a hacky scripts."
>
It's not necessarily the memory quantity that is interesting in this case
(or proportion of available memory), it's how the badness() score is
altered relative to other eligible tasks that end up changing the oom kill
priority list. If we were to implement a tunable that only took a memory
quantity, it would require specific knowledge of the system's capacity to
make any sense compared to other tasks. An oom_score_adj of 125MB means
vastly different things on a 4GB system compared to 64GB system and admins
do not want to update their script anytime they add (or hotadd) memory or
run on a variety of systems that don't have the same capacities. What
they want to do is specify the priority of an application either to prefer
it or protect it from oom kill by saying "this application can use 25%
more memory available to it than everything else" or "this application
should be killed if it's not leaving at least 25% to everything else."
> For example, an admin uses an application which always use 3G bytes adn it's
> valid and sane use for the application. When he run it on a server with
> 4G system and 8G system, he has to change the value for oom_score_adj.
>
That's the same if you were to implement a memory quantity instead of a
proportion for oom_score_adj and depends on how you want to protect or
prefer that application. For a 3G application on a 4G machine, an
oom_score_adj of 250 is legitimate if you want to ensure it never uses
more than 3G and is always killed first when it does. For the 8G machine,
you can't make the same killing choice if another instance of the same
application is using 5G instead of 3G. See the difference? In that case,
it may not be the correct choice for oom kill and we should kill something
else: the 5G memory leaker. That requires userspace intervention to
identify, but unless we mandate the expected memory use is spelled out for
every single application (which we can't), there's no way to use a fixed
memory quantity to determine relative priority.
If you really did always want to kill that 3G task, an oom_score_adj value
of +1000 would always work just like a value of +15 does for oom_adj.
> One good point of old oom_adj is that it's not influenced by environment.
> Then, X-window applications set it's oom_adj to be fixed value.
> IIUC, they're hardcoded with fixed value, now.
>
It _is_ influenced by environment, just indirectly. It's a bitshift on
the badness() score so for any other usecase other than a complete
polarization of the score to either always prefer or completely disable
oom killing for a task, it's practically useless. The bitshift will
increase or decrease the score but that score will be ranked according to
the scores of other tasks on the system. So if a task consuming 400K of
memory has a badness score of 100 with an oom_adj value of +10, the end
result is a score of 102400 which would represent about 10% of system
memory on a 4G system but about 1.5% of system memory on a 64GB system.
So the actual preference of a task, minus the usecase of polarizing the
task with oom_adj, is completely dependent on the size of system RAM.
oom_adj must also be altered anytime a task is attached to a cpuset or
memcg (or even mempolicy now) since its effect on badness will skew how
the score is compared relative to all other tasks in that cpuset, memcg,
or attached to the mempolicy nodes.
> Even if my customer may use only OOM_DISABLE, I think using oom_score_adj
> is too difficult for usual users.
>
How is oom_score_adj, which has a well-defined unit of "proportion of
available memory" more difficult to use than oom_adj, which exponentially
increases a tasks badness score with no units other than "badness()
quantum"? The typical use case for these users is simply to polarize the
score, anyway, which is still possible with oom_score_adj: oom_score_adj
of -1000 is equivalent oom_adj of -17 and oom_score_adj of +1000 is
equivalent to oom_adj of +15. That conversion is done automatically for
existing users of oom_adj, so I find oom_score_adj to be much more
predicatable and scalable than oom_adj.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-26 1:40 ` David Rientjes
@ 2010-05-26 2:00 ` KAMEZAWA Hiroyuki
2010-05-26 3:26 ` David Rientjes
0 siblings, 1 reply; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-26 2:00 UTC (permalink / raw)
To: David Rientjes; +Cc: KOSAKI Motohiro, linux-mm
On Tue, 25 May 2010 18:40:36 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> On Wed, 26 May 2010, KAMEZAWA Hiroyuki wrote:
>
> > > The only sane badness heuristic will be one that effectively compares all
> > > eligible tasks for oom kill in a way that are relative to one another; I'm
> > > concerned that a tunable that is based on a pure memory quantity requires
> > > specific knowledge of the system (or memcg, cpuset, etc) capacity before
> > > it is meaningful. In other words, I opted to use a relative proportion so
> > > that when tasks are constrained to cpusets or memcgs or mempolicies they
> > > become part of a "virtualized system" where the proportion is then used in
> > > calculation of the total amount of system RAM, memcg limit, cpuset mems
> > > capacities, etc, without knowledge of what that value actually is. So
> > > "echo 3G" may be valid in your example when not constrained to any cgroup
> > > or mempolicy but becomes invalid if I attach it to a cpuset with a single
> > > node of 1G capacity. When oom_score_adj, we can specify the proportion
> > > "of the resources that the application has access to" in comparison to
> > > other applications that share those resources to determine oom killing
> > > priority. I think that's a very powerful interface and your suggestion
> > > could easily be implemented in userspace with a simple divide, thus we
> > > don't need kernel support for it.
> > >
> > I know admins will be able to write a script. But, my point is
> > "please don't force admins to write such a hacky scripts."
> >
>
> It's not necessarily the memory quantity that is interesting in this case
> (or proportion of available memory), it's how the badness() score is
> altered relative to other eligible tasks that end up changing the oom kill
> priority list. If we were to implement a tunable that only took a memory
> quantity, it would require specific knowledge of the system's capacity to
> make any sense compared to other tasks. An oom_score_adj of 125MB means
> vastly different things on a 4GB system compared to 64GB system and admins
> do not want to update their script anytime they add (or hotadd) memory or
> run on a variety of systems that don't have the same capacities.
IMHO, importance of application is consistent under all hosts in the system.
(the system here means a system maintained by a team of admins to do a service.)
It's not be influenced by the amount of memory, other applications, etc..
If influenced, it's a chaos for admins.
It seems that's fundamental difference in ideas among you and me.
> > For example, an admin uses an application which always use 3G bytes adn it's
> > valid and sane use for the application. When he run it on a server with
> > 4G system and 8G system, he has to change the value for oom_score_adj.
> >
>
> That's the same if you were to implement a memory quantity instead of a
> proportion for oom_score_adj and depends on how you want to protect or
> prefer that application. For a 3G application on a 4G machine, an
> oom_score_adj of 250 is legitimate if you want to ensure it never uses
> more than 3G and is always killed first when it does. For the 8G machine,
> you can't make the same killing choice if another instance of the same
> application is using 5G instead of 3G. See the difference? In that case,
> it may not be the correct choice for oom kill and we should kill something
> else: the 5G memory leaker. That requires userspace intervention to
> identify, but unless we mandate the expected memory use is spelled out for
> every single application (which we can't), there's no way to use a fixed
> memory quantity to determine relative priority.
>
I just don't believe relative priority ;)
Then, my customer will just disable oom or will use panic_on_oom.
That's why I wrote don't take my words serious.
I wonder if people wants precise control of oom_score_adj, they should
use memcg and put apps into containers. In that case, static priority
and will be useful.
> If you really did always want to kill that 3G task, an oom_score_adj value
> of +1000 would always work just like a value of +15 does for oom_adj.
>
> > One good point of old oom_adj is that it's not influenced by environment.
> > Then, X-window applications set it's oom_adj to be fixed value.
> > IIUC, they're hardcoded with fixed value, now.
> >
>
> It _is_ influenced by environment, just indirectly. It's a bitshift on
> the badness() score so for any other usecase other than a complete
> polarization of the score to either always prefer or completely disable
> oom killing for a task, it's practically useless. The bitshift will
> increase or decrease the score but that score will be ranked according to
> the scores of other tasks on the system. So if a task consuming 400K of
> memory has a badness score of 100 with an oom_adj value of +10, the end
> result is a score of 102400 which would represent about 10% of system
> memory on a 4G system but about 1.5% of system memory on a 64GB system.
> So the actual preference of a task, minus the usecase of polarizing the
> task with oom_adj, is completely dependent on the size of system RAM.
>
> oom_adj must also be altered anytime a task is attached to a cpuset or
> memcg (or even mempolicy now) since its effect on badness will skew how
> the score is compared relative to all other tasks in that cpuset, memcg,
> or attached to the mempolicy nodes.
>
I agree that oom_score_adj is better than current oom_adj.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-26 2:00 ` KAMEZAWA Hiroyuki
@ 2010-05-26 3:26 ` David Rientjes
0 siblings, 0 replies; 18+ messages in thread
From: David Rientjes @ 2010-05-26 3:26 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: KOSAKI Motohiro, linux-mm
On Wed, 26 May 2010, KAMEZAWA Hiroyuki wrote:
> > It's not necessarily the memory quantity that is interesting in this case
> > (or proportion of available memory), it's how the badness() score is
> > altered relative to other eligible tasks that end up changing the oom kill
> > priority list. If we were to implement a tunable that only took a memory
> > quantity, it would require specific knowledge of the system's capacity to
> > make any sense compared to other tasks. An oom_score_adj of 125MB means
> > vastly different things on a 4GB system compared to 64GB system and admins
> > do not want to update their script anytime they add (or hotadd) memory or
> > run on a variety of systems that don't have the same capacities.
>
> IMHO, importance of application is consistent under all hosts in the system.
> (the system here means a system maintained by a team of admins to do a service.)
>
And it's consistent whether it's unbound to any cgroup, it's attached to a
cpuset, a memcg, or has a mempolicy restriction. Its importance in those
"virtualzied environments" is the same when shared with other tasks,
that's why proportions work well: I can decide that my application should
be targeted first by the oom killer when it is using more than 25% of
system memory, for example. When I attach that task to a cpuset, the
priority is the same, I've just adjusted the amount of available memory
out from under a set of tasks. The point is that the script or admin that
sets oom_score_adj need not know what resources the application has, but
rather its memory expectations relative to other applications that share
the same resources.
> It's not be influenced by the amount of memory, other applications, etc..
> If influenced, it's a chaos for admins.
> It seems that's fundamental difference in ideas among you and me.
>
Other than polarizing tasks with oom_score_adj of -1000 or +1000, you must
consider the relative importance of an application to other applications,
that's the point of adjusting badness scoring: so we can influence which
task is killed by the kernel instead of simply the task that is most
memory hogging. When an application is using more memory than desired (or
expected), the cost of business would mandate that the task is killed and
that threshold is definable in clear units from userspace via
oom_score_adj and not oom_adj.
> > That's the same if you were to implement a memory quantity instead of a
> > proportion for oom_score_adj and depends on how you want to protect or
> > prefer that application. For a 3G application on a 4G machine, an
> > oom_score_adj of 250 is legitimate if you want to ensure it never uses
> > more than 3G and is always killed first when it does. For the 8G machine,
> > you can't make the same killing choice if another instance of the same
> > application is using 5G instead of 3G. See the difference? In that case,
> > it may not be the correct choice for oom kill and we should kill something
> > else: the 5G memory leaker. That requires userspace intervention to
> > identify, but unless we mandate the expected memory use is spelled out for
> > every single application (which we can't), there's no way to use a fixed
> > memory quantity to determine relative priority.
> >
>
> I just don't believe relative priority ;)
If application A typically consumes 6G on an 8G system and that's an
important task, it's possible to protect it from being oom killed as the
result of the memory usage of the remaining applications B and C with
oom_score_adj; oom_adj does not have a clear interface for being able to
define when to kill A when it uses more than 7G but not to kill it when it
uses 6G. That's the motivation behind developing a relative priority like
oom_score_adj.
> That's why I wrote don't take my words serious.
> I wonder if people wants precise control of oom_score_adj, they should
> use memcg and put apps into containers. In that case, static priority
> and will be useful.
>
Indeed, using a hierarchy of memcgs is another way to do the same thing
but seems like overkill if I'm running only a webserver and a few
monitoring applications. oom_score_adj also applies to cpusets and
mempolicies.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-25 9:55 ` David Rientjes
2010-05-26 0:02 ` David Rientjes
@ 2010-05-28 5:25 ` KOSAKI Motohiro
2010-06-01 7:30 ` David Rientjes
1 sibling, 1 reply; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-05-28 5:25 UTC (permalink / raw)
To: David Rientjes; +Cc: kosaki.motohiro, linux-mm, KAMEZAWA Hiroyuki
Hi
> On Mon, 24 May 2010, KOSAKI Motohiro wrote:
>
> > > I've been notified that my entire oom killer rewrite has been dropped from
> > > -mm based solely on your feedback. The problem is that I have absolutely
> > > no idea what issues you have with the changes that haven't already been
> > > addressed (nobody else does, either, it seems).
> >
> > That's simple. A regression and an incompatibility are absolutely
> > unacceptable. They should be removed. Your patches have some funny parts,
> > but, afaik, nobody said funny requirement itself is wrong. They only said
> > your requirement don't have to cause any pain to other users.
> >
> > Zero risk patches are always acceptable.
> >
>
> When you see these "funny parts," please let me know what they are. The
> were was no incompatibility issue after
> oom-reintroduce-and-deprecate-oom_kill_allocating_task.patch was merged,
> the interface was simply deprecated. Arguing against the deprecation is
> understandable and quite frankly something I'd like to avoid since it's
> apparently hanging up the larger importance of the work, so I've dropped
> the consolidation (and subsequent deprecation of oom_kill_allocating_task)
> of the sysctls from my latest patch series.
That's said, don't deprecated current interface. Other MM developers makes
effort to reduce a number of oom bug report. I don't hope you run just opposite
direction.
> > I've reviewed all of your patches. the result is here.
> >
> > > oom-filter-tasks-not-sharing-the-same-cpuset.patch
> > ok, no objection.
> > I'm still afraid this patch reinstanciate old bug. but at that time,
> > we can drop it solely. this patch is enough bisectable.
> >
> > > oom-sacrifice-child-with-highest-badness-score-for-parent.patch
> > ok, no objection.
> > It's good patch.
> >
> > > oom-select-task-from-tasklist-for-mempolicy-ooms.patch
> > ok, no objection.
> >
> > > oom-remove-special-handling-for-pagefault-ooms.patch
> > ok, no objection.
> >
> > > oom-badness-heuristic-rewrite.patch
> > No. All of rewrite is bad idea. Please make separate some
> > individual patches.
> > All rewrite thing break bisectability. Perhaps it can steal
> > a lot of time from MM developers.
>
> We've talked about that before, and I remember specifically addressing why
> it couldn't be broken apart with any coherent understanding of what was
> happening. I think the patchset itself was fairly well divided, but this
> specific patch touches many different areas and function signatures but
> are mainly localized to the oom killer.
Heh, that's ok.
I'll merge apart of this one If you can't. The rule is simple, rewrite
all patches will never merge. but ok too. you can choice no merge.
> > This patch have following parts.
> > 1) Add oom_score_adj
>
> A patch that only adds oom_score_adj but doesn't do anything else? It
> can't be used with the current badness function, it requires the rewrite
> of oom_badness().
ok. you can drop oom_score_adj too.
> > 2) OOM score normalization
>
> I prefer to do that with the addition of oom_score_adj since that tunable
> is meaninless until the score uses it.
No. This one have no justification. BAD IDEA.
Any core heuristic change need to prove to improve desktop use case.
That's said, now lkml have one or two oom bug report per month. We have
to make effort to reduce it. Please don't append new confusion source.
> > 3) forkbomb detector
>
> Ok, I can seperate that out but that's only a small part of the overall
> code. Are there specific issues you'd like to address with that now
> instead of later?
reviewability and bisectability are one of most important issue. that's all.
> > 4) oom_forkbomb_thres new knob
>
> I'd prefer to keep the introduction of the sysctl, again, with the
> addition of the functional code that uses it.
I'm not against new knob. I only request separation.
> > 5) Root user get 3% bonus instead 400%
>
> I don't understand this.
Now, our oom have "if (root-user) points /= 4" logic, I wrote it as 400%.
> > all except (2) seems ok. but I'll review them again after separation.
> > but you can't insert your copyright.
> >
>
> I can't add a copyright under the GPL for the new heuristic? Why?
1) too small work
2) In this area, almost work had been lead to kamezawa-san. you don't have
proper right.
(1) mean other people of joining this improvement can't append it too.
> > > oom-deprecate-oom_adj-tunable.patch
> > NAK. you can't change userland use-case at all. This patch
> > only makes bug report flood and streal our time.
>
> It was Andrew's idea to deprecate this since the tunable works on a much
> higher granularity than oom_score_adj. Andrew?
If we can create really better new knob, the end users naturally migrate
to use new one. thus, we can remove older one without pain.
again, we still have a bit high bug report rate in oom area. Please don't
increase it.
> > > oom-replace-sysctls-with-quick-mode.patch
> > NAK. To change sysctl makes confusion to userland.
> > You have to prove such deprecated sysctl was alread unused.
> > But the fact is, there is users. I have hear some times such
> > use case and recent bug reporter said that's used.
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=15058
> >
>
> Already dropped.
thanks.
> > > oom-avoid-oom-killer-for-lowmem-allocations.patch
> > I don't like this one. 64bit arch have big (e.g. 2/4G)
> > DMA_ZONE/DMA32_ZONE. So, if we create small guest kernel
> > on KVM (or Xen), Killing processes may help. IOW, this
> > one is conceptually good. but this check way is brutal.
>
> It "may" help but has a significant probability of unnecessarily killing a
> task that won't free any lowmem, so how would you suggest we modify the
> oom killer to handle the allocation failure for GFP_DMA without negatively
> impacting the system?
I prefer to check numbers of anon pages in the zone. I mean we want
"skip oom if the zone have no freeable memory", thus just do it.
But again, I'm ok yours too.
>
> > but even though it's ok. Let's go merge it. this patch is
> > enough small.
> > If any problem is occur, we can revert this one easily.
> >
>
> Ok.
>
> > > oom-remove-unnecessary-code-and-cleanup.patch
> > ok, no objection.
> >
> > > oom-default-to-killing-current-for-pagefault-ooms.patch
> > NAK.
> > 1) this patch break panic_on_oom
> > 2) At this merge window, Nick change almost all architecture's
> > page hault handler. now almost all arch use
> > pagefault_out_of_memory. your description has been a bit obsoleted.
> >
>
> Already changed, as previously mentioned in earlier posts.
ok, thanks.
>
> > > oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch
> > no objection. but afaik Oleg already pointed out "if (!p->mm)" is bad.
> > So, Don't we need push his patch instead?
> >
>
> I think it all depends on the order in which this work is merged.
>
> > > oom-hold-tasklist_lock-when-dumping-tasks.patch
> > ok, no objection.
> >
> > > oom-give-current-access-to-memory-reserves-if-it-has-been-killed.patch
> > ok, no objection.
> >
> > > oom-avoid-sending-exiting-tasks-a-sigkill.patch
> > ok, no objection
> >
> > > oom-cleanup-oom_kill_task.patch
> > ok, no objection
> >
> > > oom-cleanup-oom_badness.patch
> > ok, no objection
> >
> > The above "no objection" mean you can feel free to use my reviewed-by tag.
> >
>
> Thanks for the detailed review, I look forward to your feedback when I
> post the updated series.
I'm thanks too.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-26 0:02 ` David Rientjes
@ 2010-05-28 5:27 ` KOSAKI Motohiro
0 siblings, 0 replies; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-05-28 5:27 UTC (permalink / raw)
To: David Rientjes
Cc: kosaki.motohiro, Oleg Nesterov, linux-mm, KAMEZAWA Hiroyuki
> On Tue, 25 May 2010, David Rientjes wrote:
>
> > > > oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch
> > > no objection. but afaik Oleg already pointed out "if (!p->mm)" is bad.
> > > So, Don't we need push his patch instead?
> > >
> >
> > I think it all depends on the order in which this work is merged.
> >
>
> I just noticed that Oleg's patches were dropped as well from -mm so I'll
> merge them into my set and repost them as well.
Oops, no. It shouldn't.
His patch is important and should be merged ASAP. I believe standalone merge is best.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: oom killer rewrite
2010-05-28 5:25 ` KOSAKI Motohiro
@ 2010-06-01 7:30 ` David Rientjes
0 siblings, 0 replies; 18+ messages in thread
From: David Rientjes @ 2010-06-01 7:30 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: linux-mm, KAMEZAWA Hiroyuki
On Fri, 28 May 2010, KOSAKI Motohiro wrote:
> > When you see these "funny parts," please let me know what they are. The
> > were was no incompatibility issue after
> > oom-reintroduce-and-deprecate-oom_kill_allocating_task.patch was merged,
> > the interface was simply deprecated. Arguing against the deprecation is
> > understandable and quite frankly something I'd like to avoid since it's
> > apparently hanging up the larger importance of the work, so I've dropped
> > the consolidation (and subsequent deprecation of oom_kill_allocating_task)
> > of the sysctls from my latest patch series.
>
> That's said, don't deprecated current interface. Other MM developers makes
> effort to reduce a number of oom bug report. I don't hope you run just opposite
> direction.
>
It's dropped, as I said above.
> > > > oom-badness-heuristic-rewrite.patch
> > > No. All of rewrite is bad idea. Please make separate some
> > > individual patches.
> > > All rewrite thing break bisectability. Perhaps it can steal
> > > a lot of time from MM developers.
> >
> > We've talked about that before, and I remember specifically addressing why
> > it couldn't be broken apart with any coherent understanding of what was
> > happening. I think the patchset itself was fairly well divided, but this
> > specific patch touches many different areas and function signatures but
> > are mainly localized to the oom killer.
>
> Heh, that's ok.
> I'll merge apart of this one If you can't. The rule is simple, rewrite
> all patches will never merge. but ok too. you can choice no merge.
>
You can't break this patch apart functionally into anything that's
meaninful, it doesn't help to go around changing function signatures in
one patch when the arguments are left unused, or adding sysctls in one
patch when its unused, etc. I gave a tip for reviewing of this particular
patch since all of the changes are isolated in oom_badness(): merge the
patch into your own tree and review the heuristic.
> > > This patch have following parts.
> > > 1) Add oom_score_adj
> >
> > A patch that only adds oom_score_adj but doesn't do anything else? It
> > can't be used with the current badness function, it requires the rewrite
> > of oom_badness().
>
> ok. you can drop oom_score_adj too.
>
I don't know what this means. oom_score_adj is the new tunable that is in
the units that the new heuristic uses, they obviously need to be merged
together.
> > > 2) OOM score normalization
> >
> > I prefer to do that with the addition of oom_score_adj since that tunable
> > is meaninless until the score uses it.
>
> No. This one have no justification. BAD IDEA.
> Any core heuristic change need to prove to improve desktop use case.
>
This entire rewrite was largely based on improving the desktop use case,
the new heuristic doesn't kill KDE when you fork a large memory-hogging
process on the desktop when it currently does.
> That's said, now lkml have one or two oom bug report per month. We have
> to make effort to reduce it. Please don't append new confusion source.
>
My heuristic is far superior to the current heuristic and this introduces
a new tunable that allows userspace to tune oom killer prioritization in
units that admins understand and in a linear way, not exponentially on the
total VM size that the current heuristic does.
> > > 3) forkbomb detector
> >
> > Ok, I can seperate that out but that's only a small part of the overall
> > code. Are there specific issues you'd like to address with that now
> > instead of later?
>
> reviewability and bisectability are one of most important issue. that's all.
>
This is done and has been proposed as patch 09/18 in my latest posting,
thanks.
> > > 5) Root user get 3% bonus instead 400%
> >
> > I don't understand this.
>
> Now, our oom have "if (root-user) points /= 4" logic, I wrote it as 400%.
>
This bias should be equivalent to that given in __vm_enough_memory() as
specified in the patch description.
> > I can't add a copyright under the GPL for the new heuristic? Why?
>
> 1) too small work
14 files changed, 680 insertions(+), 332 deletions(-)
That's too small?
> 2) In this area, almost work had been lead to kamezawa-san. you don't have
> proper right.
>
Absolutely ridiculous, the only thing that this entire patchset builds
upon is the use of MM_SWAPENTS which was introduced to be exported via
/proc/pid/status.
> (1) mean other people of joining this improvement can't append it too.
>
I don't know what this means.
If you object to adding Google's copyright to this work, please merge that
as an additional patch and submit it to Andrew along with my nacked-by.
He can remove Google's copyright to the rewrite if he'd like.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2010-06-01 7:30 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-19 22:14 oom killer rewrite David Rientjes
2010-05-20 0:27 ` KAMEZAWA Hiroyuki
2010-05-25 9:42 ` David Rientjes
2010-05-26 0:17 ` KAMEZAWA Hiroyuki
2010-05-26 1:40 ` David Rientjes
2010-05-26 2:00 ` KAMEZAWA Hiroyuki
2010-05-26 3:26 ` David Rientjes
2010-05-24 1:09 ` KOSAKI Motohiro
2010-05-24 7:07 ` Nick Piggin
2010-05-25 9:46 ` David Rientjes
2010-05-25 10:05 ` Nick Piggin
2010-05-25 10:23 ` David Rientjes
2010-05-25 10:31 ` Nick Piggin
2010-05-25 9:55 ` David Rientjes
2010-05-26 0:02 ` David Rientjes
2010-05-28 5:27 ` KOSAKI Motohiro
2010-05-28 5:25 ` KOSAKI Motohiro
2010-06-01 7:30 ` David Rientjes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).