* memcg: fix fatal livelock in kswapd @ 2011-05-02 20:07 James Bottomley 2011-05-02 22:48 ` Johannes Weiner 2011-05-02 22:53 ` Paul Menage 0 siblings, 2 replies; 11+ messages in thread From: James Bottomley @ 2011-05-02 20:07 UTC (permalink / raw) To: Chris Mason, linux-fsdevel, linux-mm, linux-kernel, Paul Menage The fatal livelock in kswapd, reported in this thread: http://marc.info/?t=130392066000001 Is mitigateable if we prevent the cgroups code being so aggressive in its zone shrinking (by reducing it's default shrink from 0 [everything] to DEF_PRIORITY [some things]). This will have an obvious knock on effect to cgroup accounting, but it's better than hanging systems. Signed-off-by: James Bottomley <James.Bottomley@suse.de> --- >From 74b62fc417f07e1411d98181631e4e097c8e3e68 Mon Sep 17 00:00:00 2001 From: James Bottomley <James.Bottomley@HansenPartnership.com> Date: Mon, 2 May 2011 14:56:29 -0500 Subject: [PATCH] vmscan: move containers scan back to default priority diff --git a/mm/vmscan.c b/mm/vmscan.c index f6b435c..46cde92 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2173,8 +2173,12 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, * if we don't reclaim here, the shrink_zone from balance_pgdat * will pick up pages from other mem cgroup's as well. We hack * the priority and make it zero. + * + * FIXME: jejb: zero here was causing a livelock in the + * shrinker so changed to DEF_PRIORITY to fix this. Now need to + * sort out cgroup accounting. */ - shrink_zone(0, zone, &sc); + shrink_zone(DEF_PRIORITY, zone, &sc); trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: memcg: fix fatal livelock in kswapd 2011-05-02 20:07 memcg: fix fatal livelock in kswapd James Bottomley @ 2011-05-02 22:48 ` Johannes Weiner 2011-05-02 23:14 ` Ying Han 2011-05-07 21:59 ` Balbir Singh 2011-05-02 22:53 ` Paul Menage 1 sibling, 2 replies; 11+ messages in thread From: Johannes Weiner @ 2011-05-02 22:48 UTC (permalink / raw) To: James Bottomley Cc: Chris Mason, linux-fsdevel, linux-mm, linux-kernel, Paul Menage, Li Zefan, containers, Balbir Singh Hi, On Mon, May 02, 2011 at 03:07:29PM -0500, James Bottomley wrote: > The fatal livelock in kswapd, reported in this thread: > > http://marc.info/?t=130392066000001 > > Is mitigateable if we prevent the cgroups code being so aggressive in > its zone shrinking (by reducing it's default shrink from 0 [everything] > to DEF_PRIORITY [some things]). This will have an obvious knock on > effect to cgroup accounting, but it's better than hanging systems. Actually, it's not that obvious. At least not to me. I added Balbir, who added said comment and code in the first place, to CC: Here is the comment in full quote: /* * NOTE: Although we can get the priority field, using it * here is not a good idea, since it limits the pages we can scan. * if we don't reclaim here, the shrink_zone from balance_pgdat * will pick up pages from other mem cgroup's as well. We hack * the priority and make it zero. */ The idea is that if one memcg is above its softlimit, we prefer reducing pages from this memcg over reclaiming random other pages, including those of other memcgs. But the code flow looks like this: balance_pgdat mem_cgroup_soft_limit_reclaim mem_cgroup_shrink_node_zone shrink_zone(0, zone, &sc) shrink_zone(prio, zone, &sc) so the success of the inner memcg shrink_zone does at least not explicitely result in the outer, global shrink_zone steering clear of other memcgs' pages. It just tries to move the pressure of balancing the zones to the memcg with the biggest soft limit excess. That can only really work if the memcg is a large enough contributor to the zone's total number of lru pages, though, and looks very likely to hit the exceeding memcg too hard in other cases. I am very much for removing this hack. There is still more scan pressure applied to memcgs in excess of their soft limit even if the extra scan is happening at a sane priority level. And the fact that global reclaim operates completely unaware of memcgs is a different story. However, this code came into place with v2.6.31-8387-g4e41695. Why is it only now showing up? You also wrote in that thread that this happens on a standard F15 installation. On the F15 I am running here, systemd does not configure memcgs, however. Did you manually configure memcgs and set soft limits? Because I wonder how it ended up in soft limit reclaim in the first place. Hannes > Signed-off-by: James Bottomley <James.Bottomley@suse.de> > > --- > > >From 74b62fc417f07e1411d98181631e4e097c8e3e68 Mon Sep 17 00:00:00 2001 > From: James Bottomley <James.Bottomley@HansenPartnership.com> > Date: Mon, 2 May 2011 14:56:29 -0500 > Subject: [PATCH] vmscan: move containers scan back to default priority > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f6b435c..46cde92 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2173,8 +2173,12 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, > * if we don't reclaim here, the shrink_zone from balance_pgdat > * will pick up pages from other mem cgroup's as well. We hack > * the priority and make it zero. > + * > + * FIXME: jejb: zero here was causing a livelock in the > + * shrinker so changed to DEF_PRIORITY to fix this. Now need to > + * sort out cgroup accounting. > */ > - shrink_zone(0, zone, &sc); > + shrink_zone(DEF_PRIORITY, zone, &sc); > > trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: memcg: fix fatal livelock in kswapd 2011-05-02 22:48 ` Johannes Weiner @ 2011-05-02 23:14 ` Ying Han 2011-05-02 23:58 ` James Bottomley 2011-05-03 6:11 ` Johannes Weiner 2011-05-07 21:59 ` Balbir Singh 1 sibling, 2 replies; 11+ messages in thread From: Ying Han @ 2011-05-02 23:14 UTC (permalink / raw) To: Johannes Weiner Cc: James Bottomley, Chris Mason, linux-fsdevel, linux-mm, linux-kernel, Paul Menage, Li Zefan, containers, Balbir Singh On Mon, May 2, 2011 at 3:48 PM, Johannes Weiner <hannes@cmpxchg.org> wrote: > Hi, > > On Mon, May 02, 2011 at 03:07:29PM -0500, James Bottomley wrote: >> The fatal livelock in kswapd, reported in this thread: >> >> http://marc.info/?t=130392066000001 >> >> Is mitigateable if we prevent the cgroups code being so aggressive in >> its zone shrinking (by reducing it's default shrink from 0 [everything] >> to DEF_PRIORITY [some things]). This will have an obvious knock on >> effect to cgroup accounting, but it's better than hanging systems. > > Actually, it's not that obvious. At least not to me. I added Balbir, > who added said comment and code in the first place, to CC: Here is the > comment in full quote: > > /* > * NOTE: Although we can get the priority field, using it > * here is not a good idea, since it limits the pages we can scan. > * if we don't reclaim here, the shrink_zone from balance_pgdat > * will pick up pages from other mem cgroup's as well. We hack > * the priority and make it zero. > */ > > The idea is that if one memcg is above its softlimit, we prefer > reducing pages from this memcg over reclaiming random other pages, > including those of other memcgs. > > But the code flow looks like this: > > balance_pgdat > mem_cgroup_soft_limit_reclaim > mem_cgroup_shrink_node_zone > shrink_zone(0, zone, &sc) > shrink_zone(prio, zone, &sc) > > so the success of the inner memcg shrink_zone does at least not > explicitely result in the outer, global shrink_zone steering clear of > other memcgs' pages. It just tries to move the pressure of balancing > the zones to the memcg with the biggest soft limit excess. That can > only really work if the memcg is a large enough contributor to the > zone's total number of lru pages, though, and looks very likely to hit > the exceeding memcg too hard in other cases. yes, the logic is selecting one memcg(the one exceeding the most) and starting hierarchical reclaim on it. It will looping until the the following condition becomes true: 1. memcg usage is below its soft_limit 2. looping 100 times 3. reclaimed pages equal or greater than (excess >>2) where excess is the (usage - soft_limit) hmm, the worst case i can think of is the memcg only has one page allocate on the zone, and we end up looping 100 time each time and not contributing much to the global reclaim. > > I am very much for removing this hack. There is still more scan > pressure applied to memcgs in excess of their soft limit even if the > extra scan is happening at a sane priority level. And the fact that > global reclaim operates completely unaware of memcgs is a different > story. > > However, this code came into place with v2.6.31-8387-g4e41695. Why is > it only now showing up? > > You also wrote in that thread that this happens on a standard F15 > installation. On the F15 I am running here, systemd does not > configure memcgs, however. Did you manually configure memcgs and set > soft limits? Because I wonder how it ended up in soft limit reclaim > in the first place. curious as well. if we have workload to reproduce it, i would like to try --Ying > > Hannes > >> Signed-off-by: James Bottomley <James.Bottomley@suse.de> >> >> --- >> >> >From 74b62fc417f07e1411d98181631e4e097c8e3e68 Mon Sep 17 00:00:00 2001 >> From: James Bottomley <James.Bottomley@HansenPartnership.com> >> Date: Mon, 2 May 2011 14:56:29 -0500 >> Subject: [PATCH] vmscan: move containers scan back to default priority >> >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index f6b435c..46cde92 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -2173,8 +2173,12 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, >> * if we don't reclaim here, the shrink_zone from balance_pgdat >> * will pick up pages from other mem cgroup's as well. We hack >> * the priority and make it zero. >> + * >> + * FIXME: jejb: zero here was causing a livelock in the >> + * shrinker so changed to DEF_PRIORITY to fix this. Now need to >> + * sort out cgroup accounting. >> */ >> - shrink_zone(0, zone, &sc); >> + shrink_zone(DEF_PRIORITY, zone, &sc); >> >> trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: memcg: fix fatal livelock in kswapd 2011-05-02 23:14 ` Ying Han @ 2011-05-02 23:58 ` James Bottomley 2011-05-03 6:38 ` Johannes Weiner 2011-05-03 6:11 ` Johannes Weiner 1 sibling, 1 reply; 11+ messages in thread From: James Bottomley @ 2011-05-02 23:58 UTC (permalink / raw) To: Ying Han Cc: Johannes Weiner, Chris Mason, linux-fsdevel, linux-mm, linux-kernel, Paul Menage, Li Zefan, containers, Balbir Singh On Mon, 2011-05-02 at 16:14 -0700, Ying Han wrote: > On Mon, May 2, 2011 at 3:48 PM, Johannes Weiner <hannes@cmpxchg.org> wrote: > > I am very much for removing this hack. There is still more scan > > pressure applied to memcgs in excess of their soft limit even if the > > extra scan is happening at a sane priority level. And the fact that > > global reclaim operates completely unaware of memcgs is a different > > story. > > > > However, this code came into place with v2.6.31-8387-g4e41695. Why is > > it only now showing up? > > > > You also wrote in that thread that this happens on a standard F15 > > installation. On the F15 I am running here, systemd does not > > configure memcgs, however. Did you manually configure memcgs and set > > soft limits? Because I wonder how it ended up in soft limit reclaim > > in the first place. It doesn't ... it's standard FC15 ... the mere fact of having memcg compiled into the kernel is enough to do it (conversely disabling it at compile time fixes the problem). > curious as well. if we have workload to reproduce it, i would like to try Well, the only one I can suggest is the one that produces it (large untar). There seems to be something magical about the memory size (mine is 2G) because adding more also seems to make the problem go away. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: memcg: fix fatal livelock in kswapd 2011-05-02 23:58 ` James Bottomley @ 2011-05-03 6:38 ` Johannes Weiner 2011-05-03 14:11 ` James Bottomley 0 siblings, 1 reply; 11+ messages in thread From: Johannes Weiner @ 2011-05-03 6:38 UTC (permalink / raw) To: James Bottomley Cc: Ying Han, Chris Mason, linux-fsdevel, linux-mm, linux-kernel, Paul Menage, Li Zefan, containers, Balbir Singh On Mon, May 02, 2011 at 06:58:18PM -0500, James Bottomley wrote: > On Mon, 2011-05-02 at 16:14 -0700, Ying Han wrote: > > On Mon, May 2, 2011 at 3:48 PM, Johannes Weiner <hannes@cmpxchg.org> wrote: > > > I am very much for removing this hack. There is still more scan > > > pressure applied to memcgs in excess of their soft limit even if the > > > extra scan is happening at a sane priority level. And the fact that > > > global reclaim operates completely unaware of memcgs is a different > > > story. > > > > > > However, this code came into place with v2.6.31-8387-g4e41695. Why is > > > it only now showing up? > > > > > > You also wrote in that thread that this happens on a standard F15 > > > installation. On the F15 I am running here, systemd does not > > > configure memcgs, however. Did you manually configure memcgs and set > > > soft limits? Because I wonder how it ended up in soft limit reclaim > > > in the first place. > > It doesn't ... it's standard FC15 ... the mere fact of having memcg > compiled into the kernel is enough to do it (conversely disabling it at > compile time fixes the problem). Does this mean you have not set one up yourself, or does it mean that you have checked no other software is setting up a soft-limited memcg? Right now, I still don't see how we could enter the problematic path without one memcg exceeding its soft limit. So if you have not done this yet, can you check the cgroup fs for memcgs, their memory.soft_limit_in_bytes and .usage_in_bytes right before you would run the workload that reproduces the problem? > > curious as well. if we have workload to reproduce it, i would like to try > > Well, the only one I can suggest is the one that produces it (large > untar). There seems to be something magical about the memory size (mine > is 2G) because adding more also seems to make the problem go away. I'll try to reproduce this on my F15 as well. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: memcg: fix fatal livelock in kswapd 2011-05-03 6:38 ` Johannes Weiner @ 2011-05-03 14:11 ` James Bottomley 2011-05-05 21:00 ` Andrew Morton 0 siblings, 1 reply; 11+ messages in thread From: James Bottomley @ 2011-05-03 14:11 UTC (permalink / raw) To: Johannes Weiner Cc: Ying Han, Chris Mason, linux-fsdevel, linux-mm, linux-kernel, Paul Menage, Li Zefan, containers, Balbir Singh On Tue, 2011-05-03 at 08:38 +0200, Johannes Weiner wrote: > On Mon, May 02, 2011 at 06:58:18PM -0500, James Bottomley wrote: > > On Mon, 2011-05-02 at 16:14 -0700, Ying Han wrote: > > > On Mon, May 2, 2011 at 3:48 PM, Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > I am very much for removing this hack. There is still more scan > > > > pressure applied to memcgs in excess of their soft limit even if the > > > > extra scan is happening at a sane priority level. And the fact that > > > > global reclaim operates completely unaware of memcgs is a different > > > > story. > > > > > > > > However, this code came into place with v2.6.31-8387-g4e41695. Why is > > > > it only now showing up? > > > > > > > > You also wrote in that thread that this happens on a standard F15 > > > > installation. On the F15 I am running here, systemd does not > > > > configure memcgs, however. Did you manually configure memcgs and set > > > > soft limits? Because I wonder how it ended up in soft limit reclaim > > > > in the first place. > > > > It doesn't ... it's standard FC15 ... the mere fact of having memcg > > compiled into the kernel is enough to do it (conversely disabling it at > > compile time fixes the problem). > > Does this mean you have not set one up yourself, or does it mean that > you have checked no other software is setting up a soft-limited memcg? Right, I've done nothing other than install and boot. As far as I can tell from /sys/fs/cgroup/memory, nothing is defined other than the standard limits. > Right now, I still don't see how we could enter the problematic path > without one memcg exceeding its soft limit. Yes, that's what we all think too. The limit is way above my memory size, though. > So if you have not done this yet, can you check the cgroup fs for > memcgs, their memory.soft_limit_in_bytes and .usage_in_bytes right > before you would run the workload that reproduces the problem? Sure ... I've got the entire contents at the bottom. > > > curious as well. if we have workload to reproduce it, i would like to try > > > > Well, the only one I can suggest is the one that produces it (large > > untar). There seems to be something magical about the memory size (mine > > is 2G) because adding more also seems to make the problem go away. > > I'll try to reproduce this on my F15 as well. It's an SMP kernel (The core i5 Lenovo laptop has two cores with two threads). Turning on PREEMPT makes the hang go away, but still causes kswapd to loop. James --- ]# for f in *; do echo -e "$f\t"; cat $f;done cgroup.clone_children 0 cgroup.event_control cat: cgroup.event_control: Invalid argument cgroup.procs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 49 50 51 52 53 54 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 335 339 352 370 371 408 409 415 427 431 443 613 614 679 690 704 732 758 759 775 799 800 825 840 849 851 865 866 890 948 964 997 1000 1037 memory.failcnt 0 memory.force_empty cat: memory.force_empty: Invalid argument memory.limit_in_bytes 9223372036854775807 memory.max_usage_in_bytes 0 memory.move_charge_at_immigrate 0 memory.oom_control oom_kill_disable 0 under_oom 0 memory.soft_limit_in_bytes 9223372036854775807 memory.stat cache 68370432 rss 34246656 mapped_file 6008832 pgpgin 132627 pgpgout 107574 inactive_anon 6766592 active_anon 34226176 inactive_file 45350912 active_file 16228352 unevictable 0 hierarchical_memory_limit 9223372036854775807 total_cache 68370432 total_rss 34246656 total_mapped_file 6008832 total_pgpgin 132627 total_pgpgout 107574 total_inactive_anon 6766592 total_active_anon 34226176 total_inactive_file 45350912 total_active_file 16228352 total_unevictable 0 memory.swappiness 60 memory.usage_in_bytes 102617088 memory.use_hierarchy 0 notify_on_release 0 release_agent tasks 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 49 50 51 52 53 54 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 335 339 352 370 371 408 409 415 427 431 443 613 614 679 690 704 732 758 759 775 799 800 825 840 849 851 865 866 890 891 948 964 997 1000 1051 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: memcg: fix fatal livelock in kswapd 2011-05-03 14:11 ` James Bottomley @ 2011-05-05 21:00 ` Andrew Morton 0 siblings, 0 replies; 11+ messages in thread From: Andrew Morton @ 2011-05-05 21:00 UTC (permalink / raw) To: James Bottomley Cc: Johannes Weiner, Ying Han, Chris Mason, linux-fsdevel, linux-mm, linux-kernel, Paul Menage, Li Zefan, containers, Balbir Singh The trail seems to have cooled off here, but it's pretty urgent. Having re-read the threads I find it notable that James hit a kswapd softlockup with "non-PREEMPT CGROUP but disabled GROUP_MEM_RES_CTLR". This suggests that the problem isn't with memcg. Or at least, we should fix this kswapd lockup before worrying about memcg. And I'm not sure that we should be assuming that there's something wrong in shrink_slab(). We know that kswapd has gone berserk, and that it will frequently call shrink_slab() when in that mode. But this may be because the top-level balance_pgdat() loop isn't terminating for reasons unrelated to shrink_slab(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: memcg: fix fatal livelock in kswapd 2011-05-02 23:14 ` Ying Han 2011-05-02 23:58 ` James Bottomley @ 2011-05-03 6:11 ` Johannes Weiner 1 sibling, 0 replies; 11+ messages in thread From: Johannes Weiner @ 2011-05-03 6:11 UTC (permalink / raw) To: Ying Han Cc: James Bottomley, Chris Mason, linux-fsdevel, linux-mm, linux-kernel, Paul Menage, Li Zefan, containers, Balbir Singh On Mon, May 02, 2011 at 04:14:09PM -0700, Ying Han wrote: > On Mon, May 2, 2011 at 3:48 PM, Johannes Weiner <hannes@cmpxchg.org> wrote: > > Hi, > > > > On Mon, May 02, 2011 at 03:07:29PM -0500, James Bottomley wrote: > >> The fatal livelock in kswapd, reported in this thread: > >> > >> http://marc.info/?t=130392066000001 > >> > >> Is mitigateable if we prevent the cgroups code being so aggressive in > >> its zone shrinking (by reducing it's default shrink from 0 [everything] > >> to DEF_PRIORITY [some things]). This will have an obvious knock on > >> effect to cgroup accounting, but it's better than hanging systems. > > > > Actually, it's not that obvious. At least not to me. I added Balbir, > > who added said comment and code in the first place, to CC: Here is the > > comment in full quote: > > > > /* > > * NOTE: Although we can get the priority field, using it > > * here is not a good idea, since it limits the pages we can scan. > > * if we don't reclaim here, the shrink_zone from balance_pgdat > > * will pick up pages from other mem cgroup's as well. We hack > > * the priority and make it zero. > > */ > > > > The idea is that if one memcg is above its softlimit, we prefer > > reducing pages from this memcg over reclaiming random other pages, > > including those of other memcgs. > > > > But the code flow looks like this: > > > > balance_pgdat > > mem_cgroup_soft_limit_reclaim > > mem_cgroup_shrink_node_zone > > shrink_zone(0, zone, &sc) > > shrink_zone(prio, zone, &sc) > > > > so the success of the inner memcg shrink_zone does at least not > > explicitely result in the outer, global shrink_zone steering clear of > > other memcgs' pages. It just tries to move the pressure of balancing > > the zones to the memcg with the biggest soft limit excess. That can > > only really work if the memcg is a large enough contributor to the > > zone's total number of lru pages, though, and looks very likely to hit > > the exceeding memcg too hard in other cases. > yes, the logic is selecting one memcg(the one exceeding the most) and > starting hierarchical reclaim on it. It will looping until the the > following condition becomes true: > 1. memcg usage is below its soft_limit > 2. looping 100 times > 3. reclaimed pages equal or greater than (excess >>2) where excess is > the (usage - soft_limit) There is no need to loop if we beat up the memcg in question with a hammer during the first iteration ;-) That is, we already did the aggressive scan when all these conditions are checked. > hmm, the worst case i can think of is the memcg only has one page > allocate on the zone, and we end up looping 100 time each time and not > contributing much to the global reclaim. Good point, it should probably bail earlier on a zone that does not really contribute to the soft limit excess. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: memcg: fix fatal livelock in kswapd 2011-05-02 22:48 ` Johannes Weiner 2011-05-02 23:14 ` Ying Han @ 2011-05-07 21:59 ` Balbir Singh 2011-05-07 22:00 ` Balbir Singh 1 sibling, 1 reply; 11+ messages in thread From: Balbir Singh @ 2011-05-07 21:59 UTC (permalink / raw) To: Johannes Weiner Cc: James Bottomley, Chris Mason, linux-fsdevel, linux-mm, linux-kernel, Paul Menage, Li Zefan, containers [-- Attachment #1: Type: text/plain, Size: 3281 bytes --] On Tue, May 3, 2011 at 4:18 AM, Johannes Weiner <hannes@cmpxchg.org> wrote: > Hi, > > On Mon, May 02, 2011 at 03:07:29PM -0500, James Bottomley wrote: > > The fatal livelock in kswapd, reported in this thread: > > > > http://marc.info/?t=130392066000001 > > > > Is mitigateable if we prevent the cgroups code being so aggressive in > > its zone shrinking (by reducing it's default shrink from 0 [everything] > > to DEF_PRIORITY [some things]). This will have an obvious knock on > > effect to cgroup accounting, but it's better than hanging systems. > > Actually, it's not that obvious. At least not to me. I added Balbir, > who added said comment and code in the first place, to CC: Here is the > comment in full quote: > > I missed this email in my inbox, just saw it and responding > /* > * NOTE: Although we can get the priority field, using it > * here is not a good idea, since it limits the pages we can scan. > * if we don't reclaim here, the shrink_zone from balance_pgdat > * will pick up pages from other mem cgroup's as well. We hack > * the priority and make it zero. > */ > > The idea is that if one memcg is above its softlimit, we prefer > reducing pages from this memcg over reclaiming random other pages, > including those of other memcgs. > > My comment and code were based on the observations I saw during my tests. With DEF_PRIORITY we see scan >> priority in get_scan_count(), since we know how much exactly we are over the soft limit, it makes sense to go after the pages, so that normal balancing can be restored. > But the code flow looks like this: > > balance_pgdat > mem_cgroup_soft_limit_reclaim > mem_cgroup_shrink_node_zone > shrink_zone(0, zone, &sc) > shrink_zone(prio, zone, &sc) > > so the success of the inner memcg shrink_zone does at least not > explicitely result in the outer, global shrink_zone steering clear of > other memcgs' pages. Yes, but it allows soft reclaim to know what to target first for success > It just tries to move the pressure of balancing > the zones to the memcg with the biggest soft limit excess. That can > only really work if the memcg is a large enough contributor to the > zone's total number of lru pages, though, and looks very likely to hit > the exceeding memcg too hard in other cases. > > I am very much for removing this hack. There is still more scan > pressure applied to memcgs in excess of their soft limit even if the > extra scan is happening at a sane priority level. And the fact that > global reclaim operates completely unaware of memcgs is a different > story. > > However, this code came into place with v2.6.31-8387-g4e41695. Why is > it only now showing up? > > You also wrote in that thread that this happens on a standard F15 > installation. On the F15 I am running here, systemd does not > configure memcgs, however. Did you manually configure memcgs and set > soft limits? Because I wonder how it ended up in soft limit reclaim > in the first place. > > I am running F15 as well, but never hit the problem so far. I am surprised to see the stack posted on the thread, it seemed like you never explicitly enabled anything to wake up the memcg beast :) Balbir [-- Attachment #2: Type: text/html, Size: 4421 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: memcg: fix fatal livelock in kswapd 2011-05-07 21:59 ` Balbir Singh @ 2011-05-07 22:00 ` Balbir Singh 0 siblings, 0 replies; 11+ messages in thread From: Balbir Singh @ 2011-05-07 22:00 UTC (permalink / raw) To: Johannes Weiner Cc: James Bottomley, Chris Mason, linux-fsdevel, linux-mm, linux-kernel, Paul Menage, Li Zefan, containers Sorry, my mailer might have used intelligence to send HTML (that is what happens when the setup changes, I apologize). Resending in text format On Sun, May 8, 2011 at 3:29 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > On Tue, May 3, 2011 at 4:18 AM, Johannes Weiner <hannes@cmpxchg.org> wrote: >> >> Hi, >> >> On Mon, May 02, 2011 at 03:07:29PM -0500, James Bottomley wrote: >> > The fatal livelock in kswapd, reported in this thread: >> > >> > http://marc.info/?t=130392066000001 >> > >> > Is mitigateable if we prevent the cgroups code being so aggressive in >> > its zone shrinking (by reducing it's default shrink from 0 [everything] >> > to DEF_PRIORITY [some things]). This will have an obvious knock on >> > effect to cgroup accounting, but it's better than hanging systems. >> >> Actually, it's not that obvious. At least not to me. I added Balbir, >> who added said comment and code in the first place, to CC: Here is the >> comment in full quote: >> > > I missed this email in my inbox, just saw it and responding > >> >> /* >> * NOTE: Although we can get the priority field, using it >> * here is not a good idea, since it limits the pages we can scan. >> * if we don't reclaim here, the shrink_zone from balance_pgdat >> * will pick up pages from other mem cgroup's as well. We hack >> * the priority and make it zero. >> */ >> >> The idea is that if one memcg is above its softlimit, we prefer >> reducing pages from this memcg over reclaiming random other pages, >> including those of other memcgs. >> > > My comment and code were based on the observations I saw during my tests. > With DEF_PRIORITY we see scan >> priority in get_scan_count(), since we know > how much exactly we are over the soft limit, it makes sense to go after the > pages, so that normal balancing can be restored. > >> >> But the code flow looks like this: >> >> balance_pgdat >> mem_cgroup_soft_limit_reclaim >> mem_cgroup_shrink_node_zone >> shrink_zone(0, zone, &sc) >> shrink_zone(prio, zone, &sc) >> >> so the success of the inner memcg shrink_zone does at least not >> explicitely result in the outer, global shrink_zone steering clear of >> other memcgs' pages. > > Yes, but it allows soft reclaim to know what to target first for success > >> >> It just tries to move the pressure of balancing >> the zones to the memcg with the biggest soft limit excess. That can >> only really work if the memcg is a large enough contributor to the >> zone's total number of lru pages, though, and looks very likely to hit >> the exceeding memcg too hard in other cases. >> >> I am very much for removing this hack. There is still more scan >> pressure applied to memcgs in excess of their soft limit even if the >> extra scan is happening at a sane priority level. And the fact that >> global reclaim operates completely unaware of memcgs is a different >> story. >> >> However, this code came into place with v2.6.31-8387-g4e41695. Why is >> it only now showing up? >> >> You also wrote in that thread that this happens on a standard F15 >> installation. On the F15 I am running here, systemd does not >> configure memcgs, however. Did you manually configure memcgs and set >> soft limits? Because I wonder how it ended up in soft limit reclaim >> in the first place. >> > > I am running F15 as well, but never hit the problem so far. I am surprised > to see the stack posted on the thread, it seemed like you > never explicitly enabled anything to wake up the memcg beast :) > Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: memcg: fix fatal livelock in kswapd 2011-05-02 20:07 memcg: fix fatal livelock in kswapd James Bottomley 2011-05-02 22:48 ` Johannes Weiner @ 2011-05-02 22:53 ` Paul Menage 1 sibling, 0 replies; 11+ messages in thread From: Paul Menage @ 2011-05-02 22:53 UTC (permalink / raw) To: James Bottomley, Balbir Singh, KAMEZAWA Hiroyuki, nishimura Cc: Chris Mason, linux-fsdevel, linux-mm, linux-kernel, Li Zefan, containers [ Adding the memcg maintainers ] On Mon, May 2, 2011 at 1:07 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > The fatal livelock in kswapd, reported in this thread: > > http://marc.info/?t=130392066000001 > > Is mitigateable if we prevent the cgroups code being so aggressive in > its zone shrinking (by reducing it's default shrink from 0 [everything] > to DEF_PRIORITY [some things]). This will have an obvious knock on > effect to cgroup accounting, but it's better than hanging systems. > > Signed-off-by: James Bottomley <James.Bottomley@suse.de> > > --- > > From 74b62fc417f07e1411d98181631e4e097c8e3e68 Mon Sep 17 00:00:00 2001 > From: James Bottomley <James.Bottomley@HansenPartnership.com> > Date: Mon, 2 May 2011 14:56:29 -0500 > Subject: [PATCH] vmscan: move containers scan back to default priority > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f6b435c..46cde92 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2173,8 +2173,12 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, > * if we don't reclaim here, the shrink_zone from balance_pgdat > * will pick up pages from other mem cgroup's as well. We hack > * the priority and make it zero. > + * > + * FIXME: jejb: zero here was causing a livelock in the > + * shrinker so changed to DEF_PRIORITY to fix this. Now need to > + * sort out cgroup accounting. > */ > - shrink_zone(0, zone, &sc); > + shrink_zone(DEF_PRIORITY, zone, &sc); > > trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); > > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2011-05-07 22:00 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-05-02 20:07 memcg: fix fatal livelock in kswapd James Bottomley 2011-05-02 22:48 ` Johannes Weiner 2011-05-02 23:14 ` Ying Han 2011-05-02 23:58 ` James Bottomley 2011-05-03 6:38 ` Johannes Weiner 2011-05-03 14:11 ` James Bottomley 2011-05-05 21:00 ` Andrew Morton 2011-05-03 6:11 ` Johannes Weiner 2011-05-07 21:59 ` Balbir Singh 2011-05-07 22:00 ` Balbir Singh 2011-05-02 22:53 ` Paul Menage
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).