* [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations @ 2011-05-11 15:29 Mel Gorman 2011-05-11 15:29 ` [PATCH 1/3] mm: slub: Do not wake kswapd for SLUBs speculative " Mel Gorman ` (3 more replies) 0 siblings, 4 replies; 77+ messages in thread From: Mel Gorman @ 2011-05-11 15:29 UTC (permalink / raw) To: Andrew Morton Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Mel Gorman Debian (and probably Ubuntu) have recently have changed to the default option of SLUB. There are a few reports of people experiencing hangs when copying large amounts of data with kswapd using a large amount of CPU. It appears this is down to SLUB using high orders by default and the page allocator and reclaim struggling to keep up. The following three patches reduce the cost of using those high orders. Patch 1 prevents kswapd waking up in response to SLUBs speculative use of high orders. This eliminates the hangs and while the system can still stall for long periods, it recovers. Patch 2 further reduces the cost by prevent SLUB entering direct compaction or reclaim paths on the grounds that falling back to order-0 should be cheaper. Patch 3 defaults SLUB to using order-0 on the grounds that the systems that heavily benefit from using high-order are also sized to fit in physical memory. On such systems, they should manually tune slub_max_order=3. My own data on this is not great. I haven't really been able to reproduce the same problem locally but a significant failing is that the tests weren't stressing X but I couldn't make meaningful comparisons by just randomly clicking on things (working on fixing this problem). The test case is simple. "download tar" wgets a large tar file and stores it locally. "unpack" is expanding it (15 times physical RAM in this case) and "delete source dirs" is the tarfile being deleted again. I also experimented with having the tar copied numerous times and into deeper directories to increase the size but the results were not particularly interesting so I left it as one tar. Test server, 4 CPU threads (AMD Phenom), x86_64, 2G of RAM, no X running - nowake largecopy-vanilla kswapd-v1r1 noexstep-v1r1 default0-v1r1 download tar 94 ( 0.00%) 94 ( 0.00%) 94 ( 0.00%) 93 ( 1.08%) unpack tar 521 ( 0.00%) 551 (-5.44%) 482 ( 8.09%) 488 ( 6.76%) delete source dirs 208 ( 0.00%) 218 (-4.59%) 194 ( 7.22%) 194 ( 7.22%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 740.82 777.73 739.98 747.47 Total Elapsed Time (seconds) 1046.66 1273.91 962.47 936.17 Disabling kswapd alone hurts performance slightly even though testers report it fixes hangs. I would guess it's because SLUB callers are calling direct reclaim more frequently (I belatedly noticed that compaction was disabled so it's not a factor) but haven't confirmed it. However, preventing kswapd waking or entering direct reclaim and having SLUB falling back to order-0 performed noticeably faster. Just using order-0 in the first place was fastest of all. I tried running the same test on a test laptop but unfortunately due to a misconfiguration the results were lost. It would take a few hours to rerun so am posting without them. If the testers verify this series help and we agree the patches are appropriate, they should be considered a stable candidate for 2.6.38. Documentation/vm/slub.txt | 2 +- mm/page_alloc.c | 3 ++- mm/slub.c | 5 +++-- 3 files changed, 6 insertions(+), 4 deletions(-) -- 1.7.3.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 1/3] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations 2011-05-11 15:29 [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations Mel Gorman @ 2011-05-11 15:29 ` Mel Gorman 2011-05-11 20:38 ` David Rientjes 2011-05-11 15:29 ` [PATCH 2/3] mm: slub: Do not take expensive steps " Mel Gorman ` (2 subsequent siblings) 3 siblings, 1 reply; 77+ messages in thread From: Mel Gorman @ 2011-05-11 15:29 UTC (permalink / raw) To: Andrew Morton Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Mel Gorman To avoid locking and per-cpu overhead, SLUB optimisically uses high-order allocations and falls back to lower allocations if they fail. However, by simply trying to allocate, kswapd is woken up to start reclaiming at that order. On a desktop system, two users report that the system is getting locked up with kswapd using large amounts of CPU. Using SLAB instead of SLUB made this problem go away. This patch prevents kswapd being woken up for high-order allocations. Testing indicated that with this patch applied, the system was much harder to hang and even when it did, it eventually recovered. Signed-off-by: Mel Gorman <mgorman@suse.de> --- mm/slub.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 9d2e5e4..98c358d 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1170,7 +1170,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) * Let the initial higher-order allocation fail under memory pressure * so we fall-back to the minimum order allocation. */ - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL; + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL; page = alloc_slab_page(alloc_gfp, node, oo); if (unlikely(!page)) { -- 1.7.3.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 77+ messages in thread
* Re: [PATCH 1/3] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations 2011-05-11 15:29 ` [PATCH 1/3] mm: slub: Do not wake kswapd for SLUBs speculative " Mel Gorman @ 2011-05-11 20:38 ` David Rientjes 0 siblings, 0 replies; 77+ messages in thread From: David Rientjes @ 2011-05-11 20:38 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, 11 May 2011, Mel Gorman wrote: > To avoid locking and per-cpu overhead, SLUB optimisically uses > high-order allocations and falls back to lower allocations if they > fail. However, by simply trying to allocate, kswapd is woken up to > start reclaiming at that order. On a desktop system, two users report > that the system is getting locked up with kswapd using large amounts > of CPU. Using SLAB instead of SLUB made this problem go away. > > This patch prevents kswapd being woken up for high-order allocations. > Testing indicated that with this patch applied, the system was much > harder to hang and even when it did, it eventually recovered. > > Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 2/3] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations 2011-05-11 15:29 [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations Mel Gorman 2011-05-11 15:29 ` [PATCH 1/3] mm: slub: Do not wake kswapd for SLUBs speculative " Mel Gorman @ 2011-05-11 15:29 ` Mel Gorman 2011-05-11 20:38 ` David Rientjes 2011-05-11 15:29 ` [PATCH 3/3] mm: slub: Default slub_max_order to 0 Mel Gorman 2011-05-11 21:39 ` [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations James Bottomley 3 siblings, 1 reply; 77+ messages in thread From: Mel Gorman @ 2011-05-11 15:29 UTC (permalink / raw) To: Andrew Morton Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Mel Gorman To avoid locking and per-cpu overhead, SLUB optimisically uses high-order allocations and falls back to lower allocations if they fail. However, by simply trying to allocate, the caller can enter compaction or reclaim - both of which are likely to cost more than the benefit of using high-order pages in SLUB. On a desktop system, two users report that the system is getting stalled with kswapd using large amounts of CPU. This patch prevents SLUB taking any expensive steps when trying to use high-order allocations. Instead, it is expected to fall back to smaller orders more aggressively. Testing from users was somewhat inconclusive on how much this helped but local tests showed it made a positive difference. It makes sense that falling back to order-0 allocations is faster than entering compaction or direct reclaim. Signed-off-yet: Mel Gorman <mgorman@suse.de> --- mm/page_alloc.c | 3 ++- mm/slub.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9f8a97b..057f1e2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) { int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; const gfp_t wait = gfp_mask & __GFP_WAIT; + const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD); /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */ BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH); @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) */ alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH); - if (!wait) { + if (!wait && can_wake_kswapd) { /* * Not worth trying to allocate harder for * __GFP_NOMEMALLOC even if it can't schedule. diff --git a/mm/slub.c b/mm/slub.c index 98c358d..1071723 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) * Let the initial higher-order allocation fail under memory pressure * so we fall-back to the minimum order allocation. */ - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL; + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & + ~(__GFP_NOFAIL | __GFP_WAIT); page = alloc_slab_page(alloc_gfp, node, oo); if (unlikely(!page)) { -- 1.7.3.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 77+ messages in thread
* Re: [PATCH 2/3] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations 2011-05-11 15:29 ` [PATCH 2/3] mm: slub: Do not take expensive steps " Mel Gorman @ 2011-05-11 20:38 ` David Rientjes 2011-05-11 21:10 ` Mel Gorman 0 siblings, 1 reply; 77+ messages in thread From: David Rientjes @ 2011-05-11 20:38 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, 11 May 2011, Mel Gorman wrote: > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 9f8a97b..057f1e2 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > { > int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; > const gfp_t wait = gfp_mask & __GFP_WAIT; > + const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD); > > /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */ > BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH); > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > */ > alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH); > > - if (!wait) { > + if (!wait && can_wake_kswapd) { > /* > * Not worth trying to allocate harder for > * __GFP_NOMEMALLOC even if it can't schedule. > diff --git a/mm/slub.c b/mm/slub.c > index 98c358d..1071723 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) > * Let the initial higher-order allocation fail under memory pressure > * so we fall-back to the minimum order allocation. > */ > - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL; > + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & > + ~(__GFP_NOFAIL | __GFP_WAIT); __GFP_NORETRY is a no-op without __GFP_WAIT. > > page = alloc_slab_page(alloc_gfp, node, oo); > if (unlikely(!page)) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 2/3] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations 2011-05-11 20:38 ` David Rientjes @ 2011-05-11 21:10 ` Mel Gorman 2011-05-12 17:25 ` Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Mel Gorman @ 2011-05-11 21:10 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 11, 2011 at 01:38:44PM -0700, David Rientjes wrote: > On Wed, 11 May 2011, Mel Gorman wrote: > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 9f8a97b..057f1e2 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > > { > > int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; > > const gfp_t wait = gfp_mask & __GFP_WAIT; > > + const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD); > > > > /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */ > > BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH); > > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > > */ > > alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH); > > > > - if (!wait) { > > + if (!wait && can_wake_kswapd) { > > /* > > * Not worth trying to allocate harder for > > * __GFP_NOMEMALLOC even if it can't schedule. > > diff --git a/mm/slub.c b/mm/slub.c > > index 98c358d..1071723 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) > > * Let the initial higher-order allocation fail under memory pressure > > * so we fall-back to the minimum order allocation. > > */ > > - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL; > > + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & > > + ~(__GFP_NOFAIL | __GFP_WAIT); > > __GFP_NORETRY is a no-op without __GFP_WAIT. > True. I'll remove it in a V2 but I won't respin just yet. > > > > page = alloc_slab_page(alloc_gfp, node, oo); > > if (unlikely(!page)) { -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 2/3] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations 2011-05-11 21:10 ` Mel Gorman @ 2011-05-12 17:25 ` Andrea Arcangeli 0 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2011-05-12 17:25 UTC (permalink / raw) To: Mel Gorman Cc: David Rientjes, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 Hi, On Wed, May 11, 2011 at 10:10:43PM +0100, Mel Gorman wrote: > > > diff --git a/mm/slub.c b/mm/slub.c > > > index 98c358d..1071723 100644 > > > --- a/mm/slub.c > > > +++ b/mm/slub.c > > > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) > > > * Let the initial higher-order allocation fail under memory pressure > > > * so we fall-back to the minimum order allocation. > > > */ > > > - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL; > > > + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & > > > + ~(__GFP_NOFAIL | __GFP_WAIT); > > > > __GFP_NORETRY is a no-op without __GFP_WAIT. > > > > True. I'll remove it in a V2 but I won't respin just yet. Nothing wrong and no performance difference with clearing __GFP_NORETRY too, if something it doesn't make sense for a caller to use __GFP_NOFAIL without __GFP_WAIT so the original version above looks cleaner. I like this change overall to only poll the buddy allocator without spinning kswapd and without invoking lumpy reclaim. Like you noted in the first mail, compaction was disabled, and very bad behavior is expected without it unless GFP_ATOMIC|__GFP_NO_KSWAPD is set (that was the way I had to use before disabling lumpy compaction when first developing THP too for the same reasons). But when compaction enabled slub could try to only clear __GFP_NOFAIL and leave __GFP_WAIT and no bad behavior should happen... but it's probably slower so I prefer to clear __GFP_WAIT too (for THP compaction is worth it because the allocation is generally long lived, but for slub allocations like tiny skb the allocation can be extremely short lived so it's unlikely to be worth it). So this way compaction is then invoked only by the minimal order allocation later if needed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-11 15:29 [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations Mel Gorman 2011-05-11 15:29 ` [PATCH 1/3] mm: slub: Do not wake kswapd for SLUBs speculative " Mel Gorman 2011-05-11 15:29 ` [PATCH 2/3] mm: slub: Do not take expensive steps " Mel Gorman @ 2011-05-11 15:29 ` Mel Gorman 2011-05-11 20:38 ` David Rientjes 2011-05-12 14:43 ` Christoph Lameter 2011-05-11 21:39 ` [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations James Bottomley 3 siblings, 2 replies; 77+ messages in thread From: Mel Gorman @ 2011-05-11 15:29 UTC (permalink / raw) To: Andrew Morton Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Mel Gorman To avoid locking and per-cpu overhead, SLUB optimisically uses high-order allocations up to order-3 by default and falls back to lower allocations if they fail. While care is taken that the caller and kswapd take no unusual steps in response to this, there are further consequences like shrinkers who have to free more objects to release any memory. There is anecdotal evidence that significant time is being spent looping in shrinkers with insufficient progress being made (https://lkml.org/lkml/2011/4/28/361) and keeping kswapd awake. SLUB is now the default allocator and some bug reports have been pinned down to SLUB using high orders during operations like copying large amounts of data. SLUBs use of high-orders benefits applications that are sized to memory appropriately but this does not necessarily apply to large file servers or desktops. This patch causes SLUB to use order-0 pages like SLAB does by default. There is further evidence that this keeps kswapd's usage lower (https://lkml.org/lkml/2011/5/10/383). Signed-off-by: Mel Gorman <mgorman@suse.de> --- Documentation/vm/slub.txt | 2 +- mm/slub.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt index 07375e7..778e9fa 100644 --- a/Documentation/vm/slub.txt +++ b/Documentation/vm/slub.txt @@ -117,7 +117,7 @@ can be influenced by kernel parameters: slub_min_objects=x (default 4) slub_min_order=x (default 0) -slub_max_order=x (default 1) +slub_max_order=x (default 0) slub_min_objects allows to specify how many objects must at least fit into one slab in order for the allocation order to be acceptable. diff --git a/mm/slub.c b/mm/slub.c index 1071723..23a4789 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free); * take the list_lock. */ static int slub_min_order; -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER; +static int slub_max_order; static int slub_min_objects; /* -- 1.7.3.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-11 15:29 ` [PATCH 3/3] mm: slub: Default slub_max_order to 0 Mel Gorman @ 2011-05-11 20:38 ` David Rientjes 2011-05-11 20:53 ` James Bottomley ` (2 more replies) 2011-05-12 14:43 ` Christoph Lameter 1 sibling, 3 replies; 77+ messages in thread From: David Rientjes @ 2011-05-11 20:38 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, 11 May 2011, Mel Gorman wrote: > To avoid locking and per-cpu overhead, SLUB optimisically uses > high-order allocations up to order-3 by default and falls back to > lower allocations if they fail. While care is taken that the caller > and kswapd take no unusual steps in response to this, there are > further consequences like shrinkers who have to free more objects to > release any memory. There is anecdotal evidence that significant time > is being spent looping in shrinkers with insufficient progress being > made (https://lkml.org/lkml/2011/4/28/361) and keeping kswapd awake. > > SLUB is now the default allocator and some bug reports have been > pinned down to SLUB using high orders during operations like > copying large amounts of data. SLUBs use of high-orders benefits > applications that are sized to memory appropriately but this does not > necessarily apply to large file servers or desktops. This patch > causes SLUB to use order-0 pages like SLAB does by default. > There is further evidence that this keeps kswapd's usage lower > (https://lkml.org/lkml/2011/5/10/383). > This is going to severely impact slub's performance for applications on machines with plenty of memory available where fragmentation isn't a concern when allocating from caches with large object sizes (even changing the min order of kamlloc-256 from 1 to 0!) by default for users who don't use slub_max_order=3 on the command line. SLUB relies heavily on allocating from the cpu slab and freeing to the cpu slab to avoid the slowpaths, so higher order slabs are important for its performance. I can get numbers for a simple netperf TCP_RR benchmark with this change applied to show the degradation on a server with >32GB of RAM with this patch applied. It would be ideal if this default could be adjusted based on the amount of memory available in the smallest node to determine whether we're concerned about making higher order allocations. (Using the smallest node as a metric so that mempolicies and cpusets don't get unfairly biased against.) With the previous changes in this patchset, specifically avoiding waking kswapd and doing compaction for the higher order allocs before falling back to the min order, it shouldn't be devastating to try an order-3 alloc that will fail quickly. > Signed-off-by: Mel Gorman <mgorman@suse.de> > --- > Documentation/vm/slub.txt | 2 +- > mm/slub.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt > index 07375e7..778e9fa 100644 > --- a/Documentation/vm/slub.txt > +++ b/Documentation/vm/slub.txt > @@ -117,7 +117,7 @@ can be influenced by kernel parameters: > > slub_min_objects=x (default 4) > slub_min_order=x (default 0) > -slub_max_order=x (default 1) > +slub_max_order=x (default 0) Hmm, that was wrong to begin with, it should have been 3. > > slub_min_objects allows to specify how many objects must at least fit > into one slab in order for the allocation order to be acceptable. > diff --git a/mm/slub.c b/mm/slub.c > index 1071723..23a4789 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free); > * take the list_lock. > */ > static int slub_min_order; > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER; > +static int slub_max_order; > static int slub_min_objects; > > /* ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-11 20:38 ` David Rientjes @ 2011-05-11 20:53 ` James Bottomley 2011-05-11 21:09 ` Mel Gorman 2011-05-12 17:36 ` Andrea Arcangeli 2 siblings, 0 replies; 77+ messages in thread From: James Bottomley @ 2011-05-11 20:53 UTC (permalink / raw) To: David Rientjes Cc: Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, 2011-05-11 at 13:38 -0700, David Rientjes wrote: > On Wed, 11 May 2011, Mel Gorman wrote: > > > To avoid locking and per-cpu overhead, SLUB optimisically uses > > high-order allocations up to order-3 by default and falls back to > > lower allocations if they fail. While care is taken that the caller > > and kswapd take no unusual steps in response to this, there are > > further consequences like shrinkers who have to free more objects to > > release any memory. There is anecdotal evidence that significant time > > is being spent looping in shrinkers with insufficient progress being > > made (https://lkml.org/lkml/2011/4/28/361) and keeping kswapd awake. > > > > SLUB is now the default allocator and some bug reports have been > > pinned down to SLUB using high orders during operations like > > copying large amounts of data. SLUBs use of high-orders benefits > > applications that are sized to memory appropriately but this does not > > necessarily apply to large file servers or desktops. This patch > > causes SLUB to use order-0 pages like SLAB does by default. > > There is further evidence that this keeps kswapd's usage lower > > (https://lkml.org/lkml/2011/5/10/383). > > > > This is going to severely impact slub's performance for applications on > machines with plenty of memory available where fragmentation isn't a > concern when allocating from caches with large object sizes (even > changing the min order of kamlloc-256 from 1 to 0!) by default for users > who don't use slub_max_order=3 on the command line. SLUB relies heavily > on allocating from the cpu slab and freeing to the cpu slab to avoid the > slowpaths, so higher order slabs are important for its performance. > > I can get numbers for a simple netperf TCP_RR benchmark with this change > applied to show the degradation on a server with >32GB of RAM with this > patch applied. > > It would be ideal if this default could be adjusted based on the amount of > memory available in the smallest node to determine whether we're concerned > about making higher order allocations. (Using the smallest node as a > metric so that mempolicies and cpusets don't get unfairly biased against.) > With the previous changes in this patchset, specifically avoiding waking > kswapd and doing compaction for the higher order allocs before falling > back to the min order, it shouldn't be devastating to try an order-3 alloc > that will fail quickly. So my testing has shown that simply booting the kernel with slub_max_order=0 makes the hang I'm seeing go away. This definitely implicates the higher order allocations in the kswapd problem. I think it would be wise not to make it the default until we can sort out the root cause. James ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-11 20:38 ` David Rientjes 2011-05-11 20:53 ` James Bottomley @ 2011-05-11 21:09 ` Mel Gorman 2011-05-11 22:27 ` David Rientjes 2011-05-12 17:36 ` Andrea Arcangeli 2 siblings, 1 reply; 77+ messages in thread From: Mel Gorman @ 2011-05-11 21:09 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 11, 2011 at 01:38:47PM -0700, David Rientjes wrote: > On Wed, 11 May 2011, Mel Gorman wrote: > > > To avoid locking and per-cpu overhead, SLUB optimisically uses > > high-order allocations up to order-3 by default and falls back to > > lower allocations if they fail. While care is taken that the caller > > and kswapd take no unusual steps in response to this, there are > > further consequences like shrinkers who have to free more objects to > > release any memory. There is anecdotal evidence that significant time > > is being spent looping in shrinkers with insufficient progress being > > made (https://lkml.org/lkml/2011/4/28/361) and keeping kswapd awake. > > > > SLUB is now the default allocator and some bug reports have been > > pinned down to SLUB using high orders during operations like > > copying large amounts of data. SLUBs use of high-orders benefits > > applications that are sized to memory appropriately but this does not > > necessarily apply to large file servers or desktops. This patch > > causes SLUB to use order-0 pages like SLAB does by default. > > There is further evidence that this keeps kswapd's usage lower > > (https://lkml.org/lkml/2011/5/10/383). > > > > This is going to severely impact slub's performance for applications on > machines with plenty of memory available where fragmentation isn't a > concern when allocating from caches with large object sizes (even > changing the min order of kamlloc-256 from 1 to 0!) by default for users > who don't use slub_max_order=3 on the command line. SLUB relies heavily > on allocating from the cpu slab and freeing to the cpu slab to avoid the > slowpaths, so higher order slabs are important for its performance. > I agree with you that there are situations where plenty of memory means that that it'll perform much better. However, indications are that it breaks down with high CPU usage when memory is low. Worse, once fragmentation becomes a problem, large amounts of UNMOVABLE and RECLAIMABLE will make it progressively more expensive to find the necessary pages. Perhaps with patches 1 and 2, this is not as much of a problem but figures in the leader indicated that for a simple workload with large amounts of files and data exceeding physical memory that it was better off not to use high orders at all which is a situation I'd expect to be encountered by more users than performance-sensitive applications. In other words, we're taking one hit or the other. > I can get numbers for a simple netperf TCP_RR benchmark with this change > applied to show the degradation on a server with >32GB of RAM with this > patch applied. > Agreed, I'd expect netperf TCP_RR or TCP_STREAM to take a hit, particularly on a local machine where the recycling of pages will impact it heavily. > It would be ideal if this default could be adjusted based on the amount of > memory available in the smallest node to determine whether we're concerned > about making higher order allocations. It's not a function of memory size, working set size is what is important or at least how many new pages have been allocated recently. Fit your workload in physical memory - high orders are great. Go larger than that and you hit problems. James' testing indicated that kswapd CPU usage dropped to far lower levels with this patch applied his test of untarring a large file for example. > (Using the smallest node as a > metric so that mempolicies and cpusets don't get unfairly biased against.) > With the previous changes in this patchset, specifically avoiding waking > kswapd and doing compaction for the higher order allocs before falling > back to the min order, it shouldn't be devastating to try an order-3 alloc > that will fail quickly. > Which is more reasonable? That an ordinary user gets a default that is fairly safe even if benchmarks that demand the highest performance from SLUB take a hit or that administrators running such workloads set slub_max_order=3? > > Signed-off-by: Mel Gorman <mgorman@suse.de> > > --- > > Documentation/vm/slub.txt | 2 +- > > mm/slub.c | 2 +- > > 2 files changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt > > index 07375e7..778e9fa 100644 > > --- a/Documentation/vm/slub.txt > > +++ b/Documentation/vm/slub.txt > > @@ -117,7 +117,7 @@ can be influenced by kernel parameters: > > > > slub_min_objects=x (default 4) > > slub_min_order=x (default 0) > > -slub_max_order=x (default 1) > > +slub_max_order=x (default 0) > > Hmm, that was wrong to begin with, it should have been 3. > True, but I didn't see the point fixing it in a separate patch. If this patch gets rejected, I'll submit a documentation fix. > > > > slub_min_objects allows to specify how many objects must at least fit > > into one slab in order for the allocation order to be acceptable. > > diff --git a/mm/slub.c b/mm/slub.c > > index 1071723..23a4789 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free); > > * take the list_lock. > > */ > > static int slub_min_order; > > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER; > > +static int slub_max_order; > > static int slub_min_objects; > > > > /* -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-11 21:09 ` Mel Gorman @ 2011-05-11 22:27 ` David Rientjes 2011-05-13 10:14 ` Mel Gorman 0 siblings, 1 reply; 77+ messages in thread From: David Rientjes @ 2011-05-11 22:27 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, 11 May 2011, Mel Gorman wrote: > I agree with you that there are situations where plenty of memory > means that that it'll perform much better. However, indications are > that it breaks down with high CPU usage when memory is low. Worse, > once fragmentation becomes a problem, large amounts of UNMOVABLE and > RECLAIMABLE will make it progressively more expensive to find the > necessary pages. Perhaps with patches 1 and 2, this is not as much > of a problem but figures in the leader indicated that for a simple > workload with large amounts of files and data exceeding physical > memory that it was better off not to use high orders at all which > is a situation I'd expect to be encountered by more users than > performance-sensitive applications. > > In other words, we're taking one hit or the other. > Seems like the ideal solution would then be to find how to best set the default, and that can probably only be done with the size of the smallest node since it has a higher liklihood of encountering a large amount of unreclaimable slab when memory is low. > > I can get numbers for a simple netperf TCP_RR benchmark with this change > > applied to show the degradation on a server with >32GB of RAM with this > > patch applied. > > > > Agreed, I'd expect netperf TCP_RR or TCP_STREAM to take a hit, > particularly on a local machine where the recycling of pages will > impact it heavily. > Ignoring the local machine for a second, TCP_RR probably shouldn't be taking any more of a hit with slub than it already is. When I benchmarked slab vs. slub a couple months ago with two machines, each four quad-core Opterons with 64GB of memory, with this benchmark it showed slub was already 10-15% slower. That's why slub has always been unusable for us, and I'm surprised that it's now becoming the favorite of distros everywhere (and, yes, Ubuntu now defaults to it as well). > > It would be ideal if this default could be adjusted based on the amount of > > memory available in the smallest node to determine whether we're concerned > > about making higher order allocations. > > It's not a function of memory size, working set size is what > is important or at least how many new pages have been allocated > recently. Fit your workload in physical memory - high orders are > great. Go larger than that and you hit problems. James' testing > indicated that kswapd CPU usage dropped to far lower levels with this > patch applied his test of untarring a large file for example. > My point is that it would probably be better to tune the default based on how much memory is available at boot since it implies the probability of having an abundance of memory while populating the caches' partial lists up to min_partial rather than change it for everyone where it is known that it will cause performance degradations if memory is never low. We probably don't want to be doing order-3 allocations for half the slab caches when we have 1G of memory available, but that's acceptable with 64GB. > > (Using the smallest node as a > > metric so that mempolicies and cpusets don't get unfairly biased against.) > > With the previous changes in this patchset, specifically avoiding waking > > kswapd and doing compaction for the higher order allocs before falling > > back to the min order, it shouldn't be devastating to try an order-3 alloc > > that will fail quickly. > > > > Which is more reasonable? That an ordinary user gets a default that > is fairly safe even if benchmarks that demand the highest performance > from SLUB take a hit or that administrators running such workloads > set slub_max_order=3? > Not sure what is more reasonable since it depends on what the workload is, but what probably is unreasonable is changing a slub default that is known to directly impact performance by presenting a single benchmark under consideration without some due diligence in testing others like netperf. We all know that slub has some disavantages compared to slab that are only now being realized because it has become the debian default, but it does excel at some workloads -- it was initially presented to beat slab in kernbench, hackbench, sysbench, and aim9 when it was merged. Those advantages may never be fully realized on laptops or desktop machines, but with machines with plenty of memory available, slub ofter does perform better than slab. That's why I suggested tuning the min order default based on total memory, it would probably be easier to justify than changing it for everyone and demanding users who are completely happy with using slub, the kernel.org default for years, now use command line options. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-11 22:27 ` David Rientjes @ 2011-05-13 10:14 ` Mel Gorman 0 siblings, 0 replies; 77+ messages in thread From: Mel Gorman @ 2011-05-13 10:14 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 11, 2011 at 03:27:11PM -0700, David Rientjes wrote: > On Wed, 11 May 2011, Mel Gorman wrote: > > > I agree with you that there are situations where plenty of memory > > means that that it'll perform much better. However, indications are > > that it breaks down with high CPU usage when memory is low. Worse, > > once fragmentation becomes a problem, large amounts of UNMOVABLE and > > RECLAIMABLE will make it progressively more expensive to find the > > necessary pages. Perhaps with patches 1 and 2, this is not as much > > of a problem but figures in the leader indicated that for a simple > > workload with large amounts of files and data exceeding physical > > memory that it was better off not to use high orders at all which > > is a situation I'd expect to be encountered by more users than > > performance-sensitive applications. > > > > In other words, we're taking one hit or the other. > > > > Seems like the ideal solution would then be to find how to best set the > default, and that can probably only be done with the size of the smallest > node since it has a higher liklihood of encountering a large amount of > unreclaimable slab when memory is low. > Ideally yes, but glancing through this thread and thinking on it a bit more, I'm going to drop this patch. As pointed out, SLUB with high orders has been in use with distributions already so the breakage is elsewhere. Patches 1 and 2 still make some sense but they're not the root cause. > <SNIP> -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-11 20:38 ` David Rientjes 2011-05-11 20:53 ` James Bottomley 2011-05-11 21:09 ` Mel Gorman @ 2011-05-12 17:36 ` Andrea Arcangeli 2011-05-16 21:03 ` David Rientjes 2 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2011-05-12 17:36 UTC (permalink / raw) To: David Rientjes Cc: Mel Gorman, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 11, 2011 at 01:38:47PM -0700, David Rientjes wrote: > kswapd and doing compaction for the higher order allocs before falling Note that patch 2 disabled compaction by clearing __GFP_WAIT. What you describe here would be patch 2 without the ~__GFP_WAIT addition (so keeping only ~GFP_NOFAIL). Not clearing __GFP_WAIT when compaction is enabled is possible and shouldn't result in bad behavior (if compaction is not enabled with current SLUB it's hard to imagine how it could perform decently if there's fragmentation). You should try to benchmark to see if it's worth it on the large NUMA systems with heavy network traffic (for normal systems I doubt compaction is worth it but I'm not against trying to keep it enabled just in case). On a side note, this reminds me to rebuild with slub_max_order in .bss on my cellphone (where I can't switch to SLAB because of some silly rfs vfat-on-steroids proprietary module). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 17:36 ` Andrea Arcangeli @ 2011-05-16 21:03 ` David Rientjes 2011-05-17 9:48 ` Mel Gorman 0 siblings, 1 reply; 77+ messages in thread From: David Rientjes @ 2011-05-16 21:03 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mel Gorman, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 12 May 2011, Andrea Arcangeli wrote: > On Wed, May 11, 2011 at 01:38:47PM -0700, David Rientjes wrote: > > kswapd and doing compaction for the higher order allocs before falling > > Note that patch 2 disabled compaction by clearing __GFP_WAIT. > > What you describe here would be patch 2 without the ~__GFP_WAIT > addition (so keeping only ~GFP_NOFAIL). > It's out of context, my sentence was: "With the previous changes in this patchset, specifically avoiding waking kswapd and doing compaction for the higher order allocs before falling back to the min order..." meaning this patchset avoids waking kswapd and avoids doing compaction. > Not clearing __GFP_WAIT when compaction is enabled is possible and > shouldn't result in bad behavior (if compaction is not enabled with > current SLUB it's hard to imagine how it could perform decently if > there's fragmentation). You should try to benchmark to see if it's > worth it on the large NUMA systems with heavy network traffic (for > normal systems I doubt compaction is worth it but I'm not against > trying to keep it enabled just in case). > The fragmentation isn't the only issue with the netperf TCP_RR benchmark, the problem is that the slub slowpath is being used >95% of the time on every allocation and free for the very large number of kmalloc-256 and kmalloc-2K caches. Those caches are order 1 and 3, respectively, on my system by default, but the page allocator seldomly gets invoked for such a benchmark after the partial lists are populated: the overhead is from the per-node locking required in the slowpath to traverse the partial lists. See the data I presented two years ago: http://lkml.org/lkml/2009/3/30/15. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-16 21:03 ` David Rientjes @ 2011-05-17 9:48 ` Mel Gorman 2011-05-17 19:25 ` David Rientjes 0 siblings, 1 reply; 77+ messages in thread From: Mel Gorman @ 2011-05-17 9:48 UTC (permalink / raw) To: David Rientjes Cc: Andrea Arcangeli, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, May 16, 2011 at 02:03:33PM -0700, David Rientjes wrote: > On Thu, 12 May 2011, Andrea Arcangeli wrote: > > > On Wed, May 11, 2011 at 01:38:47PM -0700, David Rientjes wrote: > > > kswapd and doing compaction for the higher order allocs before falling > > > > Note that patch 2 disabled compaction by clearing __GFP_WAIT. > > > > What you describe here would be patch 2 without the ~__GFP_WAIT > > addition (so keeping only ~GFP_NOFAIL). > > > > It's out of context, my sentence was: > > "With the previous changes in this patchset, specifically avoiding waking > kswapd and doing compaction for the higher order allocs before falling > back to the min order..." > > meaning this patchset avoids waking kswapd and avoids doing compaction. > Ok. > > Not clearing __GFP_WAIT when compaction is enabled is possible and > > shouldn't result in bad behavior (if compaction is not enabled with > > current SLUB it's hard to imagine how it could perform decently if > > there's fragmentation). You should try to benchmark to see if it's > > worth it on the large NUMA systems with heavy network traffic (for > > normal systems I doubt compaction is worth it but I'm not against > > trying to keep it enabled just in case). > > > > The fragmentation isn't the only issue with the netperf TCP_RR benchmark, > the problem is that the slub slowpath is being used >95% of the time on > every allocation and free for the very large number of kmalloc-256 and > kmalloc-2K caches. Ok, that makes sense as I'd full expect that benchmark to exhaust the per-cpu page (high order or otherwise) of slab objects routinely during default and I'd also expect the freeing on the other side to be releasing slabs frequently to the partial or empty lists. > Those caches are order 1 and 3, respectively, on my > system by default, but the page allocator seldomly gets invoked for such a > benchmark after the partial lists are populated: the overhead is from the > per-node locking required in the slowpath to traverse the partial lists. > See the data I presented two years ago: http://lkml.org/lkml/2009/3/30/15. Ok, I can see how this patch would indeed make the situation worse. I vaguely recall that there were other patches that would increase the per-cpu lists of objects but have no recollection as to what happened them. Maybe Christoph remembers but one way or the other, it's out of scope for James' and Colin's bug. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-17 9:48 ` Mel Gorman @ 2011-05-17 19:25 ` David Rientjes 0 siblings, 0 replies; 77+ messages in thread From: David Rientjes @ 2011-05-17 19:25 UTC (permalink / raw) To: Mel Gorman Cc: Andrea Arcangeli, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, 17 May 2011, Mel Gorman wrote: > > The fragmentation isn't the only issue with the netperf TCP_RR benchmark, > > the problem is that the slub slowpath is being used >95% of the time on > > every allocation and free for the very large number of kmalloc-256 and > > kmalloc-2K caches. > > Ok, that makes sense as I'd full expect that benchmark to exhaust > the per-cpu page (high order or otherwise) of slab objects routinely > during default and I'd also expect the freeing on the other side to > be releasing slabs frequently to the partial or empty lists. > That's most of the problem, but it's compounded on this benchmark because the slab pulled from the partial list to replace the per-cpu page typically only has a very minimal number (2 or 3) of free objects, so it can only serve one allocation and then require the allocation slowpath to pull yet another slab from the partial list the next time around. I had a patchset that addressed that, which I called "slab thrashing", by only pulling a slab from the partial list when it had a pre-defined proportion of available objects and otherwise skipping it, and that ended up helping the benchmark by 5-7%. Smaller orders will make this worse, as well, since if there were only 2 or 3 free objects on an order-3 slab before, there's no chance that's going to be equivalent on an order-0 slab. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-11 15:29 ` [PATCH 3/3] mm: slub: Default slub_max_order to 0 Mel Gorman 2011-05-11 20:38 ` David Rientjes @ 2011-05-12 14:43 ` Christoph Lameter 2011-05-12 15:15 ` James Bottomley 1 sibling, 1 reply; 77+ messages in thread From: Christoph Lameter @ 2011-05-12 14:43 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, 11 May 2011, Mel Gorman wrote: > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free); > * take the list_lock. > */ > static int slub_min_order; > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER; > +static int slub_max_order; If we really need to do this then do not push this down to zero please. SLAB uses order 1 for the meax. Lets at least keep it theere. We have been using SLUB for a long time. Why is this issue arising now? Due to compaction etc making reclaim less efficient? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 14:43 ` Christoph Lameter @ 2011-05-12 15:15 ` James Bottomley 2011-05-12 15:27 ` Christoph Lameter 0 siblings, 1 reply; 77+ messages in thread From: James Bottomley @ 2011-05-12 15:15 UTC (permalink / raw) To: Christoph Lameter Cc: Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 2011-05-12 at 09:43 -0500, Christoph Lameter wrote: > On Wed, 11 May 2011, Mel Gorman wrote: > > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free); > > * take the list_lock. > > */ > > static int slub_min_order; > > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER; > > +static int slub_max_order; > > If we really need to do this then do not push this down to zero please. > SLAB uses order 1 for the meax. Lets at least keep it theere. 1 is the current value. Reducing it to zero seems to fix the kswapd induced hangs. The problem does look to be some shrinker/allocator interference somewhere in vmscan.c, but the fact is that it's triggered by SLUB and not SLAB. I really think that what's happening is some type of feedback loops where one of the shrinkers is issuing a wakeup_kswapd() so kswapd never sleeps (and never relinquishes the CPU on non-preempt). > We have been using SLUB for a long time. Why is this issue arising now? > Due to compaction etc making reclaim less efficient? This is the snark argument (I've said it thrice the bellman cried and what I tell you three times is true). The fact is that no enterprise distribution at all uses SLUB. It's only recently that the desktop distributions started to ... the bugs are showing up under FC15 beta, which is the first fedora distribution to enable it. I'd say we're only just beginning widespread SLUB testing. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 15:15 ` James Bottomley @ 2011-05-12 15:27 ` Christoph Lameter 2011-05-12 15:43 ` James Bottomley 2011-05-12 15:45 ` Dave Jones 0 siblings, 2 replies; 77+ messages in thread From: Christoph Lameter @ 2011-05-12 15:27 UTC (permalink / raw) To: James Bottomley Cc: Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 12 May 2011, James Bottomley wrote: > > > */ > > > static int slub_min_order; > > > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER; > > > +static int slub_max_order; > > > > If we really need to do this then do not push this down to zero please. > > SLAB uses order 1 for the meax. Lets at least keep it theere. > > 1 is the current value. Reducing it to zero seems to fix the kswapd > induced hangs. The problem does look to be some shrinker/allocator > interference somewhere in vmscan.c, but the fact is that it's triggered > by SLUB and not SLAB. I really think that what's happening is some type > of feedback loops where one of the shrinkers is issuing a > wakeup_kswapd() so kswapd never sleeps (and never relinquishes the CPU > on non-preempt). The current value is PAGE_ALLOC_COSTLY_ORDER which is 3. > > We have been using SLUB for a long time. Why is this issue arising now? > > Due to compaction etc making reclaim less efficient? > > This is the snark argument (I've said it thrice the bellman cried and > what I tell you three times is true). The fact is that no enterprise > distribution at all uses SLUB. It's only recently that the desktop > distributions started to ... the bugs are showing up under FC15 beta, > which is the first fedora distribution to enable it. I'd say we're only > just beginning widespread SLUB testing. Debian and Ubuntu have been using SLUB for a long time (and AFAICT from my archives so has Fedora). I have been running those here for a couple of years and the issues that I see here seem to be only with the most recent kernels that now do compaction and other reclaim tricks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 15:27 ` Christoph Lameter @ 2011-05-12 15:43 ` James Bottomley 2011-05-12 15:46 ` Dave Jones ` (2 more replies) 2011-05-12 15:45 ` Dave Jones 1 sibling, 3 replies; 77+ messages in thread From: James Bottomley @ 2011-05-12 15:43 UTC (permalink / raw) To: Christoph Lameter Cc: Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 2011-05-12 at 10:27 -0500, Christoph Lameter wrote: > On Thu, 12 May 2011, James Bottomley wrote: > > > > > */ > > > > static int slub_min_order; > > > > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER; > > > > +static int slub_max_order; > > > > > > If we really need to do this then do not push this down to zero please. > > > SLAB uses order 1 for the meax. Lets at least keep it theere. > > > > 1 is the current value. Reducing it to zero seems to fix the kswapd > > induced hangs. The problem does look to be some shrinker/allocator > > interference somewhere in vmscan.c, but the fact is that it's triggered > > by SLUB and not SLAB. I really think that what's happening is some type > > of feedback loops where one of the shrinkers is issuing a > > wakeup_kswapd() so kswapd never sleeps (and never relinquishes the CPU > > on non-preempt). > > The current value is PAGE_ALLOC_COSTLY_ORDER which is 3. > > > > We have been using SLUB for a long time. Why is this issue arising now? > > > Due to compaction etc making reclaim less efficient? > > > > This is the snark argument (I've said it thrice the bellman cried and > > what I tell you three times is true). The fact is that no enterprise > > distribution at all uses SLUB. It's only recently that the desktop > > distributions started to ... the bugs are showing up under FC15 beta, > > which is the first fedora distribution to enable it. I'd say we're only > > just beginning widespread SLUB testing. > > Debian and Ubuntu have been using SLUB for a long time Only from Squeeze, which has been released for ~3 months. That doesn't qualify as a "long time" in my book. > (and AFAICT from my > archives so has Fedora). As I said above, no released fedora version uses SLUB. It's only just been enabled for the unreleased FC15; I'm testing a beta copy. > I have been running those here for a couple of > years and the issues that I see here seem to be only with the most > recent kernels that now do compaction and other reclaim tricks. but a sample of one doeth not great testing make. However, since you admit even you see problems, let's concentrate on fixing them rather than recriminations? James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 15:43 ` James Bottomley @ 2011-05-12 15:46 ` Dave Jones 2011-05-12 16:00 ` James Bottomley 2011-05-12 15:55 ` Pekka Enberg 2011-05-12 16:01 ` Christoph Lameter 2 siblings, 1 reply; 77+ messages in thread From: Dave Jones @ 2011-05-12 15:46 UTC (permalink / raw) To: James Bottomley Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 10:43:13AM -0500, James Bottomley wrote: > As I said above, no released fedora version uses SLUB. It's only just > been enabled for the unreleased FC15; I'm testing a beta copy. James, this isn't true. $ grep SLUB /boot/config-2.6.35.12-88.fc14.x86_64 CONFIG_SLUB_DEBUG=y CONFIG_SLUB=y (That's the oldest release I have right now, but it's been enabled even before that release). Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 15:46 ` Dave Jones @ 2011-05-12 16:00 ` James Bottomley 2011-05-12 16:08 ` Dave Jones 2011-05-12 16:27 ` Christoph Lameter 0 siblings, 2 replies; 77+ messages in thread From: James Bottomley @ 2011-05-12 16:00 UTC (permalink / raw) To: Dave Jones Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 2011-05-12 at 11:46 -0400, Dave Jones wrote: > On Thu, May 12, 2011 at 10:43:13AM -0500, James Bottomley wrote: > > > As I said above, no released fedora version uses SLUB. It's only just > > been enabled for the unreleased FC15; I'm testing a beta copy. > > James, this isn't true. > > $ grep SLUB /boot/config-2.6.35.12-88.fc14.x86_64 > CONFIG_SLUB_DEBUG=y > CONFIG_SLUB=y > > (That's the oldest release I have right now, but it's been enabled even > before that release). OK, I concede the point ... I haven't actually kept any of my FC machines current for a while. However, the fact remains that this seems to be a slub problem and it needs fixing. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 16:00 ` James Bottomley @ 2011-05-12 16:08 ` Dave Jones 2011-05-12 16:27 ` Christoph Lameter 1 sibling, 0 replies; 77+ messages in thread From: Dave Jones @ 2011-05-12 16:08 UTC (permalink / raw) To: James Bottomley Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 11:00:23AM -0500, James Bottomley wrote: > On Thu, 2011-05-12 at 11:46 -0400, Dave Jones wrote: > > On Thu, May 12, 2011 at 10:43:13AM -0500, James Bottomley wrote: > > > > > As I said above, no released fedora version uses SLUB. It's only just > > > been enabled for the unreleased FC15; I'm testing a beta copy. > > > > James, this isn't true. > > > > $ grep SLUB /boot/config-2.6.35.12-88.fc14.x86_64 > > CONFIG_SLUB_DEBUG=y > > CONFIG_SLUB=y > > > > (That's the oldest release I have right now, but it's been enabled even > > before that release). > > OK, I concede the point ... I haven't actually kept any of my FC > machines current for a while. 'a while' is an understatement :) It was first enabled in Fedora 8 in 2007. Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 16:00 ` James Bottomley 2011-05-12 16:08 ` Dave Jones @ 2011-05-12 16:27 ` Christoph Lameter 2011-05-12 16:30 ` James Bottomley 2011-05-12 17:40 ` Andrea Arcangeli 1 sibling, 2 replies; 77+ messages in thread From: Christoph Lameter @ 2011-05-12 16:27 UTC (permalink / raw) To: James Bottomley Cc: Dave Jones, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 12 May 2011, James Bottomley wrote: > However, the fact remains that this seems to be a slub problem and it > needs fixing. Why are you so fixed on slub in these matters? Its an key component but there is a high interaction with other subsystems. There was no recent change in slub that changed the order of allocations. There were changes affecting the reclaim logic. Slub has been working just fine with the existing allocation schemes for a long time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 16:27 ` Christoph Lameter @ 2011-05-12 16:30 ` James Bottomley 2011-05-12 16:48 ` Christoph Lameter 2011-05-12 17:06 ` Pekka Enberg 2011-05-12 17:40 ` Andrea Arcangeli 1 sibling, 2 replies; 77+ messages in thread From: James Bottomley @ 2011-05-12 16:30 UTC (permalink / raw) To: Christoph Lameter Cc: Dave Jones, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 2011-05-12 at 11:27 -0500, Christoph Lameter wrote: > On Thu, 12 May 2011, James Bottomley wrote: > > > However, the fact remains that this seems to be a slub problem and it > > needs fixing. > > Why are you so fixed on slub in these matters? Because, as has been hashed out in the thread, changing SLUB to SLAB makes the hang go away. > Its an key component but > there is a high interaction with other subsystems. There was no recent > change in slub that changed the order of allocations. There were changes > affecting the reclaim logic. Slub has been working just fine with the > existing allocation schemes for a long time. So suggest an alternative root cause and a test to expose it. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 16:30 ` James Bottomley @ 2011-05-12 16:48 ` Christoph Lameter 2011-05-12 17:46 ` Andrea Arcangeli 2011-05-12 17:06 ` Pekka Enberg 1 sibling, 1 reply; 77+ messages in thread From: Christoph Lameter @ 2011-05-12 16:48 UTC (permalink / raw) To: James Bottomley Cc: Dave Jones, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 12 May 2011, James Bottomley wrote: > On Thu, 2011-05-12 at 11:27 -0500, Christoph Lameter wrote: > > On Thu, 12 May 2011, James Bottomley wrote: > > > > > However, the fact remains that this seems to be a slub problem and it > > > needs fixing. > > > > Why are you so fixed on slub in these matters? > > Because, as has been hashed out in the thread, changing SLUB to SLAB > makes the hang go away. SLUB doesnt hang here with earlier kernel versions either. So the higher allocations are no longer as effective as they were before. This is due to a change in another subsystem. > > Its an key component but > > there is a high interaction with other subsystems. There was no recent > > change in slub that changed the order of allocations. There were changes > > affecting the reclaim logic. Slub has been working just fine with the > > existing allocation schemes for a long time. > > So suggest an alternative root cause and a test to expose it. Have a look at my other emails? I am just repeating myself again it seems. Try order = 1 which gives you SLAB like interaction with the page allocator. Then we at least know that it is the order 2 and 3 allocs that are the problem and not something else. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 16:48 ` Christoph Lameter @ 2011-05-12 17:46 ` Andrea Arcangeli 2011-05-12 18:00 ` Christoph Lameter 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2011-05-12 17:46 UTC (permalink / raw) To: Christoph Lameter Cc: James Bottomley, Dave Jones, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 11:48:19AM -0500, Christoph Lameter wrote: > Try order = 1 which gives you SLAB like interaction with the page > allocator. Then we at least know that it is the order 2 and 3 allocs that > are the problem and not something else. order 1 should work better, because it's less likely we end up here (which leaves RECLAIM_MODE_LUMPYRECLAIM on and then see what happens at the top of page_check_references()) else if (sc->order && priority < DEF_PRIORITY - 2) sc->reclaim_mode |= syncmode; with order 1 more likely we end up here as enough pages are freed for order 1 and we're safe: else sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC; None of these issue should materialize with COMPACTION=n. Even __GFP_WAIT can be left enabled to run compaction without expecting adverse behavior, but running compaction may still not be worth it for small systems where the benefit of having order 1/2/3 allocation may not outweight the cost of compaction itself. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 17:46 ` Andrea Arcangeli @ 2011-05-12 18:00 ` Christoph Lameter 2011-05-12 18:18 ` Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Christoph Lameter @ 2011-05-12 18:00 UTC (permalink / raw) To: Andrea Arcangeli Cc: James Bottomley, Dave Jones, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 12 May 2011, Andrea Arcangeli wrote: > order 1 should work better, because it's less likely we end up here > (which leaves RECLAIM_MODE_LUMPYRECLAIM on and then see what happens > at the top of page_check_references()) > > else if (sc->order && priority < DEF_PRIORITY - 2) Why is this DEF_PRIORITY - 2? Shouldnt it be DEF_PRIORITY? An accomodation for SLAB order 1 allocs? May I assume that the case of order 2 and 3 allocs in that case was not very well tested after the changes to introduce compaction since people were focusing on RHEL testing? ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 18:00 ` Christoph Lameter @ 2011-05-12 18:18 ` Andrea Arcangeli 0 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2011-05-12 18:18 UTC (permalink / raw) To: Christoph Lameter Cc: James Bottomley, Dave Jones, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 01:00:10PM -0500, Christoph Lameter wrote: > On Thu, 12 May 2011, Andrea Arcangeli wrote: > > > order 1 should work better, because it's less likely we end up here > > (which leaves RECLAIM_MODE_LUMPYRECLAIM on and then see what happens > > at the top of page_check_references()) > > > > else if (sc->order && priority < DEF_PRIORITY - 2) > > Why is this DEF_PRIORITY - 2? Shouldnt it be DEF_PRIORITY? An accomodation > for SLAB order 1 allocs? That's to allow a few loops of the shrinker (i.e. not take down everything in the way regardless of any aging information in pte/page if there's no memory pressure). This "- 2" is independent of the allocation order. If it was < DEF_PRIORITY it'd trigger lumpy already at the second loop (in do_try_to_free_pages). So it'd make things worse. Like it'd make things worse decreasing the PAGE_ALLOC_COSTLY_ORDER define to 2 and keeping slub at 3. > May I assume that the case of order 2 and 3 allocs in that case was not > very well tested after the changes to introduce compaction since people > were focusing on RHEL testing? Not really, I had to eliminate lumpy before compaction was developed. RHEL6 has zero lumpy code (not even at compile time) and compaction enabled by default, so even if we enabled SLUB=y it should work ok (not sure why James still crashes with patch 2 applied that clears __GFP_WAIT, that crash likely has nothing to do with compaction or lumpy as both are off with __GFP_WAIT not set). Lumpy is also eliminated upstream now (but only at runtime when COMPACTION=y), unless __GFP_REPEAT is set, in which case I think lumpy will still work upstream too but few unfrequent things like increasing nr_hugepages uses that. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 16:30 ` James Bottomley 2011-05-12 16:48 ` Christoph Lameter @ 2011-05-12 17:06 ` Pekka Enberg 2011-05-12 17:11 ` Pekka Enberg 1 sibling, 1 reply; 77+ messages in thread From: Pekka Enberg @ 2011-05-12 17:06 UTC (permalink / raw) To: James Bottomley Cc: Christoph Lameter, Dave Jones, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 7:30 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > So suggest an alternative root cause and a test to expose it. Is your .config available somewhere, btw? ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 17:06 ` Pekka Enberg @ 2011-05-12 17:11 ` Pekka Enberg 2011-05-12 17:38 ` Christoph Lameter ` (2 more replies) 0 siblings, 3 replies; 77+ messages in thread From: Pekka Enberg @ 2011-05-12 17:11 UTC (permalink / raw) To: James Bottomley Cc: Christoph Lameter, Dave Jones, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Andrea Arcangeli On Thu, May 12, 2011 at 8:06 PM, Pekka Enberg <penberg@kernel.org> wrote: > On Thu, May 12, 2011 at 7:30 PM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: >> So suggest an alternative root cause and a test to expose it. > > Is your .config available somewhere, btw? If it's this: http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=config-x86_64-generic;hb=HEAD I'd love to see what happens if you disable CONFIG_TRANSPARENT_HUGEPAGE=y because that's going to reduce high order allocations as well, no? Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 17:11 ` Pekka Enberg @ 2011-05-12 17:38 ` Christoph Lameter 2011-05-12 18:00 ` Andrea Arcangeli 2011-05-12 17:51 ` Andrea Arcangeli 2011-05-12 18:36 ` James Bottomley 2 siblings, 1 reply; 77+ messages in thread From: Christoph Lameter @ 2011-05-12 17:38 UTC (permalink / raw) To: Pekka Enberg Cc: James Bottomley, Dave Jones, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Andrea Arcangeli On Thu, 12 May 2011, Pekka Enberg wrote: > On Thu, May 12, 2011 at 8:06 PM, Pekka Enberg <penberg@kernel.org> wrote: > > On Thu, May 12, 2011 at 7:30 PM, James Bottomley > > <James.Bottomley@hansenpartnership.com> wrote: > >> So suggest an alternative root cause and a test to expose it. > > > > Is your .config available somewhere, btw? > > If it's this: > > http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=config-x86_64-generic;hb=HEAD > > I'd love to see what happens if you disable > > CONFIG_TRANSPARENT_HUGEPAGE=y > > because that's going to reduce high order allocations as well, no? I dont think that will change much since huge pages are at MAX_ORDER size. Either you can get them or not. The challenge with the small order allocations is that they require contiguous memory. Compaction is likely not as effective as the prior mechanism that did opportunistic reclaim of neighboring pages. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 17:38 ` Christoph Lameter @ 2011-05-12 18:00 ` Andrea Arcangeli 2011-05-13 9:49 ` Mel Gorman 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2011-05-12 18:00 UTC (permalink / raw) To: Christoph Lameter Cc: Pekka Enberg, James Bottomley, Dave Jones, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 12:38:34PM -0500, Christoph Lameter wrote: > I dont think that will change much since huge pages are at MAX_ORDER size. > Either you can get them or not. The challenge with the small order > allocations is that they require contiguous memory. Compaction is likely > not as effective as the prior mechanism that did opportunistic reclaim of > neighboring pages. THP requires contiguous pages too, the issue is similar, and worse with THP, but THP enables compaction by default, likely this only happens with compaction off. We've really to differentiate between compaction on and off, it makes world of difference (a THP enabled kernel with compaction off, also runs into swap storms and temporary hangs all the time, it's probably the same issue of SLUB=y COMPACTION=n). At least THP didn't activate kswapd, kswapd running lumpy too makes things worse as it'll probably keep running in the background after the direct reclaim fails. The original reports talks about kerenls with SLUB=y and COMPACTION=n. Not sure if anybody is having trouble with SLUB=y COMPACTION=y... Compaction is more effective than the prior mechanism too (prior mechanism is lumpy reclaim) and it doesn't cause VM disruptions that ignore all referenced information and takes down anything it finds in the way. I think when COMPACTION=n, lumpy either should go away, or only be activated by __GFP_REPEAT so that only hugetlbfs makes use of it. Increasing nr_hugepages is ok to halt the system for a while but when all allocations are doing that, system becomes unusable, kind of livelocked. BTW, it comes to mind in patch 2, SLUB should clear __GFP_REPEAT too (not only __GFP_NOFAIL). Clearing __GFP_WAIT may be worth it or not with COMPACTION=y, definitely good idea to clear __GFP_WAIT unless lumpy is restricted to __GFP_REPEAT|__GFP_NOFAIL. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 18:00 ` Andrea Arcangeli @ 2011-05-13 9:49 ` Mel Gorman 2011-05-15 16:39 ` Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Mel Gorman @ 2011-05-13 9:49 UTC (permalink / raw) To: Andrea Arcangeli Cc: Christoph Lameter, Pekka Enberg, James Bottomley, Dave Jones, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 08:00:18PM +0200, Andrea Arcangeli wrote: > <SNIP> > > BTW, it comes to mind in patch 2, SLUB should clear __GFP_REPEAT too > (not only __GFP_NOFAIL). Clearing __GFP_WAIT may be worth it or not > with COMPACTION=y, definitely good idea to clear __GFP_WAIT unless > lumpy is restricted to __GFP_REPEAT|__GFP_NOFAIL. This is in V2 (unreleased, testing in progress and was running overnight). I noticed that clearing __GFP_REPEAT is required for reclaim/compaction if direct reclaimers from SLUB are to return false in should_continue_reclaim() and bail out from high-order allocation properly. As it is, there is a possibility for slub high-order direct reclaimers to loop in reclaim/compaction for a long time. This is only important when CONFIG_COMPACTION=y. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-13 9:49 ` Mel Gorman @ 2011-05-15 16:39 ` Andrea Arcangeli 2011-05-16 8:42 ` Mel Gorman 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2011-05-15 16:39 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Lameter, Pekka Enberg, James Bottomley, Dave Jones, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, May 13, 2011 at 10:49:58AM +0100, Mel Gorman wrote: > On Thu, May 12, 2011 at 08:00:18PM +0200, Andrea Arcangeli wrote: > > <SNIP> > > > > BTW, it comes to mind in patch 2, SLUB should clear __GFP_REPEAT too > > (not only __GFP_NOFAIL). Clearing __GFP_WAIT may be worth it or not > > with COMPACTION=y, definitely good idea to clear __GFP_WAIT unless > > lumpy is restricted to __GFP_REPEAT|__GFP_NOFAIL. > > This is in V2 (unreleased, testing in progress and was running > overnight). I noticed that clearing __GFP_REPEAT is required for > reclaim/compaction if direct reclaimers from SLUB are to return false in > should_continue_reclaim() and bail out from high-order allocation > properly. As it is, there is a possibility for slub high-order direct > reclaimers to loop in reclaim/compaction for a long time. This is > only important when CONFIG_COMPACTION=y. Agreed. However I don't expect anyone to allocate from slub(/slab) with __GFP_REPEAT so it's probably only theoretical but more correct indeed ;). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-15 16:39 ` Andrea Arcangeli @ 2011-05-16 8:42 ` Mel Gorman 0 siblings, 0 replies; 77+ messages in thread From: Mel Gorman @ 2011-05-16 8:42 UTC (permalink / raw) To: Andrea Arcangeli Cc: Christoph Lameter, Pekka Enberg, James Bottomley, Dave Jones, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Sun, May 15, 2011 at 06:39:06PM +0200, Andrea Arcangeli wrote: > On Fri, May 13, 2011 at 10:49:58AM +0100, Mel Gorman wrote: > > On Thu, May 12, 2011 at 08:00:18PM +0200, Andrea Arcangeli wrote: > > > <SNIP> > > > > > > BTW, it comes to mind in patch 2, SLUB should clear __GFP_REPEAT too > > > (not only __GFP_NOFAIL). Clearing __GFP_WAIT may be worth it or not > > > with COMPACTION=y, definitely good idea to clear __GFP_WAIT unless > > > lumpy is restricted to __GFP_REPEAT|__GFP_NOFAIL. > > > > This is in V2 (unreleased, testing in progress and was running > > overnight). I noticed that clearing __GFP_REPEAT is required for > > reclaim/compaction if direct reclaimers from SLUB are to return false in > > should_continue_reclaim() and bail out from high-order allocation > > properly. As it is, there is a possibility for slub high-order direct > > reclaimers to loop in reclaim/compaction for a long time. This is > > only important when CONFIG_COMPACTION=y. > > Agreed. However I don't expect anyone to allocate from slub(/slab) > with __GFP_REPEAT so it's probably only theoretical but more correct > indeed ;). Networking layer does specify __GFP_REPEAT. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 17:11 ` Pekka Enberg 2011-05-12 17:38 ` Christoph Lameter @ 2011-05-12 17:51 ` Andrea Arcangeli 2011-05-12 18:03 ` Christoph Lameter 2011-05-12 18:36 ` James Bottomley 2 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2011-05-12 17:51 UTC (permalink / raw) To: Pekka Enberg Cc: James Bottomley, Christoph Lameter, Dave Jones, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 08:11:05PM +0300, Pekka Enberg wrote: > If it's this: > > http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=config-x86_64-generic;hb=HEAD > > I'd love to see what happens if you disable > > CONFIG_TRANSPARENT_HUGEPAGE=y > > because that's going to reduce high order allocations as well, no? Well THP forces COMPACTION=y so lumpy won't risk to be activated. I got once a complaint asking not to make THP force COMPACTION=y (there is no real dependency here, THP will just call alloc_pages with __GFP_NO_KSWAPD and order 9, or 10 on x86-nopae), but I preferred to keep it forced exactly to avoid issues like these when THP is on. If even order 3 is causing troubles (which doesn't immediately make lumpy activated, it only activates when priority is < DEF_PRIORITY-2, so after 2 loops failing to reclaim nr_to_reclaim pages), imagine what was happening at order 9 every time firefox, gcc and mutt allocated memory ;). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 17:51 ` Andrea Arcangeli @ 2011-05-12 18:03 ` Christoph Lameter 2011-05-12 18:09 ` Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Christoph Lameter @ 2011-05-12 18:03 UTC (permalink / raw) To: Andrea Arcangeli Cc: Pekka Enberg, James Bottomley, Dave Jones, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 12 May 2011, Andrea Arcangeli wrote: > even order 3 is causing troubles (which doesn't immediately make lumpy > activated, it only activates when priority is < DEF_PRIORITY-2, so > after 2 loops failing to reclaim nr_to_reclaim pages), imagine what That is a significant change for SLUB with the merge of the compaction code. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 18:03 ` Christoph Lameter @ 2011-05-12 18:09 ` Andrea Arcangeli 2011-05-12 18:16 ` Christoph Lameter 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2011-05-12 18:09 UTC (permalink / raw) To: Christoph Lameter Cc: Pekka Enberg, James Bottomley, Dave Jones, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 01:03:05PM -0500, Christoph Lameter wrote: > On Thu, 12 May 2011, Andrea Arcangeli wrote: > > > even order 3 is causing troubles (which doesn't immediately make lumpy > > activated, it only activates when priority is < DEF_PRIORITY-2, so > > after 2 loops failing to reclaim nr_to_reclaim pages), imagine what > > That is a significant change for SLUB with the merge of the compaction > code. Even before compaction was posted, I had to shut off lumpy reclaim or it'd hang all the time with frequent order 9 allocations. Maybe lumpy was better before, maybe lumpy "improved" its reliability recently, but definitely it wasn't performing well. That definitely applies to >=2.6.32 (I had to nuke lumpy from it, and only keep compaction enabled, pretty much like upstream with COMPACTION=y). I think I never tried earlier lumpy code than 2.6.32, maybe it was less aggressive back then, I don't exclude it but I thought the whole notion of lumpy was to takedown everything in the way, which usually leads to process hanging in swapins or pageins for frequent used memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 18:09 ` Andrea Arcangeli @ 2011-05-12 18:16 ` Christoph Lameter 0 siblings, 0 replies; 77+ messages in thread From: Christoph Lameter @ 2011-05-12 18:16 UTC (permalink / raw) To: Andrea Arcangeli Cc: Pekka Enberg, James Bottomley, Dave Jones, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 12 May 2011, Andrea Arcangeli wrote: > On Thu, May 12, 2011 at 01:03:05PM -0500, Christoph Lameter wrote: > > On Thu, 12 May 2011, Andrea Arcangeli wrote: > > > > > even order 3 is causing troubles (which doesn't immediately make lumpy > > > activated, it only activates when priority is < DEF_PRIORITY-2, so > > > after 2 loops failing to reclaim nr_to_reclaim pages), imagine what > > > > That is a significant change for SLUB with the merge of the compaction > > code. > > Even before compaction was posted, I had to shut off lumpy reclaim or > it'd hang all the time with frequent order 9 allocations. Maybe lumpy > was better before, maybe lumpy "improved" its reliability recently, Well we are concerned about order 2 and 3 alloc here. Checking for < PAGE_ORDER_COSTLY to avoid the order 9 lumpy reclaim looks okay. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 17:11 ` Pekka Enberg 2011-05-12 17:38 ` Christoph Lameter 2011-05-12 17:51 ` Andrea Arcangeli @ 2011-05-12 18:36 ` James Bottomley 2 siblings, 0 replies; 77+ messages in thread From: James Bottomley @ 2011-05-12 18:36 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Dave Jones, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Andrea Arcangeli On Thu, 2011-05-12 at 20:11 +0300, Pekka Enberg wrote: > On Thu, May 12, 2011 at 8:06 PM, Pekka Enberg <penberg@kernel.org> wrote: > > On Thu, May 12, 2011 at 7:30 PM, James Bottomley > > <James.Bottomley@hansenpartnership.com> wrote: > >> So suggest an alternative root cause and a test to expose it. > > > > Is your .config available somewhere, btw? > > If it's this: > > http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=config-x86_64-generic;hb=HEAD > > I'd love to see what happens if you disable > > CONFIG_TRANSPARENT_HUGEPAGE=y > > because that's going to reduce high order allocations as well, no? So yes, it's a default FC15 config. Disabling THP was initially tried a long time ago and didn't make a difference (it was originally suggested by Chris Mason). James ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 16:27 ` Christoph Lameter 2011-05-12 16:30 ` James Bottomley @ 2011-05-12 17:40 ` Andrea Arcangeli 1 sibling, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2011-05-12 17:40 UTC (permalink / raw) To: Christoph Lameter Cc: James Bottomley, Dave Jones, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 11:27:04AM -0500, Christoph Lameter wrote: > On Thu, 12 May 2011, James Bottomley wrote: > > > However, the fact remains that this seems to be a slub problem and it > > needs fixing. > > Why are you so fixed on slub in these matters? Its an key component but > there is a high interaction with other subsystems. There was no recent > change in slub that changed the order of allocations. There were changes > affecting the reclaim logic. Slub has been working just fine with the > existing allocation schemes for a long time. It should work just fine when compaction is enabled. The COMPACTION=n case would also work decent if we eliminate the lumpy reclaim. Lumpy reclaim tells the VM to ignore all young bits in the pagetables and take everything down in order to generate the order 3 page that SLUB asks. You can't expect decent behavior the moment you take everything down regardless of referenced bits on page and young bits in pte. I doubt it's new issue, but lumpy may have become more or less aggressive over time. Good thing, lumpy is eliminated (basically at runtime, not compile time) by enabling compaction. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 15:43 ` James Bottomley 2011-05-12 15:46 ` Dave Jones @ 2011-05-12 15:55 ` Pekka Enberg 2011-05-12 18:37 ` James Bottomley 2011-05-12 16:01 ` Christoph Lameter 2 siblings, 1 reply; 77+ messages in thread From: Pekka Enberg @ 2011-05-12 15:55 UTC (permalink / raw) To: James Bottomley Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > However, since you admit even you see problems, let's concentrate on > fixing them rather than recriminations? Yes, please. So does dropping max_order to 1 help? PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 15:55 ` Pekka Enberg @ 2011-05-12 18:37 ` James Bottomley 2011-05-12 18:46 ` Christoph Lameter 2011-05-12 19:44 ` James Bottomley 0 siblings, 2 replies; 77+ messages in thread From: James Bottomley @ 2011-05-12 18:37 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote: > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > > However, since you admit even you see problems, let's concentrate on > > fixing them rather than recriminations? > > Yes, please. So does dropping max_order to 1 help? > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. Just booting with max_slab_order=1 (and none of the other patches applied) I can still get the machine to go into kswapd at 99%, so it doesn't seem to make much of a difference. Do you want me to try with the other two patches and max_slab_order=1? James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 18:37 ` James Bottomley @ 2011-05-12 18:46 ` Christoph Lameter 2011-05-12 19:21 ` James Bottomley 2011-05-12 19:44 ` James Bottomley 1 sibling, 1 reply; 77+ messages in thread From: Christoph Lameter @ 2011-05-12 18:46 UTC (permalink / raw) To: James Bottomley Cc: Pekka Enberg, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 12 May 2011, James Bottomley wrote: > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote: > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > > > However, since you admit even you see problems, let's concentrate on > > > fixing them rather than recriminations? > > > > Yes, please. So does dropping max_order to 1 help? > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. > > Just booting with max_slab_order=1 (and none of the other patches > applied) I can still get the machine to go into kswapd at 99%, so it > doesn't seem to make much of a difference. slub_max_order=1 right? Not max_slab_order. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 18:46 ` Christoph Lameter @ 2011-05-12 19:21 ` James Bottomley 0 siblings, 0 replies; 77+ messages in thread From: James Bottomley @ 2011-05-12 19:21 UTC (permalink / raw) To: Christoph Lameter Cc: Pekka Enberg, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 2011-05-12 at 13:46 -0500, Christoph Lameter wrote: > On Thu, 12 May 2011, James Bottomley wrote: > > > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote: > > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > > > > However, since you admit even you see problems, let's concentrate on > > > > fixing them rather than recriminations? > > > > > > Yes, please. So does dropping max_order to 1 help? > > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. > > > > Just booting with max_slab_order=1 (and none of the other patches > > applied) I can still get the machine to go into kswapd at 99%, so it > > doesn't seem to make much of a difference. > > slub_max_order=1 right? Not max_slab_order. Yes. James ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 18:37 ` James Bottomley 2011-05-12 18:46 ` Christoph Lameter @ 2011-05-12 19:44 ` James Bottomley 2011-05-12 20:04 ` James Bottomley 1 sibling, 1 reply; 77+ messages in thread From: James Bottomley @ 2011-05-12 19:44 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote: > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote: > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > > > However, since you admit even you see problems, let's concentrate on > > > fixing them rather than recriminations? > > > > Yes, please. So does dropping max_order to 1 help? > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. > > Just booting with max_slab_order=1 (and none of the other patches > applied) I can still get the machine to go into kswapd at 99%, so it > doesn't seem to make much of a difference. > > Do you want me to try with the other two patches and max_slab_order=1? OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to trigger the problem (kswapd spinning at 99%). This is still with PREEMPT; it's possible that non-PREEMPT might be better, so I'll try patches 1+2+3 with PREEMPT just to see if the perturbation is caused by it. James ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 19:44 ` James Bottomley @ 2011-05-12 20:04 ` James Bottomley 2011-05-12 20:29 ` Johannes Weiner ` (2 more replies) 0 siblings, 3 replies; 77+ messages in thread From: James Bottomley @ 2011-05-12 20:04 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 2011-05-12 at 14:44 -0500, James Bottomley wrote: > On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote: > > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote: > > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > > > > However, since you admit even you see problems, let's concentrate on > > > > fixing them rather than recriminations? > > > > > > Yes, please. So does dropping max_order to 1 help? > > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. > > > > Just booting with max_slab_order=1 (and none of the other patches > > applied) I can still get the machine to go into kswapd at 99%, so it > > doesn't seem to make much of a difference. > > > > Do you want me to try with the other two patches and max_slab_order=1? > > OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to > trigger the problem (kswapd spinning at 99%). This is still with > PREEMPT; it's possible that non-PREEMPT might be better, so I'll try > patches 1+2+3 with PREEMPT just to see if the perturbation is caused by > it. Confirmed, I'm afraid ... I can trigger the problem with all three patches under PREEMPT. It's not a hang this time, it's just kswapd taking 100% system time on 1 CPU and it won't calm down after I unload the system. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 20:04 ` James Bottomley @ 2011-05-12 20:29 ` Johannes Weiner 2011-05-12 20:31 ` Johannes Weiner 2011-05-12 20:31 ` James Bottomley 2011-05-12 22:04 ` James Bottomley 2011-05-13 6:16 ` Pekka Enberg 2 siblings, 2 replies; 77+ messages in thread From: Johannes Weiner @ 2011-05-12 20:29 UTC (permalink / raw) To: James Bottomley Cc: Pekka Enberg, Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 03:04:12PM -0500, James Bottomley wrote: > On Thu, 2011-05-12 at 14:44 -0500, James Bottomley wrote: > > On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote: > > > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote: > > > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > > > > > However, since you admit even you see problems, let's concentrate on > > > > > fixing them rather than recriminations? > > > > > > > > Yes, please. So does dropping max_order to 1 help? > > > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. > > > > > > Just booting with max_slab_order=1 (and none of the other patches > > > applied) I can still get the machine to go into kswapd at 99%, so it > > > doesn't seem to make much of a difference. > > > > > > Do you want me to try with the other two patches and max_slab_order=1? > > > > OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to > > trigger the problem (kswapd spinning at 99%). This is still with > > PREEMPT; it's possible that non-PREEMPT might be better, so I'll try > > patches 1+2+3 with PREEMPT just to see if the perturbation is caused by > > it. > > Confirmed, I'm afraid ... I can trigger the problem with all three > patches under PREEMPT. It's not a hang this time, it's just kswapd > taking 100% system time on 1 CPU and it won't calm down after I unload > the system. That is kind of expected, though. If one CPU is busy with a streaming IO load generating new pages, kswapd is busy reclaiming the old ones so that the generator does not have to do the reclaim itself. By unload, do you mean stopping the generator? And if so, how quickly after you stop the generator does kswapd go back to sleep? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 20:29 ` Johannes Weiner @ 2011-05-12 20:31 ` Johannes Weiner 2011-05-12 20:31 ` James Bottomley 1 sibling, 0 replies; 77+ messages in thread From: Johannes Weiner @ 2011-05-12 20:31 UTC (permalink / raw) To: James Bottomley Cc: Pekka Enberg, Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 10:29:17PM +0200, Johannes Weiner wrote: > On Thu, May 12, 2011 at 03:04:12PM -0500, James Bottomley wrote: > > On Thu, 2011-05-12 at 14:44 -0500, James Bottomley wrote: > > > On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote: > > > > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote: > > > > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > > > > > > However, since you admit even you see problems, let's concentrate on > > > > > > fixing them rather than recriminations? > > > > > > > > > > Yes, please. So does dropping max_order to 1 help? > > > > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. > > > > > > > > Just booting with max_slab_order=1 (and none of the other patches > > > > applied) I can still get the machine to go into kswapd at 99%, so it > > > > doesn't seem to make much of a difference. > > > > > > > > Do you want me to try with the other two patches and max_slab_order=1? > > > > > > OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to > > > trigger the problem (kswapd spinning at 99%). This is still with > > > PREEMPT; it's possible that non-PREEMPT might be better, so I'll try > > > patches 1+2+3 with PREEMPT just to see if the perturbation is caused by > > > it. > > > > Confirmed, I'm afraid ... I can trigger the problem with all three > > patches under PREEMPT. It's not a hang this time, it's just kswapd > > taking 100% system time on 1 CPU and it won't calm down after I unload > > the system. I am so sorry, I missed the "won't" here. Please ignore. > That is kind of expected, though. If one CPU is busy with a streaming > IO load generating new pages, kswapd is busy reclaiming the old ones > so that the generator does not have to do the reclaim itself. > > By unload, do you mean stopping the generator? And if so, how quickly > after you stop the generator does kswapd go back to sleep? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 20:29 ` Johannes Weiner 2011-05-12 20:31 ` Johannes Weiner @ 2011-05-12 20:31 ` James Bottomley 1 sibling, 0 replies; 77+ messages in thread From: James Bottomley @ 2011-05-12 20:31 UTC (permalink / raw) To: Johannes Weiner Cc: Pekka Enberg, Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 2011-05-12 at 22:29 +0200, Johannes Weiner wrote: > On Thu, May 12, 2011 at 03:04:12PM -0500, James Bottomley wrote: > > On Thu, 2011-05-12 at 14:44 -0500, James Bottomley wrote: > > > On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote: > > > > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote: > > > > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > > > > > > However, since you admit even you see problems, let's concentrate on > > > > > > fixing them rather than recriminations? > > > > > > > > > > Yes, please. So does dropping max_order to 1 help? > > > > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. > > > > > > > > Just booting with max_slab_order=1 (and none of the other patches > > > > applied) I can still get the machine to go into kswapd at 99%, so it > > > > doesn't seem to make much of a difference. > > > > > > > > Do you want me to try with the other two patches and max_slab_order=1? > > > > > > OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to > > > trigger the problem (kswapd spinning at 99%). This is still with > > > PREEMPT; it's possible that non-PREEMPT might be better, so I'll try > > > patches 1+2+3 with PREEMPT just to see if the perturbation is caused by > > > it. > > > > Confirmed, I'm afraid ... I can trigger the problem with all three > > patches under PREEMPT. It's not a hang this time, it's just kswapd > > taking 100% system time on 1 CPU and it won't calm down after I unload > > the system. > > That is kind of expected, though. If one CPU is busy with a streaming > IO load generating new pages, kswapd is busy reclaiming the old ones > so that the generator does not have to do the reclaim itself. > > By unload, do you mean stopping the generator? Correct. > And if so, how quickly > after you stop the generator does kswapd go back to sleep? It doesn't. At least not on its own; the CPU stays pegged. If I start other work (like a kernel compile), then sometimes it does go back to nothing. I'm speculating that this is the hang case for non-PREEMPT. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 20:04 ` James Bottomley 2011-05-12 20:29 ` Johannes Weiner @ 2011-05-12 22:04 ` James Bottomley 2011-05-12 22:15 ` Johannes Weiner 2011-05-13 6:16 ` Pekka Enberg 2 siblings, 1 reply; 77+ messages in thread From: James Bottomley @ 2011-05-12 22:04 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote: > Confirmed, I'm afraid ... I can trigger the problem with all three > patches under PREEMPT. It's not a hang this time, it's just kswapd > taking 100% system time on 1 CPU and it won't calm down after I unload > the system. Just on a "if you don't know what's wrong poke about and see" basis, I sliced out all the complex logic in sleeping_prematurely() and, as far as I can tell, it cures the problem behaviour. I've loaded up the system, and taken the tar load generator through three runs without producing a spinning kswapd (this is PREEMPT). I'll try with a non-PREEMPT kernel shortly. What this seems to say is that there's a problem with the complex logic in sleeping_prematurely(). I'm pretty sure hacking up sleeping_prematurely() just to dump all the calculations is the wrong thing to do, but perhaps someone can see what the right thing is ... By the way, I stripped off all the patches, so this is a plain old 2.6.38.6 kernel with the default FC15 config. James --- diff --git a/mm/vmscan.c b/mm/vmscan.c index 0665520..1bdea7d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2255,6 +2255,8 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, if (remaining) return true; + return false; + /* Check the watermark levels */ for (i = 0; i < pgdat->nr_zones; i++) { struct zone *zone = pgdat->node_zones + i; ^ permalink raw reply related [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 22:04 ` James Bottomley @ 2011-05-12 22:15 ` Johannes Weiner 2011-05-12 22:58 ` Minchan Kim ` (2 more replies) 0 siblings, 3 replies; 77+ messages in thread From: Johannes Weiner @ 2011-05-12 22:15 UTC (permalink / raw) To: James Bottomley Cc: Pekka Enberg, Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote: > On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote: > > Confirmed, I'm afraid ... I can trigger the problem with all three > > patches under PREEMPT. It's not a hang this time, it's just kswapd > > taking 100% system time on 1 CPU and it won't calm down after I unload > > the system. > > Just on a "if you don't know what's wrong poke about and see" basis, I > sliced out all the complex logic in sleeping_prematurely() and, as far > as I can tell, it cures the problem behaviour. I've loaded up the > system, and taken the tar load generator through three runs without > producing a spinning kswapd (this is PREEMPT). I'll try with a > non-PREEMPT kernel shortly. > > What this seems to say is that there's a problem with the complex logic > in sleeping_prematurely(). I'm pretty sure hacking up > sleeping_prematurely() just to dump all the calculations is the wrong > thing to do, but perhaps someone can see what the right thing is ... I think I see the problem: the boolean logic of sleeping_prematurely() is odd. If it returns true, kswapd will keep running. So if pgdat_balanced() returns true, kswapd should go to sleep. This? diff --git a/mm/vmscan.c b/mm/vmscan.c index 2b701e0..092d773 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2261,7 +2261,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, * must be balanced */ if (order) - return pgdat_balanced(pgdat, balanced, classzone_idx); + return !pgdat_balanced(pgdat, balanced, classzone_idx); else return !all_zones_ok; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 22:15 ` Johannes Weiner @ 2011-05-12 22:58 ` Minchan Kim 2011-05-13 5:39 ` Minchan Kim 2011-05-13 0:47 ` James Bottomley 2011-05-13 10:30 ` Mel Gorman 2 siblings, 1 reply; 77+ messages in thread From: Minchan Kim @ 2011-05-12 22:58 UTC (permalink / raw) To: Johannes Weiner Cc: James Bottomley, Pekka Enberg, Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, May 13, 2011 at 7:15 AM, Johannes Weiner <hannes@cmpxchg.org> wrote: > On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote: >> On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote: >> > Confirmed, I'm afraid ... I can trigger the problem with all three >> > patches under PREEMPT. It's not a hang this time, it's just kswapd >> > taking 100% system time on 1 CPU and it won't calm down after I unload >> > the system. >> >> Just on a "if you don't know what's wrong poke about and see" basis, I >> sliced out all the complex logic in sleeping_prematurely() and, as far >> as I can tell, it cures the problem behaviour. I've loaded up the >> system, and taken the tar load generator through three runs without >> producing a spinning kswapd (this is PREEMPT). I'll try with a >> non-PREEMPT kernel shortly. >> >> What this seems to say is that there's a problem with the complex logic >> in sleeping_prematurely(). I'm pretty sure hacking up >> sleeping_prematurely() just to dump all the calculations is the wrong >> thing to do, but perhaps someone can see what the right thing is ... > > I think I see the problem: the boolean logic of sleeping_prematurely() > is odd. If it returns true, kswapd will keep running. So if > pgdat_balanced() returns true, kswapd should go to sleep. > > This? Yes. Good catch. > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2b701e0..092d773 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2261,7 +2261,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > * must be balanced > */ > if (order) > - return pgdat_balanced(pgdat, balanced, classzone_idx); > + return !pgdat_balanced(pgdat, balanced, classzone_idx); > else > return !all_zones_ok; > } > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 22:58 ` Minchan Kim @ 2011-05-13 5:39 ` Minchan Kim 0 siblings, 0 replies; 77+ messages in thread From: Minchan Kim @ 2011-05-13 5:39 UTC (permalink / raw) To: Johannes Weiner Cc: James Bottomley, Pekka Enberg, Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, May 13, 2011 at 7:58 AM, Minchan Kim <minchan.kim@gmail.com> wrote: > On Fri, May 13, 2011 at 7:15 AM, Johannes Weiner <hannes@cmpxchg.org> wrote: >> On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote: >>> On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote: >>> > Confirmed, I'm afraid ... I can trigger the problem with all three >>> > patches under PREEMPT. It's not a hang this time, it's just kswapd >>> > taking 100% system time on 1 CPU and it won't calm down after I unload >>> > the system. >>> >>> Just on a "if you don't know what's wrong poke about and see" basis, I >>> sliced out all the complex logic in sleeping_prematurely() and, as far >>> as I can tell, it cures the problem behaviour. I've loaded up the >>> system, and taken the tar load generator through three runs without >>> producing a spinning kswapd (this is PREEMPT). I'll try with a >>> non-PREEMPT kernel shortly. >>> >>> What this seems to say is that there's a problem with the complex logic >>> in sleeping_prematurely(). I'm pretty sure hacking up >>> sleeping_prematurely() just to dump all the calculations is the wrong >>> thing to do, but perhaps someone can see what the right thing is ... >> >> I think I see the problem: the boolean logic of sleeping_prematurely() >> is odd. If it returns true, kswapd will keep running. So if >> pgdat_balanced() returns true, kswapd should go to sleep. >> >> This? > > Yes. Good catch. In addition, I see some strange thing. The comment in pgdat_balanced says "Only zones that meet watermarks and are in a zone allowed by the callers classzone_idx are added to balanced_pages" It's true in case of balance_pgdat but it's not true in sleeping_prematurely. This? barrios@barrios-desktop:~/linux-mmotm$ git diff mm/vmscan.c diff --git a/mm/vmscan.c b/mm/vmscan.c index 292582c..d9078cf 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2322,7 +2322,8 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, classzone_idx, 0)) all_zones_ok = false; else - balanced += zone->present_pages; + if (i <= classzone_idx) + balanced += zone->present_pages; } /* @@ -2331,7 +2332,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, * must be balanced */ if (order) - return pgdat_balanced(pgdat, balanced, classzone_idx); + return !pgdat_balanced(pgdat, balanced, classzone_idx); else return !all_zones_ok; } -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 22:15 ` Johannes Weiner 2011-05-12 22:58 ` Minchan Kim @ 2011-05-13 0:47 ` James Bottomley 2011-05-13 4:12 ` James Bottomley 2011-05-13 10:55 ` Mel Gorman 2011-05-13 10:30 ` Mel Gorman 2 siblings, 2 replies; 77+ messages in thread From: James Bottomley @ 2011-05-13 0:47 UTC (permalink / raw) To: Johannes Weiner Cc: Pekka Enberg, Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, 2011-05-13 at 00:15 +0200, Johannes Weiner wrote: > On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote: > > On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote: > > > Confirmed, I'm afraid ... I can trigger the problem with all three > > > patches under PREEMPT. It's not a hang this time, it's just kswapd > > > taking 100% system time on 1 CPU and it won't calm down after I unload > > > the system. > > > > Just on a "if you don't know what's wrong poke about and see" basis, I > > sliced out all the complex logic in sleeping_prematurely() and, as far > > as I can tell, it cures the problem behaviour. I've loaded up the > > system, and taken the tar load generator through three runs without > > producing a spinning kswapd (this is PREEMPT). I'll try with a > > non-PREEMPT kernel shortly. > > > > What this seems to say is that there's a problem with the complex logic > > in sleeping_prematurely(). I'm pretty sure hacking up > > sleeping_prematurely() just to dump all the calculations is the wrong > > thing to do, but perhaps someone can see what the right thing is ... > > I think I see the problem: the boolean logic of sleeping_prematurely() > is odd. If it returns true, kswapd will keep running. So if > pgdat_balanced() returns true, kswapd should go to sleep. > > This? I was going to say this was a winner, but on the third untar run on non-PREEMPT, I hit the kswapd livelock. It's got much farther than previous attempts, which all hang on the first run, but I think the essential problem is still (at least on this machine) that sleeping_prematurely() is doing too much work for the wakeup storm that allocators are causing. Something that ratelimits the amount of time we spend in the watermark calculations, like the below (which incorporates your pgdat fix) seems to be much more stable (I've not run it for three full runs yet, but kswapd CPU time is way lower so far). The heuristic here is that if we're making the calculation more than ten times in 1/10 of a second, stop and sleep anyway. James --- diff --git a/mm/vmscan.c b/mm/vmscan.c index 0665520..545250c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2249,12 +2249,32 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, { int i; unsigned long balanced = 0; - bool all_zones_ok = true; + bool all_zones_ok = true, ret; + static int returned_true = 0; + static unsigned long prev_jiffies = 0; + /* If a direct reclaimer woke kswapd within HZ/10, it's premature */ if (remaining) return true; + /* rate limit our entry to the watermark calculations */ + if (time_after(prev_jiffies + HZ/10, jiffies)) { + /* previously returned false, do so again */ + if (returned_true == 0) + return false; + /* or we've done the true calculation too many times */ + if (returned_true++ > 10) + return false; + + return true; + } else { + /* haven't been here for a while, reset the true count */ + returned_true = 0; + } + + prev_jiffies = jiffies; + /* Check the watermark levels */ for (i = 0; i < pgdat->nr_zones; i++) { struct zone *zone = pgdat->node_zones + i; @@ -2286,9 +2306,16 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, * must be balanced */ if (order) - return pgdat_balanced(pgdat, balanced, classzone_idx); + ret = !pgdat_balanced(pgdat, balanced, classzone_idx); + else + ret = !all_zones_ok; + + if (ret) + returned_true++; else - return !all_zones_ok; + returned_true = 0; + + return ret; } /* ^ permalink raw reply related [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-13 0:47 ` James Bottomley @ 2011-05-13 4:12 ` James Bottomley 2011-05-13 10:55 ` Mel Gorman 1 sibling, 0 replies; 77+ messages in thread From: James Bottomley @ 2011-05-13 4:12 UTC (permalink / raw) To: Johannes Weiner Cc: Pekka Enberg, Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 2011-05-12 at 19:47 -0500, James Bottomley wrote: > On Fri, 2011-05-13 at 00:15 +0200, Johannes Weiner wrote: > > On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote: > > > On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote: > > > > Confirmed, I'm afraid ... I can trigger the problem with all three > > > > patches under PREEMPT. It's not a hang this time, it's just kswapd > > > > taking 100% system time on 1 CPU and it won't calm down after I unload > > > > the system. > > > > > > Just on a "if you don't know what's wrong poke about and see" basis, I > > > sliced out all the complex logic in sleeping_prematurely() and, as far > > > as I can tell, it cures the problem behaviour. I've loaded up the > > > system, and taken the tar load generator through three runs without > > > producing a spinning kswapd (this is PREEMPT). I'll try with a > > > non-PREEMPT kernel shortly. > > > > > > What this seems to say is that there's a problem with the complex logic > > > in sleeping_prematurely(). I'm pretty sure hacking up > > > sleeping_prematurely() just to dump all the calculations is the wrong > > > thing to do, but perhaps someone can see what the right thing is ... > > > > I think I see the problem: the boolean logic of sleeping_prematurely() > > is odd. If it returns true, kswapd will keep running. So if > > pgdat_balanced() returns true, kswapd should go to sleep. > > > > This? > > I was going to say this was a winner, but on the third untar run on > non-PREEMPT, I hit the kswapd livelock. It's got much farther than > previous attempts, which all hang on the first run, but I think the > essential problem is still (at least on this machine) that > sleeping_prematurely() is doing too much work for the wakeup storm that > allocators are causing. > > Something that ratelimits the amount of time we spend in the watermark > calculations, like the below (which incorporates your pgdat fix) seems > to be much more stable (I've not run it for three full runs yet, but > kswapd CPU time is way lower so far). I've hammered it for several hours now with multiple loads; I can't seem to break it (famous last words, of course). James ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-13 0:47 ` James Bottomley 2011-05-13 4:12 ` James Bottomley @ 2011-05-13 10:55 ` Mel Gorman 2011-05-13 14:16 ` James Bottomley 1 sibling, 1 reply; 77+ messages in thread From: Mel Gorman @ 2011-05-13 10:55 UTC (permalink / raw) To: James Bottomley Cc: Johannes Weiner, Pekka Enberg, Christoph Lameter, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 07:47:05PM -0500, James Bottomley wrote: > On Fri, 2011-05-13 at 00:15 +0200, Johannes Weiner wrote: > > On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote: > > > On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote: > > > > Confirmed, I'm afraid ... I can trigger the problem with all three > > > > patches under PREEMPT. It's not a hang this time, it's just kswapd > > > > taking 100% system time on 1 CPU and it won't calm down after I unload > > > > the system. > > > > > > Just on a "if you don't know what's wrong poke about and see" basis, I > > > sliced out all the complex logic in sleeping_prematurely() and, as far > > > as I can tell, it cures the problem behaviour. I've loaded up the > > > system, and taken the tar load generator through three runs without > > > producing a spinning kswapd (this is PREEMPT). I'll try with a > > > non-PREEMPT kernel shortly. > > > > > > What this seems to say is that there's a problem with the complex logic > > > in sleeping_prematurely(). I'm pretty sure hacking up > > > sleeping_prematurely() just to dump all the calculations is the wrong > > > thing to do, but perhaps someone can see what the right thing is ... > > > > I think I see the problem: the boolean logic of sleeping_prematurely() > > is odd. If it returns true, kswapd will keep running. So if > > pgdat_balanced() returns true, kswapd should go to sleep. > > > > This? > > I was going to say this was a winner, but on the third untar run on > non-PREEMPT, I hit the kswapd livelock. It's got much farther than > previous attempts, which all hang on the first run, but I think the > essential problem is still (at least on this machine) that > sleeping_prematurely() is doing too much work for the wakeup storm that > allocators are causing. > > Something that ratelimits the amount of time we spend in the watermark > calculations, like the below (which incorporates your pgdat fix) seems > to be much more stable (I've not run it for three full runs yet, but > kswapd CPU time is way lower so far). > > The heuristic here is that if we're making the calculation more than ten > times in 1/10 of a second, stop and sleep anyway. > Is that heuristic not basically the same as this? diff --git a/mm/vmscan.c b/mm/vmscan.c index af24d1e..4d24828 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, unsigned long balanced = 0; bool all_zones_ok = true; + /* If kswapd has been running too long, just sleep */ + if (need_resched()) + return false; + /* If a direct reclaimer woke kswapd within HZ/10, it's premature */ if (remaining) return true; -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-13 10:55 ` Mel Gorman @ 2011-05-13 14:16 ` James Bottomley 0 siblings, 0 replies; 77+ messages in thread From: James Bottomley @ 2011-05-13 14:16 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, Pekka Enberg, Christoph Lameter, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, 2011-05-13 at 11:55 +0100, Mel Gorman wrote: > On Thu, May 12, 2011 at 07:47:05PM -0500, James Bottomley wrote: > > On Fri, 2011-05-13 at 00:15 +0200, Johannes Weiner wrote: > > > On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote: > > > > On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote: > > > > > Confirmed, I'm afraid ... I can trigger the problem with all three > > > > > patches under PREEMPT. It's not a hang this time, it's just kswapd > > > > > taking 100% system time on 1 CPU and it won't calm down after I unload > > > > > the system. > > > > > > > > Just on a "if you don't know what's wrong poke about and see" basis, I > > > > sliced out all the complex logic in sleeping_prematurely() and, as far > > > > as I can tell, it cures the problem behaviour. I've loaded up the > > > > system, and taken the tar load generator through three runs without > > > > producing a spinning kswapd (this is PREEMPT). I'll try with a > > > > non-PREEMPT kernel shortly. > > > > > > > > What this seems to say is that there's a problem with the complex logic > > > > in sleeping_prematurely(). I'm pretty sure hacking up > > > > sleeping_prematurely() just to dump all the calculations is the wrong > > > > thing to do, but perhaps someone can see what the right thing is ... > > > > > > I think I see the problem: the boolean logic of sleeping_prematurely() > > > is odd. If it returns true, kswapd will keep running. So if > > > pgdat_balanced() returns true, kswapd should go to sleep. > > > > > > This? > > > > I was going to say this was a winner, but on the third untar run on > > non-PREEMPT, I hit the kswapd livelock. It's got much farther than > > previous attempts, which all hang on the first run, but I think the > > essential problem is still (at least on this machine) that > > sleeping_prematurely() is doing too much work for the wakeup storm that > > allocators are causing. > > > > Something that ratelimits the amount of time we spend in the watermark > > calculations, like the below (which incorporates your pgdat fix) seems > > to be much more stable (I've not run it for three full runs yet, but > > kswapd CPU time is way lower so far). > > > > The heuristic here is that if we're making the calculation more than ten > > times in 1/10 of a second, stop and sleep anyway. > > > > Is that heuristic not basically the same as this? > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index af24d1e..4d24828 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > unsigned long balanced = 0; > bool all_zones_ok = true; > > + /* If kswapd has been running too long, just sleep */ > + if (need_resched()) > + return false; Not exactly. That should cure the problem (and I'll test it out). However, the traces show most of the work is being caused by sleeping_prematurely(). The object of my patch was actually to cut that off. just doing a check on need_resched will still allow us to run around that loop for hundreds of milliseconds and contribute to needless CPU time burn of kswapd; that's why I used a number of iterations and time cutoff in my patch. If we've run around the loop 10 times tightly returning true (i.e. we can't sleep and need to rebalance) each time but the shrinkers still haven't done enough, it's time to call it quits and sleep anyway. James ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 22:15 ` Johannes Weiner 2011-05-12 22:58 ` Minchan Kim 2011-05-13 0:47 ` James Bottomley @ 2011-05-13 10:30 ` Mel Gorman 2 siblings, 0 replies; 77+ messages in thread From: Mel Gorman @ 2011-05-13 10:30 UTC (permalink / raw) To: Johannes Weiner Cc: James Bottomley, Pekka Enberg, Christoph Lameter, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, May 13, 2011 at 12:15:06AM +0200, Johannes Weiner wrote: > On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote: > > On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote: > > > Confirmed, I'm afraid ... I can trigger the problem with all three > > > patches under PREEMPT. It's not a hang this time, it's just kswapd > > > taking 100% system time on 1 CPU and it won't calm down after I unload > > > the system. > > > > Just on a "if you don't know what's wrong poke about and see" basis, I > > sliced out all the complex logic in sleeping_prematurely() and, as far > > as I can tell, it cures the problem behaviour. I've loaded up the > > system, and taken the tar load generator through three runs without > > producing a spinning kswapd (this is PREEMPT). I'll try with a > > non-PREEMPT kernel shortly. > > > > What this seems to say is that there's a problem with the complex logic > > in sleeping_prematurely(). I'm pretty sure hacking up > > sleeping_prematurely() just to dump all the calculations is the wrong > > thing to do, but perhaps someone can see what the right thing is ... > > I think I see the problem: the boolean logic of sleeping_prematurely() > is odd. If it returns true, kswapd will keep running. So if > pgdat_balanced() returns true, kswapd should go to sleep. > > This? > You're right. > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2b701e0..092d773 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2261,7 +2261,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > * must be balanced > */ > if (order) > - return pgdat_balanced(pgdat, balanced, classzone_idx); > + return !pgdat_balanced(pgdat, balanced, classzone_idx); > else > return !all_zones_ok; > } -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 20:04 ` James Bottomley 2011-05-12 20:29 ` Johannes Weiner 2011-05-12 22:04 ` James Bottomley @ 2011-05-13 6:16 ` Pekka Enberg 2011-05-13 10:05 ` Mel Gorman 2 siblings, 1 reply; 77+ messages in thread From: Pekka Enberg @ 2011-05-13 6:16 UTC (permalink / raw) To: James Bottomley Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 Hi, On Thu, May 12, 2011 at 11:04 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > Confirmed, I'm afraid ... I can trigger the problem with all three > patches under PREEMPT. It's not a hang this time, it's just kswapd > taking 100% system time on 1 CPU and it won't calm down after I unload > the system. OK, that's good to know. I'd still like to take patches 1-2, though. Mel? Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-13 6:16 ` Pekka Enberg @ 2011-05-13 10:05 ` Mel Gorman 0 siblings, 0 replies; 77+ messages in thread From: Mel Gorman @ 2011-05-13 10:05 UTC (permalink / raw) To: Pekka Enberg Cc: James Bottomley, Christoph Lameter, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, May 13, 2011 at 09:16:24AM +0300, Pekka Enberg wrote: > Hi, > > On Thu, May 12, 2011 at 11:04 PM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > Confirmed, I'm afraid ... I can trigger the problem with all three > > patches under PREEMPT. It's not a hang this time, it's just kswapd > > taking 100% system time on 1 CPU and it won't calm down after I unload > > the system. > > OK, that's good to know. I'd still like to take patches 1-2, though. Mel? > Wait for a V2 please. __GFP_REPEAT should also be removed. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 15:43 ` James Bottomley 2011-05-12 15:46 ` Dave Jones 2011-05-12 15:55 ` Pekka Enberg @ 2011-05-12 16:01 ` Christoph Lameter 2011-05-12 16:10 ` Eric Dumazet 2 siblings, 1 reply; 77+ messages in thread From: Christoph Lameter @ 2011-05-12 16:01 UTC (permalink / raw) To: James Bottomley Cc: Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 12 May 2011, James Bottomley wrote: > > Debian and Ubuntu have been using SLUB for a long time > > Only from Squeeze, which has been released for ~3 months. That doesn't > qualify as a "long time" in my book. I am sorry but I have never used a Debian/Ubuntu system in the last 3 years that did not use SLUB. And it was that by default. But then we usually do not run the "released" Debian version. Typically one runs testing. Ubuntu is different there we usually run releases. But those have been SLUB for as long as I remember. And so far it is rock solid and is widely rolled out throughout our infrastructure (mostly 2.6.32 kernels). > but a sample of one doeth not great testing make. > > However, since you admit even you see problems, let's concentrate on > fixing them rather than recriminations? I do not see problems here with earlier kernels. I only see these on one testing system with the latest kernels on Ubuntu 11.04. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 16:01 ` Christoph Lameter @ 2011-05-12 16:10 ` Eric Dumazet 2011-05-12 17:37 ` Andrew Morton 0 siblings, 1 reply; 77+ messages in thread From: Eric Dumazet @ 2011-05-12 16:10 UTC (permalink / raw) To: Christoph Lameter Cc: James Bottomley, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 Le jeudi 12 mai 2011 à 11:01 -0500, Christoph Lameter a écrit : > On Thu, 12 May 2011, James Bottomley wrote: > > > > Debian and Ubuntu have been using SLUB for a long time > > > > Only from Squeeze, which has been released for ~3 months. That doesn't > > qualify as a "long time" in my book. > > I am sorry but I have never used a Debian/Ubuntu system in the last 3 > years that did not use SLUB. And it was that by default. But then we > usually do not run the "released" Debian version. Typically one runs > testing. Ubuntu is different there we usually run releases. But those > have been SLUB for as long as I remember. > > And so far it is rock solid and is widely rolled out throughout our > infrastructure (mostly 2.6.32 kernels). > > > but a sample of one doeth not great testing make. > > > > However, since you admit even you see problems, let's concentrate on > > fixing them rather than recriminations? > > I do not see problems here with earlier kernels. I only see these on one > testing system with the latest kernels on Ubuntu 11.04. More fuel to this discussion with commit 6d4831c2 Something is wrong with high order allocations, on some machines. Maybe we can find real cause instead of limiting us to use order-0 pages in the end... ;) commit 6d4831c283530a5f2c6bd8172c13efa236eb149d Author: Andrew Morton <akpm@linux-foundation.org> Date: Wed Apr 27 15:26:41 2011 -0700 vfs: avoid large kmalloc()s for the fdtable Azurit reports large increases in system time after 2.6.36 when running Apache. It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc() to allocate fdmem if possible"). That patch caused the vfs to use kmalloc() for very large allocations and this is causing excessive work (and presumably excessive reclaim) within the page allocator. Fix it by falling back to vmalloc() earlier - when the allocation attempt would have been considered "costly" by reclaim. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 16:10 ` Eric Dumazet @ 2011-05-12 17:37 ` Andrew Morton 0 siblings, 0 replies; 77+ messages in thread From: Andrew Morton @ 2011-05-12 17:37 UTC (permalink / raw) To: Eric Dumazet Cc: Christoph Lameter, James Bottomley, Mel Gorman, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 12 May 2011 18:10:38 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > More fuel to this discussion with commit 6d4831c2 > > Something is wrong with high order allocations, on some machines. > > Maybe we can find real cause instead of limiting us to use order-0 pages > in the end... ;) > > commit 6d4831c283530a5f2c6bd8172c13efa236eb149d > Author: Andrew Morton <akpm@linux-foundation.org> > Date: Wed Apr 27 15:26:41 2011 -0700 > > vfs: avoid large kmalloc()s for the fdtable Well, it's always been the case that satisfying higher-order allocations take a disproportionate amount of work in page reclaim. And often causes excessive reclaim. That's why we've traditionally worked to avoid higher-order allocations, and this has always been a problem with slub. But the higher-order allocations shouldn't cause the VM to melt down. We changed something, and now it melts down. Changing slub to avoid that meltdown doesn't fix the thing we broke. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0 2011-05-12 15:27 ` Christoph Lameter 2011-05-12 15:43 ` James Bottomley @ 2011-05-12 15:45 ` Dave Jones 1 sibling, 0 replies; 77+ messages in thread From: Dave Jones @ 2011-05-12 15:45 UTC (permalink / raw) To: Christoph Lameter Cc: James Bottomley, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 10:27:00AM -0500, Christoph Lameter wrote: > On Thu, 12 May 2011, James Bottomley wrote: > > It's only recently that the desktop > > distributions started to ... the bugs are showing up under FC15 beta, > > which is the first fedora distribution to enable it. I'd say we're only > > just beginning widespread SLUB testing. > > Debian and Ubuntu have been using SLUB for a long time (and AFAICT from my > archives so has Fedora). Indeed. It was enabled in Fedora pretty much as soon as it appeared in mainline. Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations 2011-05-11 15:29 [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations Mel Gorman ` (2 preceding siblings ...) 2011-05-11 15:29 ` [PATCH 3/3] mm: slub: Default slub_max_order to 0 Mel Gorman @ 2011-05-11 21:39 ` James Bottomley 2011-05-11 22:28 ` David Rientjes 3 siblings, 1 reply; 77+ messages in thread From: James Bottomley @ 2011-05-11 21:39 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, 2011-05-11 at 16:29 +0100, Mel Gorman wrote: > Debian (and probably Ubuntu) have recently have changed to the default > option of SLUB. There are a few reports of people experiencing hangs > when copying large amounts of data with kswapd using a large amount of > CPU. It appears this is down to SLUB using high orders by default and > the page allocator and reclaim struggling to keep up. The following > three patches reduce the cost of using those high orders. > > Patch 1 prevents kswapd waking up in response to SLUBs speculative > use of high orders. This eliminates the hangs and while the > system can still stall for long periods, it recovers. > > Patch 2 further reduces the cost by prevent SLUB entering direct > compaction or reclaim paths on the grounds that falling > back to order-0 should be cheaper. > > Patch 3 defaults SLUB to using order-0 on the grounds that the > systems that heavily benefit from using high-order are also > sized to fit in physical memory. On such systems, they should > manually tune slub_max_order=3. > > My own data on this is not great. I haven't really been able to > reproduce the same problem locally but a significant failing is > that the tests weren't stressing X but I couldn't make meaningful > comparisons by just randomly clicking on things (working on fixing > this problem). > > The test case is simple. "download tar" wgets a large tar file and > stores it locally. "unpack" is expanding it (15 times physical RAM > in this case) and "delete source dirs" is the tarfile being deleted > again. I also experimented with having the tar copied numerous times > and into deeper directories to increase the size but the results were > not particularly interesting so I left it as one tar. > > Test server, 4 CPU threads (AMD Phenom), x86_64, 2G of RAM, no X running > - nowake > largecopy-vanilla kswapd-v1r1 noexstep-v1r1 default0-v1r1 > download tar 94 ( 0.00%) 94 ( 0.00%) 94 ( 0.00%) 93 ( 1.08%) > unpack tar 521 ( 0.00%) 551 (-5.44%) 482 ( 8.09%) 488 ( 6.76%) > delete source dirs 208 ( 0.00%) 218 (-4.59%) 194 ( 7.22%) 194 ( 7.22%) > MMTests Statistics: duration > User/Sys Time Running Test (seconds) 740.82 777.73 739.98 747.47 > Total Elapsed Time (seconds) 1046.66 1273.91 962.47 936.17 > > Disabling kswapd alone hurts performance slightly even though testers > report it fixes hangs. I would guess it's because SLUB callers are > calling direct reclaim more frequently (I belatedly noticed that > compaction was disabled so it's not a factor) but haven't confirmed > it. However, preventing kswapd waking or entering direct reclaim and > having SLUB falling back to order-0 performed noticeably faster. Just > using order-0 in the first place was fastest of all. > > I tried running the same test on a test laptop but unfortunately > due to a misconfiguration the results were lost. It would take a few > hours to rerun so am posting without them. > > If the testers verify this series help and we agree the patches are > appropriate, they should be considered a stable candidate for 2.6.38. OK, I confirm that I can't seem to break this one. No hangs visible, even when loading up the system with firefox, evolution, the usual massive untar, X and even a distribution upgrade. You can add my tested-by James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations 2011-05-11 21:39 ` [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations James Bottomley @ 2011-05-11 22:28 ` David Rientjes 2011-05-11 22:34 ` James Bottomley 0 siblings, 1 reply; 77+ messages in thread From: David Rientjes @ 2011-05-11 22:28 UTC (permalink / raw) To: James Bottomley Cc: Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, 11 May 2011, James Bottomley wrote: > OK, I confirm that I can't seem to break this one. No hangs visible, > even when loading up the system with firefox, evolution, the usual > massive untar, X and even a distribution upgrade. > > You can add my tested-by > Your system still hangs with patches 1 and 2 only? ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations 2011-05-11 22:28 ` David Rientjes @ 2011-05-11 22:34 ` James Bottomley 2011-05-12 11:13 ` Pekka Enberg 2011-05-12 18:04 ` Andrea Arcangeli 0 siblings, 2 replies; 77+ messages in thread From: James Bottomley @ 2011-05-11 22:34 UTC (permalink / raw) To: David Rientjes Cc: Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, 2011-05-11 at 15:28 -0700, David Rientjes wrote: > On Wed, 11 May 2011, James Bottomley wrote: > > > OK, I confirm that I can't seem to break this one. No hangs visible, > > even when loading up the system with firefox, evolution, the usual > > massive untar, X and even a distribution upgrade. > > > > You can add my tested-by > > > > Your system still hangs with patches 1 and 2 only? Yes, but only once in all the testing. With patches 1 and 2 the hang is much harder to reproduce, but it still seems to be present if I hit it hard enough. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations 2011-05-11 22:34 ` James Bottomley @ 2011-05-12 11:13 ` Pekka Enberg 2011-05-12 13:19 ` Mel Gorman 2011-05-12 14:04 ` James Bottomley 2011-05-12 18:04 ` Andrea Arcangeli 1 sibling, 2 replies; 77+ messages in thread From: Pekka Enberg @ 2011-05-12 11:13 UTC (permalink / raw) To: James Bottomley Cc: David Rientjes, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On 5/12/11 1:34 AM, James Bottomley wrote: > On Wed, 2011-05-11 at 15:28 -0700, David Rientjes wrote: >> On Wed, 11 May 2011, James Bottomley wrote: >> >>> OK, I confirm that I can't seem to break this one. No hangs visible, >>> even when loading up the system with firefox, evolution, the usual >>> massive untar, X and even a distribution upgrade. >>> >>> You can add my tested-by >>> >> Your system still hangs with patches 1 and 2 only? > Yes, but only once in all the testing. With patches 1 and 2 the hang is > much harder to reproduce, but it still seems to be present if I hit it > hard enough. Patches 1-2 look reasonable to me. I'm not completely convinced of patch 3, though. Why are we seeing these problems now? This has been in mainline for a long time already. Shouldn't we fix kswapd? Pekka ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations 2011-05-12 11:13 ` Pekka Enberg @ 2011-05-12 13:19 ` Mel Gorman 2011-05-12 14:04 ` James Bottomley 1 sibling, 0 replies; 77+ messages in thread From: Mel Gorman @ 2011-05-12 13:19 UTC (permalink / raw) To: Pekka Enberg Cc: James Bottomley, David Rientjes, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 02:13:44PM +0300, Pekka Enberg wrote: > On 5/12/11 1:34 AM, James Bottomley wrote: > >On Wed, 2011-05-11 at 15:28 -0700, David Rientjes wrote: > >>On Wed, 11 May 2011, James Bottomley wrote: > >> > >>>OK, I confirm that I can't seem to break this one. No hangs visible, > >>>even when loading up the system with firefox, evolution, the usual > >>>massive untar, X and even a distribution upgrade. > >>> > >>>You can add my tested-by > >>> > >>Your system still hangs with patches 1 and 2 only? > >Yes, but only once in all the testing. With patches 1 and 2 the hang is > >much harder to reproduce, but it still seems to be present if I hit it > >hard enough. > > Patches 1-2 look reasonable to me. I'm not completely convinced of > patch 3, though. Why are we seeing these problems now? I'm not certain and testing so far as only being able to point to changing from SLAB to SLUB between 2.6.37 and 2.6.38. This probably boils down to distributions changing their allocator from slab to slub as recommended by Kconfig and SLUB being tested heavily on desktop workloads in a variety of settings for the first time. It's worth noting that only a few users have been able to reproduce this. I don't see the severe hangs for example during tests meaning it might also be down to newer hardware. What may be required to reproduce this is many CPUs (4 on the test machines) with relatively low memory for a 4-CPU machine (2G) and a slower disk than people might have tested with up until now. There are other new considerations as well that weren't much of a factor when SLUB came along. The first reproduction case showed involved ext4 for example which does delayed block allocation. It's possible there is some problem wherby all the dirty pages to be written to disk need blocks to be allocated and GFP_NOFS is not being used properly. Instead of failing the high-order allocation, we then block instead hanging direct reclaimers and kswapd. The filesystem people looked at this bug but didn't mention if something like this was a possibility. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations 2011-05-12 11:13 ` Pekka Enberg 2011-05-12 13:19 ` Mel Gorman @ 2011-05-12 14:04 ` James Bottomley 2011-05-12 15:53 ` James Bottomley 1 sibling, 1 reply; 77+ messages in thread From: James Bottomley @ 2011-05-12 14:04 UTC (permalink / raw) To: Pekka Enberg Cc: David Rientjes, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 2011-05-12 at 14:13 +0300, Pekka Enberg wrote: > On 5/12/11 1:34 AM, James Bottomley wrote: > > On Wed, 2011-05-11 at 15:28 -0700, David Rientjes wrote: > >> On Wed, 11 May 2011, James Bottomley wrote: > >> > >>> OK, I confirm that I can't seem to break this one. No hangs visible, > >>> even when loading up the system with firefox, evolution, the usual > >>> massive untar, X and even a distribution upgrade. > >>> > >>> You can add my tested-by > >>> > >> Your system still hangs with patches 1 and 2 only? > > Yes, but only once in all the testing. With patches 1 and 2 the hang is > > much harder to reproduce, but it still seems to be present if I hit it > > hard enough. > > Patches 1-2 look reasonable to me. I'm not completely convinced of patch > 3, though. Why are we seeing these problems now? This has been in > mainline for a long time already. Shouldn't we fix kswapd? So I'm open to this. The hang occurs when kswapd races around in shrink_slab and never exits. It looks like there's a massive number of wakeups triggering this, but we haven't been able to diagnose it further. turning on PREEMPT gets rid of the hang, so I could try to reproduce with PREEMPT and turn on tracing. The problem so far has been that the number of events is so huge that the trace buffer only captures a few microseconds of output. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations 2011-05-12 14:04 ` James Bottomley @ 2011-05-12 15:53 ` James Bottomley 2011-05-13 11:25 ` Mel Gorman 0 siblings, 1 reply; 77+ messages in thread From: James Bottomley @ 2011-05-12 15:53 UTC (permalink / raw) To: Pekka Enberg Cc: David Rientjes, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 [-- Attachment #1: Type: text/plain, Size: 1736 bytes --] On Thu, 2011-05-12 at 09:04 -0500, James Bottomley wrote: > On Thu, 2011-05-12 at 14:13 +0300, Pekka Enberg wrote: > > On 5/12/11 1:34 AM, James Bottomley wrote: > > > On Wed, 2011-05-11 at 15:28 -0700, David Rientjes wrote: > > >> On Wed, 11 May 2011, James Bottomley wrote: > > >> > > >>> OK, I confirm that I can't seem to break this one. No hangs visible, > > >>> even when loading up the system with firefox, evolution, the usual > > >>> massive untar, X and even a distribution upgrade. > > >>> > > >>> You can add my tested-by > > >>> > > >> Your system still hangs with patches 1 and 2 only? > > > Yes, but only once in all the testing. With patches 1 and 2 the hang is > > > much harder to reproduce, but it still seems to be present if I hit it > > > hard enough. > > > > Patches 1-2 look reasonable to me. I'm not completely convinced of patch > > 3, though. Why are we seeing these problems now? This has been in > > mainline for a long time already. Shouldn't we fix kswapd? > > So I'm open to this. The hang occurs when kswapd races around in > shrink_slab and never exits. It looks like there's a massive number of > wakeups triggering this, but we haven't been able to diagnose it > further. turning on PREEMPT gets rid of the hang, so I could try to > reproduce with PREEMPT and turn on tracing. The problem so far has been > that the number of events is so huge that the trace buffer only captures > a few microseconds of output. OK, here's the trace from a PREEMPT kernel (2.6.38.6) when kswapd hits 99% and stays there. I've only enabled the vmscan tracepoints to try and get a longer run. It mosly looks like kswapd waking itself, but there might be more in there that mm trained eyes can see. James [-- Attachment #2: tmp.trace.gz --] [-- Type: application/x-gzip, Size: 175858 bytes --] ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations 2011-05-12 15:53 ` James Bottomley @ 2011-05-13 11:25 ` Mel Gorman 0 siblings, 0 replies; 77+ messages in thread From: Mel Gorman @ 2011-05-13 11:25 UTC (permalink / raw) To: James Bottomley Cc: Pekka Enberg, David Rientjes, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 10:53:44AM -0500, James Bottomley wrote: > On Thu, 2011-05-12 at 09:04 -0500, James Bottomley wrote: > > On Thu, 2011-05-12 at 14:13 +0300, Pekka Enberg wrote: > > > On 5/12/11 1:34 AM, James Bottomley wrote: > > > > On Wed, 2011-05-11 at 15:28 -0700, David Rientjes wrote: > > > >> On Wed, 11 May 2011, James Bottomley wrote: > > > >> > > > >>> OK, I confirm that I can't seem to break this one. No hangs visible, > > > >>> even when loading up the system with firefox, evolution, the usual > > > >>> massive untar, X and even a distribution upgrade. > > > >>> > > > >>> You can add my tested-by > > > >>> > > > >> Your system still hangs with patches 1 and 2 only? > > > > Yes, but only once in all the testing. With patches 1 and 2 the hang is > > > > much harder to reproduce, but it still seems to be present if I hit it > > > > hard enough. > > > > > > Patches 1-2 look reasonable to me. I'm not completely convinced of patch > > > 3, though. Why are we seeing these problems now? This has been in > > > mainline for a long time already. Shouldn't we fix kswapd? > > > > So I'm open to this. The hang occurs when kswapd races around in > > shrink_slab and never exits. It looks like there's a massive number of > > wakeups triggering this, but we haven't been able to diagnose it > > further. turning on PREEMPT gets rid of the hang, so I could try to > > reproduce with PREEMPT and turn on tracing. The problem so far has been > > that the number of events is so huge that the trace buffer only captures > > a few microseconds of output. > > OK, here's the trace from a PREEMPT kernel (2.6.38.6) when kswapd hits > 99% and stays there. I've only enabled the vmscan tracepoints to try > and get a longer run. It mosly looks like kswapd waking itself, but > there might be more in there that mm trained eyes can see. > For 2.6.38.6, commit [2876592f: mm: vmscan: stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT] may also be needed if CONFIG_COMPACTION if set. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations 2011-05-11 22:34 ` James Bottomley 2011-05-12 11:13 ` Pekka Enberg @ 2011-05-12 18:04 ` Andrea Arcangeli 2011-05-13 11:24 ` Mel Gorman 1 sibling, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2011-05-12 18:04 UTC (permalink / raw) To: James Bottomley Cc: David Rientjes, Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 Hi James! On Wed, May 11, 2011 at 05:34:27PM -0500, James Bottomley wrote: > Yes, but only once in all the testing. With patches 1 and 2 the hang is Weird patch 2 makes the large order allocation without ~__GFP_WAIT, so even COMPACTION=y/n shouldn't matter anymore. Am I misreading something Mel? Removing ~__GFP_WAIT from patch 2 (and adding ~__GFP_REPEAT as a correctness improvement) and setting COMPACTION=y also should work ok. Removing ~__GFP_WAIT from patch 2 and setting COMPACTION=n is expected not to work well. But compaction should only make the difference if you remove ~__GFP_WAIT from patch 2. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations 2011-05-12 18:04 ` Andrea Arcangeli @ 2011-05-13 11:24 ` Mel Gorman 0 siblings, 0 replies; 77+ messages in thread From: Mel Gorman @ 2011-05-13 11:24 UTC (permalink / raw) To: Andrea Arcangeli Cc: James Bottomley, David Rientjes, Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 12, 2011 at 08:04:57PM +0200, Andrea Arcangeli wrote: > Hi James! > > On Wed, May 11, 2011 at 05:34:27PM -0500, James Bottomley wrote: > > Yes, but only once in all the testing. With patches 1 and 2 the hang is > > Weird patch 2 makes the large order allocation without ~__GFP_WAIT, so > even COMPACTION=y/n shouldn't matter anymore. Am I misreading > something Mel? > > Removing ~__GFP_WAIT from patch 2 (and adding ~__GFP_REPEAT as a > correctness improvement) and setting COMPACTION=y also should work ok. > should_continue_reclaim could till be looping unless __GFP_REPEAT is cleared if CONFIG_COMPACTION is set. > Removing ~__GFP_WAIT from patch 2 and setting COMPACTION=n is expected > not to work well. > > But compaction should only make the difference if you remove > ~__GFP_WAIT from patch 2. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 77+ messages in thread
end of thread, other threads:[~2011-05-17 19:25 UTC | newest] Thread overview: 77+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-05-11 15:29 [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations Mel Gorman 2011-05-11 15:29 ` [PATCH 1/3] mm: slub: Do not wake kswapd for SLUBs speculative " Mel Gorman 2011-05-11 20:38 ` David Rientjes 2011-05-11 15:29 ` [PATCH 2/3] mm: slub: Do not take expensive steps " Mel Gorman 2011-05-11 20:38 ` David Rientjes 2011-05-11 21:10 ` Mel Gorman 2011-05-12 17:25 ` Andrea Arcangeli 2011-05-11 15:29 ` [PATCH 3/3] mm: slub: Default slub_max_order to 0 Mel Gorman 2011-05-11 20:38 ` David Rientjes 2011-05-11 20:53 ` James Bottomley 2011-05-11 21:09 ` Mel Gorman 2011-05-11 22:27 ` David Rientjes 2011-05-13 10:14 ` Mel Gorman 2011-05-12 17:36 ` Andrea Arcangeli 2011-05-16 21:03 ` David Rientjes 2011-05-17 9:48 ` Mel Gorman 2011-05-17 19:25 ` David Rientjes 2011-05-12 14:43 ` Christoph Lameter 2011-05-12 15:15 ` James Bottomley 2011-05-12 15:27 ` Christoph Lameter 2011-05-12 15:43 ` James Bottomley 2011-05-12 15:46 ` Dave Jones 2011-05-12 16:00 ` James Bottomley 2011-05-12 16:08 ` Dave Jones 2011-05-12 16:27 ` Christoph Lameter 2011-05-12 16:30 ` James Bottomley 2011-05-12 16:48 ` Christoph Lameter 2011-05-12 17:46 ` Andrea Arcangeli 2011-05-12 18:00 ` Christoph Lameter 2011-05-12 18:18 ` Andrea Arcangeli 2011-05-12 17:06 ` Pekka Enberg 2011-05-12 17:11 ` Pekka Enberg 2011-05-12 17:38 ` Christoph Lameter 2011-05-12 18:00 ` Andrea Arcangeli 2011-05-13 9:49 ` Mel Gorman 2011-05-15 16:39 ` Andrea Arcangeli 2011-05-16 8:42 ` Mel Gorman 2011-05-12 17:51 ` Andrea Arcangeli 2011-05-12 18:03 ` Christoph Lameter 2011-05-12 18:09 ` Andrea Arcangeli 2011-05-12 18:16 ` Christoph Lameter 2011-05-12 18:36 ` James Bottomley 2011-05-12 17:40 ` Andrea Arcangeli 2011-05-12 15:55 ` Pekka Enberg 2011-05-12 18:37 ` James Bottomley 2011-05-12 18:46 ` Christoph Lameter 2011-05-12 19:21 ` James Bottomley 2011-05-12 19:44 ` James Bottomley 2011-05-12 20:04 ` James Bottomley 2011-05-12 20:29 ` Johannes Weiner 2011-05-12 20:31 ` Johannes Weiner 2011-05-12 20:31 ` James Bottomley 2011-05-12 22:04 ` James Bottomley 2011-05-12 22:15 ` Johannes Weiner 2011-05-12 22:58 ` Minchan Kim 2011-05-13 5:39 ` Minchan Kim 2011-05-13 0:47 ` James Bottomley 2011-05-13 4:12 ` James Bottomley 2011-05-13 10:55 ` Mel Gorman 2011-05-13 14:16 ` James Bottomley 2011-05-13 10:30 ` Mel Gorman 2011-05-13 6:16 ` Pekka Enberg 2011-05-13 10:05 ` Mel Gorman 2011-05-12 16:01 ` Christoph Lameter 2011-05-12 16:10 ` Eric Dumazet 2011-05-12 17:37 ` Andrew Morton 2011-05-12 15:45 ` Dave Jones 2011-05-11 21:39 ` [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations James Bottomley 2011-05-11 22:28 ` David Rientjes 2011-05-11 22:34 ` James Bottomley 2011-05-12 11:13 ` Pekka Enberg 2011-05-12 13:19 ` Mel Gorman 2011-05-12 14:04 ` James Bottomley 2011-05-12 15:53 ` James Bottomley 2011-05-13 11:25 ` Mel Gorman 2011-05-12 18:04 ` Andrea Arcangeli 2011-05-13 11:24 ` Mel Gorman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).