diff for duplicates of <20110519091906.GT5279@suse.de> diff --git a/a/1.txt b/N1/1.txt index c9fdf3e..d9ecb3b 100644 --- a/a/1.txt +++ b/N1/1.txt @@ -192,7 +192,3 @@ spending a significant percentage of it in shrink_slab(). -- Mel Gorman SUSE Labs --- -To unsubscribe from this list: send the line "unsubscribe linux-ext4" in -the body of a message to majordomo@vger.kernel.org -More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/a/content_digest b/N1/content_digest index f3d9b1d..6f6ce39 100644 --- a/a/content_digest +++ b/N1/content_digest @@ -219,10 +219,6 @@ "\n" "-- \n" "Mel Gorman\n" - "SUSE Labs\n" - "--\n" - "To unsubscribe from this list: send the line \"unsubscribe linux-ext4\" in\n" - "the body of a message to majordomo@vger.kernel.org\n" - More majordomo info at http://vger.kernel.org/majordomo-info.html + SUSE Labs -a9de08dfc789d9c48bce715e2e02b190cd077370007e293b57a3f83a26074bd0 +8380d4ffd974849fa57ed02447f551f53cee44655fcce9268011ead4f9b00f2b
diff --git a/a/1.txt b/N2/1.txt index c9fdf3e..1e1d01a 100644 --- a/a/1.txt +++ b/N2/1.txt @@ -15,59 +15,59 @@ On Thu, May 19, 2011 at 07:42:29AM +0900, Minchan Kim wrote: > >> > > > Signed-off-by: Mel Gorman <mgorman@suse.de> > >> > > > Acked-by: Rik van Riel <riel@redhat.com> > >> > > > --- -> >> > > > mm/vmscan.c | 4 ++++ -> >> > > > 1 files changed, 4 insertions(+), 0 deletions(-) +> >> > > > mm/vmscan.c | 4 ++++ +> >> > > > 1 files changed, 4 insertions(+), 0 deletions(-) > >> > > > > >> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> > > > index af24d1e..4d24828 100644 > >> > > > --- a/mm/vmscan.c > >> > > > +++ b/mm/vmscan.c > >> > > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, -> >> > > > unsigned long balanced = 0; -> >> > > > bool all_zones_ok = true; +> >> > > > unsigned long balanced = 0; +> >> > > > bool all_zones_ok = true; > >> > > > -> >> > > > + /* If kswapd has been running too long, just sleep */ -> >> > > > + if (need_resched()) -> >> > > > + return false; +> >> > > > + /* If kswapd has been running too long, just sleep */ +> >> > > > + if (need_resched()) +> >> > > > + return false; > >> > > > + -> >> > > > /* If a direct reclaimer woke kswapd within HZ/10, it's premature */ -> >> > > > if (remaining) -> >> > > > return true; +> >> > > > /* If a direct reclaimer woke kswapd within HZ/10, it's premature */ +> >> > > > if (remaining) +> >> > > > return true; > >> > > > >> > > I'm a bit worried by this one. > >> > > > >> > > Do we really fully understand why kswapd is continuously running like -> >> > > this? The changelog makes me think "no" ;) +> >> > > this? The changelog makes me think "no" ;) > >> > > > >> > > Given that the page-allocating process is madly reclaiming pages in > >> > > direct reclaim (yes?) and that kswapd is madly reclaiming pages on a > >> > > different CPU, we should pretty promptly get into a situation where -> >> > > kswapd can suspend itself. But that obviously isn't happening. So +> >> > > kswapd can suspend itself. But that obviously isn't happening. So > >> > > what *is* going on? > >> > > >> > The triggering workload is a massive untar using a file on the same > >> > filesystem, so that's a continuous stream of pages read into the cache -> >> > for the input and a stream of dirty pages out for the writes. We +> >> > for the input and a stream of dirty pages out for the writes. We > >> > thought it might have been out of control shrinkers, so we already -> >> > debugged that and found it wasn't. It just seems to be an imbalance in +> >> > debugged that and found it wasn't. It just seems to be an imbalance in > >> > the zones that the shrinkers can't fix which causes > >> > sleeping_prematurely() to return true almost indefinitely. > >> -> >> Is the untar disk-bound? The untar has presumably hit the writeback -> >> dirty_ratio? So its rate of page allocation is approximately equal to +> >> Is the untar disk-bound? The untar has presumably hit the writeback +> >> dirty_ratio? So its rate of page allocation is approximately equal to > >> the write speed of the disks? > >> > > > > A reasonable assumption but it gets messy. > > > >> If so, the VM is consuming 100% of a CPU to reclaim pages at a mere -> >> tens-of-megabytes-per-second. If so, there's something seriously wrong +> >> tens-of-megabytes-per-second. If so, there's something seriously wrong > >> here - under favorable conditions one would expect reclaim to free up > >> 100,000 pages/sec, maybe more. > >> > >> If the untar is not disk-bound and the required page reclaim rate is > >> equal to the rate at which a CPU can read, decompress and write to -> >> pagecache then, err, maybe possible. But it still smells of +> >> pagecache then, err, maybe possible. But it still smells of > >> inefficient reclaim. > >> > > @@ -75,61 +75,61 @@ On Thu, May 19, 2011 at 07:42:29AM +0900, Minchan Kim wrote: > > how much exactly. Reproducing this locally would have been nice but > > the following conditions are likely happening on the problem machine. > > -> > SLUB is using high-orders for its slabs, kswapd and reclaimers are -> > reclaiming at a faster rate than required for just the data. SLUB -> > is using order-2 allocs for inodes so every 18 files created by -> > untar, we need an order-2 page. For ext4_io_end, we need order-3 -> > allocs and we are allocating these due to delayed block allocation. +> > SLUB is using high-orders for its slabs, kswapd and reclaimers are +> > reclaiming at a faster rate than required for just the data. SLUB +> > is using order-2 allocs for inodes so every 18 files created by +> > untar, we need an order-2 page. For ext4_io_end, we need order-3 +> > allocs and we are allocating these due to delayed block allocation. > > -> > So for example: 50 files, each less than 1 page in size needs 50 -> > order-0 pages, 3 order-2 page and 2 order-3 pages +> > So for example: 50 files, each less than 1 page in size needs 50 +> > order-0 pages, 3 order-2 page and 2 order-3 pages > > -> > To satisfy the high order pages, we are reclaiming at least 28 -> > pages. For compaction, we are migrating these so we are allocating -> > a further 28 pages and then copying putting further pressure on -> > the system. We may do this multiple times as order-0 allocations -> > could be breaking up the pages again. Without compaction, we are -> > only reclaiming but can get stalled for significant periods of -> > time if dirty or writeback pages are encountered in the contiguous -> > blocks and can reclaim too many pages quite easily. +> > To satisfy the high order pages, we are reclaiming at least 28 +> > pages. For compaction, we are migrating these so we are allocating +> > a further 28 pages and then copying putting further pressure on +> > the system. We may do this multiple times as order-0 allocations +> > could be breaking up the pages again. Without compaction, we are +> > only reclaiming but can get stalled for significant periods of +> > time if dirty or writeback pages are encountered in the contiguous +> > blocks and can reclaim too many pages quite easily. > > > > So the rate of allocation required to write out data is higher than > > just the data rate. The reclaim rate could be just fine but the number > > of pages we need to reclaim to allocate slab objects can be screwy. > > > >> > > Secondly, taking an up-to-100ms sleep in response to a need_resched() -> >> > > seems pretty savage and I suspect it risks undesirable side-effects. A -> >> > > plain old cond_resched() would be more cautious. But presumably +> >> > > seems pretty savage and I suspect it risks undesirable side-effects. A +> >> > > plain old cond_resched() would be more cautious. But presumably > >> > > kswapd() is already running cond_resched() pretty frequently, so why > >> > > didn't that work? > >> > > >> > So the specific problem with cond_resched() is that kswapd is still > >> > runnable, so even if there's other work the system can be getting on -> >> > with, it quickly comes back to looping madly in kswapd. If we return +> >> > with, it quickly comes back to looping madly in kswapd. If we return > >> > false from sleeping_prematurely(), we stop kswapd until its woken up to -> >> > do more work. This manifests, even on non sandybridge systems that +> >> > do more work. This manifests, even on non sandybridge systems that > >> > don't hang as a lot of time burned in kswapd. > >> > > >> > I think the sandybridge bug I see on the laptop is that cond_resched() -> >> > is somehow ineffective: kswapd is usually hogging one CPU and there are +> >> > is somehow ineffective: kswapd is usually hogging one CPU and there are > >> > runnable processes but they seem to cluster on other CPUs, leaving > >> > kswapd to spin at close to 100% system time. > >> > > >> > When the problem was first described, we tried sprinkling more > >> > cond_rescheds() in the shrinker loop and it didn't work. > >> -> >> Seems to me that kswapd for some reason is doing too much work. Or, -> >> more specifically is doing its work very inefficiently. Making kswapd +> >> Seems to me that kswapd for some reason is doing too much work. Or, +> >> more specifically is doing its work very inefficiently. Making kswapd > >> take arbitrary naps when it's misbehaving didn't fix that misbehaviour! > >> > > > > It is likely to be doing work inefficiently in one of two ways > > -> > 1. We are reclaiming far more pages than required by the data -> > for slab objects +> > 1. We are reclaiming far more pages than required by the data +> > for slab objects > > -> > 2. The rate we are reclaiming is fast enough that dirty pages are -> > reaching the end of the LRU quickly +> > 2. The rate we are reclaiming is fast enough that dirty pages are +> > reaching the end of the LRU quickly > > > > The latter part is also important. I doubt we are getting stalled in > > writepage as this is new data being written to disk to blocks aren't @@ -150,7 +150,7 @@ On Thu, May 19, 2011 at 07:42:29AM +0900, Minchan Kim wrote: > > > >> It would be interesting to watch kswapd's page reclaim inefficiency > >> when this is happening: /proc/vmstat:pgscan_kswapd_* versus -> >> /proc/vmstat:kswapd_steal. If that ration is high then kswapd is +> >> /proc/vmstat:kswapd_steal. If that ration is high then kswapd is > >> scanning many pages and not reclaiming them. > >> > >> But given the prominence of shrink_slab in the traces, perhaps that @@ -192,7 +192,10 @@ spending a significant percentage of it in shrink_slab(). -- Mel Gorman SUSE Labs + -- -To unsubscribe from this list: send the line "unsubscribe linux-ext4" in -the body of a message to majordomo@vger.kernel.org -More majordomo info at http://vger.kernel.org/majordomo-info.html +To unsubscribe, send a message with 'unsubscribe linux-mm' in +the body to majordomo@kvack.org. For more info on Linux MM, +see: http://www.linux-mm.org/ . +Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ +Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> diff --git a/a/content_digest b/N2/content_digest index f3d9b1d..e8063c9 100644 --- a/a/content_digest +++ b/N2/content_digest @@ -43,59 +43,59 @@ "> >> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>\n" "> >> > > > Acked-by: Rik van Riel <riel@redhat.com>\n" "> >> > > > ---\n" - "> >> > > > \302\240mm/vmscan.c | \302\240 \302\2404 ++++\n" - "> >> > > > \302\2401 files changed, 4 insertions(+), 0 deletions(-)\n" + "> >> > > > mm/vmscan.c | 4 ++++\n" + "> >> > > > 1 files changed, 4 insertions(+), 0 deletions(-)\n" "> >> > > >\n" "> >> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c\n" "> >> > > > index af24d1e..4d24828 100644\n" "> >> > > > --- a/mm/vmscan.c\n" "> >> > > > +++ b/mm/vmscan.c\n" "> >> > > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,\n" - "> >> > > > \302\240 \302\240 \302\240 \302\240 unsigned long balanced = 0;\n" - "> >> > > > \302\240 \302\240 \302\240 \302\240 bool all_zones_ok = true;\n" + "> >> > > > unsigned long balanced = 0;\n" + "> >> > > > bool all_zones_ok = true;\n" "> >> > > >\n" - "> >> > > > + \302\240 \302\240 \302\240 /* If kswapd has been running too long, just sleep */\n" - "> >> > > > + \302\240 \302\240 \302\240 if (need_resched())\n" - "> >> > > > + \302\240 \302\240 \302\240 \302\240 \302\240 \302\240 \302\240 return false;\n" + "> >> > > > + /* If kswapd has been running too long, just sleep */\n" + "> >> > > > + if (need_resched())\n" + "> >> > > > + return false;\n" "> >> > > > +\n" - "> >> > > > \302\240 \302\240 \302\240 \302\240 /* If a direct reclaimer woke kswapd within HZ/10, it's premature */\n" - "> >> > > > \302\240 \302\240 \302\240 \302\240 if (remaining)\n" - "> >> > > > \302\240 \302\240 \302\240 \302\240 \302\240 \302\240 \302\240 \302\240 return true;\n" + "> >> > > > /* If a direct reclaimer woke kswapd within HZ/10, it's premature */\n" + "> >> > > > if (remaining)\n" + "> >> > > > return true;\n" "> >> > >\n" "> >> > > I'm a bit worried by this one.\n" "> >> > >\n" "> >> > > Do we really fully understand why kswapd is continuously running like\n" - "> >> > > this? \302\240The changelog makes me think \"no\" ;)\n" + "> >> > > this? The changelog makes me think \"no\" ;)\n" "> >> > >\n" "> >> > > Given that the page-allocating process is madly reclaiming pages in\n" "> >> > > direct reclaim (yes?) and that kswapd is madly reclaiming pages on a\n" "> >> > > different CPU, we should pretty promptly get into a situation where\n" - "> >> > > kswapd can suspend itself. \302\240But that obviously isn't happening. \302\240So\n" + "> >> > > kswapd can suspend itself. But that obviously isn't happening. So\n" "> >> > > what *is* going on?\n" "> >> >\n" "> >> > The triggering workload is a massive untar using a file on the same\n" "> >> > filesystem, so that's a continuous stream of pages read into the cache\n" - "> >> > for the input and a stream of dirty pages out for the writes. \302\240We\n" + "> >> > for the input and a stream of dirty pages out for the writes. We\n" "> >> > thought it might have been out of control shrinkers, so we already\n" - "> >> > debugged that and found it wasn't. \302\240It just seems to be an imbalance in\n" + "> >> > debugged that and found it wasn't. It just seems to be an imbalance in\n" "> >> > the zones that the shrinkers can't fix which causes\n" "> >> > sleeping_prematurely() to return true almost indefinitely.\n" "> >>\n" - "> >> Is the untar disk-bound? \302\240The untar has presumably hit the writeback\n" - "> >> dirty_ratio? \302\240So its rate of page allocation is approximately equal to\n" + "> >> Is the untar disk-bound? The untar has presumably hit the writeback\n" + "> >> dirty_ratio? So its rate of page allocation is approximately equal to\n" "> >> the write speed of the disks?\n" "> >>\n" "> >\n" "> > A reasonable assumption but it gets messy.\n" "> >\n" "> >> If so, the VM is consuming 100% of a CPU to reclaim pages at a mere\n" - "> >> tens-of-megabytes-per-second. \302\240If so, there's something seriously wrong\n" + "> >> tens-of-megabytes-per-second. If so, there's something seriously wrong\n" "> >> here - under favorable conditions one would expect reclaim to free up\n" "> >> 100,000 pages/sec, maybe more.\n" "> >>\n" "> >> If the untar is not disk-bound and the required page reclaim rate is\n" "> >> equal to the rate at which a CPU can read, decompress and write to\n" - "> >> pagecache then, err, maybe possible. \302\240But it still smells of\n" + "> >> pagecache then, err, maybe possible. But it still smells of\n" "> >> inefficient reclaim.\n" "> >>\n" "> >\n" @@ -103,61 +103,61 @@ "> > how much exactly. Reproducing this locally would have been nice but\n" "> > the following conditions are likely happening on the problem machine.\n" "> >\n" - "> > \302\240 SLUB is using high-orders for its slabs, kswapd and reclaimers are\n" - "> > \302\240 reclaiming at a faster rate than required for just the data. SLUB\n" - "> > \302\240 is using order-2 allocs for inodes so every 18 files created by\n" - "> > \302\240 untar, we need an order-2 page. For ext4_io_end, we need order-3\n" - "> > \302\240 allocs and we are allocating these due to delayed block allocation.\n" + "> > SLUB is using high-orders for its slabs, kswapd and reclaimers are\n" + "> > reclaiming at a faster rate than required for just the data. SLUB\n" + "> > is using order-2 allocs for inodes so every 18 files created by\n" + "> > untar, we need an order-2 page. For ext4_io_end, we need order-3\n" + "> > allocs and we are allocating these due to delayed block allocation.\n" "> >\n" - "> > \302\240 So for example: 50 files, each less than 1 page in size needs 50\n" - "> > \302\240 order-0 pages, 3 order-2 page and 2 order-3 pages\n" + "> > So for example: 50 files, each less than 1 page in size needs 50\n" + "> > order-0 pages, 3 order-2 page and 2 order-3 pages\n" "> >\n" - "> > \302\240 To satisfy the high order pages, we are reclaiming at least 28\n" - "> > \302\240 pages. For compaction, we are migrating these so we are allocating\n" - "> > \302\240 a further 28 pages and then copying putting further pressure on\n" - "> > \302\240 the system. We may do this multiple times as order-0 allocations\n" - "> > \302\240 could be breaking up the pages again. Without compaction, we are\n" - "> > \302\240 only reclaiming but can get stalled for significant periods of\n" - "> > \302\240 time if dirty or writeback pages are encountered in the contiguous\n" - "> > \302\240 blocks and can reclaim too many pages quite easily.\n" + "> > To satisfy the high order pages, we are reclaiming at least 28\n" + "> > pages. For compaction, we are migrating these so we are allocating\n" + "> > a further 28 pages and then copying putting further pressure on\n" + "> > the system. We may do this multiple times as order-0 allocations\n" + "> > could be breaking up the pages again. Without compaction, we are\n" + "> > only reclaiming but can get stalled for significant periods of\n" + "> > time if dirty or writeback pages are encountered in the contiguous\n" + "> > blocks and can reclaim too many pages quite easily.\n" "> >\n" "> > So the rate of allocation required to write out data is higher than\n" "> > just the data rate. The reclaim rate could be just fine but the number\n" "> > of pages we need to reclaim to allocate slab objects can be screwy.\n" "> >\n" "> >> > > Secondly, taking an up-to-100ms sleep in response to a need_resched()\n" - "> >> > > seems pretty savage and I suspect it risks undesirable side-effects. \302\240A\n" - "> >> > > plain old cond_resched() would be more cautious. \302\240But presumably\n" + "> >> > > seems pretty savage and I suspect it risks undesirable side-effects. A\n" + "> >> > > plain old cond_resched() would be more cautious. But presumably\n" "> >> > > kswapd() is already running cond_resched() pretty frequently, so why\n" "> >> > > didn't that work?\n" "> >> >\n" "> >> > So the specific problem with cond_resched() is that kswapd is still\n" "> >> > runnable, so even if there's other work the system can be getting on\n" - "> >> > with, it quickly comes back to looping madly in kswapd. \302\240If we return\n" + "> >> > with, it quickly comes back to looping madly in kswapd. If we return\n" "> >> > false from sleeping_prematurely(), we stop kswapd until its woken up to\n" - "> >> > do more work. \302\240This manifests, even on non sandybridge systems that\n" + "> >> > do more work. This manifests, even on non sandybridge systems that\n" "> >> > don't hang as a lot of time burned in kswapd.\n" "> >> >\n" "> >> > I think the sandybridge bug I see on the laptop is that cond_resched()\n" - "> >> > is somehow ineffective: \302\240kswapd is usually hogging one CPU and there are\n" + "> >> > is somehow ineffective: kswapd is usually hogging one CPU and there are\n" "> >> > runnable processes but they seem to cluster on other CPUs, leaving\n" "> >> > kswapd to spin at close to 100% system time.\n" "> >> >\n" "> >> > When the problem was first described, we tried sprinkling more\n" "> >> > cond_rescheds() in the shrinker loop and it didn't work.\n" "> >>\n" - "> >> Seems to me that kswapd for some reason is doing too much work. \302\240Or,\n" - "> >> more specifically is doing its work very inefficiently. \302\240Making kswapd\n" + "> >> Seems to me that kswapd for some reason is doing too much work. Or,\n" + "> >> more specifically is doing its work very inefficiently. Making kswapd\n" "> >> take arbitrary naps when it's misbehaving didn't fix that misbehaviour!\n" "> >>\n" "> >\n" "> > It is likely to be doing work inefficiently in one of two ways\n" "> >\n" - "> > \302\2401. We are reclaiming far more pages than required by the data\n" - "> > \302\240 \302\240 for slab objects\n" + "> > 1. We are reclaiming far more pages than required by the data\n" + "> > for slab objects\n" "> >\n" - "> > \302\2402. The rate we are reclaiming is fast enough that dirty pages are\n" - "> > \302\240 \302\240 reaching the end of the LRU quickly\n" + "> > 2. The rate we are reclaiming is fast enough that dirty pages are\n" + "> > reaching the end of the LRU quickly\n" "> >\n" "> > The latter part is also important. I doubt we are getting stalled in\n" "> > writepage as this is new data being written to disk to blocks aren't\n" @@ -178,7 +178,7 @@ "> >\n" "> >> It would be interesting to watch kswapd's page reclaim inefficiency\n" "> >> when this is happening: /proc/vmstat:pgscan_kswapd_* versus\n" - "> >> /proc/vmstat:kswapd_steal. \302\240If that ration is high then kswapd is\n" + "> >> /proc/vmstat:kswapd_steal. If that ration is high then kswapd is\n" "> >> scanning many pages and not reclaiming them.\n" "> >>\n" "> >> But given the prominence of shrink_slab in the traces, perhaps that\n" @@ -220,9 +220,12 @@ "-- \n" "Mel Gorman\n" "SUSE Labs\n" + "\n" "--\n" - "To unsubscribe from this list: send the line \"unsubscribe linux-ext4\" in\n" - "the body of a message to majordomo@vger.kernel.org\n" - More majordomo info at http://vger.kernel.org/majordomo-info.html + "To unsubscribe, send a message with 'unsubscribe linux-mm' in\n" + "the body to majordomo@kvack.org. For more info on Linux MM,\n" + "see: http://www.linux-mm.org/ .\n" + "Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/\n" + "Don't email: <a href=mailto:\"dont@kvack.org\"> email@kvack.org </a>" -a9de08dfc789d9c48bce715e2e02b190cd077370007e293b57a3f83a26074bd0 +6da0c666fa98c498193a92dfb91878d785686671aae92bcb08ba69dd56b6e3b6
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.