From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
To: Mel Gorman <mel@csn.ul.ie>
Cc: Nick Piggin <npiggin@suse.de>,
Andrew Morton <akpm@linux-foundation.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
epasch@de.ibm.com, SCHILLIG@de.ibm.com,
Martin Schwidefsky <schwidefsky@de.ibm.com>,
Heiko Carstens <heiko.carstens@de.ibm.com>,
christof.schmitt@de.ibm.com, thoss@de.ibm.com, hare@suse.de,
gregkh@novell.com
Subject: Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
Date: Wed, 17 Feb 2010 10:55:08 +0100 [thread overview]
Message-ID: <4B7BBCFC.4090101@linux.vnet.ibm.com> (raw)
In-Reply-To: <4B7ACC1E.9080205@linux.vnet.ibm.com>
Christian Ehrhardt wrote:
> Mel Gorman wrote:
>> On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:
> [...]
>>> The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
>>> I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:
>>>
>>> 23 @@ -965,7 +965,7 @@
>>> 24 set_page_private(page, migratetype);
>>> 25 list = &page->lru;
>>> 26 }
>>> 27 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>>> 28 + __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
>>> 29 spin_unlock(&zone->lock);
>>> 30 return i;
>>> 31 }
>>>
> [...]
>> It "fixes" it only by not calling direct reclaim when it should :(
>
> yeah as we both realized -1 was not right so it was more a crazy workaround :-)
>
> Anyway after that being a dead end again I dug even deeper into the details of direct_reclaim - I think we can agree that out of the counters we already know the race between try_to_free making progress and get_page not getting a page causing the congestion_wait is source of the issue.
>
> So what I tried to bring some more light into all that was extending my perf counters to track a few more details in direct_reclaim.
> Two patches are attached and apply after the other three already available in that thread.
> The intention is
> a) to track the time
> a1) spent in try_to_free_pages
> a2) consumed after try_to_free_pages until get_page_from_freelist
> a3) spent in get_pages_from_freelist
> b1) after seeing that order!=0 -> drain_all_pages I wondered if that might differ even all calls look like they have zero
> b2) tracking the average amount of pages freed by try_to_free_pages for fast path and slow path (progres&!page)
>
> Naming convention (otherwise it might get confusing here)
> Good case - the scenario e.g. with e084b and 5f8dcc21 reverted resulting in high throughput and a low ratio of direct_reclaim running into progress&!page
> Bad case - the scenario e.g. on a clean 2.6.32
> Fast path - direct reclaim calls that did not run into progress&!page
> Slow path - direct reclaim calls that ran into progress&!page ending up in a long congestion_wait and therefore called "slow" path
>
> Mini summary of what we had before in huge tables:
> fast path slow path
> GOOD CASE ~98% ~1-3%
> BAD CASE ~70% ~30%
> -> leading to throughput impact of e.g. 600 mb/s with 16 iozone threads (worse with even more threads)
>
> Out of the numbers I got the following things might help us to create a new approach to a solution.
> The timings showed that that the so called slow case is actually much faster passing though direct_reclaim in bad case.
>
> GOOD CASE duration
> a1 Fast-avg-duration_pre_ttf_2_post_ttf 164099
> a2 Fast-avg-duration_post_ttf_2_pre_get_page 459
> a3 Fast-avg-duration_pre_get_page_2_post_get_page 346
> a1 Slow-avg-duration_pre_ttf_2_post_ttf 127621
> a2 Slow-avg-duration_post_ttf_2_pre_get_page 1957
> a3 Slow-avg-duration_pre_get_page_2_post_get_page 256
> BAD CASE duration deviation to good case in %
> a1 Fast-avg-duration_pre_ttf_2_post_ttf 122921 -25.09%
> a2 Fast-avg-duration_post_ttf_2_pre_get_page 521 13.53%
> a3 Fast-avg-duration_pre_get_page_2_post_get_page 244 -29.55%
> a1 Slow-avg-duration_pre_ttf_2_post_ttf 109740 -14.01%
> a2 Slow-avg-duration_post_ttf_2_pre_get_page 250 -87.18%
> a3 Slow-avg-duration_pre_get_page_2_post_get_page 117 -54.16%
>
> That means that in the bad case the execution is much faster. Especially in the case that eventually runs into the slow path try_to_free is 14% faster, more important the time between try_to_free and get_pages is 87%! faster => less than a fifth and finally get_page is 54% faster, but that is probably just failing in an almost empty list which is fast^^.
>
> As I checked order which always was zero the time is not spent in drain_all_pages and the only other thing left might be cond_resched ?!
> Everything else are a few assignments so it can't be much else.
> But why would e.g. not running into schedule in cond_resched cause get_pages to not find anything - I don't know and I would expect it should be the other way around - the faster you get from free to get the more pages should be left.
THe reason here is probably the the fact that in the bad case a lot of
processes are waiting on congestion_wait and are therefore not runnnable
and that way not scheduled via cond_resched.
I'll test this theory today or tomorrow with cond_resched in
direct_reclaim commented out and expect almost no difference.
> I thought the progress try_to_free_pages is doing might be important as well so I took numbers for that too.
> From those I see that the good case as well as the bad case has an average of 62 pages freed in fast path.
> But in slow path they start to differ - while the good case that is running only seldom in that path anyway frees an average of 1.46 pages (that might be the thing causing it not getting a page eventually) in the bad case it makes a progress of avg 37 pages even in slow path.
>
> PAGES-FREED fast path slow path
> GOOD CASE ~62 ~1.46
> BAD CASE ~62 ~37
5f8dcc21 introduced per migrate type pcp lists, is it possible that we
run in a scenario where try_to_free frees a lot of pages via, but of the
wrong migrate type?
And afterwards get_page
At least try_to_free_page and it's called shrink_ functions is not
migrate type aware while get_page and its subsequent buffered_rmqueue
and rmqueue_bulk are - btw here comes patch e084b.
I only see buffered_rmqueue chose a specific pcp list based on migrate
type, and a fallback to migrate_reserve - is that enough fallback, what
if the reserve is empty too but a few other types would not and those
other types are the ones filled by try_to_free?
I'll try to get a per migrate type #pages statistic after direct_reclaim
reaches !page - maybe that can confirm some parts of my theory.
> Thinking of it as asking "how few pages do I have to free until I fall from fast to slow path" the kernels behave different it looks wrong but interesting.
> The good case only drops to slow path (!page) in case of ~1.46 pages freed while the bad case seems to enter that much earlier with even 37 pages freed.
>
> As order is always 0 and get_page afaik about getting just "one" page I wonder where these 37 pages disappeared especially as in bad case it is much faster getting to get_pages after freeing those ~37 pages.
>
> Comments and ideas welcome!
>
>
--
Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
next prev parent reply other threads:[~2010-02-17 9:55 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-12-07 14:39 Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set" Christian Ehrhardt
2009-12-07 15:09 ` Mel Gorman
2009-12-08 17:59 ` Christian Ehrhardt
2009-12-10 14:36 ` Christian Ehrhardt
2009-12-11 11:20 ` Mel Gorman
2009-12-11 14:47 ` Christian Ehrhardt
2009-12-18 13:38 ` Christian Ehrhardt
2009-12-18 17:42 ` Mel Gorman
2010-01-14 12:30 ` Christian Ehrhardt
2010-01-19 11:33 ` Mel Gorman
2010-02-05 15:51 ` Christian Ehrhardt
2010-02-05 17:49 ` Mel Gorman
2010-02-08 14:01 ` Christian Ehrhardt
2010-02-08 15:21 ` Mel Gorman
2010-02-08 16:55 ` Mel Gorman
2010-02-09 6:23 ` Christian Ehrhardt
2010-02-09 15:52 ` Christian Ehrhardt
2010-02-09 17:57 ` Mel Gorman
2010-02-11 16:11 ` Christian Ehrhardt
2010-02-12 10:05 ` Nick Piggin
2010-02-15 6:59 ` Nick Piggin
2010-02-15 15:46 ` Christian Ehrhardt
2010-02-16 11:25 ` Mel Gorman
2010-02-16 16:47 ` Christian Ehrhardt
2010-02-17 9:55 ` Christian Ehrhardt [this message]
2010-02-17 10:03 ` Christian Ehrhardt
2010-02-18 11:43 ` Mel Gorman
2010-02-18 16:09 ` Christian Ehrhardt
2010-02-19 11:19 ` Christian Ehrhardt
2010-02-19 15:19 ` Mel Gorman
2010-02-22 15:42 ` Christian Ehrhardt
2010-02-25 15:13 ` Christian Ehrhardt
2010-02-26 11:18 ` Nick Piggin
2010-03-02 6:52 ` Nick Piggin
2010-03-02 10:04 ` Mel Gorman
2010-03-02 10:36 ` Nick Piggin
2010-03-02 11:01 ` Mel Gorman
2010-03-02 11:18 ` Nick Piggin
2010-03-02 11:24 ` Mel Gorman
2010-03-03 6:51 ` Christian Ehrhardt
2010-02-08 15:02 ` Christian Ehrhardt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4B7BBCFC.4090101@linux.vnet.ibm.com \
--to=ehrhardt@linux.vnet.ibm.com \
--cc=SCHILLIG@de.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=christof.schmitt@de.ibm.com \
--cc=epasch@de.ibm.com \
--cc=gregkh@novell.com \
--cc=hare@suse.de \
--cc=heiko.carstens@de.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mel@csn.ul.ie \
--cc=npiggin@suse.de \
--cc=schwidefsky@de.ibm.com \
--cc=thoss@de.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).