linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
To: Mel Gorman <mel@csn.ul.ie>
Cc: Nick Piggin <npiggin@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	epasch@de.ibm.com, SCHILLIG@de.ibm.com,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	christof.schmitt@de.ibm.com, thoss@de.ibm.com, hare@suse.de,
	gregkh@novell.com
Subject: Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when	__GFP_COLD is set"
Date: Tue, 16 Feb 2010 17:47:26 +0100	[thread overview]
Message-ID: <4B7ACC1E.9080205@linux.vnet.ibm.com> (raw)
In-Reply-To: <20100216112517.GE1194@csn.ul.ie>

[-- Attachment #1: Type: text/plain, Size: 5686 bytes --]

Mel Gorman wrote:
> On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:
[...]
>> The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
>> I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:
>>
>>  23 @@ -965,7 +965,7 @@
>>  24                 set_page_private(page, migratetype);
>>  25                 list = &page->lru;
>>  26         }
>>  27 -       __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>>  28 +       __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
>>  29         spin_unlock(&zone->lock);
>>  30         return i;
>>  31  }
>>
[...]
> It "fixes" it only by not calling direct reclaim when it should :(

yeah as we both realized -1 was not right so it was more a crazy workaround :-)

Anyway after that being a dead end again I dug even deeper into the details of direct_reclaim - I think we can agree that out of the counters we already know the race between try_to_free making progress and get_page not getting a page causing the congestion_wait is source of the issue.

So what I tried to bring some more light into all that was extending my perf counters to track a few more details in direct_reclaim.
Two patches are attached and apply after the other three already available in that thread.
The intention is
a) to track the time
 a1) spent in try_to_free_pages
 a2) consumed after try_to_free_pages until get_page_from_freelist
 a3) spent in get_pages_from_freelist
b1) after seeing that order!=0 -> drain_all_pages I wondered if that might differ even all calls look like they have zero
b2) tracking the average amount of pages freed by try_to_free_pages for fast path and slow path (progres&!page)

Naming convention (otherwise it might get confusing here)
Good case - the scenario e.g. with e084b and 5f8dcc21 reverted resulting in high throughput and a low ratio of direct_reclaim running into progress&!page
Bad case - the scenario e.g. on a clean 2.6.32
Fast path - direct reclaim calls that did not run into progress&!page
Slow path - direct reclaim calls that ran into progress&!page ending up in a long congestion_wait and therefore called "slow" path

Mini summary of what we had before in huge tables:
            fast path   slow path
GOOD CASE      ~98%       ~1-3%
BAD CASE       ~70%        ~30%
-> leading to throughput impact of e.g. 600 mb/s with 16 iozone threads (worse with even more threads)

Out of the numbers I got the following things might help us to create a new approach to a solution.
The timings showed that that the so called slow case is actually much faster passing though direct_reclaim in bad case.

GOOD CASE                                        duration
a1 Fast-avg-duration_pre_ttf_2_post_ttf            164099
a2 Fast-avg-duration_post_ttf_2_pre_get_page          459
a3 Fast-avg-duration_pre_get_page_2_post_get_page     346
a1 Slow-avg-duration_pre_ttf_2_post_ttf            127621
a2 Slow-avg-duration_post_ttf_2_pre_get_page         1957
a3 Slow-avg-duration_pre_get_page_2_post_get_page     256
BAD CASE                                         duration   deviation to good case in %
a1 Fast-avg-duration_pre_ttf_2_post_ttf            122921   -25.09%
a2 Fast-avg-duration_post_ttf_2_pre_get_page          521   13.53%
a3 Fast-avg-duration_pre_get_page_2_post_get_page     244   -29.55%
a1 Slow-avg-duration_pre_ttf_2_post_ttf            109740   -14.01%
a2 Slow-avg-duration_post_ttf_2_pre_get_page          250   -87.18%
a3 Slow-avg-duration_pre_get_page_2_post_get_page     117   -54.16%

That means that in the bad case the execution is much faster. Especially in the case that eventually runs into the slow path try_to_free is 14% faster, more important the time between try_to_free and get_pages is 87%! faster => less than a fifth and finally get_page is 54% faster, but that is probably just failing in an almost empty list which is fast^^.

As I checked order which always was zero the time is not spent in drain_all_pages and the only other thing left might be cond_resched ?!
Everything else are a few assignments so it can't be much else.
But why would e.g. not running into schedule in cond_resched cause get_pages to not find anything - I don't know and I would expect it should be the other way around - the faster you get from free to get the more pages should be left.

I thought the progress try_to_free_pages is doing might be important as well so I took numbers for that too.
>From those I see that the good case as well as the bad case has an average of 62 pages freed in fast path.
But in slow path they start to differ - while the good case that is running only seldom in that path anyway frees an average of 1.46 pages (that might be the thing causing it not getting a page eventually) in the bad case it makes a progress of avg 37 pages even in slow path.

PAGES-FREED  fast path   slow path
GOOD CASE      ~62       ~1.46
BAD CASE       ~62       ~37

Thinking of it as asking "how few pages do I have to free until I fall from fast to slow path" the kernels behave different it looks wrong but interesting.
The good case only drops to slow path (!page) in case of ~1.46 pages freed while the bad case seems to enter that much earlier with even 37 pages freed.

As order is always 0 and get_page afaik about getting just "one" page I wonder where these 37 pages disappeared especially as in bad case it is much faster getting to get_pages after freeing those ~37 pages.

Comments and ideas welcome!

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 

[-- Attachment #2: perf-count-direct-reclaim-durations.diff --]
[-- Type: text/x-patch, Size: 6199 bytes --]

Index: linux-2.6-git/kernel/sysctl.c
===================================================================
--- linux-2.6-git.orig/kernel/sysctl.c	2010-02-16 11:06:29.000000000 +0100
+++ linux-2.6-git/kernel/sysctl.c	2010-02-16 13:38:48.000000000 +0100
@@ -264,9 +264,64 @@
 extern unsigned long perf_count_failed_pages_direct_reclaim_but_progress;
 extern unsigned long perf_count_call_congestion_wait_from_alloc_pages_high_priority;
 extern unsigned long perf_count_call_congestion_wait_from_alloc_pages_slowpath;
+
+extern unsigned long long perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf;
+extern unsigned long long perf_count_direct_reclaim_slow_dursum_post_ttf_2_pre_get_page;
+extern unsigned long long perf_count_direct_reclaim_slow_dursum_pre_get_page_2_post_ttf;
+extern unsigned long long perf_count_direct_reclaim_fast_dursum_pre_ttf_2_post_ttf;
+extern unsigned long long perf_count_direct_reclaim_fast_dursum_post_ttf_2_pre_get_page;
+extern unsigned long long perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_ttf;
 static struct ctl_table perf_table[] = {
 	{
 		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf",
+		.data           = &perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_slow_dursum_post_ttf_2_pre_get_page",
+		.data           = &perf_count_direct_reclaim_slow_dursum_post_ttf_2_pre_get_page,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_slow_dursum_pre_get_page_2_post_get_page",
+		.data           = &perf_count_direct_reclaim_slow_dursum_pre_get_page_2_post_get_page,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_fast_dursum_pre_ttf_2_post_ttf",
+		.data           = &perf_count_direct_reclaim_fast_dursum_pre_ttf_2_post_ttf,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_fast_dursum_post_ttf_2_pre_get_page",
+		.data           = &perf_count_direct_reclaim_fast_dursum_post_ttf_2_pre_get_page,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_get_page",
+		.data           = &perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_get_page,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
 		.procname       = "perf_count_congestion_wait",
 		.data           = &perf_count_congestion_wait,
 		.mode           = 0666,
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2010-02-16 11:06:29.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2010-02-16 13:38:24.000000000 +0100
@@ -51,6 +51,7 @@
 #include <trace/events/kmem.h>
 
 #include <asm/tlbflush.h>
+#include <asm/timex.h>
 #include <asm/div64.h>
 #include "internal.h"
 
@@ -1655,6 +1656,17 @@
 unsigned long perf_count_failed_pages_direct_reclaim = 0;
 unsigned long perf_count_failed_pages_direct_reclaim_but_progress = 0;
 
+unsigned long long perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf = 0;
+unsigned long long perf_count_direct_reclaim_slow_dursum_post_ttf_2_pre_get_page = 0;
+unsigned long long perf_count_direct_reclaim_slow_dursum_pre_get_page_2_post_get_page = 0;
+unsigned long long perf_count_direct_reclaim_fast_dursum_pre_ttf_2_post_ttf = 0;
+unsigned long long perf_count_direct_reclaim_fast_dursum_post_ttf_2_pre_get_page = 0;
+unsigned long long perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_get_page = 0;
+
+static inline unsigned long long tod_duration_ns(unsigned long long from, unsigned long long to) {
+	return ((to*125)>>9)-((from*125)>>9);
+}
+
 /* The really slow allocator path where we enter direct reclaim */
 static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -1665,6 +1677,7 @@
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
+	unsigned long long t1,t2,t3,t4;
 
 	cond_resched();
 
@@ -1675,8 +1688,12 @@
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
+	t1 = get_clock();
+
 	*did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
 
+	t2 = get_clock();
+
 	p->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
 	p->flags &= ~PF_MEMALLOC;
@@ -1686,17 +1703,28 @@
 	if (order != 0)
 		drain_all_pages();
 
+	t3 = get_clock();
 	if (likely(*did_some_progress))
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+	t4 = get_clock();
 
 	perf_count_pages_direct_reclaim++;
 	if (!page)
 		perf_count_failed_pages_direct_reclaim++;
-	if (!page && (*did_some_progress))
+	if (!page && (*did_some_progress)) {
 		perf_count_failed_pages_direct_reclaim_but_progress++;
+		perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf += tod_duration_ns(t1,t2);
+		perf_count_direct_reclaim_slow_dursum_post_ttf_2_pre_get_page += tod_duration_ns(t2,t3);
+		perf_count_direct_reclaim_slow_dursum_pre_get_page_2_post_get_page += tod_duration_ns(t3,t4);
+	}
+	else {
+		perf_count_direct_reclaim_fast_dursum_pre_ttf_2_post_ttf += tod_duration_ns(t1,t2);
+		perf_count_direct_reclaim_fast_dursum_post_ttf_2_pre_get_page += tod_duration_ns(t2,t3);
+		perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_get_page += tod_duration_ns(t3,t4);
+	}
 
 	return page;
 }

[-- Attachment #3: perf-count-direct-reclaim-orderandprogress.diff --]
[-- Type: text/x-patch, Size: 4380 bytes --]

Index: linux-2.6-git/kernel/sysctl.c
===================================================================
--- linux-2.6-git.orig/kernel/sysctl.c	2010-02-16 13:38:48.000000000 +0100
+++ linux-2.6-git/kernel/sysctl.c	2010-02-16 15:07:21.000000000 +0100
@@ -267,13 +267,49 @@
 
 extern unsigned long long perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf;
 extern unsigned long long perf_count_direct_reclaim_slow_dursum_post_ttf_2_pre_get_page;
-extern unsigned long long perf_count_direct_reclaim_slow_dursum_pre_get_page_2_post_ttf;
+extern unsigned long long perf_count_direct_reclaim_slow_dursum_pre_get_page_2_post_get_page;
 extern unsigned long long perf_count_direct_reclaim_fast_dursum_pre_ttf_2_post_ttf;
 extern unsigned long long perf_count_direct_reclaim_fast_dursum_post_ttf_2_pre_get_page;
-extern unsigned long long perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_ttf;
+extern unsigned long long perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_get_page;
+extern unsigned long perf_count_direct_reclaim_fast_sum_progress;
+extern unsigned long perf_count_direct_reclaim_slow_sum_progress;
+extern unsigned long perf_count_direct_reclaim_fast_sum_order;
+extern unsigned long perf_count_direct_reclaim_slow_sum_order;
 static struct ctl_table perf_table[] = {
 	{
 		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_slow_sum_progress",
+		.data           = &perf_count_direct_reclaim_slow_sum_progress,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_slow_sum_order",
+		.data           = &perf_count_direct_reclaim_slow_sum_order,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_fast_sum_progress",
+		.data           = &perf_count_direct_reclaim_fast_sum_progress,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_fast_sum_order",
+		.data           = &perf_count_direct_reclaim_fast_sum_order,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
 		.procname       = "perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf",
 		.data           = &perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf,
 		.mode           = 0666,
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2010-02-16 14:58:28.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2010-02-16 15:06:15.000000000 +0100
@@ -1663,6 +1663,11 @@
 unsigned long long perf_count_direct_reclaim_fast_dursum_post_ttf_2_pre_get_page = 0;
 unsigned long long perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_get_page = 0;
 
+unsigned long perf_count_direct_reclaim_fast_sum_progress = 0;
+unsigned long perf_count_direct_reclaim_slow_sum_progress = 0;
+unsigned long perf_count_direct_reclaim_fast_sum_order = 0;
+unsigned long perf_count_direct_reclaim_slow_sum_order = 0;
+
 static inline unsigned long long tod_duration_ns(unsigned long long from, unsigned long long to) {
 	return ((to*125)>>9)-((from*125)>>9);
 }
@@ -1719,11 +1724,17 @@
 		perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf += tod_duration_ns(t1,t2);
 		perf_count_direct_reclaim_slow_dursum_post_ttf_2_pre_get_page += tod_duration_ns(t2,t3);
 		perf_count_direct_reclaim_slow_dursum_pre_get_page_2_post_get_page += tod_duration_ns(t3,t4);
+
+		perf_count_direct_reclaim_slow_sum_progress += *did_some_progress;
+		perf_count_direct_reclaim_slow_sum_order += order;
 	}
 	else {
 		perf_count_direct_reclaim_fast_dursum_pre_ttf_2_post_ttf += tod_duration_ns(t1,t2);
 		perf_count_direct_reclaim_fast_dursum_post_ttf_2_pre_get_page += tod_duration_ns(t2,t3);
 		perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_get_page += tod_duration_ns(t3,t4);
+
+		perf_count_direct_reclaim_fast_sum_progress += *did_some_progress;
+		perf_count_direct_reclaim_fast_sum_order += order;
 	}
 
 	return page;

  reply	other threads:[~2010-02-16 16:47 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-07 14:39 Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set" Christian Ehrhardt
2009-12-07 15:09 ` Mel Gorman
2009-12-08 17:59   ` Christian Ehrhardt
2009-12-10 14:36     ` Christian Ehrhardt
2009-12-11 11:20       ` Mel Gorman
2009-12-11 14:47         ` Christian Ehrhardt
2009-12-18 13:38           ` Christian Ehrhardt
2009-12-18 17:42             ` Mel Gorman
2010-01-14 12:30               ` Christian Ehrhardt
2010-01-19 11:33                 ` Mel Gorman
2010-02-05 15:51                   ` Christian Ehrhardt
2010-02-05 17:49                     ` Mel Gorman
2010-02-08 14:01                       ` Christian Ehrhardt
2010-02-08 15:21                         ` Mel Gorman
2010-02-08 16:55                           ` Mel Gorman
2010-02-09  6:23                           ` Christian Ehrhardt
2010-02-09 15:52                           ` Christian Ehrhardt
2010-02-09 17:57                             ` Mel Gorman
2010-02-11 16:11                               ` Christian Ehrhardt
2010-02-12 10:05                                 ` Nick Piggin
2010-02-15  6:59                                   ` Nick Piggin
2010-02-15 15:46                                   ` Christian Ehrhardt
2010-02-16 11:25                                     ` Mel Gorman
2010-02-16 16:47                                       ` Christian Ehrhardt [this message]
2010-02-17  9:55                                         ` Christian Ehrhardt
2010-02-17 10:03                                           ` Christian Ehrhardt
2010-02-18 11:43                                           ` Mel Gorman
2010-02-18 16:09                                             ` Christian Ehrhardt
2010-02-19 11:19                                               ` Christian Ehrhardt
2010-02-19 15:19                                                 ` Mel Gorman
2010-02-22 15:42                                                   ` Christian Ehrhardt
2010-02-25 15:13                                                     ` Christian Ehrhardt
2010-02-26 11:18                                                       ` Nick Piggin
2010-03-02  6:52                                                   ` Nick Piggin
2010-03-02 10:04                                                     ` Mel Gorman
2010-03-02 10:36                                                       ` Nick Piggin
2010-03-02 11:01                                                         ` Mel Gorman
2010-03-02 11:18                                                           ` Nick Piggin
2010-03-02 11:24                                                             ` Mel Gorman
2010-03-03  6:51                                                               ` Christian Ehrhardt
2010-02-08 15:02                       ` Christian Ehrhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B7ACC1E.9080205@linux.vnet.ibm.com \
    --to=ehrhardt@linux.vnet.ibm.com \
    --cc=SCHILLIG@de.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=christof.schmitt@de.ibm.com \
    --cc=epasch@de.ibm.com \
    --cc=gregkh@novell.com \
    --cc=hare@suse.de \
    --cc=heiko.carstens@de.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mel@csn.ul.ie \
    --cc=npiggin@suse.de \
    --cc=schwidefsky@de.ibm.com \
    --cc=thoss@de.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).