From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
To: Nick Piggin <npiggin@suse.de>
Cc: Mel Gorman <mel@csn.ul.ie>,
Andrew Morton <akpm@linux-foundation.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
epasch@de.ibm.com, SCHILLIG@de.ibm.com,
Martin Schwidefsky <schwidefsky@de.ibm.com>,
Heiko Carstens <heiko.carstens@de.ibm.com>,
christof.schmitt@de.ibm.com, thoss@de.ibm.com, hare@suse.de,
gregkh@novell.com
Subject: Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
Date: Mon, 15 Feb 2010 16:46:53 +0100 [thread overview]
Message-ID: <4B796C6D.80800@linux.vnet.ibm.com> (raw)
In-Reply-To: <20100212100519.GA29085@laptop>
Nick Piggin wrote:
> On Thu, Feb 11, 2010 at 05:11:24PM +0100, Christian Ehrhardt wrote:
>>> 2. Test with the patch below rmqueue_bulk-fewer-parameters to see if the
>>> number of parameters being passed is making some unexpected large
>>> difference
>> BINGO - this definitely hit something.
>> This experimental patch does two things - on one hand it closes the race we had:
>>
>> 4 THREAD READ 8 THREAD READ 16 THREAD READ %ofcalls
>> perf_count_congestion_wait 13 27 52
>> perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
>> perf_count_call_congestion_wait_from_alloc_pages_slowpath 13 27 52 99.52%
>> perf_count_pages_direct_reclaim 30867 56811 131552
>> perf_count_failed_pages_direct_reclaim 14 24 52
>> perf_count_failed_pages_direct_reclaim_but_progress 14 21 52 0.04% !!!
>>
>> On the other hand we see that the number of direct_reclaim calls increased by ~x4.
>>
>> I assume (might be totally wrong) that the x4 increase of direct_reclaim calls could be caused by the fact that before we used higher orders which worked on x4 number of pages at once.
>
> But the order parameter was always passed as constant 0 by the caller?
Uh - I didn't see that in the first look - you're right.
So the x4 in direct_reclaim calls are not caused by e.g. missing order information. Thanks for spotting this.
>
>> This leaves me with two ideas what the real issue could be:
>> 1. either really the 6th parameter as this is the first one that needs to go on stack and that way might open a race and rob gcc a big pile of optimization chances
>
> It must be something to do with code generation AFAIKS. I'm surprised
> the function isn't inlined, giving exactly the same result regardless
> of the patch.
after checking asm I can tell that rmqueue_bulk is inlined.
Therefore the test to inline it explicitly as requested in another mail can be considered negligible.
It actually boils down to something different in Mels patch - see below.
> Unlikely to be a correctness issue with code generation, but I'm
> really surprised that a small difference in performance could have
> such a big (and apparently repeatable) effect on behaviour like this.
>
> What's the assembly look like?
>
Line 214763
This is the preparation of the __mod_zone_page_state call from rmqueue_bulk which is inlined into
get_page_from_freelist.
!ORDER CHANGED FOR READABILITY!
FAST SLOW
lgr %r2,%r11 #parm 1 from r11 lgr %r2, %r11 #parm 1 from r11 - same
lghi %r3,0 #parm 2 = 0 const lghi %r3,0 #parm 2 = 0 - same
lghi %r4,-1 #parm 3 = -1 const lcr %r4,%r12 #parm 3 as complement of r12?
lgfr %r4,%r4 #parm 3 - clear upper 32 bit?
brasl #call brasl #call
=> This is most probably this part of Mel's patch:
23 @@ -965,7 +965,7 @@
24 set_page_private(page, migratetype);
25 list = &page->lru;
26 }
27 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
28 + __mod_zone_page_state(zone, NR_FREE_PAGES, -1);
29 spin_unlock(&zone->lock);
30 return i;
31 }
-> doesn't look too important, but to be sure we can build an executable with
just this change, but still passing order to rmqueue_bulk - see below.
Line 214800
Differend end of the function. In Slow path there is a check to %r4 and
dependent two different jump targets inside get_page_from_freelist, while in
the fast version there is only an unconditional jump (both after a
_raw_spin_lock_wait).
FAST SLOW
brasl # call _raw_spin_lock_wait brasl # _raw_spin_lock_wait
j 1e6294 get_page_from_freelist lg %r4,288(%r15) # from stack
ltgr %r4,%r4 # compare
jne 1e62a2 get_page_from_freelist
lhi %r12,0 # new call gets r12 @ 0 (GOT?)
j 1e6340 get_page_from_freelist
The rest is the same. So what is left is that the slow variant has a new branch
back in to get_page_from_freelist with r12 set to zero. This jump wents directly
after the also new lcr call seen in the first difference and might therfore be
related to that change.
Now I first thought it might not be rmqueue_bulk that misses optimization, but
__mod_zone_page_state - but that one looks the same in both cases.
Therefore I took a shot for 2.6.32 with just that snippet above applied (__mod_zone_page_state call without order which should be fine as we know it is 0 in all cases).
And it is interesting to see that this snippet alone is enough to fix throughput and the direct_reclaim -> progress&!page ratio.
The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:
23 @@ -965,7 +965,7 @@
24 set_page_private(page, migratetype);
25 list = &page->lru;
26 }
27 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
28 + __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
29 spin_unlock(&zone->lock);
30 return i;
31 }
Now that one has still the issue of the very huge throughput loss and the bad ratio of driect_reclaims leading into congestion_wait.
Now as far as I can see that - oh so important - "-i" or "-1" ends up in zone_page_state_add as variable x:
static inline void zone_page_state_add(long x, struct zone *zone,
enum zone_stat_item item)
{
atomic_long_add(x, &zone->vm_stat[item]);
atomic_long_add(x, &vm_stat[item]);
}
I guess the intention there is to update the zone with the number of pages freed - and I also guess that -1 as constant might be wrong there.
That would also explain some weird output I saw like this after boot:
h37lp01:~ # cat /proc/meminfo
MemTotal: 251216 kB
MemFree: 483460 kB bigger than total
BUT, and that is now again open for discussion - "__mod_zone_page_state(zone, NR_FREE_PAGES, -1)" even being wrong - fixed the issue in terms of throughput and the congestion_waits as good as reverting e084b+5f8dcc21!
--
Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
next prev parent reply other threads:[~2010-02-15 15:47 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-12-07 14:39 Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set" Christian Ehrhardt
2009-12-07 15:09 ` Mel Gorman
2009-12-08 17:59 ` Christian Ehrhardt
2009-12-10 14:36 ` Christian Ehrhardt
2009-12-11 11:20 ` Mel Gorman
2009-12-11 14:47 ` Christian Ehrhardt
2009-12-18 13:38 ` Christian Ehrhardt
2009-12-18 17:42 ` Mel Gorman
2010-01-14 12:30 ` Christian Ehrhardt
2010-01-19 11:33 ` Mel Gorman
2010-02-05 15:51 ` Christian Ehrhardt
2010-02-05 17:49 ` Mel Gorman
2010-02-08 14:01 ` Christian Ehrhardt
2010-02-08 15:21 ` Mel Gorman
2010-02-08 16:55 ` Mel Gorman
2010-02-09 6:23 ` Christian Ehrhardt
2010-02-09 15:52 ` Christian Ehrhardt
2010-02-09 17:57 ` Mel Gorman
2010-02-11 16:11 ` Christian Ehrhardt
2010-02-12 10:05 ` Nick Piggin
2010-02-15 6:59 ` Nick Piggin
2010-02-15 15:46 ` Christian Ehrhardt [this message]
2010-02-16 11:25 ` Mel Gorman
2010-02-16 16:47 ` Christian Ehrhardt
2010-02-17 9:55 ` Christian Ehrhardt
2010-02-17 10:03 ` Christian Ehrhardt
2010-02-18 11:43 ` Mel Gorman
2010-02-18 16:09 ` Christian Ehrhardt
2010-02-19 11:19 ` Christian Ehrhardt
2010-02-19 15:19 ` Mel Gorman
2010-02-22 15:42 ` Christian Ehrhardt
2010-02-25 15:13 ` Christian Ehrhardt
2010-02-26 11:18 ` Nick Piggin
2010-03-02 6:52 ` Nick Piggin
2010-03-02 10:04 ` Mel Gorman
2010-03-02 10:36 ` Nick Piggin
2010-03-02 11:01 ` Mel Gorman
2010-03-02 11:18 ` Nick Piggin
2010-03-02 11:24 ` Mel Gorman
2010-03-03 6:51 ` Christian Ehrhardt
2010-02-08 15:02 ` Christian Ehrhardt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4B796C6D.80800@linux.vnet.ibm.com \
--to=ehrhardt@linux.vnet.ibm.com \
--cc=SCHILLIG@de.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=christof.schmitt@de.ibm.com \
--cc=epasch@de.ibm.com \
--cc=gregkh@novell.com \
--cc=hare@suse.de \
--cc=heiko.carstens@de.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mel@csn.ul.ie \
--cc=npiggin@suse.de \
--cc=schwidefsky@de.ibm.com \
--cc=thoss@de.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.