From: Shaohua Li <shaohua.li@intel.com>
To: Mel Gorman <mel@csn.ul.ie>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
"cl@linux.com" <cl@linux.com>,
Andrew Morton <akpm@linux-foundation.org>,
David Rientjes <rientjes@google.com>
Subject: Re: zone state overhead
Date: Wed, 13 Oct 2010 10:41:36 +0800 [thread overview]
Message-ID: <20101013024136.GA16665@sli10-conroe.sh.intel.com> (raw)
In-Reply-To: <20101012162526.GG30667@csn.ul.ie>
On Wed, Oct 13, 2010 at 12:25:26AM +0800, Mel Gorman wrote:
> > > > > > In a 4 socket 64 CPU system, zone_nr_free_pages() takes about 5% ~ 10% cpu time
> > > > > > according to perf when memory pressure is high. The workload does something
> > > > > > like:
> > > > > > for i in `seq 1 $nr_cpu`
> > > > > > do
> > > > > > create_sparse_file $SPARSE_FILE-$i $((10 * mem / nr_cpu))
> > > > > > $USEMEM -f $SPARSE_FILE-$i -j 4096 --readonly $((10 * mem / nr_cpu)) &
> > > > > > done
> > > > > > this simply reads a sparse file for each CPU. Apparently the
> > > > > > zone->percpu_drift_mark is too big, and guess zone_page_state_snapshot() makes
> > > > > > a lot of cache bounce for ->vm_stat_diff[]. below is the zoneinfo for reference.
> > > > >
> > > > > Would it be possible for you to post the oprofile report? I'm in the
> > > > > early stages of trying to reproduce this locally based on your test
> > > > > description. The first machine I tried showed that zone_nr_page_state
> > > > > was consuming 0.26% of profile time with the vast bulk occupied by
> > > > > do_mpage_readahead. See as follows
> > > > >
> > > > > 1599339 53.3463 vmlinux-2.6.36-rc7-pcpudrift do_mpage_readpage
> > > > > 131713 4.3933 vmlinux-2.6.36-rc7-pcpudrift __isolate_lru_page
> > > > > 103958 3.4675 vmlinux-2.6.36-rc7-pcpudrift free_pcppages_bulk
> > > > > 85024 2.8360 vmlinux-2.6.36-rc7-pcpudrift __rmqueue
> > > > > 78697 2.6250 vmlinux-2.6.36-rc7-pcpudrift native_flush_tlb_others
> > > > > 75678 2.5243 vmlinux-2.6.36-rc7-pcpudrift unlock_page
> > > > > 68741 2.2929 vmlinux-2.6.36-rc7-pcpudrift get_page_from_freelist
> > > > > 56043 1.8693 vmlinux-2.6.36-rc7-pcpudrift __alloc_pages_nodemask
> > > > > 55863 1.8633 vmlinux-2.6.36-rc7-pcpudrift ____pagevec_lru_add
> > > > > 46044 1.5358 vmlinux-2.6.36-rc7-pcpudrift radix_tree_delete
> > > > > 44543 1.4857 vmlinux-2.6.36-rc7-pcpudrift shrink_page_list
> > > > > 33636 1.1219 vmlinux-2.6.36-rc7-pcpudrift zone_watermark_ok
> > > > > .....
> > > > > 7855 0.2620 vmlinux-2.6.36-rc7-pcpudrift zone_nr_free_pages
> > > > >
> > > > > The machine I am testing on is non-NUMA 4-core single socket and totally
> > > > > different characteristics but I want to be sure I'm going more or less the
> > > > > right direction with the reproduction case before trying to find a larger
> > > > > machine.
> > > >
> > > > Here it is. this is a 4 socket nahalem machine.
> > > > 268160.00 57.2% _raw_spin_lock /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > > 40302.00 8.6% zone_nr_free_pages /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > > 36827.00 7.9% do_mpage_readpage /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > > 28011.00 6.0% _raw_spin_lock_irq /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > > 22973.00 4.9% flush_tlb_others_ipi /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > > 10713.00 2.3% smp_invalidate_interrupt /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > >
> > > <SNIP>
> > >
> > Basically the similar test. I'm using Fengguang's test, please check attached
> > file. I didn't enable lock stat or debug. The difference is my test is under a
> > 4 socket system. In a 1 socket system, I don't see the issue too.
> >
>
> Ok, finding a large enough machine was key here true enough. I don't
> have access to Nehalem boxes but the same problem showed up on a large
> ppc64 machine (8 socket, interestingly enough a 3 socket did not have any
> significant problem). Based on that, I reproduced the problem and came up
> with the patch below.
>
> Christoph, can you look at this please? I know you had concerns about adjusting
> thresholds as being an expensive operation but the patch limits how often it
> occurs and it seems better than reducing thresholds for the full lifetime of
> the system just to avoid counter drift. What I did find with the patch that
> the overhead of __mod_zone_page_state() increases because of the temporarily
> reduced threshold. It goes from 0.0403% of profile time to 0.0967% on one
> machine and from 0.0677% to 0.43% on another. As this is just while kswapd
> is awake, it seems withiin an acceptable margin but it is a caution against
> simply reducing the existing thresholds. What is more relevant is the time
> to complete the benchmark is increased due to the reduction of the thresholds.
> This is a tradeoff between being fast and safe but I'm open to
> suggestions on how high a safe threshold might be.
>
> Shaohua, can you test keeping an eye out for any additional function
> that is now taking a lot more CPU time?
seems ok so far in the 4 sockets system. In this system, each node has 8G
memory, so the threshold is 5 with memory pressure. Haven't tested this
in some small machines yet, for example, each node just has 4G memory, etc.
Thanks,
Shaohua
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-10-13 2:41 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-28 5:08 zone state overhead Shaohua Li
2010-09-28 12:39 ` Christoph Lameter
2010-09-28 13:30 ` Mel Gorman
2010-09-28 13:40 ` Christoph Lameter
2010-09-28 13:51 ` Mel Gorman
2010-09-28 14:08 ` Christoph Lameter
2010-09-29 3:02 ` Shaohua Li
2010-09-29 4:02 ` David Rientjes
2010-09-29 4:47 ` Shaohua Li
2010-09-29 5:06 ` David Rientjes
2010-09-29 10:03 ` Mel Gorman
2010-09-29 14:12 ` Christoph Lameter
2010-09-29 14:17 ` Mel Gorman
2010-09-29 14:34 ` Christoph Lameter
2010-09-29 14:41 ` Mel Gorman
2010-09-29 14:45 ` Mel Gorman
2010-09-29 14:54 ` Christoph Lameter
2010-09-29 14:52 ` Christoph Lameter
2010-09-29 19:44 ` David Rientjes
2010-10-08 15:29 ` Mel Gorman
2010-10-09 0:58 ` Shaohua Li
2010-10-11 8:56 ` Mel Gorman
2010-10-12 1:05 ` Shaohua Li
2010-10-12 16:25 ` Mel Gorman
2010-10-13 2:41 ` Shaohua Li [this message]
2010-10-13 12:09 ` Mel Gorman
2010-10-13 3:36 ` KOSAKI Motohiro
2010-10-13 6:25 ` [RFC][PATCH 0/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot() KOSAKI Motohiro
2010-10-13 6:27 ` [RFC][PATCH 1/3] mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occur KOSAKI Motohiro
2010-10-13 6:39 ` KAMEZAWA Hiroyuki
2010-10-13 12:59 ` Mel Gorman
2010-10-14 2:44 ` KOSAKI Motohiro
2010-10-13 6:28 ` [RFC][PATCH 2/3] mm: update pcp->stat_threshold " KOSAKI Motohiro
2010-10-13 6:40 ` KAMEZAWA Hiroyuki
2010-10-13 13:02 ` Mel Gorman
2010-10-13 6:32 ` [RFC][PATCH 3/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot() KOSAKI Motohiro
2010-10-13 13:19 ` Mel Gorman
2010-10-14 2:39 ` KOSAKI Motohiro
2010-10-18 10:43 ` Mel Gorman
2010-10-13 7:10 ` [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed KOSAKI Motohiro
2010-10-13 7:16 ` KAMEZAWA Hiroyuki
2010-10-13 13:22 ` Mel Gorman
2010-10-14 2:50 ` KOSAKI Motohiro
2010-10-15 17:31 ` Christoph Lameter
2010-10-18 9:27 ` KOSAKI Motohiro
2010-10-18 15:44 ` Christoph Lameter
2010-10-19 1:10 ` KOSAKI Motohiro
2010-10-18 11:08 ` Mel Gorman
2010-10-19 1:34 ` KOSAKI Motohiro
2010-10-19 9:06 ` Mel Gorman
2010-10-18 15:51 ` Christoph Lameter
2010-10-19 0:43 ` KOSAKI Motohiro
2010-10-13 11:24 ` zone state overhead Mel Gorman
2010-10-14 3:07 ` KOSAKI Motohiro
2010-10-18 10:39 ` Mel Gorman
2010-10-19 1:16 ` KOSAKI Motohiro
2010-10-19 9:08 ` Mel Gorman
2010-10-22 14:12 ` Mel Gorman
2010-10-22 15:23 ` Christoph Lameter
2010-10-22 18:45 ` Mel Gorman
2010-10-22 15:27 ` Christoph Lameter
2010-10-22 18:46 ` Mel Gorman
2010-10-22 20:01 ` Christoph Lameter
2010-10-25 4:46 ` KOSAKI Motohiro
2010-10-27 8:19 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101013024136.GA16665@sli10-conroe.sh.intel.com \
--to=shaohua.li@intel.com \
--cc=akpm@linux-foundation.org \
--cc=cl@linux.com \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.