prefetch on ppc64

All of lore.kernel.org
 help / color / mirror / Atom feed

* prefetch on ppc64
@ 2005-03-30  3:40 Serge E. Hallyn
  2005-03-30  3:55 ` Paul Mackerras
  0 siblings, 1 reply; 5+ messages in thread
From: Serge E. Hallyn @ 2005-03-30  3:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: linuxppc64-dev

[-- Attachment #1: Type: text/plain, Size: 1921 bytes --]

Hi,

While investigating the inordinate performance impact one of my patches
seemed to be having, we tracked it down to two hlist_for_each_entry
loops, and finally to the prefetch instruction in the loop.

The machine I'm testing on has 4 power5 1.5Ghz cpus and 16G ram.  I was
mostly using dbench (v3.03) in runs of 50 and 100 on an ext2 system.
Kernel was 2.6.11-rc5.

I've not had much of a chance to test on x86, but the few tests I've run
have shown that prefetch does improve performance there.  From what I've
seen this seems to be a ppc (perhaps ppc64) specific symptom.

Following are two sets of interesting results on the ppc64 system.  The
first is on a stock 2.6.11-rc5 kernel.  The actual stock kernel gave the
following results for 100 runs of dbench:
# elements: 100, mean 862.580380, variance 5.973441, std dev 2.444062

When I patched fs/dcache.c to replace the three hlist_for_each_entry{,_rcu}
rules with manual loops, as shown in the attached file dcache-nohlist.patch,
I got:
# elements: 50, mean 881.804980, variance 10.695022, std dev 3.270325

The next set of results is based on 2.6.11-rc5 with the LSM stacking
patches (from www.sf.net/projects/lsm-stacker).  I was understandably
alarmed to find the original patched version gave me:
# elements: 100, mean 797.654870, variance 7.503588, std dev 2.739268

The code which I determined to be responsible contained two
list_for_each_entry loops,  Replacing one with a manual loop gave me
# elements: 50, mean 835.859980, variance 81.901719, std dev 9.049957
and replacing the second gave me
# elements: 50, mean 846.541060, variance 17.095401, std dev 4.134658

Finally I followed Paul McKenney's suggestion and just commented out the
ppc definition of prefetch altogether, which gave me:

# elements: 50, mean 860.823880, variance 47.567428, std dev 6.896914

I am currently testing this same patch against a non-stacking kernel.

thanks,
-serge

[-- Attachment #2: dcache-nohlist.patch --]
[-- Type: text/plain, Size: 1052 bytes --]

Index: linux-2.6.11-rc5-nostack/fs/dcache.c
===================================================================
--- linux-2.6.11-rc5-nostack.orig/fs/dcache.c	2005-03-11 15:19:58.000000000 -0600
+++ linux-2.6.11-rc5-nostack/fs/dcache.c	2005-03-26 01:35:29.000000000 -0600
@@ -656,7 +656,7 @@
 	do {
 		found = 0;
 		spin_lock(&dcache_lock);
-		hlist_for_each(lp, head) {
+		for (lp=head->first; lp; lp = lp->next) {
 			struct dentry *this = hlist_entry(lp, struct dentry, d_hash);
 			if (!list_empty(&this->d_lru)) {
 				dentry_stat.nr_unused--;
@@ -1047,7 +1047,9 @@

 	rcu_read_lock();

-	hlist_for_each_rcu(node, head) {
+	for (node=head->first; node;
+			({ node = node->next; smp_read_barrier_depends(); }))
+	{
 		struct dentry *dentry; 
 		struct qstr *qstr;

@@ -1123,7 +1125,7 @@

 	spin_lock(&dcache_lock);
 	base = d_hash(dparent, dentry->d_name.hash);
-	hlist_for_each(lhp,base) { 
+	for (lhp=base->first; lhp; lhp = lhp->next) {
 		/* hlist_for_each_rcu() not required for d_hash list
 		 * as it is parsed under dcache_lock
 		 */

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: prefetch on ppc64
  2005-03-30  3:40 prefetch on ppc64 Serge E. Hallyn
@ 2005-03-30  3:55 ` Paul Mackerras
  2005-03-30  5:38   ` Antonio Vargas
  2005-03-30 14:33   ` Serge E. Hallyn
  0 siblings, 2 replies; 5+ messages in thread
From: Paul Mackerras @ 2005-03-30  3:55 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: linux-kernel, linuxppc64-dev

Serge E. Hallyn writes:

> While investigating the inordinate performance impact one of my patches
> seemed to be having, we tracked it down to two hlist_for_each_entry
> loops, and finally to the prefetch instruction in the loop.

I would be interested to know what results you get if you leave the
loops using hlist_for_each_entry but change prefetch() and prefetchw()
to do the dcbt or dcbtst instruction only if the address is non-zero,
like this:

static inline void prefetch(const void *x)
{
	if (x)
	        __asm__ __volatile__ ("dcbt 0,%0" : : "r" (x));
}

static inline void prefetchw(const void *x)
{
	if (x)
	        __asm__ __volatile__ ("dcbtst 0,%0" : : "r" (x));
}

It seems that doing a prefetch on a NULL pointer, while it doesn't
cause a fault, does waste time looking for a translation of the zero
address.

Paul.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: prefetch on ppc64
  2005-03-30  3:55 ` Paul Mackerras
@ 2005-03-30  5:38   ` Antonio Vargas
  2005-03-30  6:00     ` Paul Mackerras
  2005-03-30 14:33   ` Serge E. Hallyn
  1 sibling, 1 reply; 5+ messages in thread
From: Antonio Vargas @ 2005-03-30  5:38 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Serge E. Hallyn, linux-kernel, linuxppc64-dev

On Wed, 30 Mar 2005 13:55:25 +1000, Paul Mackerras <paulus@samba.org> wrote:
> Serge E. Hallyn writes:
> 
> > While investigating the inordinate performance impact one of my patches
> > seemed to be having, we tracked it down to two hlist_for_each_entry
> > loops, and finally to the prefetch instruction in the loop.
> 
> I would be interested to know what results you get if you leave the
> loops using hlist_for_each_entry but change prefetch() and prefetchw()
> to do the dcbt or dcbtst instruction only if the address is non-zero,
> like this:
> 
> static inline void prefetch(const void *x)
> {
>         if (x)
>                 __asm__ __volatile__ ("dcbt 0,%0" : : "r" (x));
> }
> 
> static inline void prefetchw(const void *x)
> {
>         if (x)
>                 __asm__ __volatile__ ("dcbtst 0,%0" : : "r" (x));
> }
> 
> It seems that doing a prefetch on a NULL pointer, while it doesn't
> cause a fault, does waste time looking for a translation of the zero
> address.
> 
> Paul.

Don't know exactly about power5, but G5 processor is described on IBM
docs as doing automatic whole-page prefetch read-ahead when detecting
linear accesses.

-- 
Greetz, Antonio Vargas aka winden of network

http://wind.codepixel.com/

Las cosas no son lo que parecen, excepto cuando parecen lo que si son.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: prefetch on ppc64
  2005-03-30  5:38   ` Antonio Vargas
@ 2005-03-30  6:00     ` Paul Mackerras
  0 siblings, 0 replies; 5+ messages in thread
From: Paul Mackerras @ 2005-03-30  6:00 UTC (permalink / raw)
  To: Antonio Vargas; +Cc: Serge E. Hallyn, linux-kernel, linuxppc64-dev

Antonio Vargas writes:

> Don't know exactly about power5, but G5 processor is described on IBM
> docs as doing automatic whole-page prefetch read-ahead when detecting
> linear accesses.

Sure, but linked lists would rarely be laid out linearly in memory.

Paul.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: prefetch on ppc64
  2005-03-30  3:55 ` Paul Mackerras
  2005-03-30  5:38   ` Antonio Vargas
@ 2005-03-30 14:33   ` Serge E. Hallyn
  1 sibling, 0 replies; 5+ messages in thread
From: Serge E. Hallyn @ 2005-03-30 14:33 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linux-kernel, linuxppc64-dev

Quoting Paul Mackerras (paulus@samba.org):
> Serge E. Hallyn writes:
> 
> > While investigating the inordinate performance impact one of my patches
> > seemed to be having, we tracked it down to two hlist_for_each_entry
> > loops, and finally to the prefetch instruction in the loop.
> 
> I would be interested to know what results you get if you leave the
> loops using hlist_for_each_entry but change prefetch() and prefetchw()
> to do the dcbt or dcbtst instruction only if the address is non-zero,
> like this:
> 
> static inline void prefetch(const void *x)
> {
> 	if (x)
> 	        __asm__ __volatile__ ("dcbt 0,%0" : : "r" (x));
> }
> 
> static inline void prefetchw(const void *x)
> {
> 	if (x)
> 	        __asm__ __volatile__ ("dcbtst 0,%0" : : "r" (x));
> }
> 
> It seems that doing a prefetch on a NULL pointer, while it doesn't
> cause a fault, does waste time looking for a translation of the zero
> address.

Hi,

Olof Johansson had suggested that earlier, except that his patch used

if (unlikely(!x))
	return;

Performance was quite good, but not as good as having prefetch completely
disabled.  I got

# elements: 50, mean 851.263680, variance 24.561146, std dev 4.955920

compared to 860.823880 stdev 6.896914 with prefetch disabled.

thanks,
-serge


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2005-03-30 14:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-30  3:40 prefetch on ppc64 Serge E. Hallyn
2005-03-30  3:55 ` Paul Mackerras
2005-03-30  5:38   ` Antonio Vargas
2005-03-30  6:00     ` Paul Mackerras
2005-03-30 14:33   ` Serge E. Hallyn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.