Re: Fw: potential /dev/urandom scalability improvement

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Fw: potential /dev/urandom scalability improvement
       [not found] <20040325141923.7080c6f0.akpm@osdl.org>
@ 2004-03-25 22:47 ` Matt Mackall
  2004-03-26  1:45   ` David Mosberger
  0 siblings, 1 reply; 22+ messages in thread
From: Matt Mackall @ 2004-03-25 22:47 UTC (permalink / raw)
  To: David Mosberger, Andrew Morton; +Cc: linux-kernel

On Thu, Mar 25, 2004 at 02:19:23PM -0800, Andrew Morton wrote:
> 
> Matt, could you please review David's changes?
> 
> Begin forwarded message:
> 
> Date: Wed, 24 Mar 2004 22:06:57 -0800
> From: David Mosberger <davidm@napali.hpl.hp.com>
> To: akpm@osdl.org
> Cc: linux-kernel@vger.kernel.org, linux-ia64@vger.kernel.org
> Subject: potential /dev/urandom scalability improvement
> 
> 
> Hi Andrew,
> 
> I'm addressing this patch to you because you seem to have been the
> person who most recently made some performance improvements to the
> random driver.

That was probably me, actually.

> Oh, and somebody who actually understands this code may want to
> double-check the patch for correctness.

Seems perfectly sane.

However, I've got a few pending patches that touch the same areas and
do some more critical cleanup that I've been sitting on since the
2.6.0 freeze. So perhaps I should start pushing those again and we can
queue this behind them. David, if you get a chance, grab the latest
copy of my linux-tiny tree from

 http://www.selenic.com/tiny/2.6.5-rc2-tiny1-broken-out.tar.bz2
 http://www.selenic.com/tiny/2.6.5-rc2-tiny1.patch.bz2

and see how I've tweaked the pool structure and the locking and how
your bits fit with it.

> +#ifdef ARCH_HAS_PREFETCH
> +	for (cp = (char *) r->pool; cp <= (char *) (r->pool + wordmask); cp += PREFETCH_STRIDE)
> +		prefetch(cp);
> +#endif

Can we avoid adding this ifdef in some fashion? What does the compiler
generate here when prefetch is a no-op? This seems to call for a
prefetch_range(start, len) function/macro in any case.

-- 
Matt Mackall : http://www.selenic.com : Linux development and consulting

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-25 22:47 ` Matt Mackall
@ 2004-03-26  1:45   ` David Mosberger
  2004-03-26  2:00     ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: David Mosberger @ 2004-03-26  1:45 UTC (permalink / raw)
  To: Matt Mackall; +Cc: David Mosberger, Andrew Morton, linux-kernel

>>>>> On Thu, 25 Mar 2004 16:47:27 -0600, Matt Mackall <mpm@selenic.com> said:

  >> I'm addressing this patch to you because you seem to have been
  >> the person who most recently made some performance improvements
  >> to the random driver.

  Matt> That was probably me, actually.

Sorry, that's what I get for trusting the BK log.

  Matt> However, I've got a few pending patches that touch the same
  Matt> areas and do some more critical cleanup that I've been sitting
  Matt> on since the 2.6.0 freeze. So perhaps I should start pushing
  Matt> those again and we can queue this behind them. David, if you
  Matt> get a chance, grab the latest copy of my linux-tiny tree from

  Matt>
  Matt> http://www.selenic.com/tiny/2.6.5-rc2-tiny1-broken-out.tar.bz2
  Matt> http://www.selenic.com/tiny/2.6.5-rc2-tiny1.patch.bz2

  Matt> and see how I've tweaked the pool structure and the locking
  Matt> and how your bits fit with it.

Not much left of the original bits!

  >> +#ifdef ARCH_HAS_PREFETCH

  Matt> Can we avoid adding this ifdef in some fashion? What does the
  Matt> compiler generate here when prefetch is a no-op? This seems to
  Matt> call for a prefetch_range(start, len) function/macro in any
  Matt> case.

Sounds reasonable, but I would prefer to do this in separate steps.

I tried your changes and performance was virtually unchanged.  The
patch below is updated to go on top of your patch and gives about the
same performance as I reported yesterday.  For now, I defined an
inline prefetch_range().  If and when all architectures get updated to
define this directly, we can simply remove prefetch_range() from the
driver.

Thanks

	--david

--
--- drivers/char/random.c	2004-03-25 17:41:54.432358997 -0800
+++ drivers/char/random.c-davidm	2004-03-25 17:34:59.040063215 -0800
@@ -421,14 +421,20 @@
  **********************************************************************/
 
 struct entropy_store {
+	/* mostly-read data: */
 	const char *name;
+	struct poolinfo *poolinfo;
+	__u32 *pool;
+	/*
+	 * read-write data (colocate with lock such that when we get
+	 * the lock, we get the other data for "free"; may cause some
+	 * extra bus-transactions, though):
+	 */
+	spinlock_t lock ____cacheline_aligned;
 	unsigned add_ptr;
 	int entropy_count;
 	int input_rotate;
 	int reserved;
-	struct poolinfo *poolinfo;
-	spinlock_t lock;
-	__u32 *pool;
 };
 
 static __u32 input_pool_data[INPUT_POOL_WORDS];
@@ -456,6 +462,16 @@
 	.pool = nonblocking_pool_data
 };
 
+static inline void
+prefetch_range (void *addr, size_t len)
+{
+#ifdef ARCH_HAS_PREFETCH
+	char *cp, *end = (char *) addr + len;
+	for (cp = addr; cp < end; cp += PREFETCH_STRIDE)
+		prefetch(cp);
+#endif
+}
+
 /*
  * This function adds words into the entropy "pool".  It does not
  * update the entropy estimate.  The caller should call
@@ -472,17 +488,30 @@
 	static __u32 const twist_table[8] = {
 		         0, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
 		0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };
-	unsigned i;
-	int new_rotate;
-	int wordmask = r->poolinfo->poolwords - 1;
-	__u32 w;
-	unsigned long flags;
+	unsigned long i, add_ptr, tap1, tap2, tap3, tap4, tap5;
+	int new_rotate, input_rotate;
+	unsigned long flags, wordmask = r->poolinfo->poolwords - 1;
+	__u32 w, next_w, *pool = r->pool;
+
+	/* Taps are constant, so we can load them without holding r->lock.  */
+	tap1 = r->poolinfo->tap1;
+	tap2 = r->poolinfo->tap2;
+	tap3 = r->poolinfo->tap3;
+	tap4 = r->poolinfo->tap4;
+	tap5 = r->poolinfo->tap5;
+	next_w = *in++;
 
 	spin_lock_irqsave(&r->lock, flags);
 
+	prefetch_range(pool, 4 * (wordmask + 1));
+	input_rotate = r->input_rotate;
+	add_ptr = r->add_ptr;
+
 	while (nwords--) {
-		w = rol32(*in++, r->input_rotate);
-		i = r->add_ptr = (r->add_ptr - 1) & wordmask;
+		w = rol32(next_w, input_rotate);
+		if (nwords > 0)
+			next_w = *in++;
+		i = add_ptr = (add_ptr - 1) & wordmask;
 
 		/*
 		 * Normally, we add 7 bits of rotation to the pool.
@@ -490,21 +519,24 @@
 		 * rotation, so that successive passes spread the
 		 * input bits across the pool evenly.
 		 */
-		new_rotate = r->input_rotate + 14;
+		new_rotate = input_rotate + 14;
 		if (i)
-			new_rotate = r->input_rotate + 7;
-		r->input_rotate = new_rotate & 31;
+			new_rotate = input_rotate + 7;
+		input_rotate = new_rotate & 31;
 
 		/* XOR in the various taps */
-		w ^= r->pool[(i + r->poolinfo->tap1) & wordmask];
-		w ^= r->pool[(i + r->poolinfo->tap2) & wordmask];
-		w ^= r->pool[(i + r->poolinfo->tap3) & wordmask];
-		w ^= r->pool[(i + r->poolinfo->tap4) & wordmask];
-		w ^= r->pool[(i + r->poolinfo->tap5) & wordmask];
-		w ^= r->pool[i];
-		r->pool[i] = (w >> 3) ^ twist_table[w & 7];
+		w ^= pool[(i + tap1) & wordmask];
+		w ^= pool[(i + tap2) & wordmask];
+		w ^= pool[(i + tap3) & wordmask];
+		w ^= pool[(i + tap4) & wordmask];
+		w ^= pool[(i + tap5) & wordmask];
+		w ^= pool[i];
+		pool[i] = (w >> 3) ^ twist_table[w & 7];
 	}
 
+	r->input_rotate = input_rotate;
+	r->add_ptr = add_ptr;
+
 	spin_unlock_irqrestore(&r->lock, flags);
 }
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26  1:45   ` David Mosberger
@ 2004-03-26  2:00     ` Andrew Morton
  2004-03-26  2:10       ` David Mosberger
                         ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Andrew Morton @ 2004-03-26  2:00 UTC (permalink / raw)
  To: davidm; +Cc: davidm, mpm, linux-kernel

David Mosberger <davidm@napali.hpl.hp.com> wrote:
>
> The
>  patch below is updated to go on top of your patch and gives about the
>  same performance as I reported yesterday.  For now, I defined an
>  inline prefetch_range().  If and when all architectures get updated to
>  define this directly, we can simply remove prefetch_range() from the
>  driver.

We may as well stick prefetch_range() in prefetch.h.

And Matt's patch series is not a thing I want to take on board at present,
so let's stick with the straight scalability patch for now.

I moved the prefetch_range() call to outside the spinlock.  Does that make
sense?

 25-akpm/drivers/char/random.c    |   51 ++++++++++++++++++++++++++-------------
 25-akpm/include/linux/prefetch.h |   11 ++++++++
 2 files changed, 46 insertions(+), 16 deletions(-)

diff -puN drivers/char/random.c~urandom-scalability-fix drivers/char/random.c
--- 25/drivers/char/random.c~urandom-scalability-fix	2004-03-25 17:53:57.498675480 -0800
+++ 25-akpm/drivers/char/random.c	2004-03-25 17:57:39.795881168 -0800
@@ -490,12 +490,15 @@ static inline __u32 int_ln_12bits(__u32 
  **********************************************************************/
 
 struct entropy_store {
+	/* mostly-read data: */
+	struct poolinfo poolinfo;
+	__u32		*pool;
+
+	/* read-write data: */
+	spinlock_t lock ____cacheline_aligned;
 	unsigned	add_ptr;
 	int		entropy_count;
 	int		input_rotate;
-	struct poolinfo poolinfo;
-	__u32		*pool;
-	spinlock_t lock;
 };
 
 /*
@@ -571,38 +574,54 @@ static void add_entropy_words(struct ent
 	static __u32 const twist_table[8] = {
 		         0, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
 		0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };
-	unsigned i;
-	int new_rotate;
+	unsigned long i, add_ptr, tap1, tap2, tap3, tap4, tap5;
+	int new_rotate, input_rotate;
 	int wordmask = r->poolinfo.poolwords - 1;
-	__u32 w;
+	__u32 w, next_w;
 	unsigned long flags;
 
+	/* Taps are constant, so we can load them without holding r->lock.  */
+	tap1 = r->poolinfo.tap1;
+	tap2 = r->poolinfo.tap2;
+	tap3 = r->poolinfo.tap3;
+	tap4 = r->poolinfo.tap4;
+	tap5 = r->poolinfo.tap5;
+	next_w = *in++;
+
+	prefetch_range(r->pool, wordmask);
 	spin_lock_irqsave(&r->lock, flags);
+	input_rotate = r->input_rotate;
+	add_ptr = r->add_ptr;
 
 	while (nwords--) {
-		w = rotate_left(r->input_rotate, *in++);
-		i = r->add_ptr = (r->add_ptr - 1) & wordmask;
+		w = rotate_left(input_rotate, next_w);
+		if (nwords > 0)
+			next_w = *in++;
+		i = add_ptr = (add_ptr - 1) & wordmask;
 		/*
 		 * Normally, we add 7 bits of rotation to the pool.
 		 * At the beginning of the pool, add an extra 7 bits
 		 * rotation, so that successive passes spread the
 		 * input bits across the pool evenly.
 		 */
-		new_rotate = r->input_rotate + 14;
+		new_rotate = input_rotate + 14;
 		if (i)
-			new_rotate = r->input_rotate + 7;
-		r->input_rotate = new_rotate & 31;
+			new_rotate = input_rotate + 7;
+		input_rotate = new_rotate & 31;
 
 		/* XOR in the various taps */
-		w ^= r->pool[(i + r->poolinfo.tap1) & wordmask];
-		w ^= r->pool[(i + r->poolinfo.tap2) & wordmask];
-		w ^= r->pool[(i + r->poolinfo.tap3) & wordmask];
-		w ^= r->pool[(i + r->poolinfo.tap4) & wordmask];
-		w ^= r->pool[(i + r->poolinfo.tap5) & wordmask];
+		w ^= r->pool[(i + tap1) & wordmask];
+		w ^= r->pool[(i + tap2) & wordmask];
+		w ^= r->pool[(i + tap3) & wordmask];
+		w ^= r->pool[(i + tap4) & wordmask];
+		w ^= r->pool[(i + tap5) & wordmask];
 		w ^= r->pool[i];
 		r->pool[i] = (w >> 3) ^ twist_table[w & 7];
 	}
 
+	r->input_rotate = input_rotate;
+	r->add_ptr = add_ptr;
+
 	spin_unlock_irqrestore(&r->lock, flags);
 }
 
diff -puN include/linux/prefetch.h~urandom-scalability-fix include/linux/prefetch.h
--- 25/include/linux/prefetch.h~urandom-scalability-fix	2004-03-25 17:54:27.279148160 -0800
+++ 25-akpm/include/linux/prefetch.h	2004-03-25 17:55:34.456935584 -0800
@@ -54,4 +54,15 @@ static inline void prefetchw(const void 
 #define PREFETCH_STRIDE (4*L1_CACHE_BYTES)
 #endif
 
+static inline void prefetch_range(void *addr, size_t len)
+{
+#ifdef ARCH_HAS_PREFETCH
+	char *cp;
+	char *end = addr + len;
+
+	for (cp = addr; cp < end; cp += PREFETCH_STRIDE)
+		prefetch(cp);
+#endif
+}
+
 #endif

_


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26  2:00     ` Andrew Morton
@ 2004-03-26  2:10       ` David Mosberger
  2004-03-26  4:07       ` Matt Mackall
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: David Mosberger @ 2004-03-26  2:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davidm, davidm, mpm, linux-kernel

>>>>> On Thu, 25 Mar 2004 18:00:14 -0800, Andrew Morton <akpm@osdl.org> said:

  Andrew> We may as well stick prefetch_range() in prefetch.h.

Fair enough.

  Andrew> And Matt's patch series is not a thing I want to take on
  Andrew> board at present, so let's stick with the straight
  Andrew> scalability patch for now.

  Andrew> I moved the prefetch_range() call to outside the spinlock.
  Andrew> Does that make sense?

The other CPUs will dirty that data, so prefetching it before you hold
the lock is almost certainly bad for performance.  (Well, to be
precise, it really depends on the number of lines dirtied.)

	--david

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26  2:00     ` Andrew Morton
  2004-03-26  2:10       ` David Mosberger
@ 2004-03-26  4:07       ` Matt Mackall
  2004-03-26  4:19       ` Matt Mackall
  2004-03-26 11:06       ` Dave Jones
  3 siblings, 0 replies; 22+ messages in thread
From: Matt Mackall @ 2004-03-26  4:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davidm, davidm, linux-kernel

On Thu, Mar 25, 2004 at 06:00:14PM -0800, Andrew Morton wrote:
> David Mosberger <davidm@napali.hpl.hp.com> wrote:
> >
> > The
> >  patch below is updated to go on top of your patch and gives about the
> >  same performance as I reported yesterday.  For now, I defined an
> >  inline prefetch_range().  If and when all architectures get updated to
> >  define this directly, we can simply remove prefetch_range() from the
> >  driver.
> 
> We may as well stick prefetch_range() in prefetch.h.
> 
> And Matt's patch series is not a thing I want to take on board at present,
> so let's stick with the straight scalability patch for now.

Sigh, I'll trim it back to some bits I think are critical.
 
> I moved the prefetch_range() call to outside the spinlock.  Does that make
> sense?

I don't think that's actually a win. If there's contention, threads
racing to the lock will grab the same cache lines and all but one
thread's cache will end up invalidated by the time the lock is
released.

-- 
Matt Mackall : http://www.selenic.com : Linux development and consulting

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26  2:00     ` Andrew Morton
  2004-03-26  2:10       ` David Mosberger
  2004-03-26  4:07       ` Matt Mackall
@ 2004-03-26  4:19       ` Matt Mackall
  2004-03-26  4:51         ` David Mosberger
  2004-03-26 11:06       ` Dave Jones
  3 siblings, 1 reply; 22+ messages in thread
From: Matt Mackall @ 2004-03-26  4:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davidm, davidm, linux-kernel

On Thu, Mar 25, 2004 at 06:00:14PM -0800, Andrew Morton wrote:
> 
>  25-akpm/drivers/char/random.c    |   51 ++++++++++++++++++++++++++-------------
>  25-akpm/include/linux/prefetch.h |   11 ++++++++
>  2 files changed, 46 insertions(+), 16 deletions(-)
> 
> diff -puN drivers/char/random.c~urandom-scalability-fix drivers/char/random.c
> +++ 25-akpm/drivers/char/random.c	2004-03-25 17:57:39.795881168 -0800
> @@ -490,12 +490,15 @@ static inline __u32 int_ln_12bits(__u32 
>   **********************************************************************/
>  
>  struct entropy_store {
> +	/* mostly-read data: */
> +	struct poolinfo poolinfo;
> +	__u32		*pool;
> +
> +	/* read-write data: */
> +	spinlock_t lock ____cacheline_aligned;
>  	unsigned	add_ptr;
>  	int		entropy_count;
>  	int		input_rotate;
> -	struct poolinfo poolinfo;
> -	__u32		*pool;
> -	spinlock_t lock;
>  };


Also, I think in general we'd prefer to stick the aligned bit at the
front of the structure rather than at the middle, as we'll avoid extra
padding. The size of cachelines is getting rather obscene on some
modern processors.

-- 
Matt Mackall : http://www.selenic.com : Linux development and consulting

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26  4:19       ` Matt Mackall
@ 2004-03-26  4:51         ` David Mosberger
  2004-03-26  5:15           ` Matt Mackall
  0 siblings, 1 reply; 22+ messages in thread
From: David Mosberger @ 2004-03-26  4:51 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Andrew Morton, davidm, davidm, linux-kernel

>>>>> On Thu, 25 Mar 2004 22:19:26 -0600, Matt Mackall <mpm@selenic.com> said:

>  struct entropy_store {
> +	/* mostly-read data: */
> +	struct poolinfo poolinfo;
> +	__u32		*pool;
> +
> +	/* read-write data: */
> +	spinlock_t lock ____cacheline_aligned;
>  	unsigned	add_ptr;
>  	int		entropy_count;
>  	int		input_rotate;
> -	struct poolinfo poolinfo;
> -	__u32		*pool;
> -	spinlock_t lock;
>  };

  Matt> Also, I think in general we'd prefer to stick the aligned bit at the
  Matt> front of the structure rather than at the middle, as we'll avoid extra
  Matt> padding. The size of cachelines is getting rather obscene on some
  Matt> modern processors.

Not sharing the cacheline between the mostly-read data and the
read-write data is the _point_ of this change.  If you reverse the
order, the "poolinfo" and "pool" members will also get invalidated
whenever someone updates the write-intensive data.

	--david

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26  4:51         ` David Mosberger
@ 2004-03-26  5:15           ` Matt Mackall
  2004-03-26  5:24             ` David Mosberger
  0 siblings, 1 reply; 22+ messages in thread
From: Matt Mackall @ 2004-03-26  5:15 UTC (permalink / raw)
  To: davidm; +Cc: Andrew Morton, davidm, linux-kernel

On Thu, Mar 25, 2004 at 08:51:54PM -0800, David Mosberger wrote:
> >>>>> On Thu, 25 Mar 2004 22:19:26 -0600, Matt Mackall <mpm@selenic.com> said:
> 
> >  struct entropy_store {
> > +	/* mostly-read data: */
> > +	struct poolinfo poolinfo;
> > +	__u32		*pool;
> > +
> > +	/* read-write data: */
> > +	spinlock_t lock ____cacheline_aligned;
> >  	unsigned	add_ptr;
> >  	int		entropy_count;
> >  	int		input_rotate;
> > -	struct poolinfo poolinfo;
> > -	__u32		*pool;
> > -	spinlock_t lock;
> >  };
> 
>   Matt> Also, I think in general we'd prefer to stick the aligned bit at the
>   Matt> front of the structure rather than at the middle, as we'll avoid extra
>   Matt> padding. The size of cachelines is getting rather obscene on some
>   Matt> modern processors.
> 
> Not sharing the cacheline between the mostly-read data and the
> read-write data is the _point_ of this change.  If you reverse the
> order, the "poolinfo" and "pool" members will also get invalidated
> whenever someone updates the write-intensive data.

Ok, previous observation made no sense; I should really be taking a
nap right now. Hopefully this next one will make more sense: it ought
to be ____cacheline_aligned_in_smp as the zero-byte spinlock struct
still forces alignment.

-- 
Matt Mackall : http://www.selenic.com : Linux development and consulting

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26  5:15           ` Matt Mackall
@ 2004-03-26  5:24             ` David Mosberger
  0 siblings, 0 replies; 22+ messages in thread
From: David Mosberger @ 2004-03-26  5:24 UTC (permalink / raw)
  To: Matt Mackall; +Cc: davidm, Andrew Morton, linux-kernel

>>>>> On Thu, 25 Mar 2004 23:15:32 -0600, Matt Mackall <mpm@selenic.com> said:

  Matt> Ok, previous observation made no sense; I should really be taking a
  Matt> nap right now. Hopefully this next one will make more sense: it ought
  Matt> to be ____cacheline_aligned_in_smp as the zero-byte spinlock struct
  Matt> still forces alignment.

Makes tons of sense, agreed.

	--david

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26  2:00     ` Andrew Morton
                         ` (2 preceding siblings ...)
  2004-03-26  4:19       ` Matt Mackall
@ 2004-03-26 11:06       ` Dave Jones
  2004-03-26 18:08         ` David Mosberger
  3 siblings, 1 reply; 22+ messages in thread
From: Dave Jones @ 2004-03-26 11:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davidm, davidm, mpm, linux-kernel

On Thu, Mar 25, 2004 at 06:00:14PM -0800, Andrew Morton wrote:

 > +static inline void prefetch_range(void *addr, size_t len)
 > +{
 > +#ifdef ARCH_HAS_PREFETCH
 > +	char *cp;
 > +	char *end = addr + len;
 > +
 > +	for (cp = addr; cp < end; cp += PREFETCH_STRIDE)
 > +		prefetch(cp);
 > +#endif
 > +}
 > +
 >  #endif

I think this may be dangerous on some CPUs, if may prefetch past
the end of the buffer. Ie, if PREFETCH_STRIDE was 32, and len
was 65, we'd end up prefetching 65->97. As well as being
wasteful to cachelines, this can crash if theres for eg
nothing mapped after the next page boundary.

		Dave


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26 11:06       ` Dave Jones
@ 2004-03-26 18:08         ` David Mosberger
  2004-03-26 18:23           ` Dave Jones
  2004-03-26 18:49           ` Andrew Morton
  0 siblings, 2 replies; 22+ messages in thread
From: David Mosberger @ 2004-03-26 18:08 UTC (permalink / raw)
  To: Dave Jones; +Cc: Andrew Morton, davidm, davidm, mpm, linux-kernel

>>>>> On Fri, 26 Mar 2004 11:06:19 +0000, Dave Jones <davej@redhat.com> said:

  Dave> On Thu, Mar 25, 2004 at 06:00:14PM -0800, Andrew Morton wrote:
  >> +static inline void prefetch_range(void *addr, size_t len) +{
  >> +#ifdef ARCH_HAS_PREFETCH + char *cp; + char *end = addr + len; +
  >> + for (cp = addr; cp < end; cp += PREFETCH_STRIDE) +
  >> prefetch(cp); +#endif +} + #endif

  Dave> I think this may be dangerous on some CPUs, if may prefetch
  Dave> past the end of the buffer. Ie, if PREFETCH_STRIDE was 32, and
  Dave> len was 65, we'd end up prefetching 65->97. As well as being
  Dave> wasteful to cachelines, this can crash if theres for eg
  Dave> nothing mapped after the next page boundary.

Huh?  It only ever prefetches addresses that are _within_ the
specified buffer.  Of course it will prefetch entire cachelines, but I
hope you're not worried about cachlines crossing page-boundaries! ;-))

	--david

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26 18:08         ` David Mosberger
@ 2004-03-26 18:23           ` Dave Jones
  2004-03-26 21:31             ` David Mosberger
  2004-03-26 18:49           ` Andrew Morton
  1 sibling, 1 reply; 22+ messages in thread
From: Dave Jones @ 2004-03-26 18:23 UTC (permalink / raw)
  To: davidm; +Cc: Andrew Morton, davidm, mpm, linux-kernel

On Fri, Mar 26, 2004 at 10:08:39AM -0800, David Mosberger wrote:

 > > I think this may be dangerous on some CPUs, if may prefetch
 > > past the end of the buffer. Ie, if PREFETCH_STRIDE was 32, and
 > > len was 65, we'd end up prefetching 65->97. As well as being
 > > wasteful to cachelines, this can crash if theres for eg
 > > nothing mapped after the next page boundary.
 > 
 > Huh?  It only ever prefetches addresses that are _within_ the
 > specified buffer.  Of course it will prefetch entire cachelines, but I
 > hope you're not worried about cachlines crossing page-boundaries! ;-))

The proposed only user of this may take care of this requirement, but
I'm more concerned someone not aware of this requirement using this
helper routine. At the least it deserves a comment IMO.

		Dave


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26 18:08         ` David Mosberger
  2004-03-26 18:23           ` Dave Jones
@ 2004-03-26 18:49           ` Andrew Morton
  2004-03-26 20:25             ` David Mosberger
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2004-03-26 18:49 UTC (permalink / raw)
  To: davidm; +Cc: davidm, davej, mpm, linux-kernel

David Mosberger <davidm@napali.hpl.hp.com> wrote:
>
> >>>>> On Fri, 26 Mar 2004 11:06:19 +0000, Dave Jones <davej@redhat.com> said:
> 
>   Dave> On Thu, Mar 25, 2004 at 06:00:14PM -0800, Andrew Morton wrote:
>   >> +static inline void prefetch_range(void *addr, size_t len) +{
>   >> +#ifdef ARCH_HAS_PREFETCH + char *cp; + char *end = addr + len; +
>   >> + for (cp = addr; cp < end; cp += PREFETCH_STRIDE) +
>   >> prefetch(cp); +#endif +} + #endif
> 
>   Dave> I think this may be dangerous on some CPUs, if may prefetch
>   Dave> past the end of the buffer. Ie, if PREFETCH_STRIDE was 32, and
>   Dave> len was 65, we'd end up prefetching 65->97. As well as being
>   Dave> wasteful to cachelines, this can crash if theres for eg
>   Dave> nothing mapped after the next page boundary.
> 
> Huh?  It only ever prefetches addresses that are _within_ the
> specified buffer.  Of course it will prefetch entire cachelines, but I
> hope you're not worried about cachlines crossing page-boundaries! ;-))
> 

But the start address which is fed into prefetch_range() may not be
cacheline-aligned.  So if appropriately abused, a prefetch_range() could
wander off the end of the user's buffer and into a new page.

I think this gets it right, but I probably screwed something up.

static inline void prefetch_range(void *addr, size_t len)
{
#ifdef ARCH_HAS_PREFETCH
	char *cp;
	unsigned long end;

	end = ((unsigned long)addr + len + PREFETCH_STRIDE - 1);
	end &= ~(PREFETCH_STRIDE - 1);

	for (cp = addr; cp < (char *)end; cp += PREFETCH_STRIDE)
		prefetch(cp);
#endif
}


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26 18:49           ` Andrew Morton
@ 2004-03-26 20:25             ` David Mosberger
  2004-03-26 20:33               ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: David Mosberger @ 2004-03-26 20:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davidm, davidm, davej, mpm, linux-kernel

>>>>> On Fri, 26 Mar 2004 10:49:04 -0800, Andrew Morton <akpm@osdl.org> said:

  Andrew> But the start address which is fed into prefetch_range() may
  Andrew> not be cacheline-aligned.  So if appropriately abused, a
  Andrew> prefetch_range() could wander off the end of the user's
  Andrew> buffer and into a new page.

  Andrew> I think this gets it right, but I probably screwed something
  Andrew> up.

Please, let's not make this more complicated than it is.  The
cacheline alignment doesn't matter at all.  Provided prefetch_range()
is given a range of guaranteed to be valid memory, then it will be
fine.  It never touches anything outside the specified range.

	--david

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26 20:25             ` David Mosberger
@ 2004-03-26 20:33               ` Andrew Morton
  2004-03-26 20:45                 ` Arjan van de Ven
  2004-03-26 21:12                 ` David Mosberger
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2004-03-26 20:33 UTC (permalink / raw)
  To: davidm; +Cc: davidm, davej, mpm, linux-kernel

David Mosberger <davidm@napali.hpl.hp.com> wrote:
>
> >>>>> On Fri, 26 Mar 2004 10:49:04 -0800, Andrew Morton <akpm@osdl.org> said:
> 
>   Andrew> But the start address which is fed into prefetch_range() may
>   Andrew> not be cacheline-aligned.  So if appropriately abused, a
>   Andrew> prefetch_range() could wander off the end of the user's
>   Andrew> buffer and into a new page.
> 
>   Andrew> I think this gets it right, but I probably screwed something
>   Andrew> up.
> 
> Please, let's not make this more complicated than it is.  The
> cacheline alignment doesn't matter at all.  Provided prefetch_range()
> is given a range of guaranteed to be valid memory, then it will be
> fine.  It never touches anything outside the specified range.

If someone does, say,

	prefetch_range(some_pointer, sizeof(*some_pointer));

then it is possible that prefetch_range() could

a) execute a prefetch at addresses which are not PREFETCH_STRIDE-aligned
   and, as a consequence,

b) prefetch data from the next page, outside the range of the user's
   (addr,len).


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26 20:33               ` Andrew Morton
@ 2004-03-26 20:45                 ` Arjan van de Ven
  2004-03-26 21:17                   ` Andrew Morton
  2004-03-26 21:12                 ` David Mosberger
  1 sibling, 1 reply; 22+ messages in thread
From: Arjan van de Ven @ 2004-03-26 20:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davidm, davidm, davej, mpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 316 bytes --]


> a) execute a prefetch at addresses which are not PREFETCH_STRIDE-aligned
>    and, as a consequence,
> 
> b) prefetch data from the next page, outside the range of the user's
>    (addr,len).

well if you assume that cachelines (and prefetch stride) are proper
divisors of PAGE_SIZE is that still true ?

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26 20:33               ` Andrew Morton
  2004-03-26 20:45                 ` Arjan van de Ven
@ 2004-03-26 21:12                 ` David Mosberger
  1 sibling, 0 replies; 22+ messages in thread
From: David Mosberger @ 2004-03-26 21:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davidm, davidm, davej, mpm, linux-kernel

>>>>> On Fri, 26 Mar 2004 12:33:03 -0800, Andrew Morton <akpm@osdl.org> said:

  Andrew> If someone does, say,

  Andrew> 	prefetch_range(some_pointer, sizeof(*some_pointer));

  Andrew> then it is possible that prefetch_range() could

  Andrew> a) execute a prefetch at addresses which are not
  Andrew> PREFETCH_STRIDE-aligned and, as a consequence,

  Andrew> b) prefetch data from the next page, outside the range of
  Andrew> the user's (addr,len).

This is getting silly.  Cache-lines _never_ cross page-boundaries.

	--david

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26 20:45                 ` Arjan van de Ven
@ 2004-03-26 21:17                   ` Andrew Morton
  2004-03-27  7:44                     ` Arjan van de Ven
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2004-03-26 21:17 UTC (permalink / raw)
  To: arjanv; +Cc: davidm, davidm, davej, mpm, linux-kernel

Arjan van de Ven <arjanv@redhat.com> wrote:
>
> > a) execute a prefetch at addresses which are not PREFETCH_STRIDE-aligned
>  >    and, as a consequence,
>  > 
>  > b) prefetch data from the next page, outside the range of the user's
>  >    (addr,len).
> 
>  well if you assume that cachelines (and prefetch stride) are proper
>  divisors of PAGE_SIZE is that still true ?

Probably not ;)

If someone does

	prefetch_range(4096, 1);

on 4k pagesize, what should we do?

Issuing a single

	prefetch 4090

sounds reasonable.

In that case I'm arranging for it to perform

	prefetch (4096 - 32)

in that case, which seems neater.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26 18:23           ` Dave Jones
@ 2004-03-26 21:31             ` David Mosberger
  0 siblings, 0 replies; 22+ messages in thread
From: David Mosberger @ 2004-03-26 21:31 UTC (permalink / raw)
  To: Dave Jones; +Cc: davidm, Andrew Morton, davidm, mpm, linux-kernel

>>>>> On Fri, 26 Mar 2004 18:23:26 +0000, Dave Jones <davej@redhat.com> said:

  Dave> The proposed only user of this may take care of this
  Dave> requirement, but I'm more concerned someone not aware of this
  Dave> requirement using this helper routine. At the least it
  Dave> deserves a comment IMO.

A comment sounds like a fine idea.  But while at it, the equivalent
comment for prefetch() and prefetchw() seems just as appropriate and
perhaps even more important.

	--david

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
       [not found]           ` <1E4IT-7f3-21@gated-at.bofh.it>
@ 2004-03-27  1:29             ` Andi Kleen
  2004-03-27 15:48               ` Matt Mackall
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2004-03-27  1:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Andrew Morton <akpm@osdl.org> writes:
>
> I think this gets it right, but I probably screwed something up.
>
> static inline void prefetch_range(void *addr, size_t len)
> {
> #ifdef ARCH_HAS_PREFETCH
> 	char *cp;
> 	unsigned long end;
>
> 	end = ((unsigned long)addr + len + PREFETCH_STRIDE - 1);
> 	end &= ~(PREFETCH_STRIDE - 1);
>
> 	for (cp = addr; cp < (char *)end; cp += PREFETCH_STRIDE)
> 		prefetch(cp);
> #endif
> }

The memory/bus controller usually only has a limited queue of
outstanding transactions and for a big buffer you will likely overflow
it. Also usually on modern CPUs it is enough to do prefetch for 2-3
cache lines at the beginning, then an automatic hardware prefetcher
will kick in and take care of the rest.

-Andi


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-26 21:17                   ` Andrew Morton
@ 2004-03-27  7:44                     ` Arjan van de Ven
  0 siblings, 0 replies; 22+ messages in thread
From: Arjan van de Ven @ 2004-03-27  7:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davidm, davidm, davej, mpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 727 bytes --]

On Fri, Mar 26, 2004 at 01:17:36PM -0800, Andrew Morton wrote:
> If someone does
> 
> 	prefetch_range(4090, 20);
> 
> on 4k pagesize, what should we do?

well if the datastructure you prefetch crosses pageboundary you know both
pages are safe to prefetch.

> Issuing a single
> 
> 	prefetch 4090
> 
> sounds reasonable.
> 
> In that case I'm arranging for it to perform
> 
> 	prefetch (4096 - 32)
> 
> in that case, which seems neater.

well prefetch is defined as "get the cacheline the address is located in";
eg the cpu does the rounding to cacheline size for you.

(btw prefetch stride isn't the cacheline size! it's how far you're supposed
to prefetch ahead to get any useful gain, si for aligning and stuff it's
useless)

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fw: potential /dev/urandom scalability improvement
  2004-03-27  1:29             ` Fw: potential /dev/urandom scalability improvement Andi Kleen
@ 2004-03-27 15:48               ` Matt Mackall
  0 siblings, 0 replies; 22+ messages in thread
From: Matt Mackall @ 2004-03-27 15:48 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linux-kernel

On Sat, Mar 27, 2004 at 02:29:12AM +0100, Andi Kleen wrote:
> Andrew Morton <akpm@osdl.org> writes:
> >
> > I think this gets it right, but I probably screwed something up.
> >
> > static inline void prefetch_range(void *addr, size_t len)
> > {
> > #ifdef ARCH_HAS_PREFETCH
> > 	char *cp;
> > 	unsigned long end;
> >
> > 	end = ((unsigned long)addr + len + PREFETCH_STRIDE - 1);
> > 	end &= ~(PREFETCH_STRIDE - 1);
> >
> > 	for (cp = addr; cp < (char *)end; cp += PREFETCH_STRIDE)
> > 		prefetch(cp);
> > #endif
> > }
> 
> The memory/bus controller usually only has a limited queue of
> outstanding transactions and for a big buffer you will likely overflow
> it. Also usually on modern CPUs it is enough to do prefetch for 2-3
> cache lines at the beginning, then an automatic hardware prefetcher
> will kick in and take care of the rest.

Presumable the automatic prefetcher doesn't deal well with
non-sequential access patterns. But I suspect the right thing to do is
put the knowledge of what's sensible for prefetch priming inside an
arch-specific prefetch_range.

-- 
Matt Mackall : http://www.selenic.com : Linux development and consulting

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2004-03-27 15:48 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1DLZM-8aK-67@gated-at.bofh.it>
     [not found] ` <1DLZM-8aK-65@gated-at.bofh.it>
     [not found]   ` <1DOE1-20o-17@gated-at.bofh.it>
     [not found]     ` <1DOXn-2k7-5@gated-at.bofh.it>
     [not found]       ` <1DXxI-Z7-39@gated-at.bofh.it>
     [not found]         ` <1E467-6KK-17@gated-at.bofh.it>
     [not found]           ` <1E4IT-7f3-21@gated-at.bofh.it>
2004-03-27  1:29             ` Fw: potential /dev/urandom scalability improvement Andi Kleen
2004-03-27 15:48               ` Matt Mackall
     [not found] <20040325141923.7080c6f0.akpm@osdl.org>
2004-03-25 22:47 ` Matt Mackall
2004-03-26  1:45   ` David Mosberger
2004-03-26  2:00     ` Andrew Morton
2004-03-26  2:10       ` David Mosberger
2004-03-26  4:07       ` Matt Mackall
2004-03-26  4:19       ` Matt Mackall
2004-03-26  4:51         ` David Mosberger
2004-03-26  5:15           ` Matt Mackall
2004-03-26  5:24             ` David Mosberger
2004-03-26 11:06       ` Dave Jones
2004-03-26 18:08         ` David Mosberger
2004-03-26 18:23           ` Dave Jones
2004-03-26 21:31             ` David Mosberger
2004-03-26 18:49           ` Andrew Morton
2004-03-26 20:25             ` David Mosberger
2004-03-26 20:33               ` Andrew Morton
2004-03-26 20:45                 ` Arjan van de Ven
2004-03-26 21:17                   ` Andrew Morton
2004-03-27  7:44                     ` Arjan van de Ven
2004-03-26 21:12                 ` David Mosberger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox