* Re: Fw: potential /dev/urandom scalability improvement
[not found] <20040325141923.7080c6f0.akpm@osdl.org>
@ 2004-03-25 22:47 ` Matt Mackall
2004-03-26 1:45 ` David Mosberger
0 siblings, 1 reply; 22+ messages in thread
From: Matt Mackall @ 2004-03-25 22:47 UTC (permalink / raw)
To: David Mosberger, Andrew Morton; +Cc: linux-kernel
On Thu, Mar 25, 2004 at 02:19:23PM -0800, Andrew Morton wrote:
>
> Matt, could you please review David's changes?
>
> Begin forwarded message:
>
> Date: Wed, 24 Mar 2004 22:06:57 -0800
> From: David Mosberger <davidm@napali.hpl.hp.com>
> To: akpm@osdl.org
> Cc: linux-kernel@vger.kernel.org, linux-ia64@vger.kernel.org
> Subject: potential /dev/urandom scalability improvement
>
>
> Hi Andrew,
>
> I'm addressing this patch to you because you seem to have been the
> person who most recently made some performance improvements to the
> random driver.
That was probably me, actually.
> Oh, and somebody who actually understands this code may want to
> double-check the patch for correctness.
Seems perfectly sane.
However, I've got a few pending patches that touch the same areas and
do some more critical cleanup that I've been sitting on since the
2.6.0 freeze. So perhaps I should start pushing those again and we can
queue this behind them. David, if you get a chance, grab the latest
copy of my linux-tiny tree from
http://www.selenic.com/tiny/2.6.5-rc2-tiny1-broken-out.tar.bz2
http://www.selenic.com/tiny/2.6.5-rc2-tiny1.patch.bz2
and see how I've tweaked the pool structure and the locking and how
your bits fit with it.
> +#ifdef ARCH_HAS_PREFETCH
> + for (cp = (char *) r->pool; cp <= (char *) (r->pool + wordmask); cp += PREFETCH_STRIDE)
> + prefetch(cp);
> +#endif
Can we avoid adding this ifdef in some fashion? What does the compiler
generate here when prefetch is a no-op? This seems to call for a
prefetch_range(start, len) function/macro in any case.
--
Matt Mackall : http://www.selenic.com : Linux development and consulting
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-25 22:47 ` Matt Mackall
@ 2004-03-26 1:45 ` David Mosberger
2004-03-26 2:00 ` Andrew Morton
0 siblings, 1 reply; 22+ messages in thread
From: David Mosberger @ 2004-03-26 1:45 UTC (permalink / raw)
To: Matt Mackall; +Cc: David Mosberger, Andrew Morton, linux-kernel
>>>>> On Thu, 25 Mar 2004 16:47:27 -0600, Matt Mackall <mpm@selenic.com> said:
>> I'm addressing this patch to you because you seem to have been
>> the person who most recently made some performance improvements
>> to the random driver.
Matt> That was probably me, actually.
Sorry, that's what I get for trusting the BK log.
Matt> However, I've got a few pending patches that touch the same
Matt> areas and do some more critical cleanup that I've been sitting
Matt> on since the 2.6.0 freeze. So perhaps I should start pushing
Matt> those again and we can queue this behind them. David, if you
Matt> get a chance, grab the latest copy of my linux-tiny tree from
Matt>
Matt> http://www.selenic.com/tiny/2.6.5-rc2-tiny1-broken-out.tar.bz2
Matt> http://www.selenic.com/tiny/2.6.5-rc2-tiny1.patch.bz2
Matt> and see how I've tweaked the pool structure and the locking
Matt> and how your bits fit with it.
Not much left of the original bits!
>> +#ifdef ARCH_HAS_PREFETCH
Matt> Can we avoid adding this ifdef in some fashion? What does the
Matt> compiler generate here when prefetch is a no-op? This seems to
Matt> call for a prefetch_range(start, len) function/macro in any
Matt> case.
Sounds reasonable, but I would prefer to do this in separate steps.
I tried your changes and performance was virtually unchanged. The
patch below is updated to go on top of your patch and gives about the
same performance as I reported yesterday. For now, I defined an
inline prefetch_range(). If and when all architectures get updated to
define this directly, we can simply remove prefetch_range() from the
driver.
Thanks
--david
--
--- drivers/char/random.c 2004-03-25 17:41:54.432358997 -0800
+++ drivers/char/random.c-davidm 2004-03-25 17:34:59.040063215 -0800
@@ -421,14 +421,20 @@
**********************************************************************/
struct entropy_store {
+ /* mostly-read data: */
const char *name;
+ struct poolinfo *poolinfo;
+ __u32 *pool;
+ /*
+ * read-write data (colocate with lock such that when we get
+ * the lock, we get the other data for "free"; may cause some
+ * extra bus-transactions, though):
+ */
+ spinlock_t lock ____cacheline_aligned;
unsigned add_ptr;
int entropy_count;
int input_rotate;
int reserved;
- struct poolinfo *poolinfo;
- spinlock_t lock;
- __u32 *pool;
};
static __u32 input_pool_data[INPUT_POOL_WORDS];
@@ -456,6 +462,16 @@
.pool = nonblocking_pool_data
};
+static inline void
+prefetch_range (void *addr, size_t len)
+{
+#ifdef ARCH_HAS_PREFETCH
+ char *cp, *end = (char *) addr + len;
+ for (cp = addr; cp < end; cp += PREFETCH_STRIDE)
+ prefetch(cp);
+#endif
+}
+
/*
* This function adds words into the entropy "pool". It does not
* update the entropy estimate. The caller should call
@@ -472,17 +488,30 @@
static __u32 const twist_table[8] = {
0, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };
- unsigned i;
- int new_rotate;
- int wordmask = r->poolinfo->poolwords - 1;
- __u32 w;
- unsigned long flags;
+ unsigned long i, add_ptr, tap1, tap2, tap3, tap4, tap5;
+ int new_rotate, input_rotate;
+ unsigned long flags, wordmask = r->poolinfo->poolwords - 1;
+ __u32 w, next_w, *pool = r->pool;
+
+ /* Taps are constant, so we can load them without holding r->lock. */
+ tap1 = r->poolinfo->tap1;
+ tap2 = r->poolinfo->tap2;
+ tap3 = r->poolinfo->tap3;
+ tap4 = r->poolinfo->tap4;
+ tap5 = r->poolinfo->tap5;
+ next_w = *in++;
spin_lock_irqsave(&r->lock, flags);
+ prefetch_range(pool, 4 * (wordmask + 1));
+ input_rotate = r->input_rotate;
+ add_ptr = r->add_ptr;
+
while (nwords--) {
- w = rol32(*in++, r->input_rotate);
- i = r->add_ptr = (r->add_ptr - 1) & wordmask;
+ w = rol32(next_w, input_rotate);
+ if (nwords > 0)
+ next_w = *in++;
+ i = add_ptr = (add_ptr - 1) & wordmask;
/*
* Normally, we add 7 bits of rotation to the pool.
@@ -490,21 +519,24 @@
* rotation, so that successive passes spread the
* input bits across the pool evenly.
*/
- new_rotate = r->input_rotate + 14;
+ new_rotate = input_rotate + 14;
if (i)
- new_rotate = r->input_rotate + 7;
- r->input_rotate = new_rotate & 31;
+ new_rotate = input_rotate + 7;
+ input_rotate = new_rotate & 31;
/* XOR in the various taps */
- w ^= r->pool[(i + r->poolinfo->tap1) & wordmask];
- w ^= r->pool[(i + r->poolinfo->tap2) & wordmask];
- w ^= r->pool[(i + r->poolinfo->tap3) & wordmask];
- w ^= r->pool[(i + r->poolinfo->tap4) & wordmask];
- w ^= r->pool[(i + r->poolinfo->tap5) & wordmask];
- w ^= r->pool[i];
- r->pool[i] = (w >> 3) ^ twist_table[w & 7];
+ w ^= pool[(i + tap1) & wordmask];
+ w ^= pool[(i + tap2) & wordmask];
+ w ^= pool[(i + tap3) & wordmask];
+ w ^= pool[(i + tap4) & wordmask];
+ w ^= pool[(i + tap5) & wordmask];
+ w ^= pool[i];
+ pool[i] = (w >> 3) ^ twist_table[w & 7];
}
+ r->input_rotate = input_rotate;
+ r->add_ptr = add_ptr;
+
spin_unlock_irqrestore(&r->lock, flags);
}
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 1:45 ` David Mosberger
@ 2004-03-26 2:00 ` Andrew Morton
2004-03-26 2:10 ` David Mosberger
` (3 more replies)
0 siblings, 4 replies; 22+ messages in thread
From: Andrew Morton @ 2004-03-26 2:00 UTC (permalink / raw)
To: davidm; +Cc: davidm, mpm, linux-kernel
David Mosberger <davidm@napali.hpl.hp.com> wrote:
>
> The
> patch below is updated to go on top of your patch and gives about the
> same performance as I reported yesterday. For now, I defined an
> inline prefetch_range(). If and when all architectures get updated to
> define this directly, we can simply remove prefetch_range() from the
> driver.
We may as well stick prefetch_range() in prefetch.h.
And Matt's patch series is not a thing I want to take on board at present,
so let's stick with the straight scalability patch for now.
I moved the prefetch_range() call to outside the spinlock. Does that make
sense?
25-akpm/drivers/char/random.c | 51 ++++++++++++++++++++++++++-------------
25-akpm/include/linux/prefetch.h | 11 ++++++++
2 files changed, 46 insertions(+), 16 deletions(-)
diff -puN drivers/char/random.c~urandom-scalability-fix drivers/char/random.c
--- 25/drivers/char/random.c~urandom-scalability-fix 2004-03-25 17:53:57.498675480 -0800
+++ 25-akpm/drivers/char/random.c 2004-03-25 17:57:39.795881168 -0800
@@ -490,12 +490,15 @@ static inline __u32 int_ln_12bits(__u32
**********************************************************************/
struct entropy_store {
+ /* mostly-read data: */
+ struct poolinfo poolinfo;
+ __u32 *pool;
+
+ /* read-write data: */
+ spinlock_t lock ____cacheline_aligned;
unsigned add_ptr;
int entropy_count;
int input_rotate;
- struct poolinfo poolinfo;
- __u32 *pool;
- spinlock_t lock;
};
/*
@@ -571,38 +574,54 @@ static void add_entropy_words(struct ent
static __u32 const twist_table[8] = {
0, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };
- unsigned i;
- int new_rotate;
+ unsigned long i, add_ptr, tap1, tap2, tap3, tap4, tap5;
+ int new_rotate, input_rotate;
int wordmask = r->poolinfo.poolwords - 1;
- __u32 w;
+ __u32 w, next_w;
unsigned long flags;
+ /* Taps are constant, so we can load them without holding r->lock. */
+ tap1 = r->poolinfo.tap1;
+ tap2 = r->poolinfo.tap2;
+ tap3 = r->poolinfo.tap3;
+ tap4 = r->poolinfo.tap4;
+ tap5 = r->poolinfo.tap5;
+ next_w = *in++;
+
+ prefetch_range(r->pool, wordmask);
spin_lock_irqsave(&r->lock, flags);
+ input_rotate = r->input_rotate;
+ add_ptr = r->add_ptr;
while (nwords--) {
- w = rotate_left(r->input_rotate, *in++);
- i = r->add_ptr = (r->add_ptr - 1) & wordmask;
+ w = rotate_left(input_rotate, next_w);
+ if (nwords > 0)
+ next_w = *in++;
+ i = add_ptr = (add_ptr - 1) & wordmask;
/*
* Normally, we add 7 bits of rotation to the pool.
* At the beginning of the pool, add an extra 7 bits
* rotation, so that successive passes spread the
* input bits across the pool evenly.
*/
- new_rotate = r->input_rotate + 14;
+ new_rotate = input_rotate + 14;
if (i)
- new_rotate = r->input_rotate + 7;
- r->input_rotate = new_rotate & 31;
+ new_rotate = input_rotate + 7;
+ input_rotate = new_rotate & 31;
/* XOR in the various taps */
- w ^= r->pool[(i + r->poolinfo.tap1) & wordmask];
- w ^= r->pool[(i + r->poolinfo.tap2) & wordmask];
- w ^= r->pool[(i + r->poolinfo.tap3) & wordmask];
- w ^= r->pool[(i + r->poolinfo.tap4) & wordmask];
- w ^= r->pool[(i + r->poolinfo.tap5) & wordmask];
+ w ^= r->pool[(i + tap1) & wordmask];
+ w ^= r->pool[(i + tap2) & wordmask];
+ w ^= r->pool[(i + tap3) & wordmask];
+ w ^= r->pool[(i + tap4) & wordmask];
+ w ^= r->pool[(i + tap5) & wordmask];
w ^= r->pool[i];
r->pool[i] = (w >> 3) ^ twist_table[w & 7];
}
+ r->input_rotate = input_rotate;
+ r->add_ptr = add_ptr;
+
spin_unlock_irqrestore(&r->lock, flags);
}
diff -puN include/linux/prefetch.h~urandom-scalability-fix include/linux/prefetch.h
--- 25/include/linux/prefetch.h~urandom-scalability-fix 2004-03-25 17:54:27.279148160 -0800
+++ 25-akpm/include/linux/prefetch.h 2004-03-25 17:55:34.456935584 -0800
@@ -54,4 +54,15 @@ static inline void prefetchw(const void
#define PREFETCH_STRIDE (4*L1_CACHE_BYTES)
#endif
+static inline void prefetch_range(void *addr, size_t len)
+{
+#ifdef ARCH_HAS_PREFETCH
+ char *cp;
+ char *end = addr + len;
+
+ for (cp = addr; cp < end; cp += PREFETCH_STRIDE)
+ prefetch(cp);
+#endif
+}
+
#endif
_
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 2:00 ` Andrew Morton
@ 2004-03-26 2:10 ` David Mosberger
2004-03-26 4:07 ` Matt Mackall
` (2 subsequent siblings)
3 siblings, 0 replies; 22+ messages in thread
From: David Mosberger @ 2004-03-26 2:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: davidm, davidm, mpm, linux-kernel
>>>>> On Thu, 25 Mar 2004 18:00:14 -0800, Andrew Morton <akpm@osdl.org> said:
Andrew> We may as well stick prefetch_range() in prefetch.h.
Fair enough.
Andrew> And Matt's patch series is not a thing I want to take on
Andrew> board at present, so let's stick with the straight
Andrew> scalability patch for now.
Andrew> I moved the prefetch_range() call to outside the spinlock.
Andrew> Does that make sense?
The other CPUs will dirty that data, so prefetching it before you hold
the lock is almost certainly bad for performance. (Well, to be
precise, it really depends on the number of lines dirtied.)
--david
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 2:00 ` Andrew Morton
2004-03-26 2:10 ` David Mosberger
@ 2004-03-26 4:07 ` Matt Mackall
2004-03-26 4:19 ` Matt Mackall
2004-03-26 11:06 ` Dave Jones
3 siblings, 0 replies; 22+ messages in thread
From: Matt Mackall @ 2004-03-26 4:07 UTC (permalink / raw)
To: Andrew Morton; +Cc: davidm, davidm, linux-kernel
On Thu, Mar 25, 2004 at 06:00:14PM -0800, Andrew Morton wrote:
> David Mosberger <davidm@napali.hpl.hp.com> wrote:
> >
> > The
> > patch below is updated to go on top of your patch and gives about the
> > same performance as I reported yesterday. For now, I defined an
> > inline prefetch_range(). If and when all architectures get updated to
> > define this directly, we can simply remove prefetch_range() from the
> > driver.
>
> We may as well stick prefetch_range() in prefetch.h.
>
> And Matt's patch series is not a thing I want to take on board at present,
> so let's stick with the straight scalability patch for now.
Sigh, I'll trim it back to some bits I think are critical.
> I moved the prefetch_range() call to outside the spinlock. Does that make
> sense?
I don't think that's actually a win. If there's contention, threads
racing to the lock will grab the same cache lines and all but one
thread's cache will end up invalidated by the time the lock is
released.
--
Matt Mackall : http://www.selenic.com : Linux development and consulting
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 2:00 ` Andrew Morton
2004-03-26 2:10 ` David Mosberger
2004-03-26 4:07 ` Matt Mackall
@ 2004-03-26 4:19 ` Matt Mackall
2004-03-26 4:51 ` David Mosberger
2004-03-26 11:06 ` Dave Jones
3 siblings, 1 reply; 22+ messages in thread
From: Matt Mackall @ 2004-03-26 4:19 UTC (permalink / raw)
To: Andrew Morton; +Cc: davidm, davidm, linux-kernel
On Thu, Mar 25, 2004 at 06:00:14PM -0800, Andrew Morton wrote:
>
> 25-akpm/drivers/char/random.c | 51 ++++++++++++++++++++++++++-------------
> 25-akpm/include/linux/prefetch.h | 11 ++++++++
> 2 files changed, 46 insertions(+), 16 deletions(-)
>
> diff -puN drivers/char/random.c~urandom-scalability-fix drivers/char/random.c
> +++ 25-akpm/drivers/char/random.c 2004-03-25 17:57:39.795881168 -0800
> @@ -490,12 +490,15 @@ static inline __u32 int_ln_12bits(__u32
> **********************************************************************/
>
> struct entropy_store {
> + /* mostly-read data: */
> + struct poolinfo poolinfo;
> + __u32 *pool;
> +
> + /* read-write data: */
> + spinlock_t lock ____cacheline_aligned;
> unsigned add_ptr;
> int entropy_count;
> int input_rotate;
> - struct poolinfo poolinfo;
> - __u32 *pool;
> - spinlock_t lock;
> };
Also, I think in general we'd prefer to stick the aligned bit at the
front of the structure rather than at the middle, as we'll avoid extra
padding. The size of cachelines is getting rather obscene on some
modern processors.
--
Matt Mackall : http://www.selenic.com : Linux development and consulting
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 4:19 ` Matt Mackall
@ 2004-03-26 4:51 ` David Mosberger
2004-03-26 5:15 ` Matt Mackall
0 siblings, 1 reply; 22+ messages in thread
From: David Mosberger @ 2004-03-26 4:51 UTC (permalink / raw)
To: Matt Mackall; +Cc: Andrew Morton, davidm, davidm, linux-kernel
>>>>> On Thu, 25 Mar 2004 22:19:26 -0600, Matt Mackall <mpm@selenic.com> said:
> struct entropy_store {
> + /* mostly-read data: */
> + struct poolinfo poolinfo;
> + __u32 *pool;
> +
> + /* read-write data: */
> + spinlock_t lock ____cacheline_aligned;
> unsigned add_ptr;
> int entropy_count;
> int input_rotate;
> - struct poolinfo poolinfo;
> - __u32 *pool;
> - spinlock_t lock;
> };
Matt> Also, I think in general we'd prefer to stick the aligned bit at the
Matt> front of the structure rather than at the middle, as we'll avoid extra
Matt> padding. The size of cachelines is getting rather obscene on some
Matt> modern processors.
Not sharing the cacheline between the mostly-read data and the
read-write data is the _point_ of this change. If you reverse the
order, the "poolinfo" and "pool" members will also get invalidated
whenever someone updates the write-intensive data.
--david
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 4:51 ` David Mosberger
@ 2004-03-26 5:15 ` Matt Mackall
2004-03-26 5:24 ` David Mosberger
0 siblings, 1 reply; 22+ messages in thread
From: Matt Mackall @ 2004-03-26 5:15 UTC (permalink / raw)
To: davidm; +Cc: Andrew Morton, davidm, linux-kernel
On Thu, Mar 25, 2004 at 08:51:54PM -0800, David Mosberger wrote:
> >>>>> On Thu, 25 Mar 2004 22:19:26 -0600, Matt Mackall <mpm@selenic.com> said:
>
> > struct entropy_store {
> > + /* mostly-read data: */
> > + struct poolinfo poolinfo;
> > + __u32 *pool;
> > +
> > + /* read-write data: */
> > + spinlock_t lock ____cacheline_aligned;
> > unsigned add_ptr;
> > int entropy_count;
> > int input_rotate;
> > - struct poolinfo poolinfo;
> > - __u32 *pool;
> > - spinlock_t lock;
> > };
>
> Matt> Also, I think in general we'd prefer to stick the aligned bit at the
> Matt> front of the structure rather than at the middle, as we'll avoid extra
> Matt> padding. The size of cachelines is getting rather obscene on some
> Matt> modern processors.
>
> Not sharing the cacheline between the mostly-read data and the
> read-write data is the _point_ of this change. If you reverse the
> order, the "poolinfo" and "pool" members will also get invalidated
> whenever someone updates the write-intensive data.
Ok, previous observation made no sense; I should really be taking a
nap right now. Hopefully this next one will make more sense: it ought
to be ____cacheline_aligned_in_smp as the zero-byte spinlock struct
still forces alignment.
--
Matt Mackall : http://www.selenic.com : Linux development and consulting
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 5:15 ` Matt Mackall
@ 2004-03-26 5:24 ` David Mosberger
0 siblings, 0 replies; 22+ messages in thread
From: David Mosberger @ 2004-03-26 5:24 UTC (permalink / raw)
To: Matt Mackall; +Cc: davidm, Andrew Morton, linux-kernel
>>>>> On Thu, 25 Mar 2004 23:15:32 -0600, Matt Mackall <mpm@selenic.com> said:
Matt> Ok, previous observation made no sense; I should really be taking a
Matt> nap right now. Hopefully this next one will make more sense: it ought
Matt> to be ____cacheline_aligned_in_smp as the zero-byte spinlock struct
Matt> still forces alignment.
Makes tons of sense, agreed.
--david
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 2:00 ` Andrew Morton
` (2 preceding siblings ...)
2004-03-26 4:19 ` Matt Mackall
@ 2004-03-26 11:06 ` Dave Jones
2004-03-26 18:08 ` David Mosberger
3 siblings, 1 reply; 22+ messages in thread
From: Dave Jones @ 2004-03-26 11:06 UTC (permalink / raw)
To: Andrew Morton; +Cc: davidm, davidm, mpm, linux-kernel
On Thu, Mar 25, 2004 at 06:00:14PM -0800, Andrew Morton wrote:
> +static inline void prefetch_range(void *addr, size_t len)
> +{
> +#ifdef ARCH_HAS_PREFETCH
> + char *cp;
> + char *end = addr + len;
> +
> + for (cp = addr; cp < end; cp += PREFETCH_STRIDE)
> + prefetch(cp);
> +#endif
> +}
> +
> #endif
I think this may be dangerous on some CPUs, if may prefetch past
the end of the buffer. Ie, if PREFETCH_STRIDE was 32, and len
was 65, we'd end up prefetching 65->97. As well as being
wasteful to cachelines, this can crash if theres for eg
nothing mapped after the next page boundary.
Dave
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 11:06 ` Dave Jones
@ 2004-03-26 18:08 ` David Mosberger
2004-03-26 18:23 ` Dave Jones
2004-03-26 18:49 ` Andrew Morton
0 siblings, 2 replies; 22+ messages in thread
From: David Mosberger @ 2004-03-26 18:08 UTC (permalink / raw)
To: Dave Jones; +Cc: Andrew Morton, davidm, davidm, mpm, linux-kernel
>>>>> On Fri, 26 Mar 2004 11:06:19 +0000, Dave Jones <davej@redhat.com> said:
Dave> On Thu, Mar 25, 2004 at 06:00:14PM -0800, Andrew Morton wrote:
>> +static inline void prefetch_range(void *addr, size_t len) +{
>> +#ifdef ARCH_HAS_PREFETCH + char *cp; + char *end = addr + len; +
>> + for (cp = addr; cp < end; cp += PREFETCH_STRIDE) +
>> prefetch(cp); +#endif +} + #endif
Dave> I think this may be dangerous on some CPUs, if may prefetch
Dave> past the end of the buffer. Ie, if PREFETCH_STRIDE was 32, and
Dave> len was 65, we'd end up prefetching 65->97. As well as being
Dave> wasteful to cachelines, this can crash if theres for eg
Dave> nothing mapped after the next page boundary.
Huh? It only ever prefetches addresses that are _within_ the
specified buffer. Of course it will prefetch entire cachelines, but I
hope you're not worried about cachlines crossing page-boundaries! ;-))
--david
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 18:08 ` David Mosberger
@ 2004-03-26 18:23 ` Dave Jones
2004-03-26 21:31 ` David Mosberger
2004-03-26 18:49 ` Andrew Morton
1 sibling, 1 reply; 22+ messages in thread
From: Dave Jones @ 2004-03-26 18:23 UTC (permalink / raw)
To: davidm; +Cc: Andrew Morton, davidm, mpm, linux-kernel
On Fri, Mar 26, 2004 at 10:08:39AM -0800, David Mosberger wrote:
> > I think this may be dangerous on some CPUs, if may prefetch
> > past the end of the buffer. Ie, if PREFETCH_STRIDE was 32, and
> > len was 65, we'd end up prefetching 65->97. As well as being
> > wasteful to cachelines, this can crash if theres for eg
> > nothing mapped after the next page boundary.
>
> Huh? It only ever prefetches addresses that are _within_ the
> specified buffer. Of course it will prefetch entire cachelines, but I
> hope you're not worried about cachlines crossing page-boundaries! ;-))
The proposed only user of this may take care of this requirement, but
I'm more concerned someone not aware of this requirement using this
helper routine. At the least it deserves a comment IMO.
Dave
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 18:08 ` David Mosberger
2004-03-26 18:23 ` Dave Jones
@ 2004-03-26 18:49 ` Andrew Morton
2004-03-26 20:25 ` David Mosberger
1 sibling, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2004-03-26 18:49 UTC (permalink / raw)
To: davidm; +Cc: davidm, davej, mpm, linux-kernel
David Mosberger <davidm@napali.hpl.hp.com> wrote:
>
> >>>>> On Fri, 26 Mar 2004 11:06:19 +0000, Dave Jones <davej@redhat.com> said:
>
> Dave> On Thu, Mar 25, 2004 at 06:00:14PM -0800, Andrew Morton wrote:
> >> +static inline void prefetch_range(void *addr, size_t len) +{
> >> +#ifdef ARCH_HAS_PREFETCH + char *cp; + char *end = addr + len; +
> >> + for (cp = addr; cp < end; cp += PREFETCH_STRIDE) +
> >> prefetch(cp); +#endif +} + #endif
>
> Dave> I think this may be dangerous on some CPUs, if may prefetch
> Dave> past the end of the buffer. Ie, if PREFETCH_STRIDE was 32, and
> Dave> len was 65, we'd end up prefetching 65->97. As well as being
> Dave> wasteful to cachelines, this can crash if theres for eg
> Dave> nothing mapped after the next page boundary.
>
> Huh? It only ever prefetches addresses that are _within_ the
> specified buffer. Of course it will prefetch entire cachelines, but I
> hope you're not worried about cachlines crossing page-boundaries! ;-))
>
But the start address which is fed into prefetch_range() may not be
cacheline-aligned. So if appropriately abused, a prefetch_range() could
wander off the end of the user's buffer and into a new page.
I think this gets it right, but I probably screwed something up.
static inline void prefetch_range(void *addr, size_t len)
{
#ifdef ARCH_HAS_PREFETCH
char *cp;
unsigned long end;
end = ((unsigned long)addr + len + PREFETCH_STRIDE - 1);
end &= ~(PREFETCH_STRIDE - 1);
for (cp = addr; cp < (char *)end; cp += PREFETCH_STRIDE)
prefetch(cp);
#endif
}
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 18:49 ` Andrew Morton
@ 2004-03-26 20:25 ` David Mosberger
2004-03-26 20:33 ` Andrew Morton
0 siblings, 1 reply; 22+ messages in thread
From: David Mosberger @ 2004-03-26 20:25 UTC (permalink / raw)
To: Andrew Morton; +Cc: davidm, davidm, davej, mpm, linux-kernel
>>>>> On Fri, 26 Mar 2004 10:49:04 -0800, Andrew Morton <akpm@osdl.org> said:
Andrew> But the start address which is fed into prefetch_range() may
Andrew> not be cacheline-aligned. So if appropriately abused, a
Andrew> prefetch_range() could wander off the end of the user's
Andrew> buffer and into a new page.
Andrew> I think this gets it right, but I probably screwed something
Andrew> up.
Please, let's not make this more complicated than it is. The
cacheline alignment doesn't matter at all. Provided prefetch_range()
is given a range of guaranteed to be valid memory, then it will be
fine. It never touches anything outside the specified range.
--david
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 20:25 ` David Mosberger
@ 2004-03-26 20:33 ` Andrew Morton
2004-03-26 20:45 ` Arjan van de Ven
2004-03-26 21:12 ` David Mosberger
0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2004-03-26 20:33 UTC (permalink / raw)
To: davidm; +Cc: davidm, davej, mpm, linux-kernel
David Mosberger <davidm@napali.hpl.hp.com> wrote:
>
> >>>>> On Fri, 26 Mar 2004 10:49:04 -0800, Andrew Morton <akpm@osdl.org> said:
>
> Andrew> But the start address which is fed into prefetch_range() may
> Andrew> not be cacheline-aligned. So if appropriately abused, a
> Andrew> prefetch_range() could wander off the end of the user's
> Andrew> buffer and into a new page.
>
> Andrew> I think this gets it right, but I probably screwed something
> Andrew> up.
>
> Please, let's not make this more complicated than it is. The
> cacheline alignment doesn't matter at all. Provided prefetch_range()
> is given a range of guaranteed to be valid memory, then it will be
> fine. It never touches anything outside the specified range.
If someone does, say,
prefetch_range(some_pointer, sizeof(*some_pointer));
then it is possible that prefetch_range() could
a) execute a prefetch at addresses which are not PREFETCH_STRIDE-aligned
and, as a consequence,
b) prefetch data from the next page, outside the range of the user's
(addr,len).
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 20:33 ` Andrew Morton
@ 2004-03-26 20:45 ` Arjan van de Ven
2004-03-26 21:17 ` Andrew Morton
2004-03-26 21:12 ` David Mosberger
1 sibling, 1 reply; 22+ messages in thread
From: Arjan van de Ven @ 2004-03-26 20:45 UTC (permalink / raw)
To: Andrew Morton; +Cc: davidm, davidm, davej, mpm, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 316 bytes --]
> a) execute a prefetch at addresses which are not PREFETCH_STRIDE-aligned
> and, as a consequence,
>
> b) prefetch data from the next page, outside the range of the user's
> (addr,len).
well if you assume that cachelines (and prefetch stride) are proper
divisors of PAGE_SIZE is that still true ?
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 20:33 ` Andrew Morton
2004-03-26 20:45 ` Arjan van de Ven
@ 2004-03-26 21:12 ` David Mosberger
1 sibling, 0 replies; 22+ messages in thread
From: David Mosberger @ 2004-03-26 21:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: davidm, davidm, davej, mpm, linux-kernel
>>>>> On Fri, 26 Mar 2004 12:33:03 -0800, Andrew Morton <akpm@osdl.org> said:
Andrew> If someone does, say,
Andrew> prefetch_range(some_pointer, sizeof(*some_pointer));
Andrew> then it is possible that prefetch_range() could
Andrew> a) execute a prefetch at addresses which are not
Andrew> PREFETCH_STRIDE-aligned and, as a consequence,
Andrew> b) prefetch data from the next page, outside the range of
Andrew> the user's (addr,len).
This is getting silly. Cache-lines _never_ cross page-boundaries.
--david
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 20:45 ` Arjan van de Ven
@ 2004-03-26 21:17 ` Andrew Morton
2004-03-27 7:44 ` Arjan van de Ven
0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2004-03-26 21:17 UTC (permalink / raw)
To: arjanv; +Cc: davidm, davidm, davej, mpm, linux-kernel
Arjan van de Ven <arjanv@redhat.com> wrote:
>
> > a) execute a prefetch at addresses which are not PREFETCH_STRIDE-aligned
> > and, as a consequence,
> >
> > b) prefetch data from the next page, outside the range of the user's
> > (addr,len).
>
> well if you assume that cachelines (and prefetch stride) are proper
> divisors of PAGE_SIZE is that still true ?
Probably not ;)
If someone does
prefetch_range(4096, 1);
on 4k pagesize, what should we do?
Issuing a single
prefetch 4090
sounds reasonable.
In that case I'm arranging for it to perform
prefetch (4096 - 32)
in that case, which seems neater.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 18:23 ` Dave Jones
@ 2004-03-26 21:31 ` David Mosberger
0 siblings, 0 replies; 22+ messages in thread
From: David Mosberger @ 2004-03-26 21:31 UTC (permalink / raw)
To: Dave Jones; +Cc: davidm, Andrew Morton, davidm, mpm, linux-kernel
>>>>> On Fri, 26 Mar 2004 18:23:26 +0000, Dave Jones <davej@redhat.com> said:
Dave> The proposed only user of this may take care of this
Dave> requirement, but I'm more concerned someone not aware of this
Dave> requirement using this helper routine. At the least it
Dave> deserves a comment IMO.
A comment sounds like a fine idea. But while at it, the equivalent
comment for prefetch() and prefetchw() seems just as appropriate and
perhaps even more important.
--david
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
[not found] ` <1E4IT-7f3-21@gated-at.bofh.it>
@ 2004-03-27 1:29 ` Andi Kleen
2004-03-27 15:48 ` Matt Mackall
0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2004-03-27 1:29 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
Andrew Morton <akpm@osdl.org> writes:
>
> I think this gets it right, but I probably screwed something up.
>
> static inline void prefetch_range(void *addr, size_t len)
> {
> #ifdef ARCH_HAS_PREFETCH
> char *cp;
> unsigned long end;
>
> end = ((unsigned long)addr + len + PREFETCH_STRIDE - 1);
> end &= ~(PREFETCH_STRIDE - 1);
>
> for (cp = addr; cp < (char *)end; cp += PREFETCH_STRIDE)
> prefetch(cp);
> #endif
> }
The memory/bus controller usually only has a limited queue of
outstanding transactions and for a big buffer you will likely overflow
it. Also usually on modern CPUs it is enough to do prefetch for 2-3
cache lines at the beginning, then an automatic hardware prefetcher
will kick in and take care of the rest.
-Andi
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-26 21:17 ` Andrew Morton
@ 2004-03-27 7:44 ` Arjan van de Ven
0 siblings, 0 replies; 22+ messages in thread
From: Arjan van de Ven @ 2004-03-27 7:44 UTC (permalink / raw)
To: Andrew Morton; +Cc: davidm, davidm, davej, mpm, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 727 bytes --]
On Fri, Mar 26, 2004 at 01:17:36PM -0800, Andrew Morton wrote:
> If someone does
>
> prefetch_range(4090, 20);
>
> on 4k pagesize, what should we do?
well if the datastructure you prefetch crosses pageboundary you know both
pages are safe to prefetch.
> Issuing a single
>
> prefetch 4090
>
> sounds reasonable.
>
> In that case I'm arranging for it to perform
>
> prefetch (4096 - 32)
>
> in that case, which seems neater.
well prefetch is defined as "get the cacheline the address is located in";
eg the cpu does the rounding to cacheline size for you.
(btw prefetch stride isn't the cacheline size! it's how far you're supposed
to prefetch ahead to get any useful gain, si for aligning and stuff it's
useless)
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fw: potential /dev/urandom scalability improvement
2004-03-27 1:29 ` Fw: potential /dev/urandom scalability improvement Andi Kleen
@ 2004-03-27 15:48 ` Matt Mackall
0 siblings, 0 replies; 22+ messages in thread
From: Matt Mackall @ 2004-03-27 15:48 UTC (permalink / raw)
To: Andi Kleen; +Cc: Andrew Morton, linux-kernel
On Sat, Mar 27, 2004 at 02:29:12AM +0100, Andi Kleen wrote:
> Andrew Morton <akpm@osdl.org> writes:
> >
> > I think this gets it right, but I probably screwed something up.
> >
> > static inline void prefetch_range(void *addr, size_t len)
> > {
> > #ifdef ARCH_HAS_PREFETCH
> > char *cp;
> > unsigned long end;
> >
> > end = ((unsigned long)addr + len + PREFETCH_STRIDE - 1);
> > end &= ~(PREFETCH_STRIDE - 1);
> >
> > for (cp = addr; cp < (char *)end; cp += PREFETCH_STRIDE)
> > prefetch(cp);
> > #endif
> > }
>
> The memory/bus controller usually only has a limited queue of
> outstanding transactions and for a big buffer you will likely overflow
> it. Also usually on modern CPUs it is enough to do prefetch for 2-3
> cache lines at the beginning, then an automatic hardware prefetcher
> will kick in and take care of the rest.
Presumable the automatic prefetcher doesn't deal well with
non-sequential access patterns. But I suspect the right thing to do is
put the knowledge of what's sensible for prefetch priming inside an
arch-specific prefetch_range.
--
Matt Mackall : http://www.selenic.com : Linux development and consulting
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2004-03-27 15:48 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1DLZM-8aK-67@gated-at.bofh.it>
[not found] ` <1DLZM-8aK-65@gated-at.bofh.it>
[not found] ` <1DOE1-20o-17@gated-at.bofh.it>
[not found] ` <1DOXn-2k7-5@gated-at.bofh.it>
[not found] ` <1DXxI-Z7-39@gated-at.bofh.it>
[not found] ` <1E467-6KK-17@gated-at.bofh.it>
[not found] ` <1E4IT-7f3-21@gated-at.bofh.it>
2004-03-27 1:29 ` Fw: potential /dev/urandom scalability improvement Andi Kleen
2004-03-27 15:48 ` Matt Mackall
[not found] <20040325141923.7080c6f0.akpm@osdl.org>
2004-03-25 22:47 ` Matt Mackall
2004-03-26 1:45 ` David Mosberger
2004-03-26 2:00 ` Andrew Morton
2004-03-26 2:10 ` David Mosberger
2004-03-26 4:07 ` Matt Mackall
2004-03-26 4:19 ` Matt Mackall
2004-03-26 4:51 ` David Mosberger
2004-03-26 5:15 ` Matt Mackall
2004-03-26 5:24 ` David Mosberger
2004-03-26 11:06 ` Dave Jones
2004-03-26 18:08 ` David Mosberger
2004-03-26 18:23 ` Dave Jones
2004-03-26 21:31 ` David Mosberger
2004-03-26 18:49 ` Andrew Morton
2004-03-26 20:25 ` David Mosberger
2004-03-26 20:33 ` Andrew Morton
2004-03-26 20:45 ` Arjan van de Ven
2004-03-26 21:17 ` Andrew Morton
2004-03-27 7:44 ` Arjan van de Ven
2004-03-26 21:12 ` David Mosberger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox