[RFC] [PATCH] cache pollution aware __copy_from_user

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
@ 2005-08-14  9:16 Hiro Yoshioka
  2005-08-14  9:41 ` Arjan van de Ven
  0 siblings, 1 reply; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-14  9:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: Hiro Yoshioka

Hi,

The following is a patch to reduce a cache pollution
of __copy_from_user_ll().

When I run simple iozone benchmark to find a performance bottleneck of
the linux kernel, I found that __copy_from_user_ll() spent CPU cycle
most and it did many cache misses.

The following is profiled by oprofile.

Top 5 CPU cycle
CPU: P4 / Xeon, speed 2200.91 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        app name                 symbol name
281538   15.2083  vmlinux                  __copy_from_user_ll
81069     4.3792  vmlinux                  _spin_lock
75523     4.0796  vmlinux                  journal_add_journal_head
63674     3.4396  vmlinux                  do_get_write_access
52634     2.8432  vmlinux                  journal_put_journal_head
(pattern9-0-cpu4-0-08141700/summary.out)

Top 5 Memory Access and Cache miss
CPU: P4 / Xeon, speed 2200.91 MHz (estimated)
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %        samples  %        app name                 symbol name
120801    7.4379  37017    63.4603  vmlinux                  __copy_from_user_ll
84139     5.1806  885       1.5172  vmlinux                  _spin_lock
66027     4.0654  656       1.1246  vmlinux                 
journal_add_journal_head
60400     3.7189  250       0.4286  vmlinux                  __find_get_block
60032     3.6963  120       0.2057  vmlinux                 
journal_dirty_metadata

__copy_from_user_ll spent 63.4603% of L3 cache miss though it spent only
7.4379% of memory access.

In order to reduce the cache miss in the __copy_from_user_ll, I made
the following patch and confirmed the reduction of the miss.

Top 5 CPU cycle
CPU: P4 / Xeon, speed 2200.93 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        app name                 symbol name
120717    8.3454  vmlinux                  _mmx_memcpy_nt
65955     4.5596  vmlinux                  do_get_write_access
56088     3.8775  vmlinux                  journal_put_journal_head
52550     3.6329  vmlinux                  journal_dirty_metadata
38886     2.6883  vmlinux                  journal_add_journal_head
pattern9-0-cpu4-0-08141627/summary.out

_mmx_memcpy_nt is the new function which is called from
__copy_from_user_ll and it spent only 42.88% of the original
implementation. (120717/281538==42.88%)

Top 5 Memory Access
CPU: P4 / Xeon, speed 2200.93 MHz (estimated)
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %        samples  %        app name                 symbol name
90918     6.3079  89        0.5673  vmlinux                  _mmx_memcpy_nt
83654     5.8039  177       1.1283  vmlinux                 
journal_dirty_metadata
57836     4.0127  348       2.2183  vmlinux                 
journal_put_journal_head
48236     3.3466  165       1.0518  vmlinux                  do_get_write_access
44546     3.0906  21        0.1339  vmlinux                  __getblk

The cache miss reduced from 37017 (63.4603%) to 89 (0.5673%). It is
0.24% of the original implementation.

The actual elapse time which five times run  were 229.76 (sec) and
222.94 (sec). (229.76/222.94= 3.06% gain)

iozone -CMR -i 0 -+n -+u -s 8000MB -t 4 

What do you think?

--- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c	2005-08-05
16:04:37.000000000 +0900
+++ linux-2.6.12.4/arch/i386/lib/usercopy.c	2005-08-12 13:18:14.106916200 +0900
@@ -10,6 +10,7 @@
  #include <linux/highmem.h>
  #include <linux/blkdev.h>
  #include <linux/module.h>
+#include <asm/i387.h>
  #include <asm/uaccess.h>
  #include <asm/mmx.h>
 
@@ -511,6 +512,108 @@
 		: "memory");						\
 } while (0)
 
+/* Non Temporal Hint version of mmx_memcpy */
+/* It is cache aware                       */
+/* hyoshiok@miraclelinux.com               */
+static unsigned long _mmx_memcpy_nt(void *to, const void *from, size_t len)
+{
+        /* Note! gcc doesn't seem to align stack variables properly, so we
+         * need to make use of unaligned loads and stores.
+         */
+	void *p;
+	int i;
+
+	if (unlikely(in_interrupt())){
+	        __copy_user_zeroing(to, from, len);
+		return len;
+	}
+
+	p = to;
+	i = len >> 6; /* len/64 */
+
+        kernel_fpu_begin();
+
+	__asm__ __volatile__ (
+		"1: prefetchnta (%0)\n"		/* This set is 28 bytes */
+		"   prefetchnta 64(%0)\n"
+		"   prefetchnta 128(%0)\n"
+		"   prefetchnta 192(%0)\n"
+		"   prefetchnta 256(%0)\n"
+		"2:  \n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x1AEB, 1b\n"	/* jmp on 26 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from) );
+		
+	for(; i>5; i--)
+	{
+		__asm__ __volatile__ (
+		"1:  prefetchnta 320(%0)\n"
+		"2:  movq (%0), %%mm0\n"
+		"  movq 8(%0), %%mm1\n"
+		"  movq 16(%0), %%mm2\n"
+		"  movq 24(%0), %%mm3\n"
+		"  movntq %%mm0, (%1)\n"
+		"  movntq %%mm1, 8(%1)\n"
+		"  movntq %%mm2, 16(%1)\n"
+		"  movntq %%mm3, 24(%1)\n"
+		"  movq 32(%0), %%mm0\n"
+		"  movq 40(%0), %%mm1\n"
+		"  movq 48(%0), %%mm2\n"
+		"  movq 56(%0), %%mm3\n"
+		"  movntq %%mm0, 32(%1)\n"
+		"  movntq %%mm1, 40(%1)\n"
+		"  movntq %%mm2, 48(%1)\n"
+		"  movntq %%mm3, 56(%1)\n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x05EB, 1b\n"	/* jmp on 5 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+
+	for(; i>0; i--)
+	{
+		__asm__ __volatile__ (
+		"  movq (%0), %%mm0\n"
+		"  movq 8(%0), %%mm1\n"
+		"  movq 16(%0), %%mm2\n"
+		"  movq 24(%0), %%mm3\n"
+		"  movntq %%mm0, (%1)\n"
+		"  movntq %%mm1, 8(%1)\n"
+		"  movntq %%mm2, 16(%1)\n"
+		"  movntq %%mm3, 24(%1)\n"
+		"  movq 32(%0), %%mm0\n"
+		"  movq 40(%0), %%mm1\n"
+		"  movq 48(%0), %%mm2\n"
+		"  movq 56(%0), %%mm3\n"
+		"  movntq %%mm0, 32(%1)\n"
+		"  movntq %%mm1, 40(%1)\n"
+		"  movntq %%mm2, 48(%1)\n"
+		"  movntq %%mm3, 56(%1)\n"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+	/*
+	 *	Now do the tail of the block
+	 */
+	kernel_fpu_end();
+	if(i=(len&63))
+	  __copy_user_zeroing(to, from, i);
+	return i;
+}
 
  unsigned long __copy_to_user_ll(void __user *to, const void *from,
unsigned long n)
 {
@@ -575,10 +678,14 @@
  __copy_from_user_ll(void *to, const void __user *from, unsigned long n)
 {
 	BUG_ON((long)n < 0);
-	if (movsl_is_ok(to, from, n))
+	if (n < 512) {
+	  if (movsl_is_ok(to, from, n))
 		__copy_user_zeroing(to, from, n);
-	else
+	  else
 		n = __copy_user_zeroing_intel(to, from, n);
+	}
+	else
+	  n = _mmx_memcpy_nt(to, from, n);
 	return n;
 }

Thanks in advance,
  Hiro
--
hyoshiok at miraclelinux.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-14  9:16 [RFC] [PATCH] cache pollution aware __copy_from_user_ll() Hiro Yoshioka
@ 2005-08-14  9:41 ` Arjan van de Ven
  2005-08-14 10:22   ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Arjan van de Ven @ 2005-08-14  9:41 UTC (permalink / raw)
  To: hyoshiok; +Cc: linux-kernel

On Sun, 2005-08-14 at 18:16 +0900, Hiro Yoshioka wrote:
> Hi,
> 
> The following is a patch to reduce a cache pollution
> of __copy_from_user_ll().
> 
> When I run simple iozone benchmark to find a performance bottleneck of
> the linux kernel, I found that __copy_from_user_ll() spent CPU cycle
> most and it did many cache misses.


however... you copy something from userspace... aren't you going to USE
it? The non-termoral versions actually throw the data out of the
cache... so while this part might be nice, you pay BIG elsewhere....



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-14  9:41 ` Arjan van de Ven
@ 2005-08-14 10:22   ` Hiro Yoshioka
  2005-08-14 10:35     ` Arjan van de Ven
  0 siblings, 1 reply; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-14 10:22 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel

Thanks for your comments.

On 8/14/05, Arjan van de Ven <arjan@infradead.org> wrote:
> On Sun, 2005-08-14 at 18:16 +0900, Hiro Yoshioka wrote:
> > Hi,
> >
> > The following is a patch to reduce a cache pollution
> > of __copy_from_user_ll().
> >
> > When I run simple iozone benchmark to find a performance bottleneck of
> > the linux kernel, I found that __copy_from_user_ll() spent CPU cycle
> > most and it did many cache misses.
> 
> 
> however... you copy something from userspace... aren't you going to USE
> it? The non-termoral versions actually throw the data out of the
> cache... so while this part might be nice, you pay BIG elsewhere....

The oprofile data does not give an evidence that we pay BIG elsewhere.

For examples, the original 2.6.12.4 Top 5 cache misses are the following,

37017 63.4603  vmlinux    __copy_from_user_ll
1049   1.7984  vmlinux    _spin_lock_irqsave
940    1.6115  vmlinux    blk_rq_map_sg
896    1.5361  vmlinux    generic_file_buffered_write
885    1.5172  vmlinux    _spin_lock
pattern9-0-cpu4-0-08141702

cache aware version Top 5 cache misses are
899 5.7305  vmlinux    blk_rq_map_sg
569 3.6270  vmlinux    journal_commit_transaction
531 3.3848  vmlinux    radix_tree_delete
514 3.2764  vmlinux    journal_add_journal_head
505 3.2190  vmlinux    release_pages
...
89 0.5673 vmlinux _mmx_memcpy_nt
pattern9-0-cpu4-0-08141625

What do you think?

Regards,
  Hiro

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-14 10:22   ` Hiro Yoshioka
@ 2005-08-14 10:35     ` Arjan van de Ven
  2005-08-14 10:45       ` Christoph Hellwig
  2005-08-15  6:43       ` Hiro Yoshioka
  0 siblings, 2 replies; 63+ messages in thread
From: Arjan van de Ven @ 2005-08-14 10:35 UTC (permalink / raw)
  To: hyoshiok; +Cc: linux-kernel

On Sun, 2005-08-14 at 19:22 +0900, Hiro Yoshioka wrote:
> Thanks for your comments.
> 
> On 8/14/05, Arjan van de Ven <arjan@infradead.org> wrote:
> > On Sun, 2005-08-14 at 18:16 +0900, Hiro Yoshioka wrote:
> > > Hi,
> > >
> > > The following is a patch to reduce a cache pollution
> > > of __copy_from_user_ll().
> > >
> > > When I run simple iozone benchmark to find a performance bottleneck of
> > > the linux kernel, I found that __copy_from_user_ll() spent CPU cycle
> > > most and it did many cache misses.
> > 
> > 
> > however... you copy something from userspace... aren't you going to USE
> > it? The non-termoral versions actually throw the data out of the
> > cache... so while this part might be nice, you pay BIG elsewhere....
> 
> The oprofile data does not give an evidence that we pay BIG elsewhere.


the problem is that the pay elsewhere is far more spread out, but not
less. At least generally....

I can see the point of a copy_from_user_nocache() or something, for
those cases where we *know* we are not going to use the copied data in
the cpu (but say, only do DMA).
But that should be explicit, not implicit, since the general case will
be that the kernel WILL use the data. And if that's the case your change
is a loss.... (just harder to see because the cost is spread out)


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-14 10:35     ` Arjan van de Ven
@ 2005-08-14 10:45       ` Christoph Hellwig
  2005-08-15  6:43       ` Hiro Yoshioka
  1 sibling, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2005-08-14 10:45 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: hyoshiok, linux-kernel

> the problem is that the pay elsewhere is far more spread out, but not
> less. At least generally....
> 
> I can see the point of a copy_from_user_nocache() or something, for
> those cases where we *know* we are not going to use the copied data in
> the cpu (but say, only do DMA).
> But that should be explicit, not implicit, since the general case will
> be that the kernel WILL use the data.

Most of the callers probably want the normal one, but most of the copied
data (buffered filesystem I/O) will want the non cache poluting one.

So yes, doing this explicit makes a lot of sense.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-14 10:35     ` Arjan van de Ven
  2005-08-14 10:45       ` Christoph Hellwig
@ 2005-08-15  6:43       ` Hiro Yoshioka
  2005-08-15  7:16         ` Arjan van de Ven
  1 sibling, 1 reply; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-15  6:43 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, Hiro Yoshioka

Hi,

From: Arjan van de Ven <arjan@infradead.org>
Subject: Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Date: Sun, 14 Aug 2005 12:35:43 +0200
Message-ID: <1124015743.3222.17.camel@laptopd505.fenrus.org>

> On Sun, 2005-08-14 at 19:22 +0900, Hiro Yoshioka wrote:
> > Thanks for your comments.
> > 
> > On 8/14/05, Arjan van de Ven <arjan@infradead.org> wrote:
> > > On Sun, 2005-08-14 at 18:16 +0900, Hiro Yoshioka wrote:
> > > > Hi,
> > > >
> > > > The following is a patch to reduce a cache pollution
> > > > of __copy_from_user_ll().
> > > >
> > > > When I run simple iozone benchmark to find a performance bottleneck of
> > > > the linux kernel, I found that __copy_from_user_ll() spent CPU cycle
> > > > most and it did many cache misses.
> > > 
> > > 
> > > however... you copy something from userspace... aren't you going to USE
> > > it? The non-termoral versions actually throw the data out of the
> > > cache... so while this part might be nice, you pay BIG elsewhere....
> > 
> > The oprofile data does not give an evidence that we pay BIG elsewhere.
> 
> 
> the problem is that the pay elsewhere is far more spread out, but not
> less. At least generally....
> 
> I can see the point of a copy_from_user_nocache() or something, for
> those cases where we *know* we are not going to use the copied data in
> the cpu (but say, only do DMA).
> But that should be explicit, not implicit, since the general case will
> be that the kernel WILL use the data. And if that's the case your change
> is a loss.... (just harder to see because the cost is spread out)

I understand the iozone is not good benchmark nor reprsents any useful
application so I did a kernel build as a simple benchmark.

What I did is
    cd /test/f1
    tar xjf ${baseDir}/src/linux-2.6.12.4.tar.bz2
    cd linux-2.6.12.4
    cp -p ${baseDir}/src/config .config
    make oldconfig
    time make -j $CPUS

The following is Top 5 of CPU cycle
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 10
0000
samples  %        app name                 symbol name
7347544  72.8296  cc1                      (no symbols)
532307    5.2763  libbz2.so.1.0.2          (no symbols)
241853    2.3973  vmlinux                  buffered_rmqueue
128552    1.2742  libc-2.3.4.so            _int_malloc
107784    1.0684  vmlinux                  page_fault
...
10749     0.1065  vmlinux                  __copy_from_user_ll
pattern12-0-cpu4-0-08150920/summary.out

Since __copy_from_user_ll is not hot spot, so we didn't see any big
performance difference. (the number is time (sec) of 5 runs)

original 2.6.12.4               real    user    system
No profiling                    532.27	1797.02	194.9
BSQ 0x200+0x3f                  620.15	2094.21	212.38
GLOBAL_POWER_EVENTS:100000:	586.01	1984.92	215.97

cache aware 2.6.12.4            real    user    system
No profiling                    526.65	1792.22	190.05
BSQ 0x200+0x3f                  615.51	2090.74	206.58
GLOBAL_POWER_EVENTS:100000:     587.69	1978.66	209.18

Now Top 5 of Memory Access (2.6.12.4)
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %        samples  %        app name             symbol name
11439689 82.2135  33906    27.9328  cc1                  (no symbols)
277177    1.9920  347       0.2859  libc-2.3.4.so        _int_malloc
229593    1.6500  12946    10.6653  libbz2.so.1.0.2      (no symbols)
84348     0.6062  116       0.0956  libc-2.3.4.so        _int_free
83653     0.6012  438       0.3608  libc-2.3.4.so        calloc
...
8527      0.0613  1648      1.3577  vmlinux              __copy_from_user_ll

Top 5 of Cache miss
33906   27.9328 cc1                     (no symbols)
30849   25.4144 vmlinux                 buffered_rmqueue
12946   10.6653 libbz2.so.1.0.2         (no symbols)
9178    7.5611  vmlinux                 __copy_to_user_ll
2934    2.4171  oprofiled               (no symbols)
...
1648    1.3577  vmlinux                 __copy_from_user_ll
pattern12-0-cpu4-0-08150917

Cache aware 2.6.12.4, Top 5 of Memory Access
samples  %        samples  %        app name             symbol name
11448487 82.8100  32786    28.1051  cc1                  (no symbols)
276812    2.0023  256       0.2195  libc-2.3.4.so        _int_malloc
230177    1.6649  12371    10.6048  libbz2.so.1.0.2      (no symbols)
84485     0.6111  120       0.1029  libc-2.3.4.so        _int_free
84043     0.6079  473       0.4055  libc-2.3.4.so        calloc
...
18282     0.1322  9060      7.7665  vmlinux              __copy_from_user_ll

Top 5 of Cache miss
32786   28.1051 cc1                     (no symbols)
31175   26.7241 vmlinux                 buffered_rmqueue
12371   10.6048 libbz2.so.1.0.2         (no symbols)
9060    7.7665  vmlinux                 __copy_from_user_ll
2801    2.4011  oprofiled               (no symbols)
...
0            0  vmlinux                 __copy_to_user_ll
pattern12-0-cpu4-0-08151048

Cache miss of __copy_from_user_ll has been increased but
__copy_to_user_ll has been decreased to 0. (oprofile could not get a
sample.)

I don't know the reason why __copy_to_user_ll has been decreased.

Anyway we could not find the cache aware version of __copy_from_user_ll
has a big regression yet.

What do you think?
  Hiro

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-15  6:43       ` Hiro Yoshioka
@ 2005-08-15  7:16         ` Arjan van de Ven
  2005-08-15  8:44           ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Arjan van de Ven @ 2005-08-15  7:16 UTC (permalink / raw)
  To: hyoshiok; +Cc: linux-kernel

> Anyway we could not find the cache aware version of __copy_from_user_ll
> has a big regression yet.

that is because you spread the cache misses out from one place to all
over the place, so that no one single point sticks out anymore.

Do you agree that your copy is less optimal for the case where the
kernel will (almost) immediately use the data? 

I agree that your copy is really nice for places where the kernel will
NOT use the data in the cpu, say for big write() system calls.

My suggestion is to realize there are basically 2 different use cases,
and that in the code the first one is very common, while in your
profiles the second one is very common. Based on that I suggest to make
a special copy_from_user_nocache() API for the cases where the kernel
will not use the data (and ignore software raid5 here) and use your
excellent version for that API, while leaving the code for the cases
where the kernel WILL use the data alone. Code wise the "will use" case
is the vast majority, so only changing the few places that know they
don't use the data will be very efficient, and will give immediate big
improvement in your profile data, since those few places tend to get
used a lot in the cases you benchmark.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-15  7:16         ` Arjan van de Ven
@ 2005-08-15  8:44           ` Hiro Yoshioka
  2005-08-15  8:53             ` Arjan van de Ven
  0 siblings, 1 reply; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-15  8:44 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, Hiro Yoshioka

Hi,

I appreciate your suggestion.

On 8/15/05, Arjan van de Ven <arjan@infradead.org> wrote:
> 
> > Anyway we could not find the cache aware version of __copy_from_user_ll
> > has a big regression yet.
> 
> 
> that is because you spread the cache misses out from one place to all
> over the place, so that no one single point sticks out anymore.
> 
> Do you agree that your copy is less optimal for the case where the
> kernel will (almost) immediately use the data?

Yes, I do.

My server has 8KB of L1 cache. (512KB of L2/2MB of L3)

If you move more than 4KB of data using by __copy_from_user_ll(), the
data will be spilled over L1 cache but in L2 (or L3)
When you move huge data (> 1MB), even L3 cache will not help you.
(This is known as a cache pollution.)

> I agree that your copy is really nice for places where the kernel will
> NOT use the data in the cpu, say for big write() system calls.
> 
> My suggestion is to realize there are basically 2 different use cases,
> and that in the code the first one is very common, while in your
> profiles the second one is very common. Based on that I suggest to make
> a special copy_from_user_nocache() API for the cases where the kernel
> will not use the data (and ignore software raid5 here) and use your
> excellent version for that API, while leaving the code for the cases
> where the kernel WILL use the data alone. Code wise the "will use" case
> is the vast majority, so only changing the few places that know they
> don't use the data will be very efficient, and will give immediate big
> improvement in your profile data, since those few places tend to get
> used a lot in the cases you benchmark.

copy_from_user_nocache() is fine.

But I don't know where I can use it. (I'm not so
 familiar with the linux kernel file system yet.) 

Regards,
  Hiro

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-15  8:44           ` Hiro Yoshioka
@ 2005-08-15  8:53             ` Arjan van de Ven
  2005-08-15 23:33               ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Arjan van de Ven @ 2005-08-15  8:53 UTC (permalink / raw)
  To: hyoshiok; +Cc: linux-kernel

On Mon, 2005-08-15 at 17:44 +0900, Hiro Yoshioka wrote:
> Hi,
> 
> I appreciate your suggestion.
> 
> On 8/15/05, Arjan van de Ven <arjan@infradead.org> wrote:
> > 
> > > Anyway we could not find the cache aware version of __copy_from_user_ll
> > > has a big regression yet.
> > 
> > 
> > that is because you spread the cache misses out from one place to all
> > over the place, so that no one single point sticks out anymore.
> > 
> > Do you agree that your copy is less optimal for the case where the
> > kernel will (almost) immediately use the data?
> 
> Yes, I do.
> 
> My server has 8KB of L1 cache. (512KB of L2/2MB of L3)
> 
> If you move more than 4KB of data using by __copy_from_user_ll(), the
> data will be spilled over L1 cache but in L2 (or L3)

L2 access time isn't too bad. your code evicts the data even from L2 and
L3 though (even if it was in there before)..

> When you move huge data (> 1MB), even L3 cache will not help you.
> (This is known as a cache pollution.)

yes.
> copy_from_user_nocache() is fine.
> 
> But I don't know where I can use it. (I'm not so
>  familiar with the linux kernel file system yet.) 

I suspect the few cases where it will make the most difference will be
in the VFS for the write() system call, and the AIO variants thereof.

generic_file_buffered_write() will be a good candidate to try first...



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-15  8:53             ` Arjan van de Ven
@ 2005-08-15 23:33               ` Hiro Yoshioka
  2005-08-16  3:30                 ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-15 23:33 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, Hiro Yoshioka

On 8/15/05, Arjan van de Ven <arjan@infradead.org> wrote:
> > copy_from_user_nocache() is fine.
> >
> > But I don't know where I can use it. (I'm not so
> >  familiar with the linux kernel file system yet.)
> 
> I suspect the few cases where it will make the most difference will be
> in the VFS for the write() system call, and the AIO variants thereof.
> 
> generic_file_buffered_write() will be a good candidate to try first...

Thanks.

filemap_copy_from_user() calls __copy_from_user_inatomic() calls
__copy_from_user_ll().

I'll look at the code.

Hiro
--
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-15 23:33               ` Hiro Yoshioka
@ 2005-08-16  3:30                 ` Hiro Yoshioka
  2005-08-16  4:17                   ` Hirokazu Takahashi
  2005-08-16  5:49                   ` Arjan van de Ven
  0 siblings, 2 replies; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-16  3:30 UTC (permalink / raw)
  To: lkml.hyoshiok; +Cc: arjan, linux-kernel, hyoshiok

From: Hiro Yoshioka <lkml.hyoshiok@gmail.com>
Date: Tue, 16 Aug 2005 08:33:59 +0900

> Thanks.
> 
> filemap_copy_from_user() calls __copy_from_user_inatomic() calls
> __copy_from_user_ll().
> 
> I'll look at the code.

The following is a quick hack of cache aware implementation
of __copy_from_user_ll() and __copy_from_user_inatomic()

__copy_from_user_ll_nocache() and __copy_from_user_inatomic_nocache()

filemap_copy_from_user() calles __copy_from_user_inatomic_nocache()
instead of __copy_from_user_inatomic() and reduced cashe miss.

The first column is the cache reference (memory access) and the
third column is the 3rd level cache miss.

The following example shows the L3 cache miss is reduced from 37410 to 107.

2.6.12.4 nocache version
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %        samples  %     app name       symbol name
120442    6.4106  107    0.5620  vmlinux        __copy_user_zeroing_nocache
80049     4.2606  578    3.0357  vmlinux        journal_add_journal_head
69194     3.6829  154    0.8088  vmlinux        journal_dirty_metadata
67059     3.5692  78     0.4097  vmlinux        __find_get_block
64145     3.4141  32     0.1681  vmlinux        journal_put_journal_head
pattern9-0-cpu4-0-08161154/summary.out

The 2.6.12.4 original version is
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %        samples  %     app name       symbol name
120646    7.4680  37410 62.3355  vmlinux        __copy_from_user_ll
79508     4.9215  903    1.5046  vmlinux        _spin_lock
65526     4.0561  873    1.4547  vmlinux        journal_add_journal_head
59296     3.6704  129    0.2149  vmlinux        __find_get_block
58647     3.6302  215    0.3582  vmlinux        journal_dirty_metadata

What do you think?

Hiro

diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.nocache/Makefile
--- linux-2.6.12.4.orig/Makefile	2005-08-12 14:37:59.000000000 +0900
+++ linux-2.6.12.4.nocache/Makefile	2005-08-16 10:22:31.000000000 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.nocache
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c linux-2.6.12.4.nocache/arch/i386/lib/usercopy.c
--- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.nocache/arch/i386/lib/usercopy.c	2005-08-16 10:49:59.000000000 +0900
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/blkdev.h>
 #include <linux/module.h>
+#include <asm/i387.h>
 #include <asm/uaccess.h>
 #include <asm/mmx.h>
 
@@ -511,6 +512,110 @@
 		: "memory");						\
 } while (0)
 
+/* Non Temporal Hint version of mmx_memcpy */
+/* It is cache aware                       */
+/* hyoshiok@miraclelinux.com               */
+static unsigned long 
+__copy_user_zeroing_nocache(void *to, const void *from, size_t len)
+{
+        /* Note! gcc doesn't seem to align stack variables properly, so we
+         * need to make use of unaligned loads and stores.
+         */
+	void *p;
+	int i;
+
+	if (unlikely(in_interrupt())){
+	        __copy_user_zeroing(to, from, len);
+		return len;
+	}
+
+	p = to;
+	i = len >> 6; /* len/64 */
+
+        kernel_fpu_begin();
+
+	__asm__ __volatile__ (
+		"1: prefetchnta (%0)\n"		/* This set is 28 bytes */
+		"   prefetchnta 64(%0)\n"
+		"   prefetchnta 128(%0)\n"
+		"   prefetchnta 192(%0)\n"
+		"   prefetchnta 256(%0)\n"
+		"2:  \n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x1AEB, 1b\n"	/* jmp on 26 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from) );
+		
+	for(; i>5; i--)
+	{
+		__asm__ __volatile__ (
+		"1:  prefetchnta 320(%0)\n"
+		"2:  movq (%0), %%mm0\n"
+		"  movq 8(%0), %%mm1\n"
+		"  movq 16(%0), %%mm2\n"
+		"  movq 24(%0), %%mm3\n"
+		"  movntq %%mm0, (%1)\n"
+		"  movntq %%mm1, 8(%1)\n"
+		"  movntq %%mm2, 16(%1)\n"
+		"  movntq %%mm3, 24(%1)\n"
+		"  movq 32(%0), %%mm0\n"
+		"  movq 40(%0), %%mm1\n"
+		"  movq 48(%0), %%mm2\n"
+		"  movq 56(%0), %%mm3\n"
+		"  movntq %%mm0, 32(%1)\n"
+		"  movntq %%mm1, 40(%1)\n"
+		"  movntq %%mm2, 48(%1)\n"
+		"  movntq %%mm3, 56(%1)\n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x05EB, 1b\n"	/* jmp on 5 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+
+	for(; i>0; i--)
+	{
+		__asm__ __volatile__ (
+		"  movq (%0), %%mm0\n"
+		"  movq 8(%0), %%mm1\n"
+		"  movq 16(%0), %%mm2\n"
+		"  movq 24(%0), %%mm3\n"
+		"  movntq %%mm0, (%1)\n"
+		"  movntq %%mm1, 8(%1)\n"
+		"  movntq %%mm2, 16(%1)\n"
+		"  movntq %%mm3, 24(%1)\n"
+		"  movq 32(%0), %%mm0\n"
+		"  movq 40(%0), %%mm1\n"
+		"  movq 48(%0), %%mm2\n"
+		"  movq 56(%0), %%mm3\n"
+		"  movntq %%mm0, 32(%1)\n"
+		"  movntq %%mm1, 40(%1)\n"
+		"  movntq %%mm2, 48(%1)\n"
+		"  movntq %%mm3, 56(%1)\n"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+	/*
+	 *	Now do the tail of the block
+	 */
+	kernel_fpu_end();
+	if(i=(len&63))
+	  __copy_user_zeroing(to, from, i);
+	return i;
+}
+
 
 unsigned long __copy_to_user_ll(void __user *to, const void *from, unsigned long n)
 {
@@ -582,6 +687,21 @@
 	return n;
 }
 
+unsigned long
+__copy_from_user_ll_nocache(void *to, const void __user *from, unsigned long n)
+{
+	BUG_ON((long)n < 0);
+        if (n < 512) {
+          if (movsl_is_ok(to, from, n))
+                __copy_user_zeroing(to, from, n);
+          else
+                n = __copy_user_zeroing_intel(to, from, n);
+        }
+        else
+          n = __copy_user_zeroing_nocache(to, from, n);
+	return n;
+}
+
 /**
  * copy_to_user: - Copy a block of data into user space.
  * @to:   Destination address, in user space.
diff -ur linux-2.6.12.4.orig/include/asm/uaccess.h linux-2.6.12.4.nocache/include/asm/uaccess.h
--- linux-2.6.12.4.orig/include/asm/uaccess.h	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.nocache/include/asm/uaccess.h	2005-08-16 10:44:05.000000000 +0900
@@ -413,6 +413,8 @@
 				const void *from, unsigned long n);
 unsigned long __must_check __copy_from_user_ll(void *to,
 				const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_nocache(void *to,
+				const void __user *from, unsigned long n);
 
 /*
  * Here we special-case 1, 2 and 4-byte copy_*_user invocations.  On a fault
@@ -502,11 +504,38 @@
 }
 
 static inline unsigned long
+__copy_from_user_inatomic_nocache(void *to, const void __user *from, unsigned long n)
+{
+	if (__builtin_constant_p(n)) {
+		unsigned long ret;
+
+		switch (n) {
+		case 1:
+			__get_user_size(*(u8 *)to, from, 1, ret, 1);
+			return ret;
+		case 2:
+			__get_user_size(*(u16 *)to, from, 2, ret, 2);
+			return ret;
+		case 4:
+			__get_user_size(*(u32 *)to, from, 4, ret, 4);
+			return ret;
+		}
+	}
+	return __copy_from_user_ll_nocache(to, from, n);
+}
+
+static inline unsigned long
 __copy_from_user(void *to, const void __user *from, unsigned long n)
 {
        might_sleep();
        return __copy_from_user_inatomic(to, from, n);
 }
+static inline unsigned long
+__copy_from_user_nocache(void *to, const void __user *from, unsigned long n)
+{
+       might_sleep();
+       return __copy_from_user_inatomic_nocache(to, from, n);
+}
 unsigned long __must_check copy_to_user(void __user *to,
 				const void *from, unsigned long n);
 unsigned long __must_check copy_from_user(void *to,
diff -ur linux-2.6.12.4.orig/include/asm-i386/uaccess.h linux-2.6.12.4.nocache/include/asm-i386/uaccess.h
--- linux-2.6.12.4.orig/include/asm-i386/uaccess.h	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.nocache/include/asm-i386/uaccess.h	2005-08-16 10:44:05.000000000 +0900
@@ -413,6 +413,8 @@
 				const void *from, unsigned long n);
 unsigned long __must_check __copy_from_user_ll(void *to,
 				const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_nocache(void *to,
+				const void __user *from, unsigned long n);
 
 /*
  * Here we special-case 1, 2 and 4-byte copy_*_user invocations.  On a fault
@@ -502,11 +504,38 @@
 }
 
 static inline unsigned long
+__copy_from_user_inatomic_nocache(void *to, const void __user *from, unsigned long n)
+{
+	if (__builtin_constant_p(n)) {
+		unsigned long ret;
+
+		switch (n) {
+		case 1:
+			__get_user_size(*(u8 *)to, from, 1, ret, 1);
+			return ret;
+		case 2:
+			__get_user_size(*(u16 *)to, from, 2, ret, 2);
+			return ret;
+		case 4:
+			__get_user_size(*(u32 *)to, from, 4, ret, 4);
+			return ret;
+		}
+	}
+	return __copy_from_user_ll_nocache(to, from, n);
+}
+
+static inline unsigned long
 __copy_from_user(void *to, const void __user *from, unsigned long n)
 {
        might_sleep();
        return __copy_from_user_inatomic(to, from, n);
 }
+static inline unsigned long
+__copy_from_user_nocache(void *to, const void __user *from, unsigned long n)
+{
+       might_sleep();
+       return __copy_from_user_inatomic_nocache(to, from, n);
+}
 unsigned long __must_check copy_to_user(void __user *to,
 				const void *from, unsigned long n);
 unsigned long __must_check copy_from_user(void *to,
diff -ur linux-2.6.12.4.orig/include/linux/autoconf.h linux-2.6.12.4.nocache/include/linux/autoconf.h
--- linux-2.6.12.4.orig/include/linux/autoconf.h	2005-08-15 16:53:01.000000000 +0900
+++ linux-2.6.12.4.nocache/include/linux/autoconf.h	2005-08-16 10:32:33.000000000 +0900
@@ -1,7 +1,7 @@
 /*
  * Automatically generated C config: don't edit
- * Linux kernel version: 2.6.12.4.orig
- * Mon Aug 15 16:53:01 2005
+ * Linux kernel version: 2.6.12.4.nocache
+ * Tue Aug 16 10:32:33 2005
  */
 #define AUTOCONF_INCLUDED
 #define CONFIG_X86 1
diff -ur linux-2.6.12.4.orig/mm/filemap.c linux-2.6.12.4.nocache/mm/filemap.c
--- linux-2.6.12.4.orig/mm/filemap.c	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.nocache/mm/filemap.c	2005-08-16 10:16:06.000000000 +0900
@@ -1727,13 +1727,13 @@
 	int left;
 
 	kaddr = kmap_atomic(page, KM_USER0);
-	left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
+	left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
 	kunmap_atomic(kaddr, KM_USER0);
 
 	if (left != 0) {
 		/* Do it the slow way */
 		kaddr = kmap(page);
-		left = __copy_from_user(kaddr + offset, buf, bytes);
+		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
 		kunmap(page);
 	}
 	return bytes - left;
@@ -1750,7 +1750,7 @@
 		int copy = min(bytes, iov->iov_len - base);
 
 		base = 0;
-		left = __copy_from_user_inatomic(vaddr, buf, copy);
+		left = __copy_from_user_inatomic_nocache(vaddr, buf, copy);
 		copied += copy;
 		bytes -= copy;
 		vaddr += copy;

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-16  3:30                 ` Hiro Yoshioka
@ 2005-08-16  4:17                   ` Hirokazu Takahashi
  2005-08-16  4:54                     ` Hiro Yoshioka
  2005-08-16  5:44                     ` Arjan van de Ven
  2005-08-16  5:49                   ` Arjan van de Ven
  1 sibling, 2 replies; 63+ messages in thread
From: Hirokazu Takahashi @ 2005-08-16  4:17 UTC (permalink / raw)
  To: hyoshiok; +Cc: lkml.hyoshiok, arjan, linux-kernel

Hi,

BTW, what are you going to do with the page-faults which may happen
during __copy_user_zeroing_nocache()? The current process may be blocked
in the handler for a while and get FPU registers polluted.
kernel_fpu_begin() won't help the case. This is another issue, though.

> > Thanks.
> > 
> > filemap_copy_from_user() calls __copy_from_user_inatomic() calls
> > __copy_from_user_ll().
> > 
> > I'll look at the code.
> 
> The following is a quick hack of cache aware implementation
> of __copy_from_user_ll() and __copy_from_user_inatomic()
> 
> __copy_from_user_ll_nocache() and __copy_from_user_inatomic_nocache()
> 
> filemap_copy_from_user() calles __copy_from_user_inatomic_nocache()
> instead of __copy_from_user_inatomic() and reduced cashe miss.
> 
> The first column is the cache reference (memory access) and the
> third column is the 3rd level cache miss.
> 
> The following example shows the L3 cache miss is reduced from 37410 to 107.
> 
> 2.6.12.4 nocache version
> Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
> Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
> samples  %        samples  %     app name       symbol name
> 120442    6.4106  107    0.5620  vmlinux        __copy_user_zeroing_nocache
> 80049     4.2606  578    3.0357  vmlinux        journal_add_journal_head
> 69194     3.6829  154    0.8088  vmlinux        journal_dirty_metadata
> 67059     3.5692  78     0.4097  vmlinux        __find_get_block
> 64145     3.4141  32     0.1681  vmlinux        journal_put_journal_head
> pattern9-0-cpu4-0-08161154/summary.out
> 
> The 2.6.12.4 original version is
> Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
> Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
> samples  %        samples  %     app name       symbol name
> 120646    7.4680  37410 62.3355  vmlinux        __copy_from_user_ll
> 79508     4.9215  903    1.5046  vmlinux        _spin_lock
> 65526     4.0561  873    1.4547  vmlinux        journal_add_journal_head
> 59296     3.6704  129    0.2149  vmlinux        __find_get_block
> 58647     3.6302  215    0.3582  vmlinux        journal_dirty_metadata
> 
> What do you think?
> 
> Hiro
> 
> diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.nocache/Makefile
> --- linux-2.6.12.4.orig/Makefile	2005-08-12 14:37:59.000000000 +0900
> +++ linux-2.6.12.4.nocache/Makefile	2005-08-16 10:22:31.000000000 +0900
> @@ -1,7 +1,7 @@
>  VERSION = 2
>  PATCHLEVEL = 6
>  SUBLEVEL = 12
> -EXTRAVERSION = .4.orig
> +EXTRAVERSION = .4.nocache
>  NAME=Woozy Numbat
>  
>  # *DOCUMENTATION*
> diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c linux-2.6.12.4.nocache/arch/i386/lib/usercopy.c
> --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c	2005-08-05 16:04:37.000000000 +0900
> +++ linux-2.6.12.4.nocache/arch/i386/lib/usercopy.c	2005-08-16 10:49:59.000000000 +0900
> @@ -10,6 +10,7 @@
>  #include <linux/highmem.h>
>  #include <linux/blkdev.h>
>  #include <linux/module.h>
> +#include <asm/i387.h>
>  #include <asm/uaccess.h>
>  #include <asm/mmx.h>
>  
> @@ -511,6 +512,110 @@
>  		: "memory");						\
>  } while (0)
>  
> +/* Non Temporal Hint version of mmx_memcpy */
> +/* It is cache aware                       */
> +/* hyoshiok@miraclelinux.com               */
> +static unsigned long 
> +__copy_user_zeroing_nocache(void *to, const void *from, size_t len)
> +{
> +        /* Note! gcc doesn't seem to align stack variables properly, so we
> +         * need to make use of unaligned loads and stores.
> +         */
> +	void *p;
> +	int i;
> +
> +	if (unlikely(in_interrupt())){
> +	        __copy_user_zeroing(to, from, len);
> +		return len;
> +	}
> +
> +	p = to;
> +	i = len >> 6; /* len/64 */
> +
> +        kernel_fpu_begin();
> +
> +	__asm__ __volatile__ (
> +		"1: prefetchnta (%0)\n"		/* This set is 28 bytes */
> +		"   prefetchnta 64(%0)\n"
> +		"   prefetchnta 128(%0)\n"
> +		"   prefetchnta 192(%0)\n"
> +		"   prefetchnta 256(%0)\n"
> +		"2:  \n"
> +		".section .fixup, \"ax\"\n"
> +		"3: movw $0x1AEB, 1b\n"	/* jmp on 26 bytes */
> +		"   jmp 2b\n"
> +		".previous\n"
> +		".section __ex_table,\"a\"\n"
> +		"	.align 4\n"
> +		"	.long 1b, 3b\n"
> +		".previous"
> +		: : "r" (from) );
> +		
> +	for(; i>5; i--)
> +	{
> +		__asm__ __volatile__ (
> +		"1:  prefetchnta 320(%0)\n"
> +		"2:  movq (%0), %%mm0\n"
> +		"  movq 8(%0), %%mm1\n"
> +		"  movq 16(%0), %%mm2\n"
> +		"  movq 24(%0), %%mm3\n"
> +		"  movntq %%mm0, (%1)\n"
> +		"  movntq %%mm1, 8(%1)\n"
> +		"  movntq %%mm2, 16(%1)\n"
> +		"  movntq %%mm3, 24(%1)\n"
> +		"  movq 32(%0), %%mm0\n"
> +		"  movq 40(%0), %%mm1\n"
> +		"  movq 48(%0), %%mm2\n"
> +		"  movq 56(%0), %%mm3\n"
> +		"  movntq %%mm0, 32(%1)\n"
> +		"  movntq %%mm1, 40(%1)\n"
> +		"  movntq %%mm2, 48(%1)\n"
> +		"  movntq %%mm3, 56(%1)\n"
> +		".section .fixup, \"ax\"\n"
> +		"3: movw $0x05EB, 1b\n"	/* jmp on 5 bytes */
> +		"   jmp 2b\n"
> +		".previous\n"
> +		".section __ex_table,\"a\"\n"
> +		"	.align 4\n"
> +		"	.long 1b, 3b\n"
> +		".previous"
> +		: : "r" (from), "r" (to) : "memory");
> +		from+=64;
> +		to+=64;
> +	}
> +
> +	for(; i>0; i--)
> +	{
> +		__asm__ __volatile__ (
> +		"  movq (%0), %%mm0\n"
> +		"  movq 8(%0), %%mm1\n"
> +		"  movq 16(%0), %%mm2\n"
> +		"  movq 24(%0), %%mm3\n"
> +		"  movntq %%mm0, (%1)\n"
> +		"  movntq %%mm1, 8(%1)\n"
> +		"  movntq %%mm2, 16(%1)\n"
> +		"  movntq %%mm3, 24(%1)\n"
> +		"  movq 32(%0), %%mm0\n"
> +		"  movq 40(%0), %%mm1\n"
> +		"  movq 48(%0), %%mm2\n"
> +		"  movq 56(%0), %%mm3\n"
> +		"  movntq %%mm0, 32(%1)\n"
> +		"  movntq %%mm1, 40(%1)\n"
> +		"  movntq %%mm2, 48(%1)\n"
> +		"  movntq %%mm3, 56(%1)\n"
> +		: : "r" (from), "r" (to) : "memory");
> +		from+=64;
> +		to+=64;
> +	}
> +	/*
> +	 *	Now do the tail of the block
> +	 */
> +	kernel_fpu_end();
> +	if(i=(len&63))
> +	  __copy_user_zeroing(to, from, i);
> +	return i;
> +}
> +
>  
>  unsigned long __copy_to_user_ll(void __user *to, const void *from, unsigned long n)
>  {
> @@ -582,6 +687,21 @@
>  	return n;
>  }
>  
> +unsigned long
> +__copy_from_user_ll_nocache(void *to, const void __user *from, unsigned long n)
> +{
> +	BUG_ON((long)n < 0);
> +        if (n < 512) {
> +          if (movsl_is_ok(to, from, n))
> +                __copy_user_zeroing(to, from, n);
> +          else
> +                n = __copy_user_zeroing_intel(to, from, n);
> +        }
> +        else
> +          n = __copy_user_zeroing_nocache(to, from, n);
> +	return n;
> +}
> +
>  /**
>   * copy_to_user: - Copy a block of data into user space.
>   * @to:   Destination address, in user space.
> diff -ur linux-2.6.12.4.orig/include/asm/uaccess.h linux-2.6.12.4.nocache/include/asm/uaccess.h
> --- linux-2.6.12.4.orig/include/asm/uaccess.h	2005-08-05 16:04:37.000000000 +0900
> +++ linux-2.6.12.4.nocache/include/asm/uaccess.h	2005-08-16 10:44:05.000000000 +0900
> @@ -413,6 +413,8 @@
>  				const void *from, unsigned long n);
>  unsigned long __must_check __copy_from_user_ll(void *to,
>  				const void __user *from, unsigned long n);
> +unsigned long __must_check __copy_from_user_ll_nocache(void *to,
> +				const void __user *from, unsigned long n);
>  
>  /*
>   * Here we special-case 1, 2 and 4-byte copy_*_user invocations.  On a fault
> @@ -502,11 +504,38 @@
>  }
>  
>  static inline unsigned long
> +__copy_from_user_inatomic_nocache(void *to, const void __user *from, unsigned long n)
> +{
> +	if (__builtin_constant_p(n)) {
> +		unsigned long ret;
> +
> +		switch (n) {
> +		case 1:
> +			__get_user_size(*(u8 *)to, from, 1, ret, 1);
> +			return ret;
> +		case 2:
> +			__get_user_size(*(u16 *)to, from, 2, ret, 2);
> +			return ret;
> +		case 4:
> +			__get_user_size(*(u32 *)to, from, 4, ret, 4);
> +			return ret;
> +		}
> +	}
> +	return __copy_from_user_ll_nocache(to, from, n);
> +}
> +
> +static inline unsigned long
>  __copy_from_user(void *to, const void __user *from, unsigned long n)
>  {
>         might_sleep();
>         return __copy_from_user_inatomic(to, from, n);
>  }
> +static inline unsigned long
> +__copy_from_user_nocache(void *to, const void __user *from, unsigned long n)
> +{
> +       might_sleep();
> +       return __copy_from_user_inatomic_nocache(to, from, n);
> +}
>  unsigned long __must_check copy_to_user(void __user *to,
>  				const void *from, unsigned long n);
>  unsigned long __must_check copy_from_user(void *to,
> diff -ur linux-2.6.12.4.orig/include/asm-i386/uaccess.h linux-2.6.12.4.nocache/include/asm-i386/uaccess.h
> --- linux-2.6.12.4.orig/include/asm-i386/uaccess.h	2005-08-05 16:04:37.000000000 +0900
> +++ linux-2.6.12.4.nocache/include/asm-i386/uaccess.h	2005-08-16 10:44:05.000000000 +0900
> @@ -413,6 +413,8 @@
>  				const void *from, unsigned long n);
>  unsigned long __must_check __copy_from_user_ll(void *to,
>  				const void __user *from, unsigned long n);
> +unsigned long __must_check __copy_from_user_ll_nocache(void *to,
> +				const void __user *from, unsigned long n);
>  
>  /*
>   * Here we special-case 1, 2 and 4-byte copy_*_user invocations.  On a fault
> @@ -502,11 +504,38 @@
>  }
>  
>  static inline unsigned long
> +__copy_from_user_inatomic_nocache(void *to, const void __user *from, unsigned long n)
> +{
> +	if (__builtin_constant_p(n)) {
> +		unsigned long ret;
> +
> +		switch (n) {
> +		case 1:
> +			__get_user_size(*(u8 *)to, from, 1, ret, 1);
> +			return ret;
> +		case 2:
> +			__get_user_size(*(u16 *)to, from, 2, ret, 2);
> +			return ret;
> +		case 4:
> +			__get_user_size(*(u32 *)to, from, 4, ret, 4);
> +			return ret;
> +		}
> +	}
> +	return __copy_from_user_ll_nocache(to, from, n);
> +}
> +
> +static inline unsigned long
>  __copy_from_user(void *to, const void __user *from, unsigned long n)
>  {
>         might_sleep();
>         return __copy_from_user_inatomic(to, from, n);
>  }
> +static inline unsigned long
> +__copy_from_user_nocache(void *to, const void __user *from, unsigned long n)
> +{
> +       might_sleep();
> +       return __copy_from_user_inatomic_nocache(to, from, n);
> +}
>  unsigned long __must_check copy_to_user(void __user *to,
>  				const void *from, unsigned long n);
>  unsigned long __must_check copy_from_user(void *to,
> diff -ur linux-2.6.12.4.orig/include/linux/autoconf.h linux-2.6.12.4.nocache/include/linux/autoconf.h
> --- linux-2.6.12.4.orig/include/linux/autoconf.h	2005-08-15 16:53:01.000000000 +0900
> +++ linux-2.6.12.4.nocache/include/linux/autoconf.h	2005-08-16 10:32:33.000000000 +0900
> @@ -1,7 +1,7 @@
>  /*
>   * Automatically generated C config: don't edit
> - * Linux kernel version: 2.6.12.4.orig
> - * Mon Aug 15 16:53:01 2005
> + * Linux kernel version: 2.6.12.4.nocache
> + * Tue Aug 16 10:32:33 2005
>   */
>  #define AUTOCONF_INCLUDED
>  #define CONFIG_X86 1
> diff -ur linux-2.6.12.4.orig/mm/filemap.c linux-2.6.12.4.nocache/mm/filemap.c
> --- linux-2.6.12.4.orig/mm/filemap.c	2005-08-05 16:04:37.000000000 +0900
> +++ linux-2.6.12.4.nocache/mm/filemap.c	2005-08-16 10:16:06.000000000 +0900
> @@ -1727,13 +1727,13 @@
>  	int left;
>  
>  	kaddr = kmap_atomic(page, KM_USER0);
> -	left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
> +	left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
>  	kunmap_atomic(kaddr, KM_USER0);
>  
>  	if (left != 0) {
>  		/* Do it the slow way */
>  		kaddr = kmap(page);
> -		left = __copy_from_user(kaddr + offset, buf, bytes);
> +		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
>  		kunmap(page);
>  	}
>  	return bytes - left;
> @@ -1750,7 +1750,7 @@
>  		int copy = min(bytes, iov->iov_len - base);
>  
>  		base = 0;
> -		left = __copy_from_user_inatomic(vaddr, buf, copy);
> +		left = __copy_from_user_inatomic_nocache(vaddr, buf, copy);
>  		copied += copy;
>  		bytes -= copy;
>  		vaddr += copy;
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-16  4:17                   ` Hirokazu Takahashi
@ 2005-08-16  4:54                     ` Hiro Yoshioka
  2005-08-16  5:43                       ` Arjan van de Ven
  2005-08-16  5:44                     ` Arjan van de Ven
  1 sibling, 1 reply; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-16  4:54 UTC (permalink / raw)
  To: taka; +Cc: lkml.hyoshiok, arjan, linux-kernel, hyoshiok

Takahashi san,

I appreciate your comments.

> Hi,
> 
> BTW, what are you going to do with the page-faults which may happen
> during __copy_user_zeroing_nocache()? The current process may be blocked
> in the handler for a while and get FPU registers polluted.
> kernel_fpu_begin() won't help the case. This is another issue, though.

My code does nothing do it.

I need a volunteer to implement it.

Regards,
  Hiro

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-16  4:54                     ` Hiro Yoshioka
@ 2005-08-16  5:43                       ` Arjan van de Ven
  2005-08-16 10:16                         ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Arjan van de Ven @ 2005-08-16  5:43 UTC (permalink / raw)
  To: Hiro Yoshioka; +Cc: taka, lkml.hyoshiok, linux-kernel

On Tue, 2005-08-16 at 13:54 +0900, Hiro Yoshioka wrote:
> Takahashi san,
> 
> I appreciate your comments.
> 
> > Hi,
> > 
> > BTW, what are you going to do with the page-faults which may happen
> > during __copy_user_zeroing_nocache()? The current process may be blocked
> > in the handler for a while and get FPU registers polluted.
> > kernel_fpu_begin() won't help the case. This is another issue, though.
> 
> My code does nothing do it.
> 
> I need a volunteer to implement it.

it's actually not too hard; all you need is to use SSE and not MMX; and
then just store sse register you're overwriting on the stack or so...




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-16  5:43                       ` Arjan van de Ven
@ 2005-08-16 10:16                         ` Hiro Yoshioka
  2005-08-16 10:19                           ` Hirokazu Takahashi
  2005-08-16 10:25                           ` Arjan van de Ven
  0 siblings, 2 replies; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-16 10:16 UTC (permalink / raw)
  To: arjan; +Cc: taka, lkml.hyoshiok, linux-kernel, hyoshiok

From: Arjan van de Ven <arjan@infradead.org>
> > My code does nothing do it.
> > 
> > I need a volunteer to implement it.
> 
> it's actually not too hard; all you need is to use SSE and not MMX; and
> then just store sse register you're overwriting on the stack or so...

oh, really? Does the linux kernel take care of
SSE save/restore on a task switch?

Regards,
  Hiro

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-16 10:16                         ` Hiro Yoshioka
@ 2005-08-16 10:19                           ` Hirokazu Takahashi
  2005-08-16 10:25                           ` Arjan van de Ven
  1 sibling, 0 replies; 63+ messages in thread
From: Hirokazu Takahashi @ 2005-08-16 10:19 UTC (permalink / raw)
  To: hyoshiok; +Cc: arjan, lkml.hyoshiok, linux-kernel

Hi

> > > My code does nothing do it.
> > > 
> > > I need a volunteer to implement it.
> > 
> > it's actually not too hard; all you need is to use SSE and not MMX; and
> > then just store sse register you're overwriting on the stack or so...
> 
> oh, really? Does the linux kernel take care of
> SSE save/restore on a task switch?

noop!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-16 10:16                         ` Hiro Yoshioka
  2005-08-16 10:19                           ` Hirokazu Takahashi
@ 2005-08-16 10:25                           ` Arjan van de Ven
  2005-08-16 10:24                             ` Hirokazu Takahashi
  1 sibling, 1 reply; 63+ messages in thread
From: Arjan van de Ven @ 2005-08-16 10:25 UTC (permalink / raw)
  To: Hiro Yoshioka; +Cc: taka, lkml.hyoshiok, linux-kernel

On Tue, 2005-08-16 at 19:16 +0900, Hiro Yoshioka wrote:
> From: Arjan van de Ven <arjan@infradead.org>
> > > My code does nothing do it.
> > > 
> > > I need a volunteer to implement it.
> > 
> > it's actually not too hard; all you need is to use SSE and not MMX; and
> > then just store sse register you're overwriting on the stack or so...
> 
> oh, really? Does the linux kernel take care of
> SSE save/restore on a task switch?

not on kernel entry afaik.
However just save the register on the stack and put it back at the
end...



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-16 10:25                           ` Arjan van de Ven
@ 2005-08-16 10:24                             ` Hirokazu Takahashi
  0 siblings, 0 replies; 63+ messages in thread
From: Hirokazu Takahashi @ 2005-08-16 10:24 UTC (permalink / raw)
  To: arjan; +Cc: hyoshiok, lkml.hyoshiok, linux-kernel

Hi,

> > > > My code does nothing do it.
> > > > 
> > > > I need a volunteer to implement it.
> > > 
> > > it's actually not too hard; all you need is to use SSE and not MMX; and
> > > then just store sse register you're overwriting on the stack or so...
> > 
> > oh, really? Does the linux kernel take care of
> > SSE save/restore on a task switch?
> 
> not on kernel entry afaik.
> However just save the register on the stack and put it back at the
> end...

I think this have to be done in the pagefault handlers.


Thanks,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-16  4:17                   ` Hirokazu Takahashi
  2005-08-16  4:54                     ` Hiro Yoshioka
@ 2005-08-16  5:44                     ` Arjan van de Ven
  1 sibling, 0 replies; 63+ messages in thread
From: Arjan van de Ven @ 2005-08-16  5:44 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: hyoshiok, lkml.hyoshiok, linux-kernel

On Tue, 2005-08-16 at 13:17 +0900, Hirokazu Takahashi wrote:
> Hi,
> 
> BTW, what are you going to do with the page-faults which may happen
> during __copy_user_zeroing_nocache()? The current process may be blocked
> in the handler for a while and get FPU registers polluted.
> kernel_fpu_begin() won't help the case. This is another issue, though.


__copy_from_user_inatomic

.. that implies it won't sleep actually ;)




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-16  3:30                 ` Hiro Yoshioka
  2005-08-16  4:17                   ` Hirokazu Takahashi
@ 2005-08-16  5:49                   ` Arjan van de Ven
       [not found]                     ` <20050817.110503.97359275.taka@valinux.co.jp>
  1 sibling, 1 reply; 63+ messages in thread
From: Arjan van de Ven @ 2005-08-16  5:49 UTC (permalink / raw)
  To: Hiro Yoshioka; +Cc: lkml.hyoshiok, linux-kernel

On Tue, 2005-08-16 at 12:30 +0900, Hiro Yoshioka wrote:

> The following example shows the L3 cache miss is reduced from 37410 to 107.

most impressive; it seems the approach to do this selectively is paying
off very well! 

The only comment/question I have is about the use of prefetchnta; that
might have cache-evicting properties as well (eg evict the cache of the
original of the copy, eg the userspace memory). Is that really the right
approach? 
In addition, my measurements show that removing the prefetch from the
main copy loop is a gain because the modern cpus have an autoprefetcher
already in the hardware.

       "1: prefetchnta (%0)\n"         /* This set is 28 bytes */
+               "   prefetchnta 64(%0)\n"
+               "   prefetchnta 128(%0)\n"
+               "   prefetchnta 192(%0)\n"
+               "   prefetchnta 256(%0)\n"
+               "2:  \n"
+               ".section .fixup, \"ax\"\n"
+               "3: movw $0x1AEB, 1b\n" /* jmp on 26 bytes */
+               "   jmp 2b\n"
+               ".previous\n"

oh and prefetch(nta) is a non-faulting instruction so no need for the
fixup handling...

But overall this is starting to look really interesting!

Greetings,
   Arjan van de Ven

^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20050817.110503.97359275.taka@valinux.co.jp>]

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
       [not found]                     ` <20050817.110503.97359275.taka@valinux.co.jp>
@ 2005-08-17  5:10                       ` Hiro Yoshioka
  2005-08-17 14:30                         ` Akira Tsukamoto
  0 siblings, 1 reply; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-17  5:10 UTC (permalink / raw)
  To: Akira Tsukamoto; +Cc: arjan, linux-kernel

Akira,

Thanks for your suggestions.

On 8/17/05, Akira Tsukamoto <akira-t@s9.dion.ne.jp> wrote:
> Anyway, going back to copy_user topic,
> big remaining issues are
>   1)store/restore floating point register (80/64bytes) twice every time by
>      surrounding with kernel_fpu_begin()/kernel_fpu_end() is big penalty

I don't know. If nobody uses MMX/XMM, then there is no need
to save and restore.

>   2)after pagefault not always come back to copy function and corrupts fp register

I'm trying to understand this mechanism but I don't
understand very well.

>   3)disabling long preemption
> Please correct me if I am wrong.
> 
> I tried to implement fpsave inside pagefault handler once and here is my junk;
> http://www.suna-asobi.com/~akira-t/linux/k7-copy-user/K7-copy_47_with_fpusave_not_finished.patch
> never had a time to finish it. Hiro, does it help you?

Thanks. I'm reading your patch but could not understand very well.

I'll ask you.

Regards,
  Hiro
-- 
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-17  5:10                       ` Hiro Yoshioka
@ 2005-08-17 14:30                         ` Akira Tsukamoto
  2005-08-17 15:27                           ` Akira Tsukamoto
  2005-08-18  2:37                           ` Akira Tsukamoto
  0 siblings, 2 replies; 63+ messages in thread
From: Akira Tsukamoto @ 2005-08-17 14:30 UTC (permalink / raw)
  To: arjan, linux-kernel, Hirokazu Takahashi

On Wed, 17 Aug 2005 14:10:34 +0900
Hiro Yoshioka <lkml.hyoshiok@gmail.com> mentioned:
> On 8/17/05, Akira Tsukamoto <akira-t@s9.dion.ne.jp> wrote:
> > Anyway, going back to copy_user topic,
> > big remaining issues are
> >   1)store/restore floating point register (80/64bytes) twice every time by
> >      surrounding with kernel_fpu_begin()/kernel_fpu_end() is big penalty
> 
> I don't know. If nobody uses MMX/XMM, then there is no need
> to save and restore.

I think you are misunderstanding between
 1)lazy fpu save handling for user space task
 2)kernel_fpu_begin()/kernel_fpu_end() inside the kernel

> >   2)after pagefault not always come back to copy function and corrupts fp register
> 
> I'm trying to understand this mechanism but I don't
> understand very well.

My explanation was a bit ambiguous, see the code below. 
Where the fp register saved? It saves fp register *inside* task_struct,

static inline void kernel_fpu_begin(void)
+	if (tsk->flags & PF_USEDFPU) {
+		asm volatile("rex64 ; fxsave %0 ; fnclex"
+			       : "=m" (tsk->thread.i387.fxsave));

static inline void save_init_fpu( struct task_struct *tsk )
+	if ( cpu_has_fxsr ) {
+		asm volatile( "fxrstor %0"
+			      : : "m" (tsk->thread.i387.fxsave) );

What happens, during your copy function, if memory is not allocated and 
generates pagefualt and goto reclaim memories and go into task switch
and change to other task.

-- 
Akira Tsukamoto



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-17 14:30                         ` Akira Tsukamoto
@ 2005-08-17 15:27                           ` Akira Tsukamoto
  2005-08-18 17:53                             ` Lee Revell
  2005-08-18  2:37                           ` Akira Tsukamoto
  1 sibling, 1 reply; 63+ messages in thread
From: Akira Tsukamoto @ 2005-08-17 15:27 UTC (permalink / raw)
  To: linux-kernel

I am resubmitting this because it seems to be lost when I posted 
the before yesterday.

------------------------------------
Arjan van de Ven mentioned:
> The only comment/question I have is about the use of prefetchnta; that
> might have cache-evicting properties as well (eg evict the cache of the
> original of the copy, eg the userspace memory). Is that really the right
> approach? 
> In addition, my measurements show that removing the prefetch from the
> main copy loop is a gain because the modern cpus have an autoprefetcher
> already in the hardware.

My computer with Athlon K7 was faster with manually prefetching,
but I did not know it is already becoming obsolete.

It was pretty while ago, but I also made a similar copy_user function;
http://www.suna-asobi.com/~akira-t/linux/k7-copy-user/K7-copy-47.patch
I add comments on each item in the copy function. It was basically 
inspired from Takahashi's intel faster copy function.

I also have some explanation about the speedup for pipelined cpu.
http://www.suna-asobi.com/~akira-t/linux/k7-copy-user/copy_for_highlypipelined_cpu.txt

It was originally discussed in this thread,
http://marc.theaimsgroup.com/?l=linux-kernel&m=103742983924070&w=2

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-17 15:27                           ` Akira Tsukamoto
@ 2005-08-18 17:53                             ` Lee Revell
  0 siblings, 0 replies; 63+ messages in thread
From: Lee Revell @ 2005-08-18 17:53 UTC (permalink / raw)
  To: Akira Tsukamoto; +Cc: linux-kernel

On Thu, 2005-08-18 at 00:27 +0900, Akira Tsukamoto wrote:
> My computer with Athlon K7 was faster with manually prefetching,
> but I did not know it is already becoming obsolete.
> 

Don't listen to people who tell you $FOO hardware is obsolete, they have
a very narrow view.  "Obsolete" is meaningless except in reference to
some specific application.  The 386 is obsolete on the desktop but still
common on the embedded market.

Lee


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-17 14:30                         ` Akira Tsukamoto
  2005-08-17 15:27                           ` Akira Tsukamoto
@ 2005-08-18  2:37                           ` Akira Tsukamoto
  1 sibling, 0 replies; 63+ messages in thread
From: Akira Tsukamoto @ 2005-08-18  2:37 UTC (permalink / raw)
  To: arjan, linux-kernel, Hirokazu Takahashi

On Wed, 17 Aug 2005 23:30:13 +0900
Akira Tsukamoto <akira-t@suna-asobi.com> mentioned:
> > I'm trying to understand this mechanism but I don't
> > understand very well.
> 
> My explanation was a bit ambiguous, see the code below. 
> Where the fp register saved? It saves fp register *inside* task_struct,

More clarification, to make fp_save generic,
after exception, such as pagefault, copy function might get nested,
during page allocation.
First it has user space fp content, but nested copy needs to save 
kernel space fp content which came from the first copy function.
So saving into task_struct is bit problem.

XMM_SAVE/XMM_RESTORE uses stack for it. 
Surrounding copy loop with XMM_SAVE/XMM_RESTORE should work.

Some might claim that, saving/restore every time might a big overhead,,,
but i think it is better than having a lot of cache miss hit.

Isn't there some way to avoid long preemption disabling?

-- 
Akira Tsukamoto <akira-t@suna-asobi.com, at541@columbia.edu>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
@ 2005-08-14 21:24 Ian Kumlien
  2005-08-15  7:21 ` Arjan van de Ven
  0 siblings, 1 reply; 63+ messages in thread
From: Ian Kumlien @ 2005-08-14 21:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: hch, arian, lkml.hyoshiok

[-- Attachment #1: Type: text/plain, Size: 846 bytes --]

Hi, all

I might be missunderstanding things but...

First of all, machines with long pipelines will suffer from cache misses
(p4 in this case).

Depending on the size copied, (i don't know how large they are so..)
can't one run out of cachelines and/or evict more useful cache data?

Ie, if it's cached from begining to end, we generally only need 'some
of' the begining, the cpu's prefetch should manage the rest.

I might, as i said, not know all about things like this and i also
suffer from a fever but i still find Hiro's data interesting.

Isn't there some way to do the same test for the same time and measure
the differences in allround data? to see if we really are punished as
bad on accessing the data post copy? (could it be size dependant?)

-- 
Ian Kumlien <pomac () vapor ! com> -- http://pomac.netswarm.net

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-14 21:24 Ian Kumlien
@ 2005-08-15  7:21 ` Arjan van de Ven
  2005-08-15 14:49   ` Ian Kumlien
  0 siblings, 1 reply; 63+ messages in thread
From: Arjan van de Ven @ 2005-08-15  7:21 UTC (permalink / raw)
  To: pomac; +Cc: linux-kernel, hch, arian, lkml.hyoshiok

On Sun, 2005-08-14 at 23:24 +0200, Ian Kumlien wrote:
> Hi, all
> 
> I might be missunderstanding things but...
> 
> First of all, machines with long pipelines will suffer from cache misses
> (p4 in this case).
> 
> Depending on the size copied, (i don't know how large they are so..)
> can't one run out of cachelines and/or evict more useful cache data?

CPU caches are really big nowadays

> 
> Ie, if it's cached from begining to end, we generally only need 'some
> of' the begining, the cpu's prefetch should manage the rest.

cpu prefetch isn't going to be fast enough. It helps some, but in the
end the cpu prefetch also has to wait for the ram, it doesn't make the
ram faster or free, it just takes a jumpstart on getting to it.

> I might, as i said, not know all about things like this and i also
> suffer from a fever but i still find Hiro's data interesting.

It is. It's good proof that you can make a big gain already by
converting a few key places to his excellent code. And neither me nor
Christoph are suggesting to ditch his effort! Instead we suggest that
what he is doing is useful for some cases and harmful for others, and
that it is quite easy to identify those cases and separate them from
eachother, and that thus as a result it is more optimal to have 2 apis,
one for each of the cases.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-15  7:21 ` Arjan van de Ven
@ 2005-08-15 14:49   ` Ian Kumlien
  0 siblings, 0 replies; 63+ messages in thread
From: Ian Kumlien @ 2005-08-15 14:49 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, hch, lkml.hyoshiok

[-- Attachment #1: Type: text/plain, Size: 2046 bytes --]

On Mon, 2005-08-15 at 09:21 +0200, Arjan van de Ven wrote:
> On Sun, 2005-08-14 at 23:24 +0200, Ian Kumlien wrote:
> > Hi, all
> > 
> > I might be missunderstanding things but...
> > 
> > First of all, machines with long pipelines will suffer from cache misses
> > (p4 in this case).
> > 
> > Depending on the size copied, (i don't know how large they are so..)
> > can't one run out of cachelines and/or evict more useful cache data?
> 
> CPU caches are really big nowadays

Yes but (is copy to/from user size limited?) whats the cahes size
compared to the copy operation preformed compared to lost useful
cachelines =)

> > Ie, if it's cached from begining to end, we generally only need 'some
> > of' the begining, the cpu's prefetch should manage the rest.
> 
> cpu prefetch isn't going to be fast enough. It helps some, but in the
> end the cpu prefetch also has to wait for the ram, it doesn't make the
> ram faster or free, it just takes a jumpstart on getting to it.

Yeah i know, but i was thinking more of a compromize, then it might be
better...

> > I might, as i said, not know all about things like this and i also
> > suffer from a fever but i still find Hiro's data interesting.
> 
> It is. It's good proof that you can make a big gain already by
> converting a few key places to his excellent code. And neither me nor
> Christoph are suggesting to ditch his effort! Instead we suggest that
> what he is doing is useful for some cases and harmful for others, and
> that it is quite easy to identify those cases and separate them from
> eachother, and that thus as a result it is more optimal to have 2 apis,
> one for each of the cases.

Thats good to know, since i have wondered for a while why block io seems
so oddly slow...

I just thought that there might be some good compromize between the two
that would make it automatic.

Oh well, guess i'm back to coughing and waiting for patches to be
implemented =)
-- 
Ian Kumlien <pomac () vapor ! com> -- http://pomac.netswarm.net

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
@ 2005-08-15 12:15 linux
  2005-08-15 12:25 ` Arjan van de Ven
  0 siblings, 1 reply; 63+ messages in thread
From: linux @ 2005-08-15 12:15 UTC (permalink / raw)
  To: linux-kernel

Actually, is there any place *other* than write() to the page cache that
warrants a non-temporal store?  Network sockets with scatter/gather and
hardware checksum, maybe?

This is pretty much synonomous with what is allowed to go into high
memory, no?

While we're on the subject, for the copy_from_user source, prefetchnta is
probably indicated.  If user space hasn't caused it to be cached already
(admittedly, the common case), we *know* the kernel isn't going to look
at that data again.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-15 12:15 linux
@ 2005-08-15 12:25 ` Arjan van de Ven
  0 siblings, 0 replies; 63+ messages in thread
From: Arjan van de Ven @ 2005-08-15 12:25 UTC (permalink / raw)
  To: linux; +Cc: linux-kernel

On Mon, 2005-08-15 at 08:15 -0400, linux@horizon.com wrote:
> Actually, is there any place *other* than write() to the page cache that
> warrants a non-temporal store?  Network sockets with scatter/gather and
> hardware checksum, maybe?

afaik those use zero copy already, eg straight pagecache copy.
Eg that's the only case where s/g is used right now, and that case
doesn't copy already.



^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20050815121555.29159.qmail@science.horizon.com.suse.lists.linux.kernel>]

[parent not found: <1124108702.3228.33.camel@laptopd505.fenrus.org.suse.lists.linux.kernel>]

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
       [not found] ` <1124108702.3228.33.camel@laptopd505.fenrus.org.suse.lists.linux.kernel>
@ 2005-08-15 15:02   ` Andi Kleen
  2005-08-15 15:09     ` Arjan van de Ven
  0 siblings, 1 reply; 63+ messages in thread
From: Andi Kleen @ 2005-08-15 15:02 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, linux, lkml.hyoshiok

Arjan van de Ven <arjan@infradead.org> writes:

> On Mon, 2005-08-15 at 08:15 -0400, linux@horizon.com wrote:
> > Actually, is there any place *other* than write() to the page cache that
> > warrants a non-temporal store?  Network sockets with scatter/gather and
> > hardware checksum, maybe?
> 
> afaik those use zero copy already, eg straight pagecache copy.

Only if you use sendfile(). And the normal write path uses csum_copy_* 

-Andi

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-15 15:02   ` Andi Kleen
@ 2005-08-15 15:09     ` Arjan van de Ven
  2005-08-15 15:13       ` Andi Kleen
  0 siblings, 1 reply; 63+ messages in thread
From: Arjan van de Ven @ 2005-08-15 15:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux, lkml.hyoshiok

On Mon, 2005-08-15 at 17:02 +0200, Andi Kleen wrote:
> Arjan van de Ven <arjan@infradead.org> writes:
> 
> > On Mon, 2005-08-15 at 08:15 -0400, linux@horizon.com wrote:
> > > Actually, is there any place *other* than write() to the page cache that
> > > warrants a non-temporal store?  Network sockets with scatter/gather and
> > > hardware checksum, maybe?
> > 
> > afaik those use zero copy already, eg straight pagecache copy.
> 
> Only if you use sendfile(). And the normal write path uses csum_copy_* 

but do those use s/g ? and hw csum?



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-15 15:09     ` Arjan van de Ven
@ 2005-08-15 15:13       ` Andi Kleen
  0 siblings, 0 replies; 63+ messages in thread
From: Andi Kleen @ 2005-08-15 15:13 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Andi Kleen, linux-kernel, linux, lkml.hyoshiok

On Mon, Aug 15, 2005 at 05:09:12PM +0200, Arjan van de Ven wrote:
> On Mon, 2005-08-15 at 17:02 +0200, Andi Kleen wrote:
> > Arjan van de Ven <arjan@infradead.org> writes:
> > 
> > > On Mon, 2005-08-15 at 08:15 -0400, linux@horizon.com wrote:
> > > > Actually, is there any place *other* than write() to the page cache that
> > > > warrants a non-temporal store?  Network sockets with scatter/gather and
> > > > hardware checksum, maybe?
> > > 
> > > afaik those use zero copy already, eg straight pagecache copy.
> > 
> > Only if you use sendfile(). And the normal write path uses csum_copy_* 
> 
> but do those use s/g ? 

sendfile yes. sendmsg also when the MTU of the device is larger than a page.

> and hw csum?

sendmsg normally not.

-Andi

^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20050816.131729.15816429.taka@valinux.co.jp.suse.lists.linux.kernel>]

[parent not found: <20050816.135425.719901536.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>]

[parent not found: <1124171015.3215.0.camel@laptopd505.fenrus.org.suse.lists.linux.kernel>]

[parent not found: <20050816.191617.1025215458.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>]

[parent not found: <1124187950.3215.31.camel@laptopd505.fenrus.org.suse.lists.linux.kernel>]

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
       [not found]       ` <1124187950.3215.31.camel@laptopd505.fenrus.org.suse.lists.linux.kernel>
@ 2005-08-16 13:15         ` Andi Kleen
  2005-08-18 11:06           ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Andi Kleen @ 2005-08-16 13:15 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: taka, linux-kernel

Arjan van de Ven <arjan@infradead.org> writes:
> 
> not on kernel entry afaik.
> However just save the register on the stack and put it back at the
> end...

You need to do more than that, like disabling lazy FPU mode. 
That is what kernel_fpu_begin/end takes care of. 

However it disables preemption, which especially for bigger
copies will probably make the low latency people unhappy.

Without disabling preemption there is no way to use SSE right now.

Note that there is also an integer NT store in SSE1, however at least
in Athlon K7 it is microcoded and very slow. 

-Andi

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-16 13:15         ` Andi Kleen
@ 2005-08-18 11:06           ` Hiro Yoshioka
  2005-08-18 11:11             ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-18 11:06 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Arjan van de Ven, taka, linux-kernel, Hiro Yoshioka

On 16 Aug 2005 15:15:35 +0200, Andi Kleen <ak@suse.de> wrote:
> However it disables preemption, which especially for bigger
> copies will probably make the low latency people unhappy.

In the copy loop,
+#ifdef CONFIG_PREEMPT
+               if ( (i%64)==0 ) {
+                   MMX_RESTORE;
+                   MMX_SAVE;
+               };
+#endif

It costs several hundred clocks (wow) every 4KB copy.

It kills throughput but it makes the low latency people smile.

So I make two APIs. 
__copy_user_zeroing_nocache()
__copy_user_zeroing_inatomic_nocache()

The former is a low latency version and the other is a throughput version.

What do you think?

Regards,
  Hiro

-- 
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-18 11:06           ` Hiro Yoshioka
@ 2005-08-18 11:11             ` Hiro Yoshioka
  2005-08-18 23:29               ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-18 11:11 UTC (permalink / raw)
  To: lkml.hyoshiok; +Cc: ak, arjan, taka, linux-kernel, hyoshiok

> So I make two APIs. 
> __copy_user_zeroing_nocache()
> __copy_user_zeroing_inatomic_nocache()
> 
> The former is a low latency version and the other is a throughput version.

1) using stack to save/restore MMX registers
2) low latency version of cache aware copy
3) __copy_user*_nocache APIs so if you want to use it.

diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.preempt/Makefile
--- linux-2.6.12.4.orig/Makefile	2005-08-12 14:37:59.000000000 +0900
+++ linux-2.6.12.4.preempt/Makefile	2005-08-18 18:47:07.000000000 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.preempt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c linux-2.6.12.4.preempt/arch/i386/lib/usercopy.c
--- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.preempt/arch/i386/lib/usercopy.c	2005-08-18 19:07:49.000000000 +0900
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/blkdev.h>
 #include <linux/module.h>
+#include <asm/i387.h>
 #include <asm/uaccess.h>
 #include <asm/mmx.h>
 
@@ -511,6 +512,254 @@
 		: "memory");						\
 } while (0)
 
+#define MMX_SAVE do {                           \
+        preempt_disable();                      \
+        __asm__ __volatile__ (                  \
+                "movl %%cr0,%0          ;\n\t"  \
+                "clts                   ;\n\t"  \
+                "movq %%mm0,(%1)     ;\n\t"     \
+                "movq %%mm1,8(%1) ;\n\t"     \
+                "movq %%mm2,16(%1) ;\n\t"     \
+                "movq %%mm3,24(%1) ;\n\t"     \
+                : "=&r" (cr0)                   \
+                : "r" (mmx_save)                \
+                : "memory");                    \
+} while(0)
+
+#define MMX_RESTORE do {                       \
+        __asm__ __volatile__ (                  \
+                "sfence                 ;\n\t"  \
+                "movq (%1),%%mm0     ;\n\t"  \
+                "movq 8(%1),%%mm1 ;\n\t"  \
+                "movq 16(%1),%%mm2 ;\n\t"  \
+                "movq 24(%1),%%mm3 ;\n\t"  \
+                "movl   %0,%%cr0        ;\n\t"  \
+                :                               \
+                : "r" (cr0), "r" (mmx_save)     \
+                : "memory");                    \
+        preempt_enable();                       \
+} while(0)
+
+#define ALIGN8 __attribute__((aligned(8)))
+
+/* Non Temporal Hint version of mmx_memcpy */
+/* It is cache aware                       */
+/* hyoshiok@miraclelinux.com               */
+static unsigned long 
+__copy_user_zeroing_nocache(void *to, const void *from, size_t len)
+{
+        /* Note! gcc doesn't seem to align stack variables properly, so we
+         * need to make use of unaligned loads and stores.
+         */
+	void *p;
+	int i;
+        char mmx_save[8*4] ALIGN8;
+        int cr0;
+
+	if (unlikely(in_interrupt())){
+	        __copy_user_zeroing(to, from, len);
+		return len;
+	}
+
+	p = to;
+	i = len >> 6; /* len/64 */
+
+	/*        kernel_fpu_begin();*/
+	MMX_SAVE;
+
+	__asm__ __volatile__ (
+		"1: prefetchnta (%0)\n"		/* This set is 28 bytes */
+		"   prefetchnta 64(%0)\n"
+		"   prefetchnta 128(%0)\n"
+		"   prefetchnta 192(%0)\n"
+		"   prefetchnta 256(%0)\n"
+		"2:  \n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x1AEB, 1b\n"	/* jmp on 26 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from) );
+		
+	for(; i>5; i--)
+	{
+		__asm__ __volatile__ (
+		"1:  prefetchnta 320(%0)\n"
+                "2:  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x05EB, 1b\n"	/* jmp on 5 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+#ifdef CONFIG_PREEMPT
+		if ( (i%64)==0 ) {
+		    MMX_RESTORE;
+		    MMX_SAVE;
+		};
+#endif
+	}
+
+	for(; i>0; i--)
+	{
+		__asm__ __volatile__ (
+                "  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+	/*
+	 *	Now do the tail of the block
+	 */
+	/*	kernel_fpu_end();*/
+	MMX_RESTORE;
+	if(i=(len&63))
+	  __copy_user_zeroing(to, from, i);
+	return i;
+}
+
+static unsigned long 
+__copy_user_zeroing_inatomic_nocache(void *to, const void *from, size_t len)
+{
+        /* Note! gcc doesn't seem to align stack variables properly, so we
+         * need to make use of unaligned loads and stores.
+         */
+	void *p;
+	int i;
+        char mmx_save[8*4] ALIGN8;
+        int cr0;
+
+	if (unlikely(in_interrupt())){
+	        __copy_user_zeroing(to, from, len);
+		return len;
+	}
+
+	p = to;
+	i = len >> 6; /* len/64 */
+
+	/*        kernel_fpu_begin();*/
+	MMX_SAVE;
+
+	__asm__ __volatile__ (
+		"1: prefetchnta (%0)\n"		/* This set is 28 bytes */
+		"   prefetchnta 64(%0)\n"
+		"   prefetchnta 128(%0)\n"
+		"   prefetchnta 192(%0)\n"
+		"   prefetchnta 256(%0)\n"
+		"2:  \n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x1AEB, 1b\n"	/* jmp on 26 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from) );
+		
+	for(; i>5; i--)
+	{
+		__asm__ __volatile__ (
+		"1:  prefetchnta 320(%0)\n"
+                "2:  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x05EB, 1b\n"	/* jmp on 5 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+
+	for(; i>0; i--)
+	{
+		__asm__ __volatile__ (
+                "  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+	/*
+	 *	Now do the tail of the block
+	 */
+	/*	kernel_fpu_end();*/
+	MMX_RESTORE;
+	if(i=(len&63))
+	  __copy_user_zeroing(to, from, i);
+	return i;
+}
 
 unsigned long __copy_to_user_ll(void __user *to, const void *from, unsigned long n)
 {
@@ -582,6 +831,36 @@
 	return n;
 }
 
+unsigned long
+__copy_from_user_ll_nocache(void *to, const void __user *from, unsigned long n)
+{
+	BUG_ON((long)n < 0);
+        if (n < 512) {
+          if (movsl_is_ok(to, from, n))
+                __copy_user_zeroing(to, from, n);
+          else
+                n = __copy_user_zeroing_intel(to, from, n);
+        }
+        else
+          n = __copy_user_zeroing_nocache(to, from, n);
+	return n;
+}
+
+unsigned long
+__copy_from_user_ll_inatomic_nocache(void *to, const void __user *from, unsigned long n)
+{
+	BUG_ON((long)n < 0);
+        if (n < 512) {
+          if (movsl_is_ok(to, from, n))
+                __copy_user_zeroing(to, from, n);
+          else
+                n = __copy_user_zeroing_intel(to, from, n);
+        }
+        else
+          n = __copy_user_zeroing_inatomic_nocache(to, from, n);
+	return n;
+}
+
 /**
  * copy_to_user: - Copy a block of data into user space.
  * @to:   Destination address, in user space.
diff -ur linux-2.6.12.4.orig/include/asm-i386/uaccess.h linux-2.6.12.4.preempt/include/asm-i386/uaccess.h
--- linux-2.6.12.4.orig/include/asm-i386/uaccess.h	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.preempt/include/asm-i386/uaccess.h	2005-08-18 19:16:55.000000000 +0900
@@ -413,6 +413,10 @@
 				const void *from, unsigned long n);
 unsigned long __must_check __copy_from_user_ll(void *to,
 				const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_nocache(void *to,
+				const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_inatomic_nocache(void *to,
+				const void __user *from, unsigned long n);
 
 /*
  * Here we special-case 1, 2 and 4-byte copy_*_user invocations.  On a fault
@@ -502,11 +506,55 @@
 }
 
 static inline unsigned long
+__copy_from_user_inatomic_nocache(void *to, const void __user *from, unsigned long n)
+{
+	if (__builtin_constant_p(n)) {
+		unsigned long ret;
+
+		switch (n) {
+		case 1:
+			__get_user_size(*(u8 *)to, from, 1, ret, 1);
+			return ret;
+		case 2:
+			__get_user_size(*(u16 *)to, from, 2, ret, 2);
+			return ret;
+		case 4:
+			__get_user_size(*(u32 *)to, from, 4, ret, 4);
+			return ret;
+		}
+	}
+	return __copy_from_user_ll_inatomic_nocache(to, from, n);
+}
+
+static inline unsigned long
 __copy_from_user(void *to, const void __user *from, unsigned long n)
 {
        might_sleep();
        return __copy_from_user_inatomic(to, from, n);
 }
+
+static inline unsigned long
+__copy_from_user_nocache(void *to, const void __user *from, unsigned long n)
+{
+       might_sleep();
+	if (__builtin_constant_p(n)) {
+		unsigned long ret;
+
+		switch (n) {
+		case 1:
+			__get_user_size(*(u8 *)to, from, 1, ret, 1);
+			return ret;
+		case 2:
+			__get_user_size(*(u16 *)to, from, 2, ret, 2);
+			return ret;
+		case 4:
+			__get_user_size(*(u32 *)to, from, 4, ret, 4);
+			return ret;
+		}
+	}
+	return __copy_from_user_ll_nocache(to, from, n);
+}
+
 unsigned long __must_check copy_to_user(void __user *to,
 				const void *from, unsigned long n);
 unsigned long __must_check copy_from_user(void *to,
diff -ur linux-2.6.12.4.orig/mm/filemap.c linux-2.6.12.4.preempt/mm/filemap.c
--- linux-2.6.12.4.orig/mm/filemap.c	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.preempt/mm/filemap.c	2005-08-16 10:16:06.000000000 +0900
@@ -1727,13 +1727,13 @@
 	int left;
 
 	kaddr = kmap_atomic(page, KM_USER0);
-	left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
+	left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
 	kunmap_atomic(kaddr, KM_USER0);
 
 	if (left != 0) {
 		/* Do it the slow way */
 		kaddr = kmap(page);
-		left = __copy_from_user(kaddr + offset, buf, bytes);
+		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
 		kunmap(page);
 	}
 	return bytes - left;
@@ -1750,7 +1750,7 @@
 		int copy = min(bytes, iov->iov_len - base);
 
 		base = 0;
-		left = __copy_from_user_inatomic(vaddr, buf, copy);
+		left = __copy_from_user_inatomic_nocache(vaddr, buf, copy);
 		copied += copy;
 		bytes -= copy;
 		vaddr += copy;


Regards,
  Hiro
--
Hiro Yoshioka
CTO/Miracle Linux Corporation

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-18 11:11             ` Hiro Yoshioka
@ 2005-08-18 23:29               ` Hiro Yoshioka
  2005-08-22  1:24                 ` Hiro Yoshioka
                                   ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-18 23:29 UTC (permalink / raw)
  To: Hiro Yoshioka; +Cc: ak, arjan, taka, linux-kernel

Hi,

On 8/18/05, Hiro Yoshioka <hyoshiok@miraclelinux.com> wrote:
> 1) using stack to save/restore MMX registers

It seems to me that it has some regression.
I'd like to rollback it and use kernel_fpu_begin() and kernel_fpu_end().

Regards,
  Hiro
-- 
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-18 23:29               ` Hiro Yoshioka
@ 2005-08-22  1:24                 ` Hiro Yoshioka
  2005-08-22 13:07                   ` Andi Kleen
  2005-08-22  2:43                 ` Hiro Yoshioka
  2005-08-22 23:12                 ` Hiro Yoshioka
  2 siblings, 1 reply; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-22  1:24 UTC (permalink / raw)
  To: lkml.hyoshiok; +Cc: ak, arjan, taka, linux-kernel, hyoshiok

> On 8/18/05, Hiro Yoshioka <hyoshiok@miraclelinux.com> wrote:
> > 1) using stack to save/restore MMX registers
> 
> It seems to me that it has some regression.
> I'd like to rollback it and use kernel_fpu_begin() and kernel_fpu_end().

The following is a current version of cache aware copy_from_user_ll.

1) using kernel_fpu_begin()/kernel_fpu_end()
2) low latency version of cache aware copy
3) __copy_user*_nocache APIs so if you want to use it.
(There is no change in the current APIs.)

Some performance data are

Total of GLOBAL_POWER_EVENTS (CPU cycle samples)

2.6.12.4.orig    1921587
2.6.12.4.preempt 1634411
163411/1921587=85.06% (15% reduction)

BSQ_CACHE_REFERENCE (L3 cache miss)
2.6.12.4.orig      57427
2.6.12.4.preempt   17398

samples  %
37408    65.1412  vmlinux                  __copy_from_user_ll
51        0.2931  vmlinux                  __copy_user_zeroing_inatomic_nocache
51/37408=0.136% (99.86% reduction)

Top 5 2.6.12.4.orig
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        app name                 symbol name
287643   14.9692  vmlinux                  __copy_from_user_ll
72660     3.7813  vmlinux                  journal_add_journal_head
65011     3.3832  vmlinux                  do_get_write_access
50618     2.6342  vmlinux                  journal_put_journal_head
48068     2.5015  vmlinux                  journal_dirty_metadata
pattern9-0-cpu4-0-08191743/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
samples  %        app name                 symbol name
134756    7.9364  vmlinux                  __copy_from_user_ll
57735     3.4003  vmlinux                  journal_add_journal_head
50653     2.9832  vmlinux                  __find_get_block
44522     2.6221  vmlinux                  journal_put_journal_head
38928     2.2927  vmlinux                  journal_dirty_metadata
pattern9-0-cpu4-0-08191741/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %        app name                 symbol name
37408    65.1412  vmlinux                  __copy_from_user_ll
953       1.6595  vmlinux                  blk_rq_map_sg
886       1.5429  vmlinux                  sub_preempt_count
680       1.1841  vmlinux                  journal_add_journal_head
598       1.0413  vmlinux                  journal_commit_transaction
pattern9-0-cpu4-0-08191720/summary.out

Top 5 2.6.12.4.preempt
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        app name                 symbol name
123531    7.5582  vmlinux                  __copy_user_zeroing_inatomic_nocache
64820     3.9660  vmlinux                  journal_add_journal_head
60460     3.6992  vmlinux                  do_get_write_access
47172     2.8862  vmlinux                  journal_put_journal_head
46753     2.8606  vmlinux                  journal_dirty_metadata
pattern9-0-cpu4-0-08190838/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
samples  %        app name                 symbol name
126762    6.7993  vmlinux                  __copy_user_zeroing_inatomic_nocache
79803     4.2805  vmlinux                  journal_add_journal_head
70271     3.7692  vmlinux                  journal_dirty_metadata
66146     3.5480  vmlinux                  __find_get_block
58082     3.1154  vmlinux                  journal_put_journal_head
pattern9-0-cpu4-0-08190855/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %        app name                 symbol name
901       5.1788  vmlinux                  blk_rq_map_sg
675       3.8798  vmlinux                  journal_commit_transaction
637       3.6613  vmlinux                  radix_tree_delete
605       3.4774  vmlinux                  journal_add_journal_head
580       3.3337  vmlinux                  release_pages
...
51        0.2931  vmlinux                  __copy_user_zeroing_inatomic_nocache
...
1         0.0057  vmlinux                  __copy_from_user_ll_inatomic_nocache
pattern9-0-cpu4-0-08190859/summary.out

2.6.12.4-usercopy.c.patch.050819
diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.preempt/Makefile
--- linux-2.6.12.4.orig/Makefile	2005-08-12 14:37:59.000000000 +0900
+++ linux-2.6.12.4.preempt/Makefile	2005-08-18 18:47:07.000000000 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.preempt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c linux-2.6.12.4.preempt/arch/i386/lib/usercopy.c
--- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.preempt/arch/i386/lib/usercopy.c	2005-08-19 08:25:08.000000000 +0900
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/blkdev.h>
 #include <linux/module.h>
+#include <asm/i387.h>
 #include <asm/uaccess.h>
 #include <asm/mmx.h>
 
@@ -511,6 +512,216 @@
 		: "memory");						\
 } while (0)
 
+/* Non Temporal Hint version of mmx_memcpy */
+/* It is cache aware                       */
+/* hyoshiok@miraclelinux.com               */
+static unsigned long 
+__copy_user_zeroing_nocache(void *to, const void *from, size_t len)
+{
+        /* Note! gcc doesn't seem to align stack variables properly, so we
+         * need to make use of unaligned loads and stores.
+         */
+	void *p;
+	int i;
+
+	if (unlikely(in_interrupt())){
+	        __copy_user_zeroing(to, from, len);
+		return len;
+	}
+
+	p = to;
+	i = len >> 6; /* len/64 */
+
+	kernel_fpu_begin();
+
+	__asm__ __volatile__ (
+		"1: prefetchnta (%0)\n"		/* This set is 28 bytes */
+		"   prefetchnta 64(%0)\n"
+		"   prefetchnta 128(%0)\n"
+		"   prefetchnta 192(%0)\n"
+		"   prefetchnta 256(%0)\n"
+		"2:  \n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x1AEB, 1b\n"	/* jmp on 26 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from) );
+		
+	for(; i>5; i--)
+	{
+		__asm__ __volatile__ (
+		"1:  prefetchnta 320(%0)\n"
+                "2:  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x05EB, 1b\n"	/* jmp on 5 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+#ifdef CONFIG_PREEMPT
+		if ( (i%256)==0 ) {
+		  kernel_fpu_end();
+		  kernel_fpu_begin();
+		};
+#endif
+	}
+
+	for(; i>0; i--)
+	{
+		__asm__ __volatile__ (
+                "  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+	/*
+	 *	Now do the tail of the block
+	 */
+	kernel_fpu_end();
+	if(i=(len&63))
+	  __copy_user_zeroing(to, from, i);
+	return i;
+}
+
+static unsigned long 
+__copy_user_zeroing_inatomic_nocache(void *to, const void *from, size_t len)
+{
+        /* Note! gcc doesn't seem to align stack variables properly, so we
+         * need to make use of unaligned loads and stores.
+         */
+	void *p;
+	int i;
+
+	if (unlikely(in_interrupt())){
+	        __copy_user_zeroing(to, from, len);
+		return len;
+	}
+
+	p = to;
+	i = len >> 6; /* len/64 */
+
+        kernel_fpu_begin();
+
+	__asm__ __volatile__ (
+		"1: prefetchnta (%0)\n"		/* This set is 28 bytes */
+		"   prefetchnta 64(%0)\n"
+		"   prefetchnta 128(%0)\n"
+		"   prefetchnta 192(%0)\n"
+		"   prefetchnta 256(%0)\n"
+		"2:  \n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x1AEB, 1b\n"	/* jmp on 26 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from) );
+		
+	for(; i>5; i--)
+	{
+		__asm__ __volatile__ (
+		"1:  prefetchnta 320(%0)\n"
+                "2:  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x05EB, 1b\n"	/* jmp on 5 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+
+	for(; i>0; i--)
+	{
+		__asm__ __volatile__ (
+                "  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+	/*
+	 *	Now do the tail of the block
+	 */
+	kernel_fpu_end();
+	if(i=(len&63))
+	  __copy_user_zeroing(to, from, i);
+	return i;
+}
 
 unsigned long __copy_to_user_ll(void __user *to, const void *from, unsigned long n)
 {
@@ -582,6 +793,36 @@
 	return n;
 }
 
+unsigned long
+__copy_from_user_ll_nocache(void *to, const void __user *from, unsigned long n)
+{
+	BUG_ON((long)n < 0);
+        if (n < 512) {
+          if (movsl_is_ok(to, from, n))
+                __copy_user_zeroing(to, from, n);
+          else
+                n = __copy_user_zeroing_intel(to, from, n);
+        }
+        else
+          n = __copy_user_zeroing_nocache(to, from, n);
+	return n;
+}
+
+unsigned long
+__copy_from_user_ll_inatomic_nocache(void *to, const void __user *from, unsigned long n)
+{
+	BUG_ON((long)n < 0);
+        if (n < 512) {
+          if (movsl_is_ok(to, from, n))
+                __copy_user_zeroing(to, from, n);
+          else
+                n = __copy_user_zeroing_intel(to, from, n);
+        }
+        else
+          n = __copy_user_zeroing_inatomic_nocache(to, from, n);
+	return n;
+}
+
 /**
  * copy_to_user: - Copy a block of data into user space.
  * @to:   Destination address, in user space.
diff -ur linux-2.6.12.4.orig/include/asm-i386/uaccess.h linux-2.6.12.4.preempt/include/asm-i386/uaccess.h
--- linux-2.6.12.4.orig/include/asm-i386/uaccess.h	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.preempt/include/asm-i386/uaccess.h	2005-08-18 19:16:55.000000000 +0900
@@ -413,6 +413,10 @@
 				const void *from, unsigned long n);
 unsigned long __must_check __copy_from_user_ll(void *to,
 				const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_nocache(void *to,
+				const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_inatomic_nocache(void *to,
+				const void __user *from, unsigned long n);
 
 /*
  * Here we special-case 1, 2 and 4-byte copy_*_user invocations.  On a fault
@@ -502,11 +506,55 @@
 }
 
 static inline unsigned long
+__copy_from_user_inatomic_nocache(void *to, const void __user *from, unsigned long n)
+{
+	if (__builtin_constant_p(n)) {
+		unsigned long ret;
+
+		switch (n) {
+		case 1:
+			__get_user_size(*(u8 *)to, from, 1, ret, 1);
+			return ret;
+		case 2:
+			__get_user_size(*(u16 *)to, from, 2, ret, 2);
+			return ret;
+		case 4:
+			__get_user_size(*(u32 *)to, from, 4, ret, 4);
+			return ret;
+		}
+	}
+	return __copy_from_user_ll_inatomic_nocache(to, from, n);
+}
+
+static inline unsigned long
 __copy_from_user(void *to, const void __user *from, unsigned long n)
 {
        might_sleep();
        return __copy_from_user_inatomic(to, from, n);
 }
+
+static inline unsigned long
+__copy_from_user_nocache(void *to, const void __user *from, unsigned long n)
+{
+       might_sleep();
+	if (__builtin_constant_p(n)) {
+		unsigned long ret;
+
+		switch (n) {
+		case 1:
+			__get_user_size(*(u8 *)to, from, 1, ret, 1);
+			return ret;
+		case 2:
+			__get_user_size(*(u16 *)to, from, 2, ret, 2);
+			return ret;
+		case 4:
+			__get_user_size(*(u32 *)to, from, 4, ret, 4);
+			return ret;
+		}
+	}
+	return __copy_from_user_ll_nocache(to, from, n);
+}
+
 unsigned long __must_check copy_to_user(void __user *to,
 				const void *from, unsigned long n);
 unsigned long __must_check copy_from_user(void *to,
diff -ur linux-2.6.12.4.orig/mm/filemap.c linux-2.6.12.4.preempt/mm/filemap.c
--- linux-2.6.12.4.orig/mm/filemap.c	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.preempt/mm/filemap.c	2005-08-16 10:16:06.000000000 +0900
@@ -1727,13 +1727,13 @@
 	int left;
 
 	kaddr = kmap_atomic(page, KM_USER0);
-	left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
+	left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
 	kunmap_atomic(kaddr, KM_USER0);
 
 	if (left != 0) {
 		/* Do it the slow way */
 		kaddr = kmap(page);
-		left = __copy_from_user(kaddr + offset, buf, bytes);
+		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
 		kunmap(page);
 	}
 	return bytes - left;
@@ -1750,7 +1750,7 @@
 		int copy = min(bytes, iov->iov_len - base);
 
 		base = 0;
-		left = __copy_from_user_inatomic(vaddr, buf, copy);
+		left = __copy_from_user_inatomic_nocache(vaddr, buf, copy);
 		copied += copy;
 		bytes -= copy;
 		vaddr += copy;

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-22  1:24                 ` Hiro Yoshioka
@ 2005-08-22 13:07                   ` Andi Kleen
  0 siblings, 0 replies; 63+ messages in thread
From: Andi Kleen @ 2005-08-22 13:07 UTC (permalink / raw)
  To: Hiro Yoshioka; +Cc: lkml.hyoshiok, ak, arjan, taka, linux-kernel

> 2) low latency version of cache aware copy

Having a low latency version that is only active with CONFIG_PREEMPT 
is bad - non preempt kernels need good latency too.

-Andi

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-18 23:29               ` Hiro Yoshioka
  2005-08-22  1:24                 ` Hiro Yoshioka
@ 2005-08-22  2:43                 ` Hiro Yoshioka
  2005-08-22 23:12                 ` Hiro Yoshioka
  2 siblings, 0 replies; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-22  2:43 UTC (permalink / raw)
  To: Hiro Yoshioka; +Cc: linux-kernel

Hi,

It seems to me this mail does not go out.
So resending it.

> On 8/18/05, Hiro Yoshioka <hyoshiok@miraclelinux.com> wrote:
> > 1) using stack to save/restore MMX registers
> 
> It seems to me that it has some regression.
> I'd like to rollback it and use kernel_fpu_begin() and kernel_fpu_end().

The following is a current version of cache aware copy_from_user_ll.

1) using kernel_fpu_begin()/kernel_fpu_end()
2) low latency version of cache aware copy
3) __copy_user*_nocache APIs so if you want to use it.
(There is no change in the current APIs.)

Some performance data are

Total of GLOBAL_POWER_EVENTS (CPU cycle samples)

2.6.12.4.orig    1921587
2.6.12.4.preempt 1634411
163411/1921587=85.06% (15% reduction)

BSQ_CACHE_REFERENCE (L3 cache miss)
2.6.12.4.orig      57427
2.6.12.4.preempt   17398

samples  %
37408    65.1412  vmlinux                  __copy_from_user_ll
51        0.2931  vmlinux                  __copy_user_zeroing_inatomic_nocache
51/37408=0.136% (99.86% reduction)

Top 5 2.6.12.4.orig
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        app name                 symbol name
287643   14.9692  vmlinux                  __copy_from_user_ll
72660     3.7813  vmlinux                  journal_add_journal_head
65011     3.3832  vmlinux                  do_get_write_access
50618     2.6342  vmlinux                  journal_put_journal_head
48068     2.5015  vmlinux                  journal_dirty_metadata
pattern9-0-cpu4-0-08191743/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
samples  %        app name                 symbol name
134756    7.9364  vmlinux                  __copy_from_user_ll
57735     3.4003  vmlinux                  journal_add_journal_head
50653     2.9832  vmlinux                  __find_get_block
44522     2.6221  vmlinux                  journal_put_journal_head
38928     2.2927  vmlinux                  journal_dirty_metadata
pattern9-0-cpu4-0-08191741/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %        app name                 symbol name
37408    65.1412  vmlinux                  __copy_from_user_ll
953       1.6595  vmlinux                  blk_rq_map_sg
886       1.5429  vmlinux                  sub_preempt_count
680       1.1841  vmlinux                  journal_add_journal_head
598       1.0413  vmlinux                  journal_commit_transaction
pattern9-0-cpu4-0-08191720/summary.out

Top 5 2.6.12.4.preempt
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        app name                 symbol name
123531    7.5582  vmlinux                  __copy_user_zeroing_inatomic_nocache
64820     3.9660  vmlinux                  journal_add_journal_head
60460     3.6992  vmlinux                  do_get_write_access
47172     2.8862  vmlinux                  journal_put_journal_head
46753     2.8606  vmlinux                  journal_dirty_metadata
pattern9-0-cpu4-0-08190838/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
samples  %        app name                 symbol name
126762    6.7993  vmlinux                  __copy_user_zeroing_inatomic_nocache
79803     4.2805  vmlinux                  journal_add_journal_head
70271     3.7692  vmlinux                  journal_dirty_metadata
66146     3.5480  vmlinux                  __find_get_block
58082     3.1154  vmlinux                  journal_put_journal_head
pattern9-0-cpu4-0-08190855/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %        app name                 symbol name
901       5.1788  vmlinux                  blk_rq_map_sg
675       3.8798  vmlinux                  journal_commit_transaction
637       3.6613  vmlinux                  radix_tree_delete
605       3.4774  vmlinux                  journal_add_journal_head
580       3.3337  vmlinux                  release_pages
...
51        0.2931  vmlinux                  __copy_user_zeroing_inatomic_nocache
...
1         0.0057  vmlinux                  __copy_from_user_ll_inatomic_nocache
pattern9-0-cpu4-0-08190859/summary.out

2.6.12.4-usercopy.c.patch.050819
diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.preempt/Makefile
--- linux-2.6.12.4.orig/Makefile	2005-08-12 14:37:59.000000000 +0900
+++ linux-2.6.12.4.preempt/Makefile	2005-08-18 18:47:07.000000000 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.preempt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c
linux-2.6.12.4.preempt/arch/i386/lib/usercopy.c
--- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c	2005-08-05
16:04:37.000000000 +0900
+++ linux-2.6.12.4.preempt/arch/i386/lib/usercopy.c	2005-08-19
08:25:08.000000000 +0900
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/blkdev.h>
 #include <linux/module.h>
+#include <asm/i387.h>
 #include <asm/uaccess.h>
 #include <asm/mmx.h>
 
@@ -511,6 +512,216 @@
 		: "memory");						\
 } while (0)
 
+/* Non Temporal Hint version of mmx_memcpy */
+/* It is cache aware                       */
+/* hyoshiok@miraclelinux.com               */
+static unsigned long 
+__copy_user_zeroing_nocache(void *to, const void *from, size_t len)
+{
+        /* Note! gcc doesn't seem to align stack variables properly, so we
+         * need to make use of unaligned loads and stores.
+         */
+	void *p;
+	int i;
+
+	if (unlikely(in_interrupt())){
+	        __copy_user_zeroing(to, from, len);
+		return len;
+	}
+
+	p = to;
+	i = len >> 6; /* len/64 */
+
+	kernel_fpu_begin();
+
+	__asm__ __volatile__ (
+		"1: prefetchnta (%0)\n"		/* This set is 28 bytes */
+		"   prefetchnta 64(%0)\n"
+		"   prefetchnta 128(%0)\n"
+		"   prefetchnta 192(%0)\n"
+		"   prefetchnta 256(%0)\n"
+		"2:  \n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x1AEB, 1b\n"	/* jmp on 26 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from) );
+		
+	for(; i>5; i--)
+	{
+		__asm__ __volatile__ (
+		"1:  prefetchnta 320(%0)\n"
+                "2:  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x05EB, 1b\n"	/* jmp on 5 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+#ifdef CONFIG_PREEMPT
+		if ( (i%256)==0 ) {
+		  kernel_fpu_end();
+		  kernel_fpu_begin();
+		};
+#endif
+	}
+
+	for(; i>0; i--)
+	{
+		__asm__ __volatile__ (
+                "  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+	/*
+	 *	Now do the tail of the block
+	 */
+	kernel_fpu_end();
+	if(i=(len&63))
+	  __copy_user_zeroing(to, from, i);
+	return i;
+}
+
+static unsigned long 
+__copy_user_zeroing_inatomic_nocache(void *to, const void *from, size_t len)
+{
+        /* Note! gcc doesn't seem to align stack variables properly, so we
+         * need to make use of unaligned loads and stores.
+         */
+	void *p;
+	int i;
+
+	if (unlikely(in_interrupt())){
+	        __copy_user_zeroing(to, from, len);
+		return len;
+	}
+
+	p = to;
+	i = len >> 6; /* len/64 */
+
+        kernel_fpu_begin();
+
+	__asm__ __volatile__ (
+		"1: prefetchnta (%0)\n"		/* This set is 28 bytes */
+		"   prefetchnta 64(%0)\n"
+		"   prefetchnta 128(%0)\n"
+		"   prefetchnta 192(%0)\n"
+		"   prefetchnta 256(%0)\n"
+		"2:  \n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x1AEB, 1b\n"	/* jmp on 26 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from) );
+		
+	for(; i>5; i--)
+	{
+		__asm__ __volatile__ (
+		"1:  prefetchnta 320(%0)\n"
+                "2:  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x05EB, 1b\n"	/* jmp on 5 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+
+	for(; i>0; i--)
+	{
+		__asm__ __volatile__ (
+                "  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+	/*
+	 *	Now do the tail of the block
+	 */
+	kernel_fpu_end();
+	if(i=(len&63))
+	  __copy_user_zeroing(to, from, i);
+	return i;
+}
 
 unsigned long __copy_to_user_ll(void __user *to, const void *from,
unsigned long n)
 {
@@ -582,6 +793,36 @@
 	return n;
 }
 
+unsigned long
+__copy_from_user_ll_nocache(void *to, const void __user *from, unsigned long n)
+{
+	BUG_ON((long)n < 0);
+        if (n < 512) {
+          if (movsl_is_ok(to, from, n))
+                __copy_user_zeroing(to, from, n);
+          else
+                n = __copy_user_zeroing_intel(to, from, n);
+        }
+        else
+          n = __copy_user_zeroing_nocache(to, from, n);
+	return n;
+}
+
+unsigned long
+__copy_from_user_ll_inatomic_nocache(void *to, const void __user
*from, unsigned long n)
+{
+	BUG_ON((long)n < 0);
+        if (n < 512) {
+          if (movsl_is_ok(to, from, n))
+                __copy_user_zeroing(to, from, n);
+          else
+                n = __copy_user_zeroing_intel(to, from, n);
+        }
+        else
+          n = __copy_user_zeroing_inatomic_nocache(to, from, n);
+	return n;
+}
+
 /**
  * copy_to_user: - Copy a block of data into user space.
  * @to:   Destination address, in user space.
diff -ur linux-2.6.12.4.orig/include/asm-i386/uaccess.h
linux-2.6.12.4.preempt/include/asm-i386/uaccess.h
--- linux-2.6.12.4.orig/include/asm-i386/uaccess.h	2005-08-05
16:04:37.000000000 +0900
+++ linux-2.6.12.4.preempt/include/asm-i386/uaccess.h	2005-08-18
19:16:55.000000000 +0900
@@ -413,6 +413,10 @@
 				const void *from, unsigned long n);
 unsigned long __must_check __copy_from_user_ll(void *to,
 				const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_nocache(void *to,
+				const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_inatomic_nocache(void *to,
+				const void __user *from, unsigned long n);
 
 /*
  * Here we special-case 1, 2 and 4-byte copy_*_user invocations.  On a fault
@@ -502,11 +506,55 @@
 }
 
 static inline unsigned long
+__copy_from_user_inatomic_nocache(void *to, const void __user *from,
unsigned long n)
+{
+	if (__builtin_constant_p(n)) {
+		unsigned long ret;
+
+		switch (n) {
+		case 1:
+			__get_user_size(*(u8 *)to, from, 1, ret, 1);
+			return ret;
+		case 2:
+			__get_user_size(*(u16 *)to, from, 2, ret, 2);
+			return ret;
+		case 4:
+			__get_user_size(*(u32 *)to, from, 4, ret, 4);
+			return ret;
+		}
+	}
+	return __copy_from_user_ll_inatomic_nocache(to, from, n);
+}
+
+static inline unsigned long
 __copy_from_user(void *to, const void __user *from, unsigned long n)
 {
        might_sleep();
        return __copy_from_user_inatomic(to, from, n);
 }
+
+static inline unsigned long
+__copy_from_user_nocache(void *to, const void __user *from, unsigned long n)
+{
+       might_sleep();
+	if (__builtin_constant_p(n)) {
+		unsigned long ret;
+
+		switch (n) {
+		case 1:
+			__get_user_size(*(u8 *)to, from, 1, ret, 1);
+			return ret;
+		case 2:
+			__get_user_size(*(u16 *)to, from, 2, ret, 2);
+			return ret;
+		case 4:
+			__get_user_size(*(u32 *)to, from, 4, ret, 4);
+			return ret;
+		}
+	}
+	return __copy_from_user_ll_nocache(to, from, n);
+}
+
 unsigned long __must_check copy_to_user(void __user *to,
 				const void *from, unsigned long n);
 unsigned long __must_check copy_from_user(void *to,
diff -ur linux-2.6.12.4.orig/mm/filemap.c linux-2.6.12.4.preempt/mm/filemap.c
--- linux-2.6.12.4.orig/mm/filemap.c	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.preempt/mm/filemap.c	2005-08-16 10:16:06.000000000 +0900
@@ -1727,13 +1727,13 @@
 	int left;
 
 	kaddr = kmap_atomic(page, KM_USER0);
-	left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
+	left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
 	kunmap_atomic(kaddr, KM_USER0);
 
 	if (left != 0) {
 		/* Do it the slow way */
 		kaddr = kmap(page);
-		left = __copy_from_user(kaddr + offset, buf, bytes);
+		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
 		kunmap(page);
 	}
 	return bytes - left;
@@ -1750,7 +1750,7 @@
 		int copy = min(bytes, iov->iov_len - base);
 
 		base = 0;
-		left = __copy_from_user_inatomic(vaddr, buf, copy);
+		left = __copy_from_user_inatomic_nocache(vaddr, buf, copy);
 		copied += copy;
 		bytes -= copy;
 		vaddr += copy;

-- 
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-18 23:29               ` Hiro Yoshioka
  2005-08-22  1:24                 ` Hiro Yoshioka
  2005-08-22  2:43                 ` Hiro Yoshioka
@ 2005-08-22 23:12                 ` Hiro Yoshioka
  2005-08-24 14:11                   ` Hiro Yoshioka
  2 siblings, 1 reply; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-22 23:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: hyoshiok

Hi,

It seems to me this mail does not go out.
So resending it.

> On 8/18/05, Hiro Yoshioka <hyoshiok@miraclelinux.com> wrote:
> > 1) using stack to save/restore MMX registers
> 
> It seems to me that it has some regression.
> I'd like to rollback it and use kernel_fpu_begin() and kernel_fpu_end().

The following is a current version of cache aware copy_from_user_ll.

1) using kernel_fpu_begin()/kernel_fpu_end()
2) low latency version of cache aware copy
3) __copy_user*_nocache APIs so if you want to use it.
(There is no change in the current APIs.)

Some performance data are

Total of GLOBAL_POWER_EVENTS (CPU cycle samples)

2.6.12.4.orig    1921587
2.6.12.4.preempt 1634411
163411/1921587=85.06% (15% reduction)

BSQ_CACHE_REFERENCE (L3 cache miss)
2.6.12.4.orig      57427
2.6.12.4.preempt   17398

samples  %
37408    65.1412  vmlinux                  __copy_from_user_ll
51        0.2931  vmlinux                  __copy_user_zeroing_inatomic_nocache
51/37408=0.136% (99.86% reduction)

Top 5 2.6.12.4.orig
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        app name                 symbol name
287643   14.9692  vmlinux                  __copy_from_user_ll
72660     3.7813  vmlinux                  journal_add_journal_head
65011     3.3832  vmlinux                  do_get_write_access
50618     2.6342  vmlinux                  journal_put_journal_head
48068     2.5015  vmlinux                  journal_dirty_metadata
pattern9-0-cpu4-0-08191743/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
samples  %        app name                 symbol name
134756    7.9364  vmlinux                  __copy_from_user_ll
57735     3.4003  vmlinux                  journal_add_journal_head
50653     2.9832  vmlinux                  __find_get_block
44522     2.6221  vmlinux                  journal_put_journal_head
38928     2.2927  vmlinux                  journal_dirty_metadata
pattern9-0-cpu4-0-08191741/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %        app name                 symbol name
37408    65.1412  vmlinux                  __copy_from_user_ll
953       1.6595  vmlinux                  blk_rq_map_sg
886       1.5429  vmlinux                  sub_preempt_count
680       1.1841  vmlinux                  journal_add_journal_head
598       1.0413  vmlinux                  journal_commit_transaction
pattern9-0-cpu4-0-08191720/summary.out

Top 5 2.6.12.4.preempt
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        app name                 symbol name
123531    7.5582  vmlinux                  __copy_user_zeroing_inatomic_nocache
64820     3.9660  vmlinux                  journal_add_journal_head
60460     3.6992  vmlinux                  do_get_write_access
47172     2.8862  vmlinux                  journal_put_journal_head
46753     2.8606  vmlinux                  journal_dirty_metadata
pattern9-0-cpu4-0-08190838/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
samples  %        app name                 symbol name
126762    6.7993  vmlinux                  __copy_user_zeroing_inatomic_nocache
79803     4.2805  vmlinux                  journal_add_journal_head
70271     3.7692  vmlinux                  journal_dirty_metadata
66146     3.5480  vmlinux                  __find_get_block
58082     3.1154  vmlinux                  journal_put_journal_head
pattern9-0-cpu4-0-08190855/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %        app name                 symbol name
901       5.1788  vmlinux                  blk_rq_map_sg
675       3.8798  vmlinux                  journal_commit_transaction
637       3.6613  vmlinux                  radix_tree_delete
605       3.4774  vmlinux                  journal_add_journal_head
580       3.3337  vmlinux                  release_pages
...
51        0.2931  vmlinux                  __copy_user_zeroing_inatomic_nocache
...
1         0.0057  vmlinux                  __copy_from_user_ll_inatomic_nocache
pattern9-0-cpu4-0-08190859/summary.out

2.6.12.4-usercopy.c.patch.050819
diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.preempt/Makefile
--- linux-2.6.12.4.orig/Makefile	2005-08-12 14:37:59.000000000 +0900
+++ linux-2.6.12.4.preempt/Makefile	2005-08-18 18:47:07.000000000 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.preempt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c linux-2.6.12.4.preempt/arch/i386/lib/usercopy.c
--- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.preempt/arch/i386/lib/usercopy.c	2005-08-19 08:25:08.000000000 +0900
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/blkdev.h>
 #include <linux/module.h>
+#include <asm/i387.h>
 #include <asm/uaccess.h>
 #include <asm/mmx.h>
 
@@ -511,6 +512,216 @@
 		: "memory");						\
 } while (0)
 
+/* Non Temporal Hint version of mmx_memcpy */
+/* It is cache aware                       */
+/* hyoshiok@miraclelinux.com               */
+static unsigned long 
+__copy_user_zeroing_nocache(void *to, const void *from, size_t len)
+{
+        /* Note! gcc doesn't seem to align stack variables properly, so we
+         * need to make use of unaligned loads and stores.
+         */
+	void *p;
+	int i;
+
+	if (unlikely(in_interrupt())){
+	        __copy_user_zeroing(to, from, len);
+		return len;
+	}
+
+	p = to;
+	i = len >> 6; /* len/64 */
+
+	kernel_fpu_begin();
+
+	__asm__ __volatile__ (
+		"1: prefetchnta (%0)\n"		/* This set is 28 bytes */
+		"   prefetchnta 64(%0)\n"
+		"   prefetchnta 128(%0)\n"
+		"   prefetchnta 192(%0)\n"
+		"   prefetchnta 256(%0)\n"
+		"2:  \n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x1AEB, 1b\n"	/* jmp on 26 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from) );
+		
+	for(; i>5; i--)
+	{
+		__asm__ __volatile__ (
+		"1:  prefetchnta 320(%0)\n"
+                "2:  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x05EB, 1b\n"	/* jmp on 5 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+#ifdef CONFIG_PREEMPT
+		if ( (i%256)==0 ) {
+		  kernel_fpu_end();
+		  kernel_fpu_begin();
+		};
+#endif
+	}
+
+	for(; i>0; i--)
+	{
+		__asm__ __volatile__ (
+                "  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+	/*
+	 *	Now do the tail of the block
+	 */
+	kernel_fpu_end();
+	if(i=(len&63))
+	  __copy_user_zeroing(to, from, i);
+	return i;
+}
+
+static unsigned long 
+__copy_user_zeroing_inatomic_nocache(void *to, const void *from, size_t len)
+{
+        /* Note! gcc doesn't seem to align stack variables properly, so we
+         * need to make use of unaligned loads and stores.
+         */
+	void *p;
+	int i;
+
+	if (unlikely(in_interrupt())){
+	        __copy_user_zeroing(to, from, len);
+		return len;
+	}
+
+	p = to;
+	i = len >> 6; /* len/64 */
+
+        kernel_fpu_begin();
+
+	__asm__ __volatile__ (
+		"1: prefetchnta (%0)\n"		/* This set is 28 bytes */
+		"   prefetchnta 64(%0)\n"
+		"   prefetchnta 128(%0)\n"
+		"   prefetchnta 192(%0)\n"
+		"   prefetchnta 256(%0)\n"
+		"2:  \n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x1AEB, 1b\n"	/* jmp on 26 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from) );
+		
+	for(; i>5; i--)
+	{
+		__asm__ __volatile__ (
+		"1:  prefetchnta 320(%0)\n"
+                "2:  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x05EB, 1b\n"	/* jmp on 5 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+
+	for(; i>0; i--)
+	{
+		__asm__ __volatile__ (
+                "  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+	/*
+	 *	Now do the tail of the block
+	 */
+	kernel_fpu_end();
+	if(i=(len&63))
+	  __copy_user_zeroing(to, from, i);
+	return i;
+}
 
 unsigned long __copy_to_user_ll(void __user *to, const void *from, unsigned long n)
 {
@@ -582,6 +793,36 @@
 	return n;
 }
 
+unsigned long
+__copy_from_user_ll_nocache(void *to, const void __user *from, unsigned long n)
+{
+	BUG_ON((long)n < 0);
+        if (n < 512) {
+          if (movsl_is_ok(to, from, n))
+                __copy_user_zeroing(to, from, n);
+          else
+                n = __copy_user_zeroing_intel(to, from, n);
+        }
+        else
+          n = __copy_user_zeroing_nocache(to, from, n);
+	return n;
+}
+
+unsigned long
+__copy_from_user_ll_inatomic_nocache(void *to, const void __user *from, unsigned long n)
+{
+	BUG_ON((long)n < 0);
+        if (n < 512) {
+          if (movsl_is_ok(to, from, n))
+                __copy_user_zeroing(to, from, n);
+          else
+                n = __copy_user_zeroing_intel(to, from, n);
+        }
+        else
+          n = __copy_user_zeroing_inatomic_nocache(to, from, n);
+	return n;
+}
+
 /**
  * copy_to_user: - Copy a block of data into user space.
  * @to:   Destination address, in user space.
diff -ur linux-2.6.12.4.orig/include/asm-i386/uaccess.h linux-2.6.12.4.preempt/include/asm-i386/uaccess.h
--- linux-2.6.12.4.orig/include/asm-i386/uaccess.h	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.preempt/include/asm-i386/uaccess.h	2005-08-18 19:16:55.000000000 +0900
@@ -413,6 +413,10 @@
 				const void *from, unsigned long n);
 unsigned long __must_check __copy_from_user_ll(void *to,
 				const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_nocache(void *to,
+				const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_inatomic_nocache(void *to,
+				const void __user *from, unsigned long n);
 
 /*
  * Here we special-case 1, 2 and 4-byte copy_*_user invocations.  On a fault
@@ -502,11 +506,55 @@
 }
 
 static inline unsigned long
+__copy_from_user_inatomic_nocache(void *to, const void __user *from, unsigned long n)
+{
+	if (__builtin_constant_p(n)) {
+		unsigned long ret;
+
+		switch (n) {
+		case 1:
+			__get_user_size(*(u8 *)to, from, 1, ret, 1);
+			return ret;
+		case 2:
+			__get_user_size(*(u16 *)to, from, 2, ret, 2);
+			return ret;
+		case 4:
+			__get_user_size(*(u32 *)to, from, 4, ret, 4);
+			return ret;
+		}
+	}
+	return __copy_from_user_ll_inatomic_nocache(to, from, n);
+}
+
+static inline unsigned long
 __copy_from_user(void *to, const void __user *from, unsigned long n)
 {
        might_sleep();
        return __copy_from_user_inatomic(to, from, n);
 }
+
+static inline unsigned long
+__copy_from_user_nocache(void *to, const void __user *from, unsigned long n)
+{
+       might_sleep();
+	if (__builtin_constant_p(n)) {
+		unsigned long ret;
+
+		switch (n) {
+		case 1:
+			__get_user_size(*(u8 *)to, from, 1, ret, 1);
+			return ret;
+		case 2:
+			__get_user_size(*(u16 *)to, from, 2, ret, 2);
+			return ret;
+		case 4:
+			__get_user_size(*(u32 *)to, from, 4, ret, 4);
+			return ret;
+		}
+	}
+	return __copy_from_user_ll_nocache(to, from, n);
+}
+
 unsigned long __must_check copy_to_user(void __user *to,
 				const void *from, unsigned long n);
 unsigned long __must_check copy_from_user(void *to,
diff -ur linux-2.6.12.4.orig/mm/filemap.c linux-2.6.12.4.preempt/mm/filemap.c
--- linux-2.6.12.4.orig/mm/filemap.c	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.preempt/mm/filemap.c	2005-08-16 10:16:06.000000000 +0900
@@ -1727,13 +1727,13 @@
 	int left;
 
 	kaddr = kmap_atomic(page, KM_USER0);
-	left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
+	left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
 	kunmap_atomic(kaddr, KM_USER0);
 
 	if (left != 0) {
 		/* Do it the slow way */
 		kaddr = kmap(page);
-		left = __copy_from_user(kaddr + offset, buf, bytes);
+		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
 		kunmap(page);
 	}
 	return bytes - left;
@@ -1750,7 +1750,7 @@
 		int copy = min(bytes, iov->iov_len - base);
 
 		base = 0;
-		left = __copy_from_user_inatomic(vaddr, buf, copy);
+		left = __copy_from_user_inatomic_nocache(vaddr, buf, copy);
 		copied += copy;
 		bytes -= copy;
 		vaddr += copy;

--
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-22 23:12                 ` Hiro Yoshioka
@ 2005-08-24 14:11                   ` Hiro Yoshioka
  2005-08-24 14:21                     ` Arjan van de Ven
  2005-08-24 16:22                     ` Hirokazu Takahashi
  0 siblings, 2 replies; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-24 14:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: hyoshiok

Hi,

The following patch does not use MMX regsiters so that we don't have
to worry about save/restore the FPU/MMX states.

What do you think?

Some performance data are

Total of GLOBAL_POWER_EVENTS (CPU cycle samples)

2.6.12.4.orig    1921587
2.6.12.4.nt      1688900
1688900/1921587=87.89% (12.1% reduction)
 
BSQ_CACHE_REFERENCE (L3 cache miss)
2.6.12.4.orig      57427
2.6.12.4.preempt   17122
17122/57427=29.81% (70.18% reduction)

L3 cache miss reduction of __copy_from_user_ll
samples  %
37408    65.1412  vmlinux                  __copy_from_user_ll
24        0.1402  vmlinux                  __copy_user_zeroing_intel_nocache
24/37408=0.064% (99.93% reduction)

> Top 5 2.6.12.4.orig
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
> samples  %        app name                 symbol name
> 287643   14.9692  vmlinux                  __copy_from_user_ll
> 72660     3.7813  vmlinux                  journal_add_journal_head
> 65011     3.3832  vmlinux                  do_get_write_access
> 50618     2.6342  vmlinux                  journal_put_journal_head
> 48068     2.5015  vmlinux                  journal_dirty_metadata
> pattern9-0-cpu4-0-08191743/summary.out
> 
> Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
> samples  %        app name                 symbol name
> 134756    7.9364  vmlinux                  __copy_from_user_ll
> 57735     3.4003  vmlinux                  journal_add_journal_head
> 50653     2.9832  vmlinux                  __find_get_block
> 44522     2.6221  vmlinux                  journal_put_journal_head
> 38928     2.2927  vmlinux                  journal_dirty_metadata
> pattern9-0-cpu4-0-08191741/summary.out
> 
> Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
> samples  %        app name                 symbol name
> 37408    65.1412  vmlinux                  __copy_from_user_ll
> 953       1.6595  vmlinux                  blk_rq_map_sg
> 886       1.5429  vmlinux                  sub_preempt_count
> 680       1.1841  vmlinux                  journal_add_journal_head
> 598       1.0413  vmlinux                  journal_commit_transaction
> pattern9-0-cpu4-0-08191720/summary.out
> 

The following data is an implementation without the MMX registers.
Top 5 2.6.12.4.nt
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        app name                 symbol name
137744    8.1560  vmlinux                  __copy_user_zeroing_intel_nocache
68723     4.0692  vmlinux                  do_get_write_access
65808     3.8966  vmlinux                  journal_add_journal_head
50373     2.9826  vmlinux                  journal_dirty_metadata
49038     2.9036  vmlinux                  journal_put_journal_head
pattern9-0-cpu4-0-08242225/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
samples  %        app name                 symbol name
62165     3.7913  vmlinux                  __copy_user_zeroing_intel_nocache
57862     3.5289  vmlinux                  journal_add_journal_head
54230     3.3073  vmlinux                  __find_get_block
48335     2.9478  vmlinux                  journal_put_journal_head
35737     2.1795  vmlinux                  journal_dirty_metadata
pattern9-0-cpu4-0-08242152/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %        app name                 symbol name
867       5.0637  vmlinux                  blk_rq_map_sg
694       4.0533  vmlinux                  journal_add_journal_head
629       3.6736  vmlinux                  journal_commit_transaction
624       3.6444  vmlinux                  radix_tree_delete
525       3.0662  vmlinux                  release_pages
pattern9-0-cpu4-0-08242147/summary.out

The following is MMX version of cache aware implementation.

> Top 5 2.6.12.4.preempt
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
> samples  %        app name                 symbol name
> 123531    7.5582  vmlinux                  __copy_user_zeroing_inatomic_nocache
> 64820     3.9660  vmlinux                  journal_add_journal_head
> 60460     3.6992  vmlinux                  do_get_write_access
> 47172     2.8862  vmlinux                  journal_put_journal_head
> 46753     2.8606  vmlinux                  journal_dirty_metadata
> pattern9-0-cpu4-0-08190838/summary.out
> 
> Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
> samples  %        app name                 symbol name
> 126762    6.7993  vmlinux                  __copy_user_zeroing_inatomic_nocache
> 79803     4.2805  vmlinux                  journal_add_journal_head
> 70271     3.7692  vmlinux                  journal_dirty_metadata
> 66146     3.5480  vmlinux                  __find_get_block
> 58082     3.1154  vmlinux                  journal_put_journal_head
> pattern9-0-cpu4-0-08190855/summary.out
> 
> Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
> samples  %        app name                 symbol name
> 901       5.1788  vmlinux                  blk_rq_map_sg
> 675       3.8798  vmlinux                  journal_commit_transaction
> 637       3.6613  vmlinux                  radix_tree_delete
> 605       3.4774  vmlinux                  journal_add_journal_head
> 580       3.3337  vmlinux                  release_pages
> ...
> 51        0.2931  vmlinux                  __copy_user_zeroing_inatomic_nocache
> ...
> 1         0.0057  vmlinux                  __copy_from_user_ll_inatomic_nocache
> pattern9-0-cpu4-0-08190859/summary.out

diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.nt/Makefile
--- linux-2.6.12.4.orig/Makefile	2005-08-12 14:37:59.000000000 +0900
+++ linux-2.6.12.4.nt/Makefile	2005-08-24 17:23:57.000000000 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.nt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c linux-2.6.12.4.nt/arch/i386/lib/usercopy.c
--- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.nt/arch/i386/lib/usercopy.c	2005-08-24 21:38:47.000000000 +0900
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/blkdev.h>
 #include <linux/module.h>
+#include <asm/i387.h>
 #include <asm/uaccess.h>
 #include <asm/mmx.h>
 
@@ -421,6 +422,106 @@
 		       : "eax", "edx", "memory");
 	return size;
 }
+
+/* Non Temporal Hint version of __copy_user_zeroing_intel */
+/* It is cache aware.                                     */
+/* hyoshiok@miraclelinux.com                              */
+static unsigned long 
+__copy_user_zeroing_intel_nocache(void *to, const void __user *from, unsigned long size)
+{
+        int d0, d1;
+
+	__asm__ __volatile__(
+		       "        .align 2,0x90\n"
+		       "0:      movl 32(%4), %%eax\n"
+		       "        cmpl $67, %0\n"      
+		       "        jbe 2f\n"            
+		       "1:      movl 64(%4), %%eax\n"
+		       "        .align 2,0x90\n"     
+		       "2:      movl 0(%4), %%eax\n" 
+		       "21:     movl 4(%4), %%edx\n" 
+		       "        movnti %%eax, 0(%3)\n" 
+		       "        movnti %%edx, 4(%3)\n" 
+		       "3:      movl 8(%4), %%eax\n" 
+		       "31:     movl 12(%4),%%edx\n" 
+		       "        movnti %%eax, 8(%3)\n" 
+		       "        movnti %%edx, 12(%3)\n"
+		       "4:      movl 16(%4), %%eax\n"
+		       "41:     movl 20(%4), %%edx\n"
+		       "        movnti %%eax, 16(%3)\n"
+		       "        movnti %%edx, 20(%3)\n"
+		       "10:     movl 24(%4), %%eax\n"
+		       "51:     movl 28(%4), %%edx\n"
+		       "        movnti %%eax, 24(%3)\n"
+		       "        movnti %%edx, 28(%3)\n"
+		       "11:     movl 32(%4), %%eax\n"
+		       "61:     movl 36(%4), %%edx\n"
+		       "        movnti %%eax, 32(%3)\n"
+		       "        movnti %%edx, 36(%3)\n"
+		       "12:     movl 40(%4), %%eax\n"
+		       "71:     movl 44(%4), %%edx\n"
+		       "        movnti %%eax, 40(%3)\n"
+		       "        movnti %%edx, 44(%3)\n"
+		       "13:     movl 48(%4), %%eax\n"
+		       "81:     movl 52(%4), %%edx\n"
+		       "        movnti %%eax, 48(%3)\n"
+		       "        movnti %%edx, 52(%3)\n"
+		       "14:     movl 56(%4), %%eax\n"
+		       "91:     movl 60(%4), %%edx\n"
+		       "        movnti %%eax, 56(%3)\n"
+		       "        movnti %%edx, 60(%3)\n"
+		       "        addl $-64, %0\n"     
+		       "        addl $64, %4\n"      
+		       "        addl $64, %3\n"      
+		       "        cmpl $63, %0\n"      
+		       "        ja  0b\n"            
+		       "5:      movl  %0, %%eax\n"   
+		       "        shrl  $2, %0\n"      
+		       "        andl $3, %%eax\n"    
+		       "        cld\n"               
+		       "6:      rep; movsl\n"   
+		       "        movl %%eax,%0\n"
+		       "7:      rep; movsb\n"	
+		       "8:\n"			
+		       ".section .fixup,\"ax\"\n"
+		       "9:      lea 0(%%eax,%0,4),%0\n"	
+		       "16:     pushl %0\n"	
+		       "        pushl %%eax\n"	
+		       "        xorl %%eax,%%eax\n"
+		       "        rep; stosb\n"	
+		       "        popl %%eax\n"	
+		       "        popl %0\n"	
+		       "        jmp 8b\n"	
+		       ".previous\n"		
+		       ".section __ex_table,\"a\"\n"
+		       "	.align 4\n"	   
+		       "	.long 0b,16b\n"	 
+		       "	.long 1b,16b\n"
+		       "	.long 2b,16b\n"
+		       "	.long 21b,16b\n"
+		       "	.long 3b,16b\n"	
+		       "	.long 31b,16b\n"
+		       "	.long 4b,16b\n"	
+		       "	.long 41b,16b\n"
+		       "	.long 10b,16b\n"
+		       "	.long 51b,16b\n"
+		       "	.long 11b,16b\n"
+		       "	.long 61b,16b\n"
+		       "	.long 12b,16b\n"
+		       "	.long 71b,16b\n"
+		       "	.long 13b,16b\n"
+		       "	.long 81b,16b\n"
+		       "	.long 14b,16b\n"
+		       "	.long 91b,16b\n"
+		       "	.long 6b,9b\n"	
+		       "        .long 7b,16b\n" 
+		       ".previous"		
+		       : "=&c"(size), "=&D" (d0), "=&S" (d1)
+		       :  "1"(to), "2"(from), "0"(size)
+		       : "eax", "edx", "memory");
+	return size;
+}
+
 #else
 /*
  * Leave these declared but undefined.  They should not be any references to
@@ -430,6 +531,8 @@
 __copy_user_zeroing_intel(void *to, const void __user *from, unsigned long size);
 unsigned long
 __copy_user_intel(void __user *to, const void *from, unsigned long size);
+unsigned long
+__copy_user_zeroing_intel_nocache(void *to, const void __user *from, unsigned long size);
 #endif /* CONFIG_X86_INTEL_USERCOPY */
 
 /* Generic arbitrary sized copy.  */
@@ -511,7 +614,6 @@
 		: "memory");						\
 } while (0)
 
-
 unsigned long __copy_to_user_ll(void __user *to, const void *from, unsigned long n)
 {
 	BUG_ON((long) n < 0);
@@ -582,6 +684,21 @@
 	return n;
 }
 
+unsigned long
+__copy_from_user_ll_nocache(void *to, const void __user *from, unsigned long n)
+{
+	BUG_ON((long)n < 0);
+#ifdef CONFIG_X86_INTEL_USERCOPY
+	if ( n > 64)
+                n = __copy_user_zeroing_intel_nocache(to, from, n);
+	else
+		__copy_user_zeroing(to, from, n);
+#else
+        __copy_user_zeroing(to, from, n);
+#endif
+	return n;
+}
+
 /**
  * copy_to_user: - Copy a block of data into user space.
  * @to:   Destination address, in user space.
diff -ur linux-2.6.12.4.orig/include/asm-i386/uaccess.h linux-2.6.12.4.nt/include/asm-i386/uaccess.h
--- linux-2.6.12.4.orig/include/asm-i386/uaccess.h	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.nt/include/asm-i386/uaccess.h	2005-08-24 18:18:57.000000000 +0900
@@ -413,6 +413,8 @@
 				const void *from, unsigned long n);
 unsigned long __must_check __copy_from_user_ll(void *to,
 				const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_nocache(void *to,
+				const void __user *from, unsigned long n);
 
 /*
  * Here we special-case 1, 2 and 4-byte copy_*_user invocations.  On a fault
@@ -502,11 +504,40 @@
 }
 
 static inline unsigned long
+__copy_from_user_inatomic_nocache(void *to, const void __user *from, unsigned long n)
+{
+	if (__builtin_constant_p(n)) {
+		unsigned long ret;
+
+		switch (n) {
+		case 1:
+			__get_user_size(*(u8 *)to, from, 1, ret, 1);
+			return ret;
+		case 2:
+			__get_user_size(*(u16 *)to, from, 2, ret, 2);
+			return ret;
+		case 4:
+			__get_user_size(*(u32 *)to, from, 4, ret, 4);
+			return ret;
+		}
+	}
+	return __copy_from_user_ll_nocache(to, from, n);
+}
+
+static inline unsigned long
 __copy_from_user(void *to, const void __user *from, unsigned long n)
 {
        might_sleep();
        return __copy_from_user_inatomic(to, from, n);
 }
+
+static inline unsigned long
+__copy_from_user_nocache(void *to, const void __user *from, unsigned long n)
+{
+       might_sleep();
+       return __copy_from_user_inatomic_nocache(to, from, n);
+}
+
 unsigned long __must_check copy_to_user(void __user *to,
 				const void *from, unsigned long n);
 unsigned long __must_check copy_from_user(void *to,
diff -ur linux-2.6.12.4.orig/mm/filemap.c linux-2.6.12.4.nt/mm/filemap.c
--- linux-2.6.12.4.orig/mm/filemap.c	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.nt/mm/filemap.c	2005-08-16 10:16:06.000000000 +0900
@@ -1727,13 +1727,13 @@
 	int left;
 
 	kaddr = kmap_atomic(page, KM_USER0);
-	left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
+	left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
 	kunmap_atomic(kaddr, KM_USER0);
 
 	if (left != 0) {
 		/* Do it the slow way */
 		kaddr = kmap(page);
-		left = __copy_from_user(kaddr + offset, buf, bytes);
+		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
 		kunmap(page);
 	}
 	return bytes - left;
@@ -1750,7 +1750,7 @@
 		int copy = min(bytes, iov->iov_len - base);
 
 		base = 0;
-		left = __copy_from_user_inatomic(vaddr, buf, copy);
+		left = __copy_from_user_inatomic_nocache(vaddr, buf, copy);
 		copied += copy;
 		bytes -= copy;
 		vaddr += copy;

Regards,
  Hiro
--
Hiro Yoshioka
CTO/Miracle Linux Corporation

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-24 14:11                   ` Hiro Yoshioka
@ 2005-08-24 14:21                     ` Arjan van de Ven
  2005-08-24 16:22                     ` Hirokazu Takahashi
  1 sibling, 0 replies; 63+ messages in thread
From: Arjan van de Ven @ 2005-08-24 14:21 UTC (permalink / raw)
  To: Hiro Yoshioka; +Cc: linux-kernel

On Wed, 2005-08-24 at 23:11 +0900, Hiro Yoshioka wrote:
> Hi,
> 
> The following patch does not use MMX regsiters so that we don't have
> to worry about save/restore the FPU/MMX states.
> 
> What do you think?

excellent!



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-24 14:11                   ` Hiro Yoshioka
  2005-08-24 14:21                     ` Arjan van de Ven
@ 2005-08-24 16:22                     ` Hirokazu Takahashi
  2005-08-25  4:53                       ` Hiro Yoshioka
  1 sibling, 1 reply; 63+ messages in thread
From: Hirokazu Takahashi @ 2005-08-24 16:22 UTC (permalink / raw)
  To: hyoshiok; +Cc: linux-kernel

Hi,

> The following patch does not use MMX regsiters so that we don't have
> to worry about save/restore the FPU/MMX states.
> 
> What do you think?

I think __copy_user_zeroing_intel_nocache() should be followed by sfence
or mfence instruction to flush the data.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-24 16:22                     ` Hirokazu Takahashi
@ 2005-08-25  4:53                       ` Hiro Yoshioka
  0 siblings, 0 replies; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-25  4:53 UTC (permalink / raw)
  To: taka; +Cc: linux-kernel, hyoshiok

From: Hirokazu Takahashi <taka@valinux.co.jp>
> > The following patch does not use MMX regsiters so that we don't have
> > to worry about save/restore the FPU/MMX states.
> > 
> > What do you think?
> 
> I think __copy_user_zeroing_intel_nocache() should be followed by sfence
> or mfence instruction to flush the data.

Thanks. I'll implement it.

Regards,
  Hiro

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
@ 2005-08-16 18:09 Chuck Ebbert
  2005-08-16 23:21 ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Chuck Ebbert @ 2005-08-16 18:09 UTC (permalink / raw)
  To: Hiro Yoshioka
  Cc: lkml.hyoshiok@gmail.com, taka@valinux.co.jp, Arjan van de Ven,
	linux-kernel

On Tue, 16 Aug 2005 at 19:16:17 +0900 (JST), Hiro Yoshioka wrote:

> oh, really? Does the linux kernel take care of
> SSE save/restore on a task switch?

 Check out XMMS_SAVE and XMMS_RESTORE in include/asm-i386/xor.h


__
Chuck

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-16 18:09 Chuck Ebbert
@ 2005-08-16 23:21 ` Hiro Yoshioka
  2005-08-17  4:50   ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-16 23:21 UTC (permalink / raw)
  To: 76306.1226; +Cc: lkml.hyoshiok, taka, arjan, linux-kernel, hyoshiok

Chuck,

From: Chuck Ebbert <76306.1226@compuserve.com>
> On Tue, 16 Aug 2005 at 19:16:17 +0900 (JST), Hiro Yoshioka wrote:
> > oh, really? Does the linux kernel take care of
> > SSE save/restore on a task switch?
> 
>  Check out XMMS_SAVE and XMMS_RESTORE in include/asm-i386/xor.h

Thanks for your suggestion. But it seems to me it won't help
when we have a page fault or other exeptions.

Regards,
  Hiro

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-16 23:21 ` Hiro Yoshioka
@ 2005-08-17  4:50   ` Hiro Yoshioka
  0 siblings, 0 replies; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-17  4:50 UTC (permalink / raw)
  To: 76306.1226; +Cc: lkml.hyoshiok, taka, arjan, linux-kernel, hyoshiok, hyoshiok

From: Hiro Yoshioka <hyoshiok@miraclelinux.com>
Subject: Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Date: Wed, 17 Aug 2005 08:21:53 +0900 (JST)
Message-ID: <20050817.082153.719902707.hyoshiok@miraclelinux.com>

> Chuck,
> 
> From: Chuck Ebbert <76306.1226@compuserve.com>
> > On Tue, 16 Aug 2005 at 19:16:17 +0900 (JST), Hiro Yoshioka wrote:
> > > oh, really? Does the linux kernel take care of
> > > SSE save/restore on a task switch?
> > 
> >  Check out XMMS_SAVE and XMMS_RESTORE in include/asm-i386/xor.h
> 
> Thanks for your suggestion. But it seems to me it won't help
> when we have a page fault or other exeptions.

Hi,

Let me understand what the kernel does save/resfore FPU/MMX/XMM
registers. Please let me know if I'm wrong.

1) kernel_fpu_begin()
     preempt_disable()
     if TS_USEDFPU then
       __save_init_fpu()
        ... save to tsk->thread.i387.f*save
        clear TS_USEDFPU flag of tsk->thread_info->status
     else
        clts() --- clear TS flag of CR0

2) copy 
     MMX/XMM registers are used.

3) page faults/exceptions/...
3-1  TS flag is set by the CPU (Am I right?)
     if nobody uses MMX/XMM
3-2     it's fine. we don't need save/restore
     else
3-3     MMX/XMM is used

          When TS flag is set, the CPU monitors the instruction stream
of X87 FPU/MMX/SSE/SSE2 instructions. When the CPU detects one of
these instruction, it raises a device-not-available exception (#NM)
prior to executing the instruction. (IA32 Software Developer's Manual,
Vol. 3, 12.5.1)

          math_state_restore() is the device-not-available exception
             clts()
             if (!tsk_used_math(tsk))
                    init_fpu(tsk);
             restore_fpu(tsk);
             set TS_USEDFPU;

4) kernel_fpu_end()
     stts(); set TS flag of CR0
     preempt_enable();

It seems to me that the kernel automatically save/restore FPU/MMX/XMM
registers.

What's wrong with it? Do I misunderstand it?

Regards,
  Hiro

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
@ 2005-08-17 15:19 Chuck Ebbert
  2005-08-18  9:45 ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Chuck Ebbert @ 2005-08-17 15:19 UTC (permalink / raw)
  To: Hiro Yoshioka; +Cc: linux-kernel, arjan, taka, lkml.hyoshiok

On Wed, 17 Aug 2005 at 13:50:22 +0900 (JST), Hiro Yoshioka wrote:

> 3) page faults/exceptions/...
> 3-1  TS flag is set by the CPU (Am I right?)

  TS will _not_ be set if a trap/fault or interrupt occurs.  The only
way that could happen automatically would be to use a separate hardware
task with its own TSS to handle those.

  And since the kernel does not have any state information of its own
(no task_struct) any attempt to save the kernel-mode FPU state would
overwrite the current user-mode state anyway.

  Interrupt and fault handlers will not use FP instructions anyway.
The only thing you have to worry about is getting scheduled away
while your code is running, and I guess that's why you have to worry
about page faults.  And as Arjan pointed out, if you are doing
__copy_from_user_inatomic you cannot sleep (==switch to another task.)

  So I would try the code from include/asm-i386/xor.h, modify it to
save as many registers as you plan to use and see what happens.  It will
do all the right things. See the xor_sse_2() for how to save and restore
properly -- you will need to put your xmm_save area on the stack.

__
Chuck

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-17 15:19 Chuck Ebbert
@ 2005-08-18  9:45 ` Hiro Yoshioka
  0 siblings, 0 replies; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-18  9:45 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel, arjan, taka, Hiro Yoshioka, Akira Tsukamoto

Chuck,

On 8/18/05, Chuck Ebbert <76306.1226@compuserve.com> wrote:
> On Wed, 17 Aug 2005 at 13:50:22 +0900 (JST), Hiro Yoshioka wrote:
> 
> > 3) page faults/exceptions/...
> > 3-1  TS flag is set by the CPU (Am I right?)
> 
>   TS will _not_ be set if a trap/fault or interrupt occurs.  The only
> way that could happen automatically would be to use a separate hardware
> task with its own TSS to handle those.

OK.

>   And since the kernel does not have any state information of its own
> (no task_struct) any attempt to save the kernel-mode FPU state would
> overwrite the current user-mode state anyway.
> 
>   Interrupt and fault handlers will not use FP instructions anyway.
> The only thing you have to worry about is getting scheduled away
> while your code is running, and I guess that's why you have to worry
> about page faults.  And as Arjan pointed out, if you are doing
> __copy_from_user_inatomic you cannot sleep (==switch to another task.)
> 
>   So I would try the code from include/asm-i386/xor.h, modify it to
> save as many registers as you plan to use and see what happens.  It will
> do all the right things. See the xor_sse_2() for how to save and restore
> properly -- you will need to put your xmm_save area on the stack.

My hack is the following. I just change from using kernel_fpu_begin()
and kernel_fpu_end() to using a stack.

My test does not find any regressions.

--- usercopy.c.orig     2005-08-05 16:04:37.000000000 +0900
+++ usercopy.c  2005-08-18 16:53:37.000000000 +0900
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/blkdev.h>
 #include <linux/module.h>
+#include <asm/i387.h>
 #include <asm/uaccess.h>
 #include <asm/mmx.h>

@@ -511,6 +512,144 @@
                : "memory");                                            \
 } while (0)

+#define MMX_SAVE do {                           \
+        preempt_disable();                      \
+        __asm__ __volatile__ (                  \
+                "movl %%cr0,%0          ;\n\t"  \
+                "clts                   ;\n\t"  \
+                "movq %%mm0,(%1)     ;\n\t"     \
+                "movq %%mm1,8(%1) ;\n\t"     \
+                "movq %%mm2,16(%1) ;\n\t"     \
+                "movq %%mm3,24(%1) ;\n\t"     \
+                : "=&r" (cr0)                   \
+                : "r" (mmx_save)                \
+                : "memory");                    \
+} while(0)
+
+#define MMX_RESTORE do {                       \
+        __asm__ __volatile__ (                  \
+                "sfence                 ;\n\t"  \
+                "movq (%1),%%mm0     ;\n\t"  \
+                "movq 8(%1),%%mm1 ;\n\t"  \
+                "movq 16(%1),%%mm2 ;\n\t"  \
+                "movq 24(%1),%%mm3 ;\n\t"  \
+                "movl   %0,%%cr0        ;\n\t"  \
+                :                               \
+                : "r" (cr0), "r" (mmx_save)     \
+                : "memory");                    \
+        preempt_enable();                       \
+} while(0)
+
+#define ALIGN8 __attribute__((aligned(8)))
+
+/* Non Temporal Hint version of mmx_memcpy */
+/* It is cache aware                       */
+/* hyoshiok@miraclelinux.com               */
+static unsigned long
+__copy_user_zeroing_nocache(void *to, const void *from, size_t len)
+{
+        /* Note! gcc doesn't seem to align stack variables properly, so we
+         * need to make use of unaligned loads and stores.
+         */
+       void *p;
+       int i;
+        char mmx_save[8*4] ALIGN8;
+        int cr0;
+
+       if (unlikely(in_interrupt())){
+               __copy_user_zeroing(to, from, len);
+               return len;
+       }
+
+       p = to;
+       i = len >> 6; /* len/64 */
+
+       /*        kernel_fpu_begin();*/
+       MMX_SAVE;
+
+       __asm__ __volatile__ (
+               "1: prefetchnta (%0)\n"         /* This set is 28 bytes */
+               "   prefetchnta 64(%0)\n"
+               "   prefetchnta 128(%0)\n"
+               "   prefetchnta 192(%0)\n"
+               "   prefetchnta 256(%0)\n"
+               "2:  \n"
+               ".section .fixup, \"ax\"\n"
+               "3: movw $0x1AEB, 1b\n" /* jmp on 26 bytes */
+               "   jmp 2b\n"
+               ".previous\n"
+               ".section __ex_table,\"a\"\n"
+               "       .align 4\n"
+               "       .long 1b, 3b\n"
+               ".previous"
+               : : "r" (from) );
+
+       for(; i>5; i--)
+       {
+               __asm__ __volatile__ (
+               "1:  prefetchnta 320(%0)\n"
+                "2:  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+               ".section .fixup, \"ax\"\n"
+               "3: movw $0x05EB, 1b\n" /* jmp on 5 bytes */
+               "   jmp 2b\n"
+               ".previous\n"
+               ".section __ex_table,\"a\"\n"
+               "       .align 4\n"
+               "       .long 1b, 3b\n"
+               ".previous"
+               : : "r" (from), "r" (to) : "memory");
+               from+=64;
+               to+=64;
+       }
+
+       for(; i>0; i--)
+       {
+               __asm__ __volatile__ (
+                "  movq (%0), %%mm0\n"
+                "  movq 8(%0), %%mm1\n"
+                "  movq 16(%0), %%mm2\n"
+                "  movq 24(%0), %%mm3\n"
+                "  movntq %%mm0, (%1)\n"
+                "  movntq %%mm1, 8(%1)\n"
+                "  movntq %%mm2, 16(%1)\n"
+                "  movntq %%mm3, 24(%1)\n"
+                "  movq 32(%0), %%mm0\n"
+                "  movq 40(%0), %%mm1\n"
+                "  movq 48(%0), %%mm2\n"
+                "  movq 56(%0), %%mm3\n"
+                "  movntq %%mm0, 32(%1)\n"
+                "  movntq %%mm1, 40(%1)\n"
+                "  movntq %%mm2, 48(%1)\n"
+                "  movntq %%mm3, 56(%1)\n"
+               : : "r" (from), "r" (to) : "memory");
+               from+=64;
+               to+=64;
+       }
+       /*
+        *      Now do the tail of the block
+        */
+       /*      kernel_fpu_end();*/
+       MMX_RESTORE;
+       if(i=(len&63))
+         __copy_user_zeroing(to, from, i);
+       return i;
+}
+

 unsigned long __copy_to_user_ll(void __user *to, const void *from,
unsigned long n)
 {
@@ -582,6 +721,21 @@
        return n;
 }

+unsigned long
+__copy_from_user_ll_nocache(void *to, const void __user *from, unsigned long n)
+{
+       BUG_ON((long)n < 0);
+        if (n < 512) {
+          if (movsl_is_ok(to, from, n))
+                __copy_user_zeroing(to, from, n);
+          else
+                n = __copy_user_zeroing_intel(to, from, n);
+        }
+        else
+          n = __copy_user_zeroing_nocache(to, from, n);
+       return n;
+}
+
 /**
  * copy_to_user: - Copy a block of data into user space.
  * @to:   Destination address, in user space.

-- 
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20050818.201138.607962419.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>]

[parent not found: <98df96d30508181629d85edb5@mail.gmail.com.suse.lists.linux.kernel>]

[parent not found: <20050823.081246.846946371.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>]

[parent not found: <20050824.231156.278740508.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>]

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
       [not found]     ` <20050824.231156.278740508.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>
@ 2005-08-24 16:18       ` Andi Kleen
  2005-08-25  4:54         ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Andi Kleen @ 2005-08-24 16:18 UTC (permalink / raw)
  To: Hiro Yoshioka; +Cc: linux-kernel

Hiro Yoshioka <hyoshiok@miraclelinux.com> writes:

> Hi,
> 
> The following patch does not use MMX regsiters so that we don't have
> to worry about save/restore the FPU/MMX states.
> 
> What do you think?

Performance will probably be bad on K7 Athlons - those have a microcoded
movnti which is quite slow.

Also BTW I don't see any code anywhere that tests the CPUID bits,
so your code will fail spectacularly on a PII that didn't do SSE
(intel user copy used to be enabled on those) 

One way to solve this might be to use different code using
alternative()

-Andi

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-24 16:18       ` Andi Kleen
@ 2005-08-25  4:54         ` Hiro Yoshioka
  2005-09-01  9:07           ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-25  4:54 UTC (permalink / raw)
  To: ak; +Cc: linux-kernel, hyoshiok

From: Andi Kleen <ak@suse.de>
> > Hi,
> > 
> > The following patch does not use MMX regsiters so that we don't have
> > to worry about save/restore the FPU/MMX states.
> > 
> > What do you think?
> 
> Performance will probably be bad on K7 Athlons - those have a microcoded
> movnti which is quite slow.
> 
> Also BTW I don't see any code anywhere that tests the CPUID bits,
> so your code will fail spectacularly on a PII that didn't do SSE
> (intel user copy used to be enabled on those) 
> 
> One way to solve this might be to use different code using
> alternative()
> 
> -Andi

Thanks for your comments. I'll consider it.

Regards,
  Hiro

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-08-25  4:54         ` Hiro Yoshioka
@ 2005-09-01  9:07           ` Hiro Yoshioka
  2005-09-01  9:36             ` Andi Kleen
  2005-09-02  4:29             ` Andrew Morton
  0 siblings, 2 replies; 63+ messages in thread
From: Hiro Yoshioka @ 2005-09-01  9:07 UTC (permalink / raw)
  To: ak, akpm, torvalds; +Cc: linux-kernel, hyoshiok, hyoshiok

Hi,

> From: Andi Kleen <ak@suse.de>
> > > Hi,
> > > 
> > > The following patch does not use MMX regsiters so that we don't have
> > > to worry about save/restore the FPU/MMX states.
> > > 
> > > What do you think?
> > 
> > Performance will probably be bad on K7 Athlons - those have a microcoded
> > movnti which is quite slow.
> > 
> > Also BTW I don't see any code anywhere that tests the CPUID bits,
> > so your code will fail spectacularly on a PII that didn't do SSE
> > (intel user copy used to be enabled on those) 
> > 
> > One way to solve this might be to use different code using
> > alternative()
> > 
> > -Andi

The following is the almost final version of the
cache pollution aware __copy_from_user_ll() patch.

1) use sfence instruction to perform a serializing on all
store-to-memory instructions.
2) check if the cpu has the xmm2 extentions. (movnti)

I think it is a good enough to be considered into
the main line.

What do you think?

Some performance data are

Total of GLOBAL_POWER_EVENTS (CPU cycle samples)

2.6.12.4.orig    1921587
2.6.12.4.nt      1599424
1599424/1921587=83.23% (16.77% reduction)

BSQ_CACHE_REFERENCE (L3 cache miss)
2.6.12.4.orig      57427
2.6.12.4.nt        20858
20858/57427=36.32% (63.7% reduction)

L3 cache miss reduction of __copy_from_user_ll
samples  %
37408    65.1412  vmlinux                  __copy_from_user_ll
23        0.1103  vmlinux                  __copy_user_zeroing_intel_nocache
23/37408=0.061% (99.94% reduction)

Top 5 of 2.6.12.4.nt
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        app name                 symbol name
128392    8.0274  vmlinux                  __copy_user_zeroing_intel_nocache
64206     4.0143  vmlinux                  journal_add_journal_head
59746     3.7355  vmlinux                  do_get_write_access
47674     2.9807  vmlinux                  journal_put_journal_head
46021     2.8774  vmlinux                  journal_dirty_metadata
pattern9-0-cpu4-0-09011728/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
samples  %        app name                 symbol name
69755     4.2861  vmlinux                  __copy_user_zeroing_intel_nocache
55685     3.4215  vmlinux                  journal_add_journal_head
52371     3.2179  vmlinux                  __find_get_block
45504     2.7960  vmlinux                  journal_put_journal_head
36005     2.2123  vmlinux                  journal_stop
pattern9-0-cpu4-0-09011744/summary.out

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %        app name                 symbol name
1147      5.4994  vmlinux                  journal_add_journal_head
881       4.2240  vmlinux                  journal_dirty_data
872       4.1809  vmlinux                  blk_rq_map_sg
734       3.5192  vmlinux                  journal_commit_transaction
617       2.9582  vmlinux                  radix_tree_delete
pattern9-0-cpu4-0-09011731/summary.out

diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.nt/Makefile
--- linux-2.6.12.4.orig/Makefile	2005-08-12 14:37:59.000000000 +0900
+++ linux-2.6.12.4.nt/Makefile	2005-08-24 17:23:57.000000000 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION = .4.orig
+EXTRAVERSION = .4.nt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c linux-2.6.12.4.nt/arch/i386/lib/usercopy.c
--- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.nt/arch/i386/lib/usercopy.c	2005-09-01 17:09:41.000000000 +0900
@@ -421,6 +421,107 @@
 		       : "eax", "edx", "memory");
 	return size;
 }
+
+/* Non Temporal Hint version of __copy_user_zeroing_intel */
+/* It is cache aware.                                     */
+/* hyoshiok@miraclelinux.com                              */
+static unsigned long 
+__copy_user_zeroing_intel_nocache(void *to, const void __user *from, unsigned long size)
+{
+        int d0, d1;
+
+	__asm__ __volatile__(
+		       "        .align 2,0x90\n"
+		       "0:      movl 32(%4), %%eax\n"
+		       "        cmpl $67, %0\n"      
+		       "        jbe 2f\n"            
+		       "1:      movl 64(%4), %%eax\n"
+		       "        .align 2,0x90\n"     
+		       "2:      movl 0(%4), %%eax\n" 
+		       "21:     movl 4(%4), %%edx\n" 
+		       "        movnti %%eax, 0(%3)\n" 
+		       "        movnti %%edx, 4(%3)\n" 
+		       "3:      movl 8(%4), %%eax\n" 
+		       "31:     movl 12(%4),%%edx\n" 
+		       "        movnti %%eax, 8(%3)\n" 
+		       "        movnti %%edx, 12(%3)\n"
+		       "4:      movl 16(%4), %%eax\n"
+		       "41:     movl 20(%4), %%edx\n"
+		       "        movnti %%eax, 16(%3)\n"
+		       "        movnti %%edx, 20(%3)\n"
+		       "10:     movl 24(%4), %%eax\n"
+		       "51:     movl 28(%4), %%edx\n"
+		       "        movnti %%eax, 24(%3)\n"
+		       "        movnti %%edx, 28(%3)\n"
+		       "11:     movl 32(%4), %%eax\n"
+		       "61:     movl 36(%4), %%edx\n"
+		       "        movnti %%eax, 32(%3)\n"
+		       "        movnti %%edx, 36(%3)\n"
+		       "12:     movl 40(%4), %%eax\n"
+		       "71:     movl 44(%4), %%edx\n"
+		       "        movnti %%eax, 40(%3)\n"
+		       "        movnti %%edx, 44(%3)\n"
+		       "13:     movl 48(%4), %%eax\n"
+		       "81:     movl 52(%4), %%edx\n"
+		       "        movnti %%eax, 48(%3)\n"
+		       "        movnti %%edx, 52(%3)\n"
+		       "14:     movl 56(%4), %%eax\n"
+		       "91:     movl 60(%4), %%edx\n"
+		       "        movnti %%eax, 56(%3)\n"
+		       "        movnti %%edx, 60(%3)\n"
+		       "        addl $-64, %0\n"     
+		       "        addl $64, %4\n"      
+		       "        addl $64, %3\n"      
+		       "        cmpl $63, %0\n"      
+		       "        ja  0b\n"            
+		       "        sfence \n"
+		       "5:      movl  %0, %%eax\n"   
+		       "        shrl  $2, %0\n"      
+		       "        andl $3, %%eax\n"    
+		       "        cld\n"               
+		       "6:      rep; movsl\n"   
+		       "        movl %%eax,%0\n"
+		       "7:      rep; movsb\n"	
+		       "8:\n"			
+		       ".section .fixup,\"ax\"\n"
+		       "9:      lea 0(%%eax,%0,4),%0\n"	
+		       "16:     pushl %0\n"	
+		       "        pushl %%eax\n"	
+		       "        xorl %%eax,%%eax\n"
+		       "        rep; stosb\n"	
+		       "        popl %%eax\n"	
+		       "        popl %0\n"	
+		       "        jmp 8b\n"	
+		       ".previous\n"		
+		       ".section __ex_table,\"a\"\n"
+		       "	.align 4\n"	   
+		       "	.long 0b,16b\n"	 
+		       "	.long 1b,16b\n"
+		       "	.long 2b,16b\n"
+		       "	.long 21b,16b\n"
+		       "	.long 3b,16b\n"	
+		       "	.long 31b,16b\n"
+		       "	.long 4b,16b\n"	
+		       "	.long 41b,16b\n"
+		       "	.long 10b,16b\n"
+		       "	.long 51b,16b\n"
+		       "	.long 11b,16b\n"
+		       "	.long 61b,16b\n"
+		       "	.long 12b,16b\n"
+		       "	.long 71b,16b\n"
+		       "	.long 13b,16b\n"
+		       "	.long 81b,16b\n"
+		       "	.long 14b,16b\n"
+		       "	.long 91b,16b\n"
+		       "	.long 6b,9b\n"	
+		       "        .long 7b,16b\n" 
+		       ".previous"		
+		       : "=&c"(size), "=&D" (d0), "=&S" (d1)
+		       :  "1"(to), "2"(from), "0"(size)
+		       : "eax", "edx", "memory");
+	return size;
+}
+
 #else
 /*
  * Leave these declared but undefined.  They should not be any references to
@@ -430,6 +531,8 @@
 __copy_user_zeroing_intel(void *to, const void __user *from, unsigned long size);
 unsigned long
 __copy_user_intel(void __user *to, const void *from, unsigned long size);
+unsigned long
+__copy_user_zeroing_intel_nocache(void *to, const void __user *from, unsigned long size);
 #endif /* CONFIG_X86_INTEL_USERCOPY */
 
 /* Generic arbitrary sized copy.  */
@@ -511,7 +614,6 @@
 		: "memory");						\
 } while (0)
 
-
 unsigned long __copy_to_user_ll(void __user *to, const void *from, unsigned long n)
 {
 	BUG_ON((long) n < 0);
@@ -582,6 +684,21 @@
 	return n;
 }
 
+unsigned long
+__copy_from_user_ll_nocache(void *to, const void __user *from, unsigned long n)
+{
+	BUG_ON((long)n < 0);
+#ifdef CONFIG_X86_INTEL_USERCOPY
+	if ( n > 64 && cpu_has_xmm2)
+                n = __copy_user_zeroing_intel_nocache(to, from, n);
+	else
+		__copy_user_zeroing(to, from, n);
+#else
+        __copy_user_zeroing(to, from, n);
+#endif
+	return n;
+}
+
 /**
  * copy_to_user: - Copy a block of data into user space.
  * @to:   Destination address, in user space.
diff -ur linux-2.6.12.4.orig/include/asm-i386/uaccess.h linux-2.6.12.4.nt/include/asm-i386/uaccess.h
--- linux-2.6.12.4.orig/include/asm-i386/uaccess.h	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.nt/include/asm-i386/uaccess.h	2005-08-24 18:18:57.000000000 +0900
@@ -413,6 +413,8 @@
 				const void *from, unsigned long n);
 unsigned long __must_check __copy_from_user_ll(void *to,
 				const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_nocache(void *to,
+				const void __user *from, unsigned long n);
 
 /*
  * Here we special-case 1, 2 and 4-byte copy_*_user invocations.  On a fault
@@ -502,11 +504,40 @@
 }
 
 static inline unsigned long
+__copy_from_user_inatomic_nocache(void *to, const void __user *from, unsigned long n)
+{
+	if (__builtin_constant_p(n)) {
+		unsigned long ret;
+
+		switch (n) {
+		case 1:
+			__get_user_size(*(u8 *)to, from, 1, ret, 1);
+			return ret;
+		case 2:
+			__get_user_size(*(u16 *)to, from, 2, ret, 2);
+			return ret;
+		case 4:
+			__get_user_size(*(u32 *)to, from, 4, ret, 4);
+			return ret;
+		}
+	}
+	return __copy_from_user_ll_nocache(to, from, n);
+}
+
+static inline unsigned long
 __copy_from_user(void *to, const void __user *from, unsigned long n)
 {
        might_sleep();
        return __copy_from_user_inatomic(to, from, n);
 }
+
+static inline unsigned long
+__copy_from_user_nocache(void *to, const void __user *from, unsigned long n)
+{
+       might_sleep();
+       return __copy_from_user_inatomic_nocache(to, from, n);
+}
+
 unsigned long __must_check copy_to_user(void __user *to,
 				const void *from, unsigned long n);
 unsigned long __must_check copy_from_user(void *to,
diff -ur linux-2.6.12.4.orig/mm/filemap.c linux-2.6.12.4.nt/mm/filemap.c
--- linux-2.6.12.4.orig/mm/filemap.c	2005-08-05 16:04:37.000000000 +0900
+++ linux-2.6.12.4.nt/mm/filemap.c	2005-08-16 10:16:06.000000000 +0900
@@ -1727,13 +1727,13 @@
 	int left;
 
 	kaddr = kmap_atomic(page, KM_USER0);
-	left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
+	left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
 	kunmap_atomic(kaddr, KM_USER0);
 
 	if (left != 0) {
 		/* Do it the slow way */
 		kaddr = kmap(page);
-		left = __copy_from_user(kaddr + offset, buf, bytes);
+		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
 		kunmap(page);
 	}
 	return bytes - left;
@@ -1750,7 +1750,7 @@
 		int copy = min(bytes, iov->iov_len - base);
 
 		base = 0;
-		left = __copy_from_user_inatomic(vaddr, buf, copy);
+		left = __copy_from_user_inatomic_nocache(vaddr, buf, copy);
 		copied += copy;
 		bytes -= copy;
 		vaddr += copy;


Regards,
  Hiro

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-09-01  9:07           ` Hiro Yoshioka
@ 2005-09-01  9:36             ` Andi Kleen
  2005-09-02  1:43               ` Hiro Yoshioka
  2005-09-02  4:29             ` Andrew Morton
  1 sibling, 1 reply; 63+ messages in thread
From: Andi Kleen @ 2005-09-01  9:36 UTC (permalink / raw)
  To: Hiro Yoshioka; +Cc: akpm, torvalds, linux-kernel

On Thursday 01 September 2005 11:07, Hiro Yoshioka wrote:

> The following is the almost final version of the
> cache pollution aware __copy_from_user_ll() patch.

Looks good to me.

Once the filemap.c hunk is in I'll probably do something
similar for x86-64.

-Andi

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-09-01  9:36             ` Andi Kleen
@ 2005-09-02  1:43               ` Hiro Yoshioka
  2005-09-02  2:06                 ` Andi Kleen
  2005-09-02  2:08                 ` Andrew Morton
  0 siblings, 2 replies; 63+ messages in thread
From: Hiro Yoshioka @ 2005-09-02  1:43 UTC (permalink / raw)
  To: ak; +Cc: akpm, torvalds, linux-kernel, hyoshiok

From: Andi Kleen <ak@suse.de>
> On Thursday 01 September 2005 11:07, Hiro Yoshioka wrote:
> 
> > The following is the almost final version of the
> > cache pollution aware __copy_from_user_ll() patch.
> 
> Looks good to me.
> 
> Once the filemap.c hunk is in I'll probably do something
> similar for x86-64.

Thank you very much. What else should I do? Shall I just
be waiting to check in the patch?

Regards,
  Hiro

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-09-02  1:43               ` Hiro Yoshioka
@ 2005-09-02  2:06                 ` Andi Kleen
  2005-09-02  2:08                 ` Andrew Morton
  1 sibling, 0 replies; 63+ messages in thread
From: Andi Kleen @ 2005-09-02  2:06 UTC (permalink / raw)
  To: Hiro Yoshioka; +Cc: akpm, torvalds, linux-kernel

On Friday 02 September 2005 03:43, Hiro Yoshioka wrote:
> From: Andi Kleen <ak@suse.de>
>
> > On Thursday 01 September 2005 11:07, Hiro Yoshioka wrote:
> > > The following is the almost final version of the
> > > cache pollution aware __copy_from_user_ll() patch.
> >
> > Looks good to me.
> >
> > Once the filemap.c hunk is in I'll probably do something
> > similar for x86-64.
>
> Thank you very much. What else should I do? Shall I just
> be waiting to check in the patch?

I suppose Andrew will take care of it, unless someone else
objects.

-Andi

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-09-02  1:43               ` Hiro Yoshioka
  2005-09-02  2:06                 ` Andi Kleen
@ 2005-09-02  2:08                 ` Andrew Morton
  2005-09-02  2:17                   ` Andi Kleen
  1 sibling, 1 reply; 63+ messages in thread
From: Andrew Morton @ 2005-09-02  2:08 UTC (permalink / raw)
  To: Hiro Yoshioka; +Cc: ak, torvalds, linux-kernel, hyoshiok

Hiro Yoshioka <hyoshiok@miraclelinux.com> wrote:
>
> From: Andi Kleen <ak@suse.de>
> > On Thursday 01 September 2005 11:07, Hiro Yoshioka wrote:
> > 
> > > The following is the almost final version of the
> > > cache pollution aware __copy_from_user_ll() patch.
> > 
> > Looks good to me.
> > 
> > Once the filemap.c hunk is in I'll probably do something
> > similar for x86-64.
> 
> Thank you very much. What else should I do? Shall I just
> be waiting to check in the patch?
> 

I suppose I'll queue it up in -mm for a while, although I'm a bit dubious
about the whole idea...  We'll gain some and we'll lose some - how do we
know it's a net gain?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-09-02  2:08                 ` Andrew Morton
@ 2005-09-02  2:17                   ` Andi Kleen
  2005-09-02  2:28                     ` Andrew Morton
  0 siblings, 1 reply; 63+ messages in thread
From: Andi Kleen @ 2005-09-02  2:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hiro Yoshioka, torvalds, linux-kernel

On Friday 02 September 2005 04:08, Andrew Morton wrote:

> I suppose I'll queue it up in -mm for a while, although I'm a bit dubious
> about the whole idea...  We'll gain some and we'll lose some - how do we
> know it's a net gain?

I suspect it'll gain more than it loses. The only case where it might 
not gain is immediately someone reading the data from the page cache again
after the write. But I suppose that's far less frequent than writing the data.

-Andi

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-09-02  2:17                   ` Andi Kleen
@ 2005-09-02  2:28                     ` Andrew Morton
  2005-09-02  3:41                       ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Andrew Morton @ 2005-09-02  2:28 UTC (permalink / raw)
  To: Andi Kleen; +Cc: hyoshiok, torvalds, linux-kernel

Andi Kleen <ak@suse.de> wrote:
>
> On Friday 02 September 2005 04:08, Andrew Morton wrote:
> 
> > I suppose I'll queue it up in -mm for a while, although I'm a bit dubious
> > about the whole idea...  We'll gain some and we'll lose some - how do we
> > know it's a net gain?
> 
> I suspect it'll gain more than it loses. The only case where it might 
> not gain is immediately someone reading the data from the page cache again
> after the write.

That's a pretty common case - temporary files.

> But I suppose that's far less frequent than writing the data.

yup.

Hiro, could you please send through a summary of the performance testing
results sometime?  Runtimes rather than oprofile output?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-09-02  2:28                     ` Andrew Morton
@ 2005-09-02  3:41                       ` Hiro Yoshioka
  0 siblings, 0 replies; 63+ messages in thread
From: Hiro Yoshioka @ 2005-09-02  3:41 UTC (permalink / raw)
  To: akpm; +Cc: ak, torvalds, linux-kernel, hyoshiok

Andrew,

From: Andrew Morton <akpm@osdl.org>
> Andi Kleen <ak@suse.de> wrote:
> >
> > On Friday 02 September 2005 04:08, Andrew Morton wrote:
> > 
> > > I suppose I'll queue it up in -mm for a while, although I'm a bit dubious
> > > about the whole idea...  We'll gain some and we'll lose some - how do we
> > > know it's a net gain?
> > 
> > I suspect it'll gain more than it loses. The only case where it might 
> > not gain is immediately someone reading the data from the page cache again
> > after the write.
> 
> That's a pretty common case - temporary files.
> 
> > But I suppose that's far less frequent than writing the data.
> 
> yup.
> 
> Hiro, could you please send through a summary of the performance testing
> results sometime?  Runtimes rather than oprofile output?

iozone results are

original 2.6.12.4 CPU time = 207.768 sec
cache aware       CPU time = 184.783 sec
(three times run)
184.783/207.768=88.94% (11.06% reduction)

original:
pattern9-0-cpu4-0-08191720/iozone.out:  CPU Utilization: Wall time   45.997    CPU time   64.527    CPU utilization 140.28 %
pattern9-0-cpu4-0-08191741/iozone.out:  CPU Utilization: Wall time   46.878    CPU time   71.933    CPU utilization 153.45 %
pattern9-0-cpu4-0-08191743/iozone.out:  CPU Utilization: Wall time   45.152    CPU time   71.308    CPU utilization 157.93 %

cache awre:
pattern9-0-cpu4-0-09011728/iozone.out:  CPU Utilization: Wall time   44.842    CPU time   62.465    CPU utilization 139.30 %
pattern9-0-cpu4-0-09011731/iozone.out:  CPU Utilization: Wall time   44.718    CPU time   59.273    CPU utilization 132.55 %
pattern9-0-cpu4-0-09011744/iozone.out:  CPU Utilization: Wall time   44.367    CPU time   63.045    CPU utilization 142.10 %

Regards,
  Hiro

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-09-01  9:07           ` Hiro Yoshioka
  2005-09-01  9:36             ` Andi Kleen
@ 2005-09-02  4:29             ` Andrew Morton
  2005-09-02  4:37               ` Hiro Yoshioka
  1 sibling, 1 reply; 63+ messages in thread
From: Andrew Morton @ 2005-09-02  4:29 UTC (permalink / raw)
  To: Hiro Yoshioka; +Cc: ak, torvalds, linux-kernel, hyoshiok

Hiro Yoshioka <hyoshiok@miraclelinux.com> wrote:
>
> --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c	2005-08-05 16:04:37.000000000 +0900
>  +++ linux-2.6.12.4.nt/arch/i386/lib/usercopy.c	2005-09-01 17:09:41.000000000 +0900

Really.  Please redo and retest the patch against a current kernel.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-09-02  4:29             ` Andrew Morton
@ 2005-09-02  4:37               ` Hiro Yoshioka
  2005-09-03 11:59                 ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Hiro Yoshioka @ 2005-09-02  4:37 UTC (permalink / raw)
  To: akpm; +Cc: ak, torvalds, linux-kernel, hyoshiok

From: Andrew Morton <akpm@osdl.org>
> Hiro Yoshioka <hyoshiok@miraclelinux.com> wrote:
> >
> > --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c	2005-08-05 16:04:37.000000000 +0900
> >  +++ linux-2.6.12.4.nt/arch/i386/lib/usercopy.c	2005-09-01 17:09:41.000000000 +0900
> 
> Really.  Please redo and retest the patch against a current kernel.

Does it mean 2.6.13? I'll do it. 

Regards,
  Hiro

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
  2005-09-02  4:37               ` Hiro Yoshioka
@ 2005-09-03 11:59                 ` Hiro Yoshioka
  0 siblings, 0 replies; 63+ messages in thread
From: Hiro Yoshioka @ 2005-09-03 11:59 UTC (permalink / raw)
  To: akpm; +Cc: ak, torvalds, linux-kernel, hyoshiok, hyoshiok

From: Hiro Yoshioka <hyoshiok@miraclelinux.com>
Subject: Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Date: Fri, 02 Sep 2005 13:37:16 +0900 (JST)
Message-ID: <20050902.133716.610538020.hyoshiok@miraclelinux.com>

> From: Andrew Morton <akpm@osdl.org>
> > Hiro Yoshioka <hyoshiok@miraclelinux.com> wrote:
> > >
> > > --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c	2005-08-05 16:04:37.000000000 +0900
> > >  +++ linux-2.6.12.4.nt/arch/i386/lib/usercopy.c	2005-09-01 17:09:41.000000000 +0900
> > 
> > Really.  Please redo and retest the patch against a current kernel.
> 
> Does it mean 2.6.13? I'll do it. 
> 
> Regards,
>   Hiro

Hi,

The following is the patch against 2.6.13

Hiro

diff -ur linux-2.6.13/Makefile linux-2.6.13.nt/Makefile
--- linux-2.6.13/Makefile	2005-08-29 08:41:01.000000000 +0900
+++ linux-2.6.13.nt/Makefile	2005-09-03 14:11:27.000000000 +0900
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 13
-EXTRAVERSION =
+EXTRAVERSION = .nt
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -ur linux-2.6.13/arch/i386/lib/usercopy.c linux-2.6.13.nt/arch/i386/lib/usercopy.c
--- linux-2.6.13/arch/i386/lib/usercopy.c	2005-08-29 08:41:01.000000000 +0900
+++ linux-2.6.13.nt/arch/i386/lib/usercopy.c	2005-09-03 14:09:18.000000000 +0900
@@ -425,6 +425,107 @@
 		       : "eax", "edx", "memory");
 	return size;
 }
+
+/* Non Temporal Hint version of __copy_user_zeroing_intel */
+/* It is cache aware.                                     */
+/* hyoshiok@miraclelinux.com                              */
+static unsigned long 
+__copy_user_zeroing_intel_nocache(void *to, const void __user *from, unsigned long size)
+{
+        int d0, d1;
+
+	__asm__ __volatile__(
+		       "        .align 2,0x90\n"
+		       "0:      movl 32(%4), %%eax\n"
+		       "        cmpl $67, %0\n"      
+		       "        jbe 2f\n"            
+		       "1:      movl 64(%4), %%eax\n"
+		       "        .align 2,0x90\n"     
+		       "2:      movl 0(%4), %%eax\n" 
+		       "21:     movl 4(%4), %%edx\n" 
+		       "        movnti %%eax, 0(%3)\n" 
+		       "        movnti %%edx, 4(%3)\n" 
+		       "3:      movl 8(%4), %%eax\n" 
+		       "31:     movl 12(%4),%%edx\n" 
+		       "        movnti %%eax, 8(%3)\n" 
+		       "        movnti %%edx, 12(%3)\n"
+		       "4:      movl 16(%4), %%eax\n"
+		       "41:     movl 20(%4), %%edx\n"
+		       "        movnti %%eax, 16(%3)\n"
+		       "        movnti %%edx, 20(%3)\n"
+		       "10:     movl 24(%4), %%eax\n"
+		       "51:     movl 28(%4), %%edx\n"
+		       "        movnti %%eax, 24(%3)\n"
+		       "        movnti %%edx, 28(%3)\n"
+		       "11:     movl 32(%4), %%eax\n"
+		       "61:     movl 36(%4), %%edx\n"
+		       "        movnti %%eax, 32(%3)\n"
+		       "        movnti %%edx, 36(%3)\n"
+		       "12:     movl 40(%4), %%eax\n"
+		       "71:     movl 44(%4), %%edx\n"
+		       "        movnti %%eax, 40(%3)\n"
+		       "        movnti %%edx, 44(%3)\n"
+		       "13:     movl 48(%4), %%eax\n"
+		       "81:     movl 52(%4), %%edx\n"
+		       "        movnti %%eax, 48(%3)\n"
+		       "        movnti %%edx, 52(%3)\n"
+		       "14:     movl 56(%4), %%eax\n"
+		       "91:     movl 60(%4), %%edx\n"
+		       "        movnti %%eax, 56(%3)\n"
+		       "        movnti %%edx, 60(%3)\n"
+		       "        addl $-64, %0\n"     
+		       "        addl $64, %4\n"      
+		       "        addl $64, %3\n"      
+		       "        cmpl $63, %0\n"      
+		       "        ja  0b\n"            
+		       "        sfence \n"
+		       "5:      movl  %0, %%eax\n"   
+		       "        shrl  $2, %0\n"      
+		       "        andl $3, %%eax\n"    
+		       "        cld\n"               
+		       "6:      rep; movsl\n"   
+		       "        movl %%eax,%0\n"
+		       "7:      rep; movsb\n"	
+		       "8:\n"			
+		       ".section .fixup,\"ax\"\n"
+		       "9:      lea 0(%%eax,%0,4),%0\n"	
+		       "16:     pushl %0\n"	
+		       "        pushl %%eax\n"	
+		       "        xorl %%eax,%%eax\n"
+		       "        rep; stosb\n"	
+		       "        popl %%eax\n"	
+		       "        popl %0\n"	
+		       "        jmp 8b\n"	
+		       ".previous\n"		
+		       ".section __ex_table,\"a\"\n"
+		       "	.align 4\n"	   
+		       "	.long 0b,16b\n"	 
+		       "	.long 1b,16b\n"
+		       "	.long 2b,16b\n"
+		       "	.long 21b,16b\n"
+		       "	.long 3b,16b\n"	
+		       "	.long 31b,16b\n"
+		       "	.long 4b,16b\n"	
+		       "	.long 41b,16b\n"
+		       "	.long 10b,16b\n"
+		       "	.long 51b,16b\n"
+		       "	.long 11b,16b\n"
+		       "	.long 61b,16b\n"
+		       "	.long 12b,16b\n"
+		       "	.long 71b,16b\n"
+		       "	.long 13b,16b\n"
+		       "	.long 81b,16b\n"
+		       "	.long 14b,16b\n"
+		       "	.long 91b,16b\n"
+		       "	.long 6b,9b\n"	
+		       "        .long 7b,16b\n" 
+		       ".previous"		
+		       : "=&c"(size), "=&D" (d0), "=&S" (d1)
+		       :  "1"(to), "2"(from), "0"(size)
+		       : "eax", "edx", "memory");
+	return size;
+}
+
 #else
 /*
  * Leave these declared but undefined.  They should not be any references to
@@ -434,6 +535,8 @@
 __copy_user_zeroing_intel(void *to, const void __user *from, unsigned long size);
 unsigned long
 __copy_user_intel(void __user *to, const void *from, unsigned long size);
+unsigned long
+__copy_user_zeroing_intel_nocache(void *to, const void __user *from, unsigned long size);
 #endif /* CONFIG_X86_INTEL_USERCOPY */
 
 /* Generic arbitrary sized copy.  */
@@ -515,7 +618,6 @@
 		: "memory");						\
 } while (0)
 
-
 unsigned long __copy_to_user_ll(void __user *to, const void *from, unsigned long n)
 {
 	BUG_ON((long) n < 0);
@@ -588,6 +690,21 @@
 }
 EXPORT_SYMBOL(__copy_from_user_ll);
 
+unsigned long
+__copy_from_user_ll_nocache(void *to, const void __user *from, unsigned long n)
+{
+	BUG_ON((long)n < 0);
+#ifdef CONFIG_X86_INTEL_USERCOPY
+	if ( n > 64 && cpu_has_xmm2)
+                n = __copy_user_zeroing_intel_nocache(to, from, n);
+	else
+		__copy_user_zeroing(to, from, n);
+#else
+        __copy_user_zeroing(to, from, n);
+#endif
+	return n;
+}
+
 /**
  * copy_to_user: - Copy a block of data into user space.
  * @to:   Destination address, in user space.
diff -ur linux-2.6.13/include/asm-i386/uaccess.h linux-2.6.13.nt/include/asm-i386/uaccess.h
--- linux-2.6.13/include/asm-i386/uaccess.h	2005-08-29 08:41:01.000000000 +0900
+++ linux-2.6.13.nt/include/asm-i386/uaccess.h	2005-09-03 14:09:18.000000000 +0900
@@ -413,6 +413,8 @@
 				const void *from, unsigned long n);
 unsigned long __must_check __copy_from_user_ll(void *to,
 				const void __user *from, unsigned long n);
+unsigned long __must_check __copy_from_user_ll_nocache(void *to,
+				const void __user *from, unsigned long n);
 
 /*
  * Here we special-case 1, 2 and 4-byte copy_*_user invocations.  On a fault
@@ -502,11 +504,40 @@
 }
 
 static inline unsigned long
+__copy_from_user_inatomic_nocache(void *to, const void __user *from, unsigned long n)
+{
+	if (__builtin_constant_p(n)) {
+		unsigned long ret;
+
+		switch (n) {
+		case 1:
+			__get_user_size(*(u8 *)to, from, 1, ret, 1);
+			return ret;
+		case 2:
+			__get_user_size(*(u16 *)to, from, 2, ret, 2);
+			return ret;
+		case 4:
+			__get_user_size(*(u32 *)to, from, 4, ret, 4);
+			return ret;
+		}
+	}
+	return __copy_from_user_ll_nocache(to, from, n);
+}
+
+static inline unsigned long
 __copy_from_user(void *to, const void __user *from, unsigned long n)
 {
        might_sleep();
        return __copy_from_user_inatomic(to, from, n);
 }
+
+static inline unsigned long
+__copy_from_user_nocache(void *to, const void __user *from, unsigned long n)
+{
+       might_sleep();
+       return __copy_from_user_inatomic_nocache(to, from, n);
+}
+
 unsigned long __must_check copy_to_user(void __user *to,
 				const void *from, unsigned long n);
 unsigned long __must_check copy_from_user(void *to,
diff -ur linux-2.6.13/mm/filemap.c linux-2.6.13.nt/mm/filemap.c
--- linux-2.6.13/mm/filemap.c	2005-08-29 08:41:01.000000000 +0900
+++ linux-2.6.13.nt/mm/filemap.c	2005-09-03 14:09:18.000000000 +0900
@@ -1726,7 +1726,7 @@
 		int copy = min(bytes, iov->iov_len - base);
 
 		base = 0;
-		left = __copy_from_user_inatomic(vaddr, buf, copy);
+		left = __copy_from_user_inatomic_nocache(vaddr, buf, copy);
 		copied += copy;
 		bytes -= copy;
 		vaddr += copy;
diff -ur linux-2.6.13/mm/filemap.h linux-2.6.13.nt/mm/filemap.h
--- linux-2.6.13/mm/filemap.h	2005-08-29 08:41:01.000000000 +0900
+++ linux-2.6.13.nt/mm/filemap.h	2005-09-03 16:47:39.000000000 +0900
@@ -34,13 +34,13 @@
 	int left;
 
 	kaddr = kmap_atomic(page, KM_USER0);
-	left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
+	left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
 	kunmap_atomic(kaddr, KM_USER0);
 
 	if (left != 0) {
 		/* Do it the slow way */
 		kaddr = kmap(page);
-		left = __copy_from_user(kaddr + offset, buf, bytes);
+		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
 		kunmap(page);
 	}
 	return bytes - left;

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2005-09-03 12:03 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-14  9:16 [RFC] [PATCH] cache pollution aware __copy_from_user_ll() Hiro Yoshioka
2005-08-14  9:41 ` Arjan van de Ven
2005-08-14 10:22   ` Hiro Yoshioka
2005-08-14 10:35     ` Arjan van de Ven
2005-08-14 10:45       ` Christoph Hellwig
2005-08-15  6:43       ` Hiro Yoshioka
2005-08-15  7:16         ` Arjan van de Ven
2005-08-15  8:44           ` Hiro Yoshioka
2005-08-15  8:53             ` Arjan van de Ven
2005-08-15 23:33               ` Hiro Yoshioka
2005-08-16  3:30                 ` Hiro Yoshioka
2005-08-16  4:17                   ` Hirokazu Takahashi
2005-08-16  4:54                     ` Hiro Yoshioka
2005-08-16  5:43                       ` Arjan van de Ven
2005-08-16 10:16                         ` Hiro Yoshioka
2005-08-16 10:19                           ` Hirokazu Takahashi
2005-08-16 10:25                           ` Arjan van de Ven
2005-08-16 10:24                             ` Hirokazu Takahashi
2005-08-16  5:44                     ` Arjan van de Ven
2005-08-16  5:49                   ` Arjan van de Ven
     [not found]                     ` <20050817.110503.97359275.taka@valinux.co.jp>
2005-08-17  5:10                       ` Hiro Yoshioka
2005-08-17 14:30                         ` Akira Tsukamoto
2005-08-17 15:27                           ` Akira Tsukamoto
2005-08-18 17:53                             ` Lee Revell
2005-08-18  2:37                           ` Akira Tsukamoto
  -- strict thread matches above, loose matches on Subject: below --
2005-08-14 21:24 Ian Kumlien
2005-08-15  7:21 ` Arjan van de Ven
2005-08-15 14:49   ` Ian Kumlien
2005-08-15 12:15 linux
2005-08-15 12:25 ` Arjan van de Ven
     [not found] <20050815121555.29159.qmail@science.horizon.com.suse.lists.linux.kernel>
     [not found] ` <1124108702.3228.33.camel@laptopd505.fenrus.org.suse.lists.linux.kernel>
2005-08-15 15:02   ` Andi Kleen
2005-08-15 15:09     ` Arjan van de Ven
2005-08-15 15:13       ` Andi Kleen
     [not found] <20050816.131729.15816429.taka@valinux.co.jp.suse.lists.linux.kernel>
     [not found] ` <20050816.135425.719901536.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>
     [not found]   ` <1124171015.3215.0.camel@laptopd505.fenrus.org.suse.lists.linux.kernel>
     [not found]     ` <20050816.191617.1025215458.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>
     [not found]       ` <1124187950.3215.31.camel@laptopd505.fenrus.org.suse.lists.linux.kernel>
2005-08-16 13:15         ` Andi Kleen
2005-08-18 11:06           ` Hiro Yoshioka
2005-08-18 11:11             ` Hiro Yoshioka
2005-08-18 23:29               ` Hiro Yoshioka
2005-08-22  1:24                 ` Hiro Yoshioka
2005-08-22 13:07                   ` Andi Kleen
2005-08-22  2:43                 ` Hiro Yoshioka
2005-08-22 23:12                 ` Hiro Yoshioka
2005-08-24 14:11                   ` Hiro Yoshioka
2005-08-24 14:21                     ` Arjan van de Ven
2005-08-24 16:22                     ` Hirokazu Takahashi
2005-08-25  4:53                       ` Hiro Yoshioka
2005-08-16 18:09 Chuck Ebbert
2005-08-16 23:21 ` Hiro Yoshioka
2005-08-17  4:50   ` Hiro Yoshioka
2005-08-17 15:19 Chuck Ebbert
2005-08-18  9:45 ` Hiro Yoshioka
     [not found] <20050818.201138.607962419.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>
     [not found] ` <98df96d30508181629d85edb5@mail.gmail.com.suse.lists.linux.kernel>
     [not found]   ` <20050823.081246.846946371.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>
     [not found]     ` <20050824.231156.278740508.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>
2005-08-24 16:18       ` Andi Kleen
2005-08-25  4:54         ` Hiro Yoshioka
2005-09-01  9:07           ` Hiro Yoshioka
2005-09-01  9:36             ` Andi Kleen
2005-09-02  1:43               ` Hiro Yoshioka
2005-09-02  2:06                 ` Andi Kleen
2005-09-02  2:08                 ` Andrew Morton
2005-09-02  2:17                   ` Andi Kleen
2005-09-02  2:28                     ` Andrew Morton
2005-09-02  3:41                       ` Hiro Yoshioka
2005-09-02  4:29             ` Andrew Morton
2005-09-02  4:37               ` Hiro Yoshioka
2005-09-03 11:59                 ` Hiro Yoshioka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox