public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
@ 2005-08-14 21:24 Ian Kumlien
  2005-08-15  7:21 ` Arjan van de Ven
  0 siblings, 1 reply; 63+ messages in thread
From: Ian Kumlien @ 2005-08-14 21:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: hch, arian, lkml.hyoshiok

[-- Attachment #1: Type: text/plain, Size: 846 bytes --]

Hi, all

I might be missunderstanding things but...

First of all, machines with long pipelines will suffer from cache misses
(p4 in this case).

Depending on the size copied, (i don't know how large they are so..)
can't one run out of cachelines and/or evict more useful cache data?

Ie, if it's cached from begining to end, we generally only need 'some
of' the begining, the cpu's prefetch should manage the rest.

I might, as i said, not know all about things like this and i also
suffer from a fever but i still find Hiro's data interesting.

Isn't there some way to do the same test for the same time and measure
the differences in allround data? to see if we really are punished as
bad on accessing the data post copy? (could it be size dependant?)

-- 
Ian Kumlien <pomac () vapor ! com> -- http://pomac.netswarm.net

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread
[parent not found: <20050818.201138.607962419.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>]
* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
@ 2005-08-17 15:19 Chuck Ebbert
  2005-08-18  9:45 ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Chuck Ebbert @ 2005-08-17 15:19 UTC (permalink / raw)
  To: Hiro Yoshioka; +Cc: linux-kernel, arjan, taka, lkml.hyoshiok

On Wed, 17 Aug 2005 at 13:50:22 +0900 (JST), Hiro Yoshioka wrote:

> 3) page faults/exceptions/...
> 3-1  TS flag is set by the CPU (Am I right?)

  TS will _not_ be set if a trap/fault or interrupt occurs.  The only
way that could happen automatically would be to use a separate hardware
task with its own TSS to handle those.

  And since the kernel does not have any state information of its own
(no task_struct) any attempt to save the kernel-mode FPU state would
overwrite the current user-mode state anyway.

  Interrupt and fault handlers will not use FP instructions anyway.
The only thing you have to worry about is getting scheduled away
while your code is running, and I guess that's why you have to worry
about page faults.  And as Arjan pointed out, if you are doing
__copy_from_user_inatomic you cannot sleep (==switch to another task.)

  So I would try the code from include/asm-i386/xor.h, modify it to
save as many registers as you plan to use and see what happens.  It will
do all the right things. See the xor_sse_2() for how to save and restore
properly -- you will need to put your xmm_save area on the stack.
 
__
Chuck

^ permalink raw reply	[flat|nested] 63+ messages in thread
* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
@ 2005-08-16 18:09 Chuck Ebbert
  2005-08-16 23:21 ` Hiro Yoshioka
  0 siblings, 1 reply; 63+ messages in thread
From: Chuck Ebbert @ 2005-08-16 18:09 UTC (permalink / raw)
  To: Hiro Yoshioka
  Cc: lkml.hyoshiok@gmail.com, taka@valinux.co.jp, Arjan van de Ven,
	linux-kernel

On Tue, 16 Aug 2005 at 19:16:17 +0900 (JST), Hiro Yoshioka wrote:

> oh, really? Does the linux kernel take care of
> SSE save/restore on a task switch?

 Check out XMMS_SAVE and XMMS_RESTORE in include/asm-i386/xor.h


__
Chuck

^ permalink raw reply	[flat|nested] 63+ messages in thread
[parent not found: <20050816.131729.15816429.taka@valinux.co.jp.suse.lists.linux.kernel>]
[parent not found: <20050815121555.29159.qmail@science.horizon.com.suse.lists.linux.kernel>]
* Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
@ 2005-08-15 12:15 linux
  2005-08-15 12:25 ` Arjan van de Ven
  0 siblings, 1 reply; 63+ messages in thread
From: linux @ 2005-08-15 12:15 UTC (permalink / raw)
  To: linux-kernel

Actually, is there any place *other* than write() to the page cache that
warrants a non-temporal store?  Network sockets with scatter/gather and
hardware checksum, maybe?

This is pretty much synonomous with what is allowed to go into high
memory, no?


While we're on the subject, for the copy_from_user source, prefetchnta is
probably indicated.  If user space hasn't caused it to be cached already
(admittedly, the common case), we *know* the kernel isn't going to look
at that data again.

^ permalink raw reply	[flat|nested] 63+ messages in thread
* [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
@ 2005-08-14  9:16 Hiro Yoshioka
  2005-08-14  9:41 ` Arjan van de Ven
  0 siblings, 1 reply; 63+ messages in thread
From: Hiro Yoshioka @ 2005-08-14  9:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: Hiro Yoshioka

Hi,

The following is a patch to reduce a cache pollution
of __copy_from_user_ll().

When I run simple iozone benchmark to find a performance bottleneck of
the linux kernel, I found that __copy_from_user_ll() spent CPU cycle
most and it did many cache misses.

The following is profiled by oprofile.

Top 5 CPU cycle
CPU: P4 / Xeon, speed 2200.91 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        app name                 symbol name
281538   15.2083  vmlinux                  __copy_from_user_ll
81069     4.3792  vmlinux                  _spin_lock
75523     4.0796  vmlinux                  journal_add_journal_head
63674     3.4396  vmlinux                  do_get_write_access
52634     2.8432  vmlinux                  journal_put_journal_head
(pattern9-0-cpu4-0-08141700/summary.out)

Top 5 Memory Access and Cache miss
CPU: P4 / Xeon, speed 2200.91 MHz (estimated)
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %        samples  %        app name                 symbol name
120801    7.4379  37017    63.4603  vmlinux                  __copy_from_user_ll
84139     5.1806  885       1.5172  vmlinux                  _spin_lock
66027     4.0654  656       1.1246  vmlinux                 
journal_add_journal_head
60400     3.7189  250       0.4286  vmlinux                  __find_get_block
60032     3.6963  120       0.2057  vmlinux                 
journal_dirty_metadata

__copy_from_user_ll spent 63.4603% of L3 cache miss though it spent only
7.4379% of memory access.

In order to reduce the cache miss in the __copy_from_user_ll, I made
the following patch and confirmed the reduction of the miss.

Top 5 CPU cycle
CPU: P4 / Xeon, speed 2200.93 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        app name                 symbol name
120717    8.3454  vmlinux                  _mmx_memcpy_nt
65955     4.5596  vmlinux                  do_get_write_access
56088     3.8775  vmlinux                  journal_put_journal_head
52550     3.6329  vmlinux                  journal_dirty_metadata
38886     2.6883  vmlinux                  journal_add_journal_head
pattern9-0-cpu4-0-08141627/summary.out

_mmx_memcpy_nt is the new function which is called from
__copy_from_user_ll and it spent only 42.88% of the original
implementation. (120717/281538==42.88%)

Top 5 Memory Access
CPU: P4 / Xeon, speed 2200.93 MHz (estimated)
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %        samples  %        app name                 symbol name
90918     6.3079  89        0.5673  vmlinux                  _mmx_memcpy_nt
83654     5.8039  177       1.1283  vmlinux                 
journal_dirty_metadata
57836     4.0127  348       2.2183  vmlinux                 
journal_put_journal_head
48236     3.3466  165       1.0518  vmlinux                  do_get_write_access
44546     3.0906  21        0.1339  vmlinux                  __getblk

The cache miss reduced from 37017 (63.4603%) to 89 (0.5673%). It is
0.24% of the original implementation.

The actual elapse time which five times run  were 229.76 (sec) and
222.94 (sec). (229.76/222.94= 3.06% gain)

iozone -CMR -i 0 -+n -+u -s 8000MB -t 4 

What do you think?

--- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c	2005-08-05
16:04:37.000000000 +0900
+++ linux-2.6.12.4/arch/i386/lib/usercopy.c	2005-08-12 13:18:14.106916200 +0900
@@ -10,6 +10,7 @@
  #include <linux/highmem.h>
  #include <linux/blkdev.h>
  #include <linux/module.h>
+#include <asm/i387.h>
  #include <asm/uaccess.h>
  #include <asm/mmx.h>
 
@@ -511,6 +512,108 @@
 		: "memory");						\
 } while (0)
 
+/* Non Temporal Hint version of mmx_memcpy */
+/* It is cache aware                       */
+/* hyoshiok@miraclelinux.com               */
+static unsigned long _mmx_memcpy_nt(void *to, const void *from, size_t len)
+{
+        /* Note! gcc doesn't seem to align stack variables properly, so we
+         * need to make use of unaligned loads and stores.
+         */
+	void *p;
+	int i;
+
+	if (unlikely(in_interrupt())){
+	        __copy_user_zeroing(to, from, len);
+		return len;
+	}
+
+	p = to;
+	i = len >> 6; /* len/64 */
+
+        kernel_fpu_begin();
+
+	__asm__ __volatile__ (
+		"1: prefetchnta (%0)\n"		/* This set is 28 bytes */
+		"   prefetchnta 64(%0)\n"
+		"   prefetchnta 128(%0)\n"
+		"   prefetchnta 192(%0)\n"
+		"   prefetchnta 256(%0)\n"
+		"2:  \n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x1AEB, 1b\n"	/* jmp on 26 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from) );
+		
+	for(; i>5; i--)
+	{
+		__asm__ __volatile__ (
+		"1:  prefetchnta 320(%0)\n"
+		"2:  movq (%0), %%mm0\n"
+		"  movq 8(%0), %%mm1\n"
+		"  movq 16(%0), %%mm2\n"
+		"  movq 24(%0), %%mm3\n"
+		"  movntq %%mm0, (%1)\n"
+		"  movntq %%mm1, 8(%1)\n"
+		"  movntq %%mm2, 16(%1)\n"
+		"  movntq %%mm3, 24(%1)\n"
+		"  movq 32(%0), %%mm0\n"
+		"  movq 40(%0), %%mm1\n"
+		"  movq 48(%0), %%mm2\n"
+		"  movq 56(%0), %%mm3\n"
+		"  movntq %%mm0, 32(%1)\n"
+		"  movntq %%mm1, 40(%1)\n"
+		"  movntq %%mm2, 48(%1)\n"
+		"  movntq %%mm3, 56(%1)\n"
+		".section .fixup, \"ax\"\n"
+		"3: movw $0x05EB, 1b\n"	/* jmp on 5 bytes */
+		"   jmp 2b\n"
+		".previous\n"
+		".section __ex_table,\"a\"\n"
+		"	.align 4\n"
+		"	.long 1b, 3b\n"
+		".previous"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+
+	for(; i>0; i--)
+	{
+		__asm__ __volatile__ (
+		"  movq (%0), %%mm0\n"
+		"  movq 8(%0), %%mm1\n"
+		"  movq 16(%0), %%mm2\n"
+		"  movq 24(%0), %%mm3\n"
+		"  movntq %%mm0, (%1)\n"
+		"  movntq %%mm1, 8(%1)\n"
+		"  movntq %%mm2, 16(%1)\n"
+		"  movntq %%mm3, 24(%1)\n"
+		"  movq 32(%0), %%mm0\n"
+		"  movq 40(%0), %%mm1\n"
+		"  movq 48(%0), %%mm2\n"
+		"  movq 56(%0), %%mm3\n"
+		"  movntq %%mm0, 32(%1)\n"
+		"  movntq %%mm1, 40(%1)\n"
+		"  movntq %%mm2, 48(%1)\n"
+		"  movntq %%mm3, 56(%1)\n"
+		: : "r" (from), "r" (to) : "memory");
+		from+=64;
+		to+=64;
+	}
+	/*
+	 *	Now do the tail of the block
+	 */
+	kernel_fpu_end();
+	if(i=(len&63))
+	  __copy_user_zeroing(to, from, i);
+	return i;
+}
 
  unsigned long __copy_to_user_ll(void __user *to, const void *from,
unsigned long n)
 {
@@ -575,10 +678,14 @@
  __copy_from_user_ll(void *to, const void __user *from, unsigned long n)
 {
 	BUG_ON((long)n < 0);
-	if (movsl_is_ok(to, from, n))
+	if (n < 512) {
+	  if (movsl_is_ok(to, from, n))
 		__copy_user_zeroing(to, from, n);
-	else
+	  else
 		n = __copy_user_zeroing_intel(to, from, n);
+	}
+	else
+	  n = _mmx_memcpy_nt(to, from, n);
 	return n;
 }

Thanks in advance,
  Hiro
--
hyoshiok at miraclelinux.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2005-09-03 12:03 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-14 21:24 [RFC] [PATCH] cache pollution aware __copy_from_user_ll() Ian Kumlien
2005-08-15  7:21 ` Arjan van de Ven
2005-08-15 14:49   ` Ian Kumlien
     [not found] <20050818.201138.607962419.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>
     [not found] ` <98df96d30508181629d85edb5@mail.gmail.com.suse.lists.linux.kernel>
     [not found]   ` <20050823.081246.846946371.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>
     [not found]     ` <20050824.231156.278740508.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>
2005-08-24 16:18       ` Andi Kleen
2005-08-25  4:54         ` Hiro Yoshioka
2005-09-01  9:07           ` Hiro Yoshioka
2005-09-01  9:36             ` Andi Kleen
2005-09-02  1:43               ` Hiro Yoshioka
2005-09-02  2:06                 ` Andi Kleen
2005-09-02  2:08                 ` Andrew Morton
2005-09-02  2:17                   ` Andi Kleen
2005-09-02  2:28                     ` Andrew Morton
2005-09-02  3:41                       ` Hiro Yoshioka
2005-09-02  4:29             ` Andrew Morton
2005-09-02  4:37               ` Hiro Yoshioka
2005-09-03 11:59                 ` Hiro Yoshioka
  -- strict thread matches above, loose matches on Subject: below --
2005-08-17 15:19 Chuck Ebbert
2005-08-18  9:45 ` Hiro Yoshioka
2005-08-16 18:09 Chuck Ebbert
2005-08-16 23:21 ` Hiro Yoshioka
2005-08-17  4:50   ` Hiro Yoshioka
     [not found] <20050816.131729.15816429.taka@valinux.co.jp.suse.lists.linux.kernel>
     [not found] ` <20050816.135425.719901536.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>
     [not found]   ` <1124171015.3215.0.camel@laptopd505.fenrus.org.suse.lists.linux.kernel>
     [not found]     ` <20050816.191617.1025215458.hyoshiok@miraclelinux.com.suse.lists.linux.kernel>
     [not found]       ` <1124187950.3215.31.camel@laptopd505.fenrus.org.suse.lists.linux.kernel>
2005-08-16 13:15         ` Andi Kleen
2005-08-18 11:06           ` Hiro Yoshioka
2005-08-18 11:11             ` Hiro Yoshioka
2005-08-18 23:29               ` Hiro Yoshioka
2005-08-22  1:24                 ` Hiro Yoshioka
2005-08-22 13:07                   ` Andi Kleen
2005-08-22  2:43                 ` Hiro Yoshioka
2005-08-22 23:12                 ` Hiro Yoshioka
2005-08-24 14:11                   ` Hiro Yoshioka
2005-08-24 14:21                     ` Arjan van de Ven
2005-08-24 16:22                     ` Hirokazu Takahashi
2005-08-25  4:53                       ` Hiro Yoshioka
     [not found] <20050815121555.29159.qmail@science.horizon.com.suse.lists.linux.kernel>
     [not found] ` <1124108702.3228.33.camel@laptopd505.fenrus.org.suse.lists.linux.kernel>
2005-08-15 15:02   ` Andi Kleen
2005-08-15 15:09     ` Arjan van de Ven
2005-08-15 15:13       ` Andi Kleen
2005-08-15 12:15 linux
2005-08-15 12:25 ` Arjan van de Ven
2005-08-14  9:16 Hiro Yoshioka
2005-08-14  9:41 ` Arjan van de Ven
2005-08-14 10:22   ` Hiro Yoshioka
2005-08-14 10:35     ` Arjan van de Ven
2005-08-14 10:45       ` Christoph Hellwig
2005-08-15  6:43       ` Hiro Yoshioka
2005-08-15  7:16         ` Arjan van de Ven
2005-08-15  8:44           ` Hiro Yoshioka
2005-08-15  8:53             ` Arjan van de Ven
2005-08-15 23:33               ` Hiro Yoshioka
2005-08-16  3:30                 ` Hiro Yoshioka
2005-08-16  4:17                   ` Hirokazu Takahashi
2005-08-16  4:54                     ` Hiro Yoshioka
2005-08-16  5:43                       ` Arjan van de Ven
2005-08-16 10:16                         ` Hiro Yoshioka
2005-08-16 10:19                           ` Hirokazu Takahashi
2005-08-16 10:25                           ` Arjan van de Ven
2005-08-16 10:24                             ` Hirokazu Takahashi
2005-08-16  5:44                     ` Arjan van de Ven
2005-08-16  5:49                   ` Arjan van de Ven
     [not found]                     ` <20050817.110503.97359275.taka@valinux.co.jp>
2005-08-17  5:10                       ` Hiro Yoshioka
2005-08-17 14:30                         ` Akira Tsukamoto
2005-08-17 15:27                           ` Akira Tsukamoto
2005-08-18 17:53                             ` Lee Revell
2005-08-18  2:37                           ` Akira Tsukamoto

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox