public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [BK+PATCH] remove __constant_memcpy
@ 2003-04-17  0:57 Jeff Garzik
  2003-04-17  1:04 ` Jeff Garzik
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-04-17  0:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: LKML

[-- Attachment #1: Type: text/plain, Size: 1763 bytes --]

Linus,

Please review the patch below, then do a

	bk pull http://gkernel.bkbits.net/misc-2.5

Summary:

gcc's __builtin_memcpy performs the same function (and more) as the 
kernel's __constant_memcpy.  So, let's remove __constant_memcpy, and let 
the compiler do it.

Instead of shouldering the burden of the kernel needing to have a 
decently-fast memcpy routine, I would prefer to hand off that 
maintenance burden to the compiler.  For the less common (read: 
non-Intel) processors, I bet this patch shows immediate asm-level 
benefits in instruction scheduling.

The patch below is the conservative, obvious patch.  It only kicks in 
when __builtin_constant_p() is true, and it only applies to the i386 
arch.  I'm currently running w/ 2.5.67+BK+patch and it's stable.  With 
some recently-acquired (but still nascent) x86 asm skills, I diff'd the 
before-and-after x86 asm cases for when a constant memcpy() call was 
made in the kernel code; nothing interesting.  The instruction sequence 
was usually longer once you exceeded ~32 byte memcpy, but it looked like 
it scheduled better on i686.  The small-copy cases looked reasonably 
equivalent.

The more radical direction, where I would eventually like to go, is to 
hand off all memcpy duties to the compiler, and -march=xxx selects the 
best memcpy strategies.  This "radical" direction requires a lot more 
work, benching both the kernel and gcc before and after the memcpy changes.

Finally, on a compiler note, __builtin_memcpy can fall back to emitting 
a memcpy function call.  Given the conservatism of my patch, this is 
unlikely, but it should be mentioned.  This also gives less-capable 
compilers the ability to simplify, by choosing the slow path of 
unconditionally emitting a memcpy call.

[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 3492 bytes --]

# --------------------------------------------
# 03/04/16	jgarzik@redhat.com	1.1067.1.1
# [ia32] remove __constant_memcpy, use __builtin_memcpy
# 
# gcc's memcpy handling already takes into account the cases that
# include/asm-i386/string.h's __constant_memcpy takes into
# account.  __constant_memcpy is removed, and it is replaced
# with references to __builtin_memcpy.
# 
# Compilers that do not/cannot optimize __builtin_memcpy can choose
# the slow path and simply emit a memcpy function call.
# --------------------------------------------
#
diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h
--- a/include/asm-i386/string.h	Wed Apr 16 20:19:42 2003
+++ b/include/asm-i386/string.h	Wed Apr 16 20:19:42 2003
@@ -208,75 +208,6 @@
 return (to);
 }
 
-/*
- * This looks horribly ugly, but the compiler can optimize it totally,
- * as the count is constant.
- */
-static inline void * __constant_memcpy(void * to, const void * from, size_t n)
-{
-	switch (n) {
-		case 0:
-			return to;
-		case 1:
-			*(unsigned char *)to = *(const unsigned char *)from;
-			return to;
-		case 2:
-			*(unsigned short *)to = *(const unsigned short *)from;
-			return to;
-		case 3:
-			*(unsigned short *)to = *(const unsigned short *)from;
-			*(2+(unsigned char *)to) = *(2+(const unsigned char *)from);
-			return to;
-		case 4:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			return to;
-		case 6:	/* for Ethernet addresses */
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(2+(unsigned short *)to) = *(2+(const unsigned short *)from);
-			return to;
-		case 8:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
-			return to;
-		case 12:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
-			*(2+(unsigned long *)to) = *(2+(const unsigned long *)from);
-			return to;
-		case 16:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
-			*(2+(unsigned long *)to) = *(2+(const unsigned long *)from);
-			*(3+(unsigned long *)to) = *(3+(const unsigned long *)from);
-			return to;
-		case 20:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
-			*(2+(unsigned long *)to) = *(2+(const unsigned long *)from);
-			*(3+(unsigned long *)to) = *(3+(const unsigned long *)from);
-			*(4+(unsigned long *)to) = *(4+(const unsigned long *)from);
-			return to;
-	}
-#define COMMON(x) \
-__asm__ __volatile__( \
-	"rep ; movsl" \
-	x \
-	: "=&c" (d0), "=&D" (d1), "=&S" (d2) \
-	: "0" (n/4),"1" ((long) to),"2" ((long) from) \
-	: "memory");
-{
-	int d0, d1, d2;
-	switch (n % 4) {
-		case 0: COMMON(""); return to;
-		case 1: COMMON("\n\tmovsb"); return to;
-		case 2: COMMON("\n\tmovsw"); return to;
-		default: COMMON("\n\tmovsw\n\tmovsb"); return to;
-	}
-}
-  
-#undef COMMON
-}
-
 #define __HAVE_ARCH_MEMCPY
 
 #ifdef CONFIG_X86_USE_3DNOW
@@ -290,7 +221,7 @@
 static inline void * __constant_memcpy3d(void * to, const void * from, size_t len)
 {
 	if (len < 512)
-		return __constant_memcpy(to, from, len);
+		return __builtin_memcpy(to, from, len);
 	return _mmx_memcpy(to, from, len);
 }
 
@@ -314,7 +245,7 @@
  
 #define memcpy(t, f, n) \
 (__builtin_constant_p(n) ? \
- __constant_memcpy((t),(f),(n)) : \
+ __builtin_memcpy((t),(f),(n)) : \
  __memcpy((t),(f),(n)))
 
 #endif

^ permalink raw reply	[flat|nested] 27+ messages in thread
* RE: [BK+PATCH] remove __constant_memcpy
@ 2003-04-17  2:22 Nakajima, Jun
  0 siblings, 0 replies; 27+ messages in thread
From: Nakajima, Jun @ 2003-04-17  2:22 UTC (permalink / raw)
  To: Linus Torvalds, Jeff Garzik; +Cc: LKML

I think the fear is valid. Intel compiler, for example, uses FP if it's required to optimize the code with a particular option. And the option is not necessary obvious when telling if it uses FP or not (and it does not matter for most users). This could happen with gcc.

I think we can do it better than the compiler does, because we know the usage better, e.g. alignment, typical size, etc.

Thanks,
Jun

> -----Original Message-----
> From: Linus Torvalds [mailto:torvalds@transmeta.com]
> Sent: Wednesday, April 16, 2003 7:06 PM
> To: Jeff Garzik
> Cc: LKML
> Subject: Re: [BK+PATCH] remove __constant_memcpy
> 
> 
> On Wed, 16 Apr 2003, Jeff Garzik wrote:
> >
> > gcc's __builtin_memcpy performs the same function (and more) as the
> > kernel's __constant_memcpy.  So, let's remove __constant_memcpy, and let
> > the compiler do it.
> 
> Please don't.
> 
> There's no way gcc will EVER get the SSE2 cases right. It just cannot do
> it. In fact, I live in fear that we will have to turn off the compiler
> intrisics entirely some day just because there is always the worry that
> gcc will start using FP.
> 
> So the advantage of doing our own memcpy() is not that it's necessarily
> faster than the gcc built-in, but simply because I do not believe that the
> gcc people care enough about the kernel to let them make the choice.
> 
> 		Linus
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: [BK+PATCH] remove __constant_memcpy
@ 2003-04-17 23:50 Chuck Ebbert
  0 siblings, 0 replies; 27+ messages in thread
From: Chuck Ebbert @ 2003-04-17 23:50 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel

Jeff Garzik wrote: 


>> On x86, gcc doesn't have such an option, although "-mno-sse" and
>> "-mno-sse2" probably come closest (and we should probably use them, but
>> since older gcc's don't know about it and it hasn't been an issue yet we
>> haven't).
>
> gcc on x86 definitely wants a -fdontyoudareusefloatingpoint... The
> following snippet from the -msoft-float docs isn't encouraging:
>
>     On machines where a function returns floating point results in the
>     80387 register stack, some floating point opcodes may be emitted
>     even if `-msoft-float' is used.


  -mno-fp-ret-in-387 should fix that.



--
 Chuck

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2003-04-18 14:53 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-17  0:57 [BK+PATCH] remove __constant_memcpy Jeff Garzik
2003-04-17  1:04 ` Jeff Garzik
2003-04-17  2:06 ` Linus Torvalds
2003-04-17  8:46   ` Arjan van de Ven
2003-04-17  9:02     ` Roman Zippel
2003-04-17  9:04       ` Arjan van de Ven
2003-04-17  9:11         ` Jakub Jelinek
2003-04-17 16:07     ` Linus Torvalds
2003-04-17 19:07       ` Jeff Garzik
2003-04-17 19:19       ` Jeff Garzik
2003-04-17 19:54         ` Linus Torvalds
2003-04-17 23:49           ` Jeff Garzik
2003-04-17 23:52             ` Jeff Garzik
2003-04-17 23:59             ` Linus Torvalds
2003-04-18  0:29               ` H. Peter Anvin
2003-04-18  9:06             ` Arjan van de Ven
2003-04-18 14:31             ` Timothy Miller
2003-04-18 15:07               ` Richard B. Johnson
2003-04-17 22:58         ` J.A. Magallon
2003-04-17 23:10           ` Jeff Garzik
2003-04-17 13:17 ` Alan Cox
2003-04-17 13:17 ` Alan Cox
2003-04-17 14:32   ` Jeff Garzik
2003-04-17 14:40     ` Jeff Garzik
2003-04-17 20:01   ` H. Peter Anvin
  -- strict thread matches above, loose matches on Subject: below --
2003-04-17  2:22 Nakajima, Jun
2003-04-17 23:50 Chuck Ebbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox