public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrea Arcangeli <andrea@suse.de>
To: Andi Kleen <ak@suse.de>
Cc: Linus Torvalds <torvalds@transmeta.com>,
	Brian Gerst <bgerst@didntduck.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	linux-kernel@vger.kernel.org, jh@suse.cz
Subject: Re: [PATCH] Re: SSE related security hole
Date: Sat, 20 Apr 2002 23:28:18 +0200	[thread overview]
Message-ID: <20020420232818.N1291@dualathlon.random> (raw)
In-Reply-To: <20020420201205.M1291@dualathlon.random> <Pine.LNX.4.33.0204201221120.11732-100000@penguin.transmeta.com> <20020420214114.A11894@wotan.suse.de>

On Sat, Apr 20, 2002 at 09:41:14PM +0200, Andi Kleen wrote:
> > Besides, I seriously doubt it is any faster than what is there already.
> > 
> > Time it, and notice how:
> > 
> >  - fninit takes about 200 cycles
> >  - fxrstor takes about 215 cycles
> 
> On what CPU? 
> 
> I checked the Athlon4 optimization manual and fxrstor is listed as 68/108
> cycles (i guess depending on whether there is XMM state or not so 68 cycles
> probably apply here) and fninit as 91 cycles. It doesn't list the SSE1 
> timings, but i guess the instructions don't take more than 3 cycles
> (MMX instructions take that long). So Andrea's way should be 
> 91+16*3=139+some cycles for emms (or 107 if sse ops take only a single cycle) 
> vs 68 or 108.  So the fxrstor wins well. 
> 
> On x86-64 the difference is even bigger because it has 16 XMM registers instead
> of 8.
> 
> 
> > In short, your "fast" code isn't actually any faster than doing it right.
> 
> At least on Athlon it should be slower. 

pxor+xorps is definitely faster than fxrestor on athlon-mp.

fxrestor on athlon-mp 1600, on cold cache (the "default fpu state" will
be cold most of the time, it's only ever used at the first math fault of
a task):

dualathlon:/home/andrea # for i in 1 2 3 4 5 6 7 8 9 10 ; do ./a.out ;
done
cycles 3507
cycles 3293
cycles 3179
cycles 3267
cycles 3374
cycles 3213
cycles 2575
cycles 3306
cycles 2915
cycles 3078

pxor + xorps on the same hardware cold cache:

dualathlon:/home/andrea # for i in 1 2 3 4 5 6 7 8 9 10 ; do ./a.out ;
done
cycles 2444
cycles 1619
cycles 2396
cycles 2644
cycles 2322
cycles 2404
cycles 2266
cycles 1657
cycles 2182
cycles 2077

And here it is again on PIII 800mhz for confirmation on true Intel CPU:

fxrestor:

for i in 1 2 3 4 5 6 7 8 9 10 ; do ./a.out ; done
cycles 3461
cycles 3294
cycles 3339
cycles 3276
cycles 3620
cycles 3443
cycles 3401
cycles 3455
cycles 3515
cycles 3327

xor+xorps:

for i in 1 2 3 4 5 6 7 8 9 10 ; do ./a.out ; done
cycles 2288
cycles 2181
cycles 2430
cycles 2486
cycles 2360
cycles 2534
cycles 2243
cycles 2199
cycles 2153
cycles 2396

As said in an earlier email it is a matter of memory bandwith, 59 bytes
of icache and zero data, against 7 bytes of icache and 512bytes of data.
the 512bytes of data are visibly slower, period. Saving mem bandwith is
much more important than reducing the number of instructions, even more
on SMP!

On the x86-64 it will be exactly the same, the mem bandwith is much
higher, but it's still a major bottleneck compared to a few more
instructions to execute (also consider the current ia32 isn't probably
reading the xmm8-15 slots, so on x86-64 fxrestor will be potentially
even slower compared to the in-core xorps that never do any I/O to
memory).

Go ahead and try yourself setting the #if to 0 or 1. The benchmark is
very accurate.

#include <sys/mman.h>
#include <asm/msr.h>

struct i387_fxsave_struct {
	unsigned short	cwd;
	unsigned short	swd;
	unsigned short	twd;
	unsigned short	fop;
	long	fip;
	long	fcs;
	long	foo;
	long	fos;
	long	mxcsr;
	long	reserved;
	long	st_space[32];	/* 8*16 bytes for each FP-reg = 128 bytes */
	long	xmm_space[32];	/* 8*16 bytes for each XMM-reg = 128 bytes */
	long	padding[56];
} __attribute__ ((aligned (16)));

#define LOOPS 1

#define load_mxcsr( val ) do { \
	unsigned long __mxcsr = ((unsigned long)(val) & 0xffbf); \
	asm volatile( "ldmxcsr %0" : : "m" (__mxcsr) ); \
} while (0)

struct i387_fxsave_struct i387;
char buf[1024*1024*40];

static void cold_dcache(void)
{
	memset(buf, 0, 1024*1024*40);
}

main()
{
	unsigned long before, after;
	int i;

	iopl(3);
	mlockall(MCL_FUTURE);

#if 1
	cold_dcache();
	asm("cli");

	rdtscl(before);

	__asm__("fninit");

	asm volatile(
		     "pxor %mm0, %mm0	\n"
		     "xorps %xmm0, %xmm0	\n"
		     "pxor %mm1, %mm1	\n"
		     "xorps %xmm1, %xmm1	\n"
		     "pxor %mm2, %mm2	\n"
		     "xorps %xmm2, %xmm2	\n"
		     "pxor %mm3, %mm3	\n"
		     "xorps %xmm3, %xmm3	\n"
		     "pxor %mm4, %mm4	\n"
		     "xorps %xmm4, %xmm4	\n"
		     "pxor %mm5, %mm5	\n"
		     "xorps %xmm5, %xmm5	\n"
		     "pxor %mm6, %mm6	\n"
		     "xorps %xmm6, %xmm6	\n"
		     "pxor %mm7, %mm7	\n"
		     "xorps %xmm7, %xmm7	\n"
		     "emms			\n");
	load_mxcsr(0x1f80);
	rdtscl(after);
#else
	asm volatile("fxsave %0" : : "m" (i387));

	cold_dcache();
	asm("cli");
	rdtscl(before);

	asm volatile("fxrstor %0" : "=m" (i387));

	rdtscl(after);
#endif

	printf("cycles %lu\n", after-before);
}

As said for SSE3 a kernel modification will be required anyways if xorps
isn't enough to initialize the new fpu, so the current 2.4 kernel cannot
cope with it anyways regardless of fxrestor or not.

So I don't buy that fxrestor is better, nor more generic.

Andrea

  reply	other threads:[~2002-04-20 21:28 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20020418183639.20946.qmail@science.horizon.com.suse.lists.linux.kernel>
     [not found] ` <a9ncgs$2s2$1@cesium.transmeta.com.suse.lists.linux.kernel>
2002-04-19 14:06   ` SSE related security hole Andi Kleen
2002-04-19 18:00     ` Doug Ledford
2002-04-19 21:04       ` Andrea Arcangeli
2002-04-19 21:35         ` H. Peter Anvin
2002-04-19 21:42           ` Andi Kleen
2002-04-20  3:23             ` Andrea Arcangeli
2002-04-19 23:12           ` [PATCH] " Brian Gerst
2002-04-19 23:41             ` Linus Torvalds
2002-04-20  0:01               ` H. Peter Anvin
2002-04-20  0:09                 ` Linus Torvalds
2002-04-20  0:11                   ` Brian Gerst
2002-04-20  0:19                   ` H. Peter Anvin
2002-04-20  0:29                     ` Linus Torvalds
2002-04-20  0:31                   ` Alan Cox
2002-04-20  0:08               ` Brian Gerst
2002-04-20  0:21                 ` Linus Torvalds
2002-04-20  4:21                 ` Andrea Arcangeli
2002-04-20  4:35                   ` Linus Torvalds
2002-04-20  5:07                     ` Andrea Arcangeli
2002-04-20 16:27                       ` Linus Torvalds
2002-04-20 17:27                         ` Andrea Arcangeli
2002-04-20 17:38                           ` Linus Torvalds
2002-04-20 18:12                             ` Andrea Arcangeli
2002-04-20 19:30                               ` Linus Torvalds
2002-04-20 19:41                                 ` Andi Kleen
2002-04-20 21:28                                   ` Andrea Arcangeli [this message]
2002-04-20 22:43                                     ` H. Peter Anvin
2002-04-21  2:09                                       ` Andrea Arcangeli
2002-04-20 23:23                                     ` Linus Torvalds
2002-04-21  2:08                                       ` Andrea Arcangeli
2002-04-20 23:13                                   ` Linus Torvalds
2002-04-23 19:21                               ` Linus Torvalds
2002-04-23 20:05                                 ` H. Peter Anvin
2002-04-24  0:32                                 ` Andrea Arcangeli
2002-04-24  2:10                                   ` Linus Torvalds
2002-04-26  9:13                                     ` Pavel Machek
2002-04-26 11:55                                       ` Andrea Arcangeli
2002-04-19 22:18         ` Jan Hubicka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20020420232818.N1291@dualathlon.random \
    --to=andrea@suse.de \
    --cc=ak@suse.de \
    --cc=bgerst@didntduck.org \
    --cc=hpa@zytor.com \
    --cc=jh@suse.cz \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@transmeta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox