public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* ABI change for device drivers using future AVX instruction set
@ 2008-06-25 15:32 Agner Fog
  2008-06-25 16:22 ` Arjan van de Ven
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Agner Fog @ 2008-06-25 15:32 UTC (permalink / raw)
  To: linux-kernel

The announced future Intel AVX instruction set extends the 128-bit XMM 
registers to 256-bit YMM registers.

Intel has proposed relevant ABI extensions, which I assume will be 
adopted in the System V ABI. See references below.

Some details are not covered in the Intel documents. I have discussed 
this with an Intel engineer who has supplied all the details I asked 
for. I have listed the necessary ABI changes in detail in my manual on 
calling conventions (see below).

One problem that has not been resolved yet, AFAIK, is how to handle the 
possible use of YMM registers in device drivers. Should these registers 
be saved before calling a device driver from an interrupt or should it 
be the responsibility of the device driver?

This is particularly problematic for the following reasons:

1. The YMM registers must be saved with the new instruction XSAVE and 
restored with XRESTOR if done in the device driver. Saving registers 
individually will be incompatible with future extensions of the register 
size to more than 256 bits.

2. There is a performance cost to using XSAVE / XRESTOR.

3. When compiling a device driver, the compiler may insert implicit 
calls to library functions such as memcpy and memset. These functions 
typically have a CPU dispatcher that chooses the largest register size 
available. The device driver may therefore use YMM registers without the 
knowledge of the programmer and without compiling with the AVX switch on.

4. The consequences of failing to save the YMM registers properly would 
be intermittent and irreproducible errors that are difficult to trace.

The possible solutions, as I see it, are:

A. The operating system saves the state with XSAVE before calling a 
device driver from an interrupt and restores with XRESTOR. The device 
driver can use any register. This method is safe but has a performance 
penalty.
 
B. The operating system disables the use of YMM registers with the 
instruction XSETBV before calling a device driver from an interrupt. If 
the device driver needs to use YMM registers it must save the state with 
XSAVE before enabling YMM with XSETBV, and reverse these actions before 
returning.
 
C. Make it the responsibility of the device driver to avoid the use of 
YMM registers unless it saves the state with XSAVE. This solution 
requires that available compilers have a switch to disable calls to 
library functions with internal CPU dispatchers. Appears unsafe to me.

A decision on this question should be made and published in the ABI so 
that people can make compatible device drivers.

Note:
Please Cc: me on this thread. I am not on this mailing list and I am not 
involved with Linux development.

References:
Intel AVX programming reference:
http://softwarecommunity.intel.com/isn/downloads/intelavx/Intel-AVX-Programming-Reference-31943302.pdf

Intel proposed ABI extensions (very brief):
http://intel.wingateweb.com/SHchina/published/NGMS002/SP_NGMS002_100r_eng.pdf

My interpretation of the ABI extensions in detail:
http://www.agner.org/optimize/calling_conventions.pdf

My discussion with Mark Buxton, Intel
http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/30257153.aspx


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-25 15:32 ABI change for device drivers using future AVX instruction set Agner Fog
@ 2008-06-25 16:22 ` Arjan van de Ven
  2008-06-25 19:54   ` Agner Fog
  2008-06-26  1:11   ` H. Peter Anvin
  2008-06-25 20:14 ` Alan Cox
  2008-06-26 14:01 ` Andi Kleen
  2 siblings, 2 replies; 25+ messages in thread
From: Arjan van de Ven @ 2008-06-25 16:22 UTC (permalink / raw)
  To: Agner Fog; +Cc: linux-kernel

On Wed, 25 Jun 2008 17:32:36 +0200
Agner Fog <agner@agner.org> wrote:

> 
> One problem that has not been resolved yet, AFAIK, is how to handle
> the possible use of YMM registers in device drivers. Should these
> registers be saved before calling a device driver from an interrupt
> or should it be the responsibility of the device driver?

the answer is: you don't use AVX in drivers just as you don't use SSE
in drivers.

Let me repeat this loud and clear:

It is not allowed to use floating point, SSE of AVX in device drivers.

(there are few places that very carefully do this anyway, the raid5
code being the most obvious one, but we should not add more)


> 2. There is a performance cost to using XSAVE / XRESTOR.

a patch introducing xsave/xrestor is being discussed already here...
it's not really more expensive.

> 
> 3. When compiling a device driver, the compiler may insert implicit 
> calls to library functions such as memcpy and memset. These functions 
> typically have a CPU dispatcher that chooses the largest register
> size available. The device driver may therefore use YMM registers
> without the knowledge of the programmer and without compiling with
> the AVX switch on.

this is not correct; the library functions are provided by the kernel
and do not use AVX etc.


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-25 16:22 ` Arjan van de Ven
@ 2008-06-25 19:54   ` Agner Fog
  2008-06-25 20:09     ` Arjan van de Ven
  2008-06-26  1:11   ` H. Peter Anvin
  1 sibling, 1 reply; 25+ messages in thread
From: Agner Fog @ 2008-06-25 19:54 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel

Arjan van de Ven wrote:
 >Let me repeat this loud and clear:
 >It is not allowed to use floating point, SSE of AVX in device drivers.

Thank you for a clear answer. Re-reading the ABI:
 >A.2.4 Miscellaneous Remarks
 >Linux Kernel code is not allowed to change the x87 and SSE units. If 
those are
 >changed by kernel code, they have to be restored properly before 
sleeping or leaving
 >the kernel. On preemptive kernels also more precautions may be needed.

This states your point, but certainly not "loud and clear" :-)

This ABI statement has to be amended, however, because it is allowed to 
use any register as long it is "restored properly". A programmer can 
save a single XMM register, use it, and restore it. A programmer might 
think that it is possible to do the same with a YMM register, but this 
will not work on future systems that extend the 256-bit YMM registers to 
512-bit ZMM or whatever. This has to be said "loud and clear" because 
nobody would discover the error until a distant future when we have ZMM.

Let me explain the background for this as I understand it:
Intel have decided to make two versions of every existing XMM 
instruction in the new AVX: A legacy version that leaves the upper half 
of the YMM register unchanged, and a new AVX version that clears the 
upper part of the YMM register in order to avoid false dependencies. 
This very expensive decision was necessary for only one reason: To 
prevent existing device drivers from destroying the upper part of the 
YMM registers. Learning from past mistakes, they are saying already now 
that they will not use the same fix next time the registers are made 
bigger. When ZMM and still bigger registers are introduced, the YMM 
instructions will clear bit number 256 and up of the ZMM or whatever 
registers.

Note: Please Cc: me of any replies. I am not on the mailing list.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-25 19:54   ` Agner Fog
@ 2008-06-25 20:09     ` Arjan van de Ven
  0 siblings, 0 replies; 25+ messages in thread
From: Arjan van de Ven @ 2008-06-25 20:09 UTC (permalink / raw)
  To: Agner Fog; +Cc: linux-kernel

On Wed, 25 Jun 2008 21:54:04 +0200
Agner Fog <agner@agner.org> wrote:

> Arjan van de Ven wrote:
>  >Let me repeat this loud and clear:
>  >It is not allowed to use floating point, SSE of AVX in device
>  >drivers.
> 
> Thank you for a clear answer. Re-reading the ABI:
>  >A.2.4 Miscellaneous Remarks
>  >Linux Kernel code is not allowed to change the x87 and SSE units.
>  >If 
> those are
>  >changed by kernel code, they have to be restored properly before 
> sleeping or leaving
>  >the kernel. On preemptive kernels also more precautions may be
>  >needed.
> 
> This states your point, but certainly not "loud and clear" :-)
> 
> This ABI statement has to be amended, however, because it is allowed

the linux kernel policy is loud and clear; this is more an OS policy as
it is a platform ABI issue. We as linux do not allow this (with the
small footnote around the raid stuff, but that has specific very
careful use). In addition, you really aren't allowed to use floating
point in the kernel period as concept, if you need it you very likely
are doing something wrong.


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-25 15:32 ABI change for device drivers using future AVX instruction set Agner Fog
  2008-06-25 16:22 ` Arjan van de Ven
@ 2008-06-25 20:14 ` Alan Cox
  2008-06-26 14:01 ` Andi Kleen
  2 siblings, 0 replies; 25+ messages in thread
From: Alan Cox @ 2008-06-25 20:14 UTC (permalink / raw)
  To: Agner Fog; +Cc: linux-kernel

> 3. When compiling a device driver, the compiler may insert implicit 
> calls to library functions such as memcpy and memset. These functions 
> typically have a CPU dispatcher that chooses the largest register size 
> available. The device driver may therefore use YMM registers without the 
> knowledge of the programmer and without compiling with the AVX switch on.

Linux uses its own routines in these cases - and in fact for MMX we
generally avoid kernel usage in order to avoid the overheads (ditto FP).

> A decision on this question should be made and published in the ABI so 
> that people can make compatible device drivers.

The beauty of the Linux kernel being source based - we don't have to worry
about ABI problems like this.

Alan


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-25 16:22 ` Arjan van de Ven
  2008-06-25 19:54   ` Agner Fog
@ 2008-06-26  1:11   ` H. Peter Anvin
  2008-06-27 11:31     ` Agner Fog
  1 sibling, 1 reply; 25+ messages in thread
From: H. Peter Anvin @ 2008-06-26  1:11 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Agner Fog, linux-kernel

Arjan van de Ven wrote:
> 
> Let me repeat this loud and clear:
> 
> It is not allowed to use floating point, SSE of AVX in device drivers.
> 
> (there are few places that very carefully do this anyway, the raid5
> code being the most obvious one, but we should not add more)
> 

Sadly, AVX repeats the mistakes of SSE1, and doesn't implement proper 
wide support for integer operations.  It has the basic bitwide stuff, 
which makes it usable for RAID-5, but it doesn't extend MMX (which SSE2 
eventually got around to), so it's not usable for RAID-6.  I'd hoped to 
find a VPERM-style instruction, like SSE5 has :(

	-hpa

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-25 15:32 ABI change for device drivers using future AVX instruction set Agner Fog
  2008-06-25 16:22 ` Arjan van de Ven
  2008-06-25 20:14 ` Alan Cox
@ 2008-06-26 14:01 ` Andi Kleen
  2 siblings, 0 replies; 25+ messages in thread
From: Andi Kleen @ 2008-06-26 14:01 UTC (permalink / raw)
  To: Agner Fog; +Cc: linux-kernel

Agner Fog <agner@agner.org> writes:

> Some details are not covered in the Intel documents. I have discussed
> this with an Intel engineer who has supplied all the details I asked
> for. I have listed the necessary ABI changes in detail in my manual on
> calling conventions (see below).

The discussion there pretty much only applies to other operating
systems, not Linux.

> One problem that has not been resolved yet, AFAIK, is how to handle
> the possible use of YMM registers in device drivers. Should these
> registers be saved before calling a device driver from an interrupt or
> should it be the responsibility of the device driver?

The Linux kernel always saves FP and related state only in the function
that uses it (in the "device driver" in your terminology), never on
kernel entry/exit (at least on x86, some other architectures like
IA64 use slightly different schemes because there even standard 
integer code sometimes uses the FPU for common operations) 

Also actually using it is very rare. In the mainline kernel it's only
the software RAID support functions.

So you could say it's solved.

> This is particularly problematic for the following reasons:
>
> 1. The YMM registers must be saved with the new instruction XSAVE and
> restored with XRESTOR if done in the device driver. Saving registers
> individually will be incompatible with future extensions of the
> register size to more than 256 bits.

The device driver should know what registers it uses so it can 
call the right functions. But Linux has standard functions to do this
anyways, and I believe in the current XSAVE patchkit they already
use XSAVE.

> 3. When compiling a device driver, the compiler may insert implicit
> calls to library functions such as memcpy and memset. These functions
> typically have a CPU dispatcher that chooses the largest register size

Not in the Linux kernel.  Or rather there is a CPU dispatcher,
but it only selects between two simple integer copying strategies
(string instructions and an unrolled loop). 

That works because most operations in the Linux kernel are only
on 4K pieces where SSE* operations tend to not make too much sense.

> A. The operating system saves the state with XSAVE before calling a
> device driver from an interrupt and restores with XRESTOR. The device
> driver can use any register. This method is safe but has a performance
> penalty.

Linux never did that and there's no need to do it.

> C. Make it the responsibility of the device driver to avoid the use of
> YMM registers unless it saves the state with XSAVE. This solution
> requires that available compilers have a switch to disable calls to
> library functions with internal CPU dispatchers. Appears unsafe to me.

Kernel modules are always compiled with FP/MMX/SSE/etc. disabled
on the compiler level. And we only support one compiler (gcc)

BTW there's also a D. you didn't mention: many OS (including Linux) use lazy 
FPU saving schemes and the exception handler doing that could be fixed for 
supporting kernel code too. Right now it doesn't on Linux because there's 
no need.

-Andi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-26  1:11   ` H. Peter Anvin
@ 2008-06-27 11:31     ` Agner Fog
  2008-06-27 14:22       ` Arjan van de Ven
  0 siblings, 1 reply; 25+ messages in thread
From: Agner Fog @ 2008-06-27 11:31 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Arjan van de Ven, linux-kernel

Arjan van de Ven wrote:

>the linux kernel policy is loud and clear; this is more an OS policy as
>it is a platform ABI issue.

It doesn't help to say it "loud and clear" in a closed mailing list. It has to go into an official document that people can find, and the ABI is the most natural place to look for such rules. The ABI is hard enough to find. Is there an official OS policy document somewhere that I haven't found? Please point me to an authoritative document.

You can't blame driver makers for using XMM or YMM registers in inline assembly or intrinsic functions or calling their own libraries or using a different compiler unless there is an official rule against it written in some official document that is easy to find. If you want to move data in a device driver, it is tempting indeed to use the largest register size available.

I will write the rules in my manual, but it is not authoritative.


H. Peter Anvin wrote:
 >Sadly, AVX repeats the mistakes of SSE1, and doesn't implement proper 
wide support for integer operations.
 >It has the basic bitwide stuff, which makes it usable for RAID-5, but 
it doesn't extend MMX (which SSE2
 >eventually got around to), so it's not usable for RAID-6.  I'd hoped 
to find a VPERM-style instruction, like
 >SSE5 has :(

There is no problem in using the floating point permutation instructions 
on integer data. They will not generate exceptions or anything on 
denormal numbers. Only the smallest data size is 32 bits, of course. 
Integer YMM instructions will probably come in a later version.

Note: Please Cc: me on answers, I am not on the mailing list.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-27 11:31     ` Agner Fog
@ 2008-06-27 14:22       ` Arjan van de Ven
  2008-06-28  8:05         ` Agner Fog
  0 siblings, 1 reply; 25+ messages in thread
From: Arjan van de Ven @ 2008-06-27 14:22 UTC (permalink / raw)
  To: Agner Fog; +Cc: H. Peter Anvin, linux-kernel

On Fri, 27 Jun 2008 13:31:49 +0200
Agner Fog <agner@agner.org> wrote:

> Arjan van de Ven wrote:
> 
> >the linux kernel policy is loud and clear; this is more an OS policy
> >as it is a platform ABI issue.
> 
> It doesn't help to say it "loud and clear" in a closed mailing list.

lkml is rather public

> It has to go into an official document that people can find, and the
> ABI is the most natural place to look for such rules. The ABI is hard
> enough to find. Is there an official OS policy document somewhere
> that I haven't found? Please point me to an authoritative document.

DocBook/kernel-hacking.tmpl has a section about it.


> You can't blame driver makers for using XMM or YMM registers in
> inline assembly or intrinsic functions or calling their own libraries
> or using a different compiler unless there is an official rule
> against it written in some official document that is easy to find. If
> you want to move data in a device driver, it is tempting indeed to
> use the largest register size available.

the good news is that we review drivers before they get included and we
do catch such things. In addition, we give the compiler the command
line options that prevent it from using these instructions....
I don't think this is a problem in practice for Linux drivers.

(I'm deliberately ignoring non-opensource drivers here; you get what
you get with those; "just say no").



-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-27 14:22       ` Arjan van de Ven
@ 2008-06-28  8:05         ` Agner Fog
  2008-06-28  8:10           ` David Miller
                             ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Agner Fog @ 2008-06-28  8:05 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: H. Peter Anvin, linux-kernel

Arjan van de Ven wrote:
 >the good news is that we review drivers before they get included and 
we do catch such things.

Are you saying that some group should have the monopoly of approving 
device drivers? That is exectly the policy that Microsoft has been so 
much criticized for. You can only control the drivers that are included 
in your own Linux distribution.

Nothing should prevent me from publishing my own driver for some special 
purpose (in fact, I am going to do so). The open source principle would 
allow anybody to make a different compiler for device drivers, possibly 
using a different programming language. Or making a function library for 
use in device drivers.

Thank you for the reference to DocBook/kernel-hacking. The title reads 
"Unreliable Guide To Hacking The Linux Kernel". This doesn't really look 
like the place to look for authoritative information. It warns against 
using floating point/MMX, but not XMM. You can't blame programmers for 
making faulty device drivers when there is no authoritative guideline to 
follow.

You don't seem to realize the importance of proper documentation. Do you 
want me to write in my manual: "There is no authoritative information on 
this, but the Linux Junta says so-and-so.."?

How can you expect people to regard Linux as a serious and reliable 
alternative to ..BEEP.. when there is no proper documentation?

It is a disgrace that the ABI consists of fractions stored in weird 
places and some of them still drafts, but at least there is consensus on 
which one to look in for authoritative information. There should be an 
authoritative document on what you can do and not do in a device driver, 
and the ABI would be the most natural place to put this information. It 
is important to tell people that they can save and restore an XMM 
register under certain conditions, but not a YMM register. You can't 
expect people to search through a huge mailing list archive for such 
information.


Note: Please Cc: me of answers

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-28  8:05         ` Agner Fog
@ 2008-06-28  8:10           ` David Miller
  2008-06-28 11:47           ` Andi Kleen
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2008-06-28  8:10 UTC (permalink / raw)
  To: agner; +Cc: arjan, hpa, linux-kernel

From: Agner Fog <agner@agner.org>
Date: Sat, 28 Jun 2008 10:05:46 +0200

> Thank you for the reference to DocBook/kernel-hacking. The title reads 
> "Unreliable Guide To Hacking The Linux Kernel".

The title of that document is an application of "sarcasm".

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-28  8:05         ` Agner Fog
  2008-06-28  8:10           ` David Miller
@ 2008-06-28 11:47           ` Andi Kleen
  2008-06-28 15:09             ` Agner Fog
  2008-06-29 12:29           ` Alan Cox
  2008-07-01 15:30           ` Arjan van de Ven
  3 siblings, 1 reply; 25+ messages in thread
From: Andi Kleen @ 2008-06-28 11:47 UTC (permalink / raw)
  To: Agner Fog; +Cc: Arjan van de Ven, H. Peter Anvin, linux-kernel

Agner Fog <agner@agner.org> writes:

I explained it earlier, but you didn't seem to have understood it.
Here's it again in a simpler form:

If a x86 device driver doesn't use the standard kernel interfaces
for saving/restoring extended state (kernel_fpu_begin/end) it will die. 
That is because Linux uses lazy FPU saving by default and when the lazy FPU exception
hits kernel code it will crash because that's not allowed. 

And the standard interfaces are going to handle all extended state supported
by the kernel. Full XSAVE support should be there for 2.6.27.

Pretty much the whole problem you describe in your original email doesn't
apply to Linux because of that.

-Andi


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-28 11:47           ` Andi Kleen
@ 2008-06-28 15:09             ` Agner Fog
  2008-06-28 15:44               ` Helge Hafting
  2008-06-28 20:02               ` Andi Kleen
  0 siblings, 2 replies; 25+ messages in thread
From: Agner Fog @ 2008-06-28 15:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Arjan van de Ven, H. Peter Anvin, linux-kernel

Andi Kleen wrote:
 >If a x86 device driver doesn't use the standard kernel interfaces for 
saving/restoring extended state
 > (kernel_fpu_begin/end) it will die.That is because Linux uses lazy 
FPU saving by default and when the
 >lazy FPU exception hits kernel code it will crash because that's not 
allowed. And the standard interfaces
 >are going to handle all extended state supported by the kernel. Full 
XSAVE support should be there for
 > 2.6.27.

Thank you Andi. Finally an exact and exhaustive answer. :-)
This info is not in the "Unreliable Guide To Hacking The Linux Kernel" 
or anywhere else except deeply hidden in the archives of this mailing 
list. I had to actually look into the source code of kernel_fpu_begin to 
verify that it saves not only the FPU but also the XMM registers and 
that it disables pre-emption.

You see why I want proper documentation? If this had been documented in 
some reference that was easy to find, I wouldn't have needed to take 
your time with all these questions...

Note: Please Cc: me of answers.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-28 15:09             ` Agner Fog
@ 2008-06-28 15:44               ` Helge Hafting
  2008-06-28 20:02               ` Andi Kleen
  1 sibling, 0 replies; 25+ messages in thread
From: Helge Hafting @ 2008-06-28 15:44 UTC (permalink / raw)
  To: Agner Fog; +Cc: Andi Kleen, Arjan van de Ven, H. Peter Anvin, linux-kernel

Agner Fog wrote:

> You see why I want proper documentation? If this had been documented in 
> some reference that was easy to find, I wouldn't have needed to take 
> your time with all these questions...

This is correct - but then someone would have to use time to write that 
documentation _and keep it up-to-date_. So far, nobody volunteered.
You should feel free to add your newly gained information to the
existing incomplete documentation. :)

This is how open-source works - there is no "boss/management"
that can put someone on the less popular jobs.
No matter how useful this might be.
You want something extra - driver or docs - you make it.

Helge Hafting

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-28 15:09             ` Agner Fog
  2008-06-28 15:44               ` Helge Hafting
@ 2008-06-28 20:02               ` Andi Kleen
  2008-06-29 11:33                 ` Avi Kivity
  1 sibling, 1 reply; 25+ messages in thread
From: Andi Kleen @ 2008-06-28 20:02 UTC (permalink / raw)
  To: Agner Fog; +Cc: Andi Kleen, Arjan van de Ven, H. Peter Anvin, linux-kernel

> This info is not in the "Unreliable Guide To Hacking The Linux Kernel" 

The unreliable guide is a little outdated.

> or anywhere else except deeply hidden in the archives of this mailing 

Normally all introductions of kernel programming (like
the classic "Linux device drivers") should document that FPU code
is not allowed. The kernel also enforces that by passing special
compiler optionsthat cause compiler errors for FPU and SSE code.
You have to explicitely override those.

> list. I had to actually look into the source code of kernel_fpu_begin to 
> verify that it saves not only the FPU but also the XMM registers and 
> that it disables pre-emption.

The requirement to disable preemption is one reason why XMM in 
a driver is not a good idea BTW.  XMM should be normally only
used when you plan to spend a lot of CPU cycles (otherwise
the cost of saving the state is not amortized by the improvements).

But keeping preemption disabled for a long time is considered
unfriendly because it increases kernel latencies and might 
in the worst case cause visible scheduling problems like skipping audio
etc. 

> You see why I want proper documentation? If this had been documented in 
> some reference that was easy to find, I wouldn't have needed to take 
> your time with all these questions...

Yes agreed the documentation could be better. The standard Linux
reference documentation is "kerneldoc" which you can generate from the 
kernel source with make pdfdocs/mandocs/htmldocs
(some distributions ship pregenerated docs in a separate kernel-docs package)

Unfortunately kerneldoc documents a lot of obscure unimportant
functions, but doesn't document important functionality like
kernel_fpu_begin/end (or even important functions like kmalloc!)

There's a new editor for kernel documentation now so things might
improve.

-Andi


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-28 20:02               ` Andi Kleen
@ 2008-06-29 11:33                 ` Avi Kivity
  2008-06-29 12:21                   ` Andi Kleen
  0 siblings, 1 reply; 25+ messages in thread
From: Avi Kivity @ 2008-06-29 11:33 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Agner Fog, Arjan van de Ven, H. Peter Anvin, linux-kernel

Andi Kleen wrote:
>   
>> list. I had to actually look into the source code of kernel_fpu_begin to 
>> verify that it saves not only the FPU but also the XMM registers and 
>> that it disables pre-emption.
>>     
>
> The requirement to disable preemption is one reason why XMM in 
> a driver is not a good idea BTW.  XMM should be normally only
> used when you plan to spend a lot of CPU cycles (otherwise
> the cost of saving the state is not amortized by the improvements).
>
> But keeping preemption disabled for a long time is considered
> unfriendly because it increases kernel latencies and might 
> in the worst case cause visible scheduling problems like skipping audio
> etc. 
>   

This is fixable.  We could change kernel_fpu_begin() not to disable 
preemption, but instead set a task flag.  When we get the "no device" 
fault, if the flag is set, save the fpu state into the kernel fpu save 
area instead of the user fpu save area.  Similar logic for restoring the 
fpu state.

kvm runs with the guest fpu context while in the host kernel, using 
preempt notifiers to achieve the same thing.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-29 11:33                 ` Avi Kivity
@ 2008-06-29 12:21                   ` Andi Kleen
  2008-06-29 12:31                     ` Avi Kivity
  0 siblings, 1 reply; 25+ messages in thread
From: Andi Kleen @ 2008-06-29 12:21 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Agner Fog, Arjan van de Ven, H. Peter Anvin, linux-kernel

Avi Kivity wrote:

> This is fixable. 

Sure nearly everything is fixable, but why would you want to do that?

For me the best fix currently seems to be to just not do that
when it hurts.

> We could change kernel_fpu_begin() not to disable
> preemption, but instead set a task flag.  When we get the "no device"
> fault, if the flag is set, save the fpu state into the kernel fpu save
> area 

What kernel fpu save area do you mean?

-Andi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-28  8:05         ` Agner Fog
  2008-06-28  8:10           ` David Miller
  2008-06-28 11:47           ` Andi Kleen
@ 2008-06-29 12:29           ` Alan Cox
  2008-07-01 15:30           ` Arjan van de Ven
  3 siblings, 0 replies; 25+ messages in thread
From: Alan Cox @ 2008-06-29 12:29 UTC (permalink / raw)
  To: Agner Fog; +Cc: Arjan van de Ven, H. Peter Anvin, linux-kernel

> Nothing should prevent me from publishing my own driver for some special 
> purpose (in fact, I am going to do so). The open source principle would 

Go ahead but the API is *source* level and changes each release so the
ABI question doesn't happen.

> It is a disgrace that the ABI 

We don't have an ABI.

Alan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-29 12:21                   ` Andi Kleen
@ 2008-06-29 12:31                     ` Avi Kivity
  2008-06-29 13:07                       ` Andi Kleen
  0 siblings, 1 reply; 25+ messages in thread
From: Avi Kivity @ 2008-06-29 12:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Agner Fog, Arjan van de Ven, H. Peter Anvin, linux-kernel

Andi Kleen wrote:
> Avi Kivity wrote:
>
>   
>> This is fixable. 
>>     
>
> Sure nearly everything is fixable, but why would you want to do that?
>
> For me the best fix currently seems to be to just not do that
> when it hurts.
>
>   

We already use sse in the kernel (raid), presumably disabling preemption 
there hurts some workloads.

With sse/avx gaining more features, we may see more requirements for 
kernel fpu.

>> We could change kernel_fpu_begin() not to disable
>> preemption, but instead set a task flag.  When we get the "no device"
>> fault, if the flag is set, save the fpu state into the kernel fpu save
>> area 
>>     
>
> What kernel fpu save area do you mean?
>   

A new one, of course.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-29 12:31                     ` Avi Kivity
@ 2008-06-29 13:07                       ` Andi Kleen
  2008-06-29 13:18                         ` Avi Kivity
  0 siblings, 1 reply; 25+ messages in thread
From: Andi Kleen @ 2008-06-29 13:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Agner Fog, Arjan van de Ven, H. Peter Anvin, linux-kernel


>>> We could change kernel_fpu_begin() not to disable
>>> preemption, but instead set a task flag.  When we get the "no device"
>>> fault, if the flag is set, save the fpu state into the kernel fpu save
>>> area     
>>
>> What kernel fpu save area do you mean?
>>   
> 
> A new one, of course.
> 

With that we would be eventually in the mess Agner talked about.

-Andi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-29 13:07                       ` Andi Kleen
@ 2008-06-29 13:18                         ` Avi Kivity
  2008-06-29 13:23                           ` Andi Kleen
  0 siblings, 1 reply; 25+ messages in thread
From: Avi Kivity @ 2008-06-29 13:18 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Agner Fog, Arjan van de Ven, H. Peter Anvin, linux-kernel

Andi Kleen wrote:
>>>> We could change kernel_fpu_begin() not to disable
>>>> preemption, but instead set a task flag.  When we get the "no device"
>>>> fault, if the flag is set, save the fpu state into the kernel fpu save
>>>> area     
>>>>         
>>> What kernel fpu save area do you mean?
>>>   
>>>       
>> A new one, of course.
>>
>>     
>
> With that we would be eventually in the mess Agner talked about.
>
>   

If you use xsave, I don't see how this is different to the user fpu save 
area.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-29 13:18                         ` Avi Kivity
@ 2008-06-29 13:23                           ` Andi Kleen
  2008-06-29 13:29                             ` Avi Kivity
  2008-06-29 22:08                             ` H. Peter Anvin
  0 siblings, 2 replies; 25+ messages in thread
From: Andi Kleen @ 2008-06-29 13:23 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Agner Fog, Arjan van de Ven, H. Peter Anvin, linux-kernel

Avi Kivity wrote:
> Andi Kleen wrote:
>>>>> We could change kernel_fpu_begin() not to disable
>>>>> preemption, but instead set a task flag.  When we get the "no device"
>>>>> fault, if the flag is set, save the fpu state into the kernel fpu save
>>>>> area             
>>>> What kernel fpu save area do you mean?
>>>>         
>>> A new one, of course.
>>>
>>>     
>>
>> With that we would be eventually in the mess Agner talked about.
>>
>>   
> 
> If you use xsave, I don't see how this is different to the user fpu save
> area.

For once there's no clear error handling path for allocation failures
on the (arbitarily sized) xsave state. On user code that can be barely
tolerated, but for the kernel it would be deadly.

-Andi


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-29 13:23                           ` Andi Kleen
@ 2008-06-29 13:29                             ` Avi Kivity
  2008-06-29 22:08                             ` H. Peter Anvin
  1 sibling, 0 replies; 25+ messages in thread
From: Avi Kivity @ 2008-06-29 13:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Agner Fog, Arjan van de Ven, H. Peter Anvin, linux-kernel

Andi Kleen wrote:

  

>> If you use xsave, I don't see how this is different to the user fpu save
>> area.
>>     
>
> For once there's no clear error handling path for allocation failures
> on the (arbitarily sized) xsave state. On user code that can be barely
> tolerated, but for the kernel it would be deadly.
>   

if (kernel_fpu_begin() < 0)
     goto no_sse;

The question is not if it can be done or not, it's whether making raid 
preemptible is worthwhile, and whether we see/want more sse/avx 
accelerated drivers or not.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-29 13:23                           ` Andi Kleen
  2008-06-29 13:29                             ` Avi Kivity
@ 2008-06-29 22:08                             ` H. Peter Anvin
  1 sibling, 0 replies; 25+ messages in thread
From: H. Peter Anvin @ 2008-06-29 22:08 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Avi Kivity, Agner Fog, Arjan van de Ven, linux-kernel

Andi Kleen wrote:
>>>   
>> If you use xsave, I don't see how this is different to the user fpu save
>> area.
> 
> For once there's no clear error handling path for allocation failures
> on the (arbitarily sized) xsave state. On user code that can be barely
> tolerated, but for the kernel it would be deadly.
> 

Well, that's relatively easily dealt with... you'd have to allocate that 
state save area explicitly in kernel_fpu_begin(), and it would be up to 
the callers of that function to handle the resulting sleep and/or 
allocation failure -- we could even make kernel_fpu_begin() take a GFP_* 
flag.

Now, there are a few possibilities what to do with said state area.  One 
is to make it associated with the task; if used for something as RAID, 
this would mean pretty soon *all* (or nearly all) tasks have such a 
state area, and we might as well go back to allocating them preemptively 
on thread creation.  The other, of course, is to destroy it in 
kernel_fpu_end().  This may cause quite a bit of allocation/deallocation 
overhead, but perhaps this would be a decent use of a quicklist.

	-hpa



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ABI change for device drivers using future AVX instruction set
  2008-06-28  8:05         ` Agner Fog
                             ` (2 preceding siblings ...)
  2008-06-29 12:29           ` Alan Cox
@ 2008-07-01 15:30           ` Arjan van de Ven
  3 siblings, 0 replies; 25+ messages in thread
From: Arjan van de Ven @ 2008-07-01 15:30 UTC (permalink / raw)
  To: Agner Fog; +Cc: H. Peter Anvin, linux-kernel

On Sat, 28 Jun 2008 10:05:46 +0200
Agner Fog <agner@agner.org> wrote:

> Arjan van de Ven wrote:
>  >the good news is that we review drivers before they get included
>  >and 
> we do catch such things.
> 
> Are you saying that some group should have the monopoly of approving 
> device drivers? That is exectly the policy that Microsoft has been so 
> much criticized for. You can only control the drivers that are
> included in your own Linux distribution.

the linux model is that drivers (eventually) get included into the
linux kernel source code. Everything we do process wise is centered
around that concept.

Nothing prevents you from going outside the process. But you need to
realize that that is exactly that: outside the process. Meaning: you're
partially on your own.


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2008-07-01 15:31 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-25 15:32 ABI change for device drivers using future AVX instruction set Agner Fog
2008-06-25 16:22 ` Arjan van de Ven
2008-06-25 19:54   ` Agner Fog
2008-06-25 20:09     ` Arjan van de Ven
2008-06-26  1:11   ` H. Peter Anvin
2008-06-27 11:31     ` Agner Fog
2008-06-27 14:22       ` Arjan van de Ven
2008-06-28  8:05         ` Agner Fog
2008-06-28  8:10           ` David Miller
2008-06-28 11:47           ` Andi Kleen
2008-06-28 15:09             ` Agner Fog
2008-06-28 15:44               ` Helge Hafting
2008-06-28 20:02               ` Andi Kleen
2008-06-29 11:33                 ` Avi Kivity
2008-06-29 12:21                   ` Andi Kleen
2008-06-29 12:31                     ` Avi Kivity
2008-06-29 13:07                       ` Andi Kleen
2008-06-29 13:18                         ` Avi Kivity
2008-06-29 13:23                           ` Andi Kleen
2008-06-29 13:29                             ` Avi Kivity
2008-06-29 22:08                             ` H. Peter Anvin
2008-06-29 12:29           ` Alan Cox
2008-07-01 15:30           ` Arjan van de Ven
2008-06-25 20:14 ` Alan Cox
2008-06-26 14:01 ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox