On Friday 24 April 2009 17:30:35 David Daney wrote:
Kevin D. Kissell wrote:
Brian Foster wrote:
On Wednesday 22 April 2009 20:01:44 David Daney wrote:
Kevin D. Kissell wrote:
David Daney wrote:
This is a preliminary patch to add a vdso to all user processes.
[ ... ]
Note that for FPU-less CPUs, the kernel FP emulator also uses a user
stack trampoline to execute instructions in the delay slots of emulated
FP branches. [ ... ]
As David says, this is a Very Ugly Problem. Each FP trampoline
is effectively per-(runtime-)instance per-thread [ ... ]
I haven't reviewed David's code in detail, but from his description, I
thought that there was a vdso page per task/thread. If there's only one
per processor, then, yes, that poses a challenge to porting the FPU
emulation code to use it, since, as you observe, the instruction
sequence to be executed may differ for each delay slot emulation. It
should still be possible, though. [ ... ]
Kevin is right, this is ugly.
My current plan is to map an anonymous page with execute permission for
each vma (process) and place all FP trampolines there. Each thread that
needs a trampoline will allocate a piece of this page and write the
trampoline. We can arrange it so that the only way a thread can exit
the trampoline is by taking some sort of fault (currently this is true
for the normal case), or exiting.
David,
The above is the bit which has always stumped me.
Having a per-process(or similar) page for the FP
trampoline(s) is the “obvious” approach, but what
has had me going around in circles is how to know
when an allocated slot/trampoline can be freed.
As you imply, in the normal case, it seems trivial.
It's the not-normal cases which aren't clear (or at
least aren't clear to me!).
You say (EMPHASIS added) “We can arrange it so
that the ONLY way a thread can exit the trampoline
is by taking some sort of fault ... or exiting”,
which if true, could solve the issue. Could you
elucidate on this point, please?
Well, he's *almost* right about that. The delay slot emulation function
executes a single instruction off the user stack/vdso slot, which is
followed in memory by an instruction that provokes an address
exception. The address exception handler detects the special case (and
it should be noted that detecting the special case could be made
simpler and more reliable if a vdso-type region were used), cleans up,
and restores normal stack behavior. That "clean up" could, of course,
include any necessary vdso slot management. But what about cases that
won't get to the magic alignment trap?