Hi.

I'll reply to that mail again because my previous
reply was done in a big hurry.
The new patch is attached. I applied most of the
optimizations you suggested and reverted the things
you said were unnecessary/wrong - thanks.
Does this one look better?

Petr Vandrovec wrote:
>> 1. I am not allocating the ring1 stack separately,
>> I am allocating it on a ring0 stack. Much simpler
>> code, although non-reentrant/preempt-unsafe.
> I think that it is problem.  And do not forget that NMI can occur at any
> place, so I'm not quite sure how your code works.
I assume this was a misunderstanding. Now I use
"preempt_stop" instead of "cli" to make it more
obvious what am I doing. So I assume that part is
preempt-unsafe, but not NMI-unsafe.

>> +#define NMI_STACK_RESERVE 0x400
>> +#define MAKE_ESPFIX \
>> +	cli; \
>> +	subl $NMI_STACK_RESERVE, %esp; \
> This is ugly.  What if NMI handler uses more than 1KB?
In that new patch I set the const to 0xe00, which
is 3,5K. Is it still the limitation? I can probably
raise it ever higher, but I am not sure if there is
some sensible data below the stack that I may overwrite
ocasionally?

> And ESP is not
> multiple of 4 here - it may slow down processor a bit.
Fixed with "pushl/movw".

> I'm not sure about lea speed.  And fast path should NOT jump,
> forward jumps are initialy predicted as not taken... Maybe:
Done. Looks much better now!

> And then you can move MAKE_ESPFIX out of macro.
He-he, but RESTORE_ALL is a macro itself, so... :)

> Maybe you want to introduce some per-thread flag,
What kind of flag do you have in mind? Flag the
modify_ldt() calls that set some segment to 16bit?
But is it really worth the 8 insns I use? Checking
the per-thread flag will cost you ~5 insns anyway,
and a headache...

> or leave syscall path
> to corrupt ESP (dosemu apps should not call Linux INT syscall, yes?).
I wanted to do that, but then recalled the comment
of Pavel Machek, who thinks exporting the %esp
value to user-space is an "informational leak".
I personally don't see the security threat here,
but it looks safe to assume that Pavel is right.
In which case I think we have to root out the
bug from every path, including the syscall. So I
left that place as is for now.

>> -	if (regs->xcs & 3)
>> +	if (regs->xcs & 2)
>>  		printk(" ESP: %04x:%08lx",0xffff & regs->xss,regs->esp);
> I think you should leave this as is, regs->xss/regs->esp should be valid
> for CPL1 exceptions.
OK, everything you mentioned, is reverted.

> Rest looks fine, but I'm not completely sure that SMP and NMIs are
> handled in the best possible way.
Well, even though my patch can be improved, I don't
feel quite comfortable with its complexity.
Now I really want to re-assure myself that this all
is not possible to do on ring0, in which case this
will simply be the trivial thing.
Anonymous LKML reader (whom I have to thank for the
hints) proposed 2 things:
1. Try lss followed by iret. In that case the interrupt
will not kill us because after lss the interrupts are
not accepted until the next insn is executed, AFAIK.
But he was not sure, and I don't know either, if it is
true even for NMI. So if it is - we could just do
lss/iret, and the problem is solved. So is this true
for NMI? Is this documented anywhere?
2. Set task gate for NMI. AFAICS, the task-gate is
now used only for the doublefault exception, but not
for NMI. If I understand the idea correctly, this
will guarantee that the NMI will be executing on a
separate stack, which may be a good idea in any case,
and will allow us to use the small stack at ring-0,
if only we disable the interrupts. Are there any
chances to use the task-gate for NMI? I may even try to
implement that myself, but I guess there are *reasons*
why it is not yet? (of course this applies only if 1.
fails)

Or maybe somehow to modify the NMI handler itself,
so that it will check if it is on a "small" stack,
and switch to the "big" stack before anything else?
Well, doing the whole trick at ring-0 sounds really
plausible to me...