public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
From: Breno Leitao <leitao@debian.org>
To: Michael Ellerman <mpe@ellerman.id.au>, linuxppc-dev@lists.ozlabs.org
Cc: mikey@neuling.org, gromero@linux.vnet.ibm.com,
	"v3.9+" <stable@vger.kernel.org>
Subject: Re: [PATCH] powerpc/tm: Set MSR[TS] just prior to recheckpoint
Date: Wed, 21 Nov 2018 16:32:31 -0200	[thread overview]
Message-ID: <bbd2ef02-e1f3-6b86-0d13-3c78ee2cc78d@debian.org> (raw)
In-Reply-To: <8736rw2g18.fsf@concordia.ellerman.id.au>

hi Michael,

On 11/20/18 8:34 AM, Michael Ellerman wrote:
> Hi Breno,
> 
> Thanks for chasing this one down.
> 
> Breno Leitao <leitao@debian.org> writes:
> 
>> On a signal handler return, the user could set a context with MSR[TS] bits
>> set, and these bits would be copied to task regs->msr.
>>
>> At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set,
>> several __get_user() are called and then a recheckpoint is executed.
>>
>> This is a problem since a page fault (in kernel space) could happen when
>> calling __get_user(). If it happens, the process MSR[TS] bits were
>> already set, but recheckpoint was not executed, and SPRs are still invalid.
>>
>> The page fault can cause the current process to be de-scheduled, with
>> MSR[TS] active and without tm_recheckpoint() being called.  More
>> importantly, without TEXAR[FS] bit set also.
>>
>> Since TEXASR might not have the FS bit set, and when the process is
>> scheduled back, it will try to reclaim, which will be aborted because of
>> the CPU is not in the suspended state, and, then, recheckpoint. This
>> recheckpoint will restore thread->texasr into TEXASR SPR, which might be
>> zero, hitting a BUG_ON().
>>
>> 	[ 2181.457997] kernel BUG at arch/powerpc/kernel/tm.S:446!
> 
> As Mikey said, would be good to have at least the stack trace & NIP
> here, if not the full oops.

Ack!

>> This patch simply delays the MSR[TS] set, so, if there is any page fault in
>> the __get_user() section, it does not have regs->msr[TS] set, since the TM
>> structures are still invalid, thus avoiding doing TM operations for
>> in-kernel exceptions and possible process reschedule.
>>
>> With this patch, the MSR[TS] will only be set just before recheckpointing
>> and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in
>> invalid state.
> 
> To make this safe when PREEMPT is enabled we need to preempt_disable() /
> enable() around the setting of regs->msr and the recheckpoint.
> 
> That could also serve as nice documentation.
> 
> I guess the other question is whether it should be the job of
> tm_recheckpoint() to set regs->msr, given that it already hard disables
> interrupts.
> 
> eg. we'd set the TM flags in a local msr variable and pass the to
> tm_recheckpoint(), it would then assign that to regs->msr in the IRQ
> disabled section. Though there's many callers of tm_recheckpoint() that
> don't need that behaviour, so it would probably need to be factored out.

Right, that might be doable, but I am wondering if it isn't better to create
a new function that does it as below:

void tm_set_msr_and_recheckpoint(struct pt_regs *regs, u64 msr)
{
       preempt_disable();
       regs->msr = msr;
       tm_recheckpoint();
       preempt_enable();
}

I understand that preempt_disable() does a better work disabling preemption
compared to disabling IRQ. Also, it does not change the API just for this
very specific signal case.

>> It is not possible to move tm_recheckpoint to happen earlier, because it is
>> required to get the checkpointed registers from userspace, with
>> __get_user(), thus, the only way to avoid this undesired behavior is
>> delaying the MSR[TS] set, as done in this patch.
> 
> I think the root cause here is that we're copying into the live regs of
> current. That has obviously worked in the past, because the register
> state wasn't used until we returned back to userspace. But that's no
> longer true with TM. And even so it's quite subtle. I also suspect some
> of our FP/VEC handling may not work correctly if we're scheduled part
> way through restoring the regs.

I got your point and I think we have a risk of corrupting the regs if MSR[TS]
is set and we do page_fault->recheckpoint/recheclaim in the middle of
__get_user() chunks.

> What might work better is if we copy all the regs into temporary space
> and then with interrupts disabled we copy them into the task. That way
> we should never be scheduled with a half-populated set of regs.

Yes, that seems to be the best strategy, and I might be interested in working
on it.

> That's obviously a much bigger patch though and something we'll have to
> do later.
> 
>> Fixes: 87b4e5393af7 ("powerpc/tm: Fix return of active 64bit signals")
>> Cc: stable@vger.kernel.org (v3.9+)
>> Signed-off-by: Breno Leitao <leitao@debian.org>
>> ---
>>  arch/powerpc/kernel/signal_64.c | 29 +++++++++++++++--------------
>>  1 file changed, 15 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
>> index 83d51bf586c7..15b153bdd826 100644
>> --- a/arch/powerpc/kernel/signal_64.c
>> +++ b/arch/powerpc/kernel/signal_64.c
>> @@ -467,20 +467,6 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
>>  	if (MSR_TM_RESV(msr))
>>  		return -EINVAL;
>>  
>> -	/* pull in MSR TS bits from user context */
>> -	regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK);
>> -
>> -	/*
>> -	 * Ensure that TM is enabled in regs->msr before we leave the signal
>> -	 * handler. It could be the case that (a) user disabled the TM bit
>> -	 * through the manipulation of the MSR bits in uc_mcontext or (b) the
>> -	 * TM bit was disabled because a sufficient number of context switches
>> -	 * happened whilst in the signal handler and load_tm overflowed,
>> -	 * disabling the TM bit. In either case we can end up with an illegal
>> -	 * TM state leading to a TM Bad Thing when we return to userspace.
>> -	 */
>> -	regs->msr |= MSR_TM;
>> -
>>  	/* pull in MSR LE from user context */
>>  	regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
>>  
>> @@ -572,6 +558,21 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
>>  	tm_enable();
>>  	/* Make sure the transaction is marked as failed */
>>  	tsk->thread.tm_texasr |= TEXASR_FS;
>> +
> 
> 	preempt_disable();
> 
>> +	/* pull in MSR TS bits from user context */
>> +	regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK);
>> +
>> +	/*
>> +	 * Ensure that TM is enabled in regs->msr before we leave the signal
>> +	 * handler. It could be the case that (a) user disabled the TM bit
>> +	 * through the manipulation of the MSR bits in uc_mcontext or (b) the
>> +	 * TM bit was disabled because a sufficient number of context switches
>> +	 * happened whilst in the signal handler and load_tm overflowed,
>> +	 * disabling the TM bit. In either case we can end up with an illegal
>> +	 * TM state leading to a TM Bad Thing when we return to userspace.
>> +	 */
>> +	regs->msr |= MSR_TM;
>> +
>>  	/* This loads the checkpointed FP/VEC state, if used */
>>  	tm_recheckpoint(&tsk->thread);
>>
> 	preempt_enable();
> 
> Although looking at the code that follows, it probably won't cope with
> being preempted either. So the preempt_enable() should probably go at
> the end of the function.

Why? I confess I do not understand the preempt mechanism a lot. Does a
preemption save/restore FP and vector states when it is preempted?

What is the code/IRQ that is executed when there is a preemption? I am
wondering if a preempt happens because the  Decrementer (0x900) IRQ  happens
and a syscall is being executed, thus, saving the whole context and then
restoring it. If my guess is correct, the IRQ might not save the FP/VEC
states, thus, your concern about ahving load_fp_state() after a preemption.

Let me paste the "patched" code here to help with the discussion:

        regs->msr |= MSR_TM;

        /* This loads the checkpointed FP/VEC state, if used */
        tm_recheckpoint(&tsk->thread);

        msr_check_and_set(msr & (MSR_FP | MSR_VEC));
        if (msr & MSR_FP) {
                load_fp_state(&tsk->thread.fp_state);
                regs->msr |= (MSR_FP | tsk->thread.fpexc_mode);
        }
        if (msr & MSR_VEC) {
                load_vr_state(&tsk->thread.vr_state);
                regs->msr |= MSR_VEC;
        }

Thanks
Breno

      reply	other threads:[~2018-11-22  5:07 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-19 12:44 [PATCH] powerpc/tm: Set MSR[TS] just prior to recheckpoint Breno Leitao
2018-11-20 10:34 ` Michael Ellerman
2018-11-21 18:32   ` Breno Leitao [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bbd2ef02-e1f3-6b86-0d13-3c78ee2cc78d@debian.org \
    --to=leitao@debian.org \
    --cc=gromero@linux.vnet.ibm.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mikey@neuling.org \
    --cc=mpe@ellerman.id.au \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox