Re: KRETPROBES are broken since kernel 5.8

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Adam Zabrocki <pi3@pi3.com.pl>
To: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
	linux-kernel@vger.kernel.org,
	"Naveen N. Rao" <naveen.n.rao@linux.ibm.com>,
	Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>,
	"David S. Miller" <davem@davemloft.net>,
	Solar Designer <solar@openwall.com>
Subject: Re: KRETPROBES are broken since kernel 5.8
Date: Thu, 10 Dec 2020 08:17:42 +0100	[thread overview]
Message-ID: <20201210071742.GA14484@pi3.com.pl> (raw)
In-Reply-To: <20201210102507.6bd47a08191852b9f8b5e3f0@kernel.org>

Hi,

On Thu, Dec 10, 2020 at 10:25:07AM +0900, Masami Hiramatsu wrote:
> Hi Adam,
> 
> Thank you for reporting and debugging, actually we had investigated this
> issue in Aug. Please read this thread.
> 
> https://lkml.kernel.org/r/8816bdbbc55c4d2397e0b02aad2825d3@trendmicro.com
> 

Thanks for the link. I've read the entire history of proposed fix - very 
informative :)

> We finally fixed this issue by commit e03b4a084ea6 ("kprobes: Remove NMI
> context check") and commit 645f224e7ba2 ("kprobes: Tell lockdep about
> kprobe nesting"). Yeah, it will be in the v5.10.
> 
> Could you try to test whether these commits can solve the issue?

I've applied these commits on the top of kernel 5.9.7 and verified that it 
works well. Non-optimized KRETPROBES are back to life.

However, there might be another issue which I wanted to brought / discuss - 
problem with optimizer. Until kernel 5.9 KRETPROBE on e.g. 
'ftrace_enable_sysctl' function was correctly optimized without any problems. 
Since 5.9 it can't be optimized anynmore. I didn't see any changes in the 
sources regarding the optimizer, neither function itself.
When I looked at the generated vmlinux binary, I can see that GCC generated 
padding at the end of this function using INT3 opcode:

...
ffffffff8130528b:       41 bd f0 ff ff ff       mov    $0xfffffff0,%r13d
ffffffff81305291:       e9 fe fe ff ff          jmpq   ffffffff81305194 <ftrace_enable_sysctl+0x114>
ffffffff81305296:       cc                      int3
ffffffff81305297:       cc                      int3
ffffffff81305298:       cc                      int3
ffffffff81305299:       cc                      int3
ffffffff8130529a:       cc                      int3
ffffffff8130529b:       cc                      int3
ffffffff8130529c:       cc                      int3
ffffffff8130529d:       cc                      int3
ffffffff8130529e:       cc                      int3
ffffffff8130529f:       cc                      int3

Such padding didn't exist in this function in generated images for older 
kernels. Nevertheless, such padding is not unusual and it's pretty common.

Optimizer logic fails here:

try_to_optimize_kprobe() -> alloc_aggr_kprobe() -> __prepare_optimized_kprobe()
-> arch_prepare_optimized_kprobe() -> can_optimize():

    /* Decode instructions */
    addr = paddr - offset;
    while (addr < paddr - offset + size) { /* Decode until function end */
        unsigned long recovered_insn;
        if (search_exception_tables(addr))
            /*
             * Since some fixup code will jumps into this function,
             * we can't optimize kprobe in this function.
             */
            return 0;
        recovered_insn = recover_probed_instruction(buf, addr);
        if (!recovered_insn)
            return 0;
        kernel_insn_init(&insn, (void *)recovered_insn, MAX_INSN_SIZE);
        insn_get_length(&insn);
        /* Another subsystem puts a breakpoint */
        if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
            return 0;
        /* Recover address */
        insn.kaddr = (void *)addr;
        insn.next_byte = (void *)(addr + insn.length);
        /* Check any instructions don't jump into target */
        if (insn_is_indirect_jump(&insn) ||
            insn_jump_into_range(&insn, paddr + INT3_INSN_SIZE,
                     DISP32_SIZE))
            return 0;
        addr += insn.length;
    }

One of the check tries to protect from the situation when another subsystem 
puts a breakpoint there as well:

        /* Another subsystem puts a breakpoint */
        if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
            return 0;

However, that's not the case here. INT3_INSN_OPCODE is placed at the end of 
the function as padding (and protect from NOP-padding problems).

I wonder, if optimizer should take this special case into account? Is it worth 
to still optimize such functions when they are padded with INT3?

> If it is OK, we should backport those to stable tree.

Agreed.

Thanks,
Adam

> Thank you,
> 
> On Wed, 9 Dec 2020 22:50:01 +0100
> Adam Zabrocki <pi3@pi3.com.pl> wrote:
> 
> > Hello,
> > 
> > Starting from kernel 5.8 all non-optimized kretprobes don't work. Until 5.8,
> > when #DB exception was raised, entry to the NMI was not fully performed. Among
> > others, the following logic was executed:
> > https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589
> > 
> >     if (!user_mode(regs)) {
> >         rcu_nmi_enter();
> >         preempt_disable();
> >     }
> > 
> > In some older kernels function ist_enter() was called instead. Inside this
> > function we can see the following logic:
> > https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91
> > 
> >     if (user_mode(regs)) {
> >         RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> >     } else {
> >         /*
> >          * We might have interrupted pretty much anything.  In
> >          * fact, if we're a machine check, we can even interrupt
> >          * NMI processing.  We don't want in_nmi() to return true,
> >          * but we need to notify RCU.
> >          */
> >         rcu_nmi_enter();
> >     }
> > 
> >     preempt_disable();
> > 
> > As the comment says "We don't want in_nmi() to return true, but we need to
> > notify RCU.". However, since kernel 5.8 the logic of how interrupts are handled
> > was modified and currently we have this (function "exc_int3"):
> > https://elixir.bootlin.com/linux/v5.8/source/arch/x86/kernel/traps.c#L630
> > 
> >     /*
> >      * idtentry_enter_user() uses static_branch_{,un}likely() and therefore
> >      * can trigger INT3, hence poke_int3_handler() must be done
> >      * before. If the entry came from kernel mode, then use nmi_enter()
> >      * because the INT3 could have been hit in any context including
> >      * NMI.
> >      */
> >     if (user_mode(regs)) {
> >         idtentry_enter_user(regs);
> >         instrumentation_begin();
> >         do_int3_user(regs);
> >         instrumentation_end();
> >         idtentry_exit_user(regs);
> >     } else {
> >         nmi_enter();
> >         instrumentation_begin();
> >         trace_hardirqs_off_finish();
> >         if (!do_int3(regs))
> >             die("int3", regs, 0);
> >         if (regs->flags & X86_EFLAGS_IF)
> >             trace_hardirqs_on_prepare();
> >         instrumentation_end();
> >         nmi_exit();
> >     }
> > 
> > The root of unlucky change comes from this commit:
> > 
> > https://github.com/torvalds/linux/commit/0d00449c7a28a1514595630735df383dec606812#diff-51ce909c2f65ed9cc668bc36cc3c18528541d8a10e84287874cd37a5918abae5
> > 
> > which later was modified by this commit:
> > 
> > https://github.com/torvalds/linux/commit/8edd7e37aed8b9df938a63f0b0259c70569ce3d2
> > 
> > and this is what we currently have in all kernels since 5.8. Essentially,
> > KRETPROBES are not working since these commits. We have the following logic:
> > 
> > asm_exc_int3() -> exc_int3():
> >                     |
> >     ----------------|
> >     |
> >     v
> > ...
> > nmi_enter();
> > ...
> > if (!do_int3(regs))
> >        |
> >   -----|
> >   |
> >   v
> > do_int3() -> kprobe_int3_handler():
> >                     |
> >     ----------------|
> >     |
> >     v
> > ...
> > if (!p->pre_handler || !p->pre_handler(p, regs))
> >                              |
> >     -------------------------|
> >     |
> >     v
> > ...
> > pre_handler_kretprobe():
> > ...
> >     if (unlikely(in_nmi())) {
> >         rp->nmissed++;
> >         return 0;
> >     }
> > 
> > Essentially, exc_int3() calls nmi_enter(), and pre_handler_kretprobe() before
> > invoking any registered kprobe verifies if it is not in NMI via in_nmi() call.
> > 
> > This bug was masked by OPTIMIZER which was turning most of the KPROBES to be
> > FTRACE so essentially if someone registered KRETPROBE, buggy code was not
> > invoked (FTRACE code was executed instead). However, the optimizer was changed
> > and can't optimize as many functions anymore (probably another bug which might
> > be worth to investigate) and this bug is more visible.
> > 
> > Thanks,
> > Adam
> > 
> > -- 
> > pi3 (pi3ki31ny) - pi3 (at) itsec pl
> > http://pi3.com.pl
> > 
> 
> 
> -- 
> Masami Hiramatsu <mhiramat@kernel.org>

-- 
pi3 (pi3ki31ny) - pi3 (at) itsec pl
http://pi3.com.pl

next prev parent reply	other threads:[~2020-12-10  7:18 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-09 21:50 KRETPROBES are broken since kernel 5.8 Adam Zabrocki
2020-12-10  1:25 ` Masami Hiramatsu
2020-12-10  7:17   ` Adam Zabrocki [this message]
2020-12-10 13:09     ` Masami Hiramatsu
2020-12-10 17:14       ` Adam Zabrocki
2020-12-11  1:44         ` Masami Hiramatsu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201210071742.GA14484@pi3.com.pl \
    --to=pi3@pi3.com.pl \
    --cc=anil.s.keshavamurthy@intel.com \
    --cc=bp@alien8.de \
    --cc=davem@davemloft.net \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhiramat@kernel.org \
    --cc=mingo@redhat.com \
    --cc=naveen.n.rao@linux.ibm.com \
    --cc=solar@openwall.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.