public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Masami Hiramatsu <mhiramat@kernel.org>
To: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: linux-kernel@vger.kernel.org,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	"Steven Rostedt (VMware)" <rostedt@goodmis.org>,
	Jiri Kosina <jkosina@suse.cz>,
	Josh Poimboeuf <jpoimboe@redhat.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Chris von Recklinghausen <crecklin@redhat.com>,
	Jason Baron <jbaron@akamai.com>, Scott Wood <swood@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Clark Williams <williams@redhat.com>,
	x86@kernel.org
Subject: Re: [PATCH V3 7/9] x86/alternative: Batch of patch operations
Date: Mon, 28 Jan 2019 22:52:54 +0900	[thread overview]
Message-ID: <20190128225254.6a30448ad13577c865f2bf69@kernel.org> (raw)
In-Reply-To: <82bd24f1-305d-fa0c-aa13-3eac4a31fb93@redhat.com>

On Sat, 26 Jan 2019 12:52:15 +0100
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

> On 1/23/19 6:15 AM, Masami Hiramatsu wrote:
> > Hi Daniel,
> > 
> > On Fri, 21 Dec 2018 11:27:32 +0100
> > Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> > 
> >> Currently, the patch of an address is done in three steps:
> >>
> >> -- Pseudo-code #1 - Current implementation ---
> >>         1) add an int3 trap to the address that will be patched
> >>             sync cores (send IPI to all other CPUs)
> >>         2) update all but the first byte of the patched range
> >>             sync cores (send IPI to all other CPUs)
> >>         3) replace the first byte (int3) by the first byte of replacing opcode
> >>             sync cores (send IPI to all other CPUs)
> >> -- Pseudo-code #1 ---
> >>
> >> When a static key has more than one entry, these steps are called once for
> >> each entry. The number of IPIs then is linear with regard to the number 'n' of
> >> entries of a key: O(n*3), which is O(n).
> >>
> >> This algorithm works fine for the update of a single key. But we think
> >> it is possible to optimize the case in which a static key has more than
> >> one entry. For instance, the sched_schedstats jump label has 56 entries
> >> in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
> >> which the thread that is enabling the key is _not_ running.
> >>
> >> With this patch, rather than receiving a single patch to be processed, a vector
> >> of patches is passed, enabling the rewrite of the pseudo-code #1 in this
> >> way:
> >>
> >> -- Pseudo-code #2 - This patch  ---
> >> 1)  for each patch in the vector:
> >>         add an int3 trap to the address that will be patched
> >>
> >>     sync cores (send IPI to all other CPUs)
> >>
> >> 2)  for each patch in the vector:
> >>         update all but the first byte of the patched range
> >>
> >>     sync cores (send IPI to all other CPUs)
> >>
> >> 3)  for each patch in the vector:
> >>         replace the first byte (int3) by the first byte of replacing opcode
> >>
> >>     sync cores (send IPI to all other CPUs)
> >> -- Pseudo-code #2 - This patch  ---
> >>
> >> Doing the update in this way, the number of IPI becomes O(3) with regard
> >> to the number of keys, which is O(1).
> >>
> >> The batch mode is done with the function text_poke_bp_batch(), that receives
> >> two arguments: a vector of "struct text_to_poke", and the number of entries
> >> in the vector.
> >>
> >> The vector must be sorted by the addr field of the text_to_poke structure,
> >> enabling the binary search of a handler in the poke_int3_handler function
> >> (a fast path).
> >>
> >> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
> >> Cc: Thomas Gleixner <tglx@linutronix.de>
> >> Cc: Ingo Molnar <mingo@redhat.com>
> >> Cc: Borislav Petkov <bp@alien8.de>
> >> Cc: "H. Peter Anvin" <hpa@zytor.com>
> >> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> >> Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
> >> Cc: Jiri Kosina <jkosina@suse.cz>
> >> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> >> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> >> Cc: Chris von Recklinghausen <crecklin@redhat.com>
> >> Cc: Jason Baron <jbaron@akamai.com>
> >> Cc: Scott Wood <swood@redhat.com>
> >> Cc: Marcelo Tosatti <mtosatti@redhat.com>
> >> Cc: Clark Williams <williams@redhat.com>
> >> Cc: x86@kernel.org
> >> Cc: linux-kernel@vger.kernel.org
> >> ---
> >>  arch/x86/include/asm/text-patching.h |  15 ++++
> >>  arch/x86/kernel/alternative.c        | 108 +++++++++++++++++++++++++--
> >>  2 files changed, 117 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> >> index e85ff65c43c3..42ea7846df33 100644
> >> --- a/arch/x86/include/asm/text-patching.h
> >> +++ b/arch/x86/include/asm/text-patching.h
> >> @@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
> >>  #define __parainstructions_end	NULL
> >>  #endif
> >>  
> >> +/*
> >> + * Currently, the max observed size in the kernel code is
> >> + * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
> >> + * Raise it if needed.
> >> + */
> >> +#define POKE_MAX_OPCODE_SIZE	5
> >> +
> >> +struct text_to_poke {
> >> +	void *handler;
> >> +	void *addr;
> >> +	size_t len;
> >> +	const char opcode[POKE_MAX_OPCODE_SIZE];
> >> +};
> >> +
> >>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> >>  
> >>  /*
> >> @@ -37,6 +51,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> >>  extern void *text_poke(void *addr, const void *opcode, size_t len);
> >>  extern int poke_int3_handler(struct pt_regs *regs);
> >>  extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> >> +extern void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries);
> >>  extern int after_bootmem;
> >>  
> >>  #endif /* _ASM_X86_TEXT_PATCHING_H */
> >> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> >> index 6f5ad8587de0..8fa47e5ec709 100644
> >> --- a/arch/x86/kernel/alternative.c
> >> +++ b/arch/x86/kernel/alternative.c
> >> @@ -21,6 +21,7 @@
> >>  #include <asm/tlbflush.h>
> >>  #include <asm/io.h>
> >>  #include <asm/fixmap.h>
> >> +#include <linux/bsearch.h>
> >>  
> >>  int __read_mostly alternatives_patched;
> >>  
> >> @@ -738,10 +739,32 @@ static void do_sync_core(void *info)
> >>  }
> >>  
> >>  static bool bp_patching_in_progress;
> >> +/*
> >> + * Single poke.
> >> + */
> >>  static void *bp_int3_handler, *bp_int3_addr;
> >> +/*
> >> + * Batching poke.
> >> + */
> >> +static struct text_to_poke *bp_int3_tpv;
> >> +static unsigned int bp_int3_tpv_nr;
> >> +
> >> +static int text_bp_batch_bsearch(const void *key, const void *elt)
> >> +{
> >> +	struct text_to_poke *tp = (struct text_to_poke *) elt;
> >> +
> >> +	if (key < tp->addr)
> >> +		return -1;
> >> +	if (key > tp->addr)
> >> +		return 1;
> >> +	return 0;
> >> +}
> >>  
> >>  int poke_int3_handler(struct pt_regs *regs)
> >>  {
> >> +	void *ip;
> >> +	struct text_to_poke *tp;
> >> +
> >>  	/*
> >>  	 * Having observed our INT3 instruction, we now must observe
> >>  	 * bp_patching_in_progress.
> >> @@ -757,21 +780,41 @@ int poke_int3_handler(struct pt_regs *regs)
> >>  	if (likely(!bp_patching_in_progress))
> >>  		return 0;
> >>  
> >> -	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> >> +	if (user_mode(regs))
> >>  		return 0;
> >>  
> >> -	/* set up the specified breakpoint handler */
> >> -	regs->ip = (unsigned long) bp_int3_handler;
> >> +	/*
> >> +	 * Single poke first.
> >> +	 */
> > 
> > I wonder why would you separate single poke and batch poke?
> > It seems a single poke is just a case that bp_int3_tpv_nr == 1.
> 
> Hi Masami!
> 
> The single poke is used only at the boot time, before the system is able to
> allocate memory. After that, the batch mode becomes the default.

Hmm, what's the difference from text_poke_early()?

> 
> I was thinking to make one function to each method, but then I would have to
> change the do_int3() and manage how to switch between one and the other without
> further overhead. I was planing to do this in a second round of improvements.

I didn't think such big change.
I just thought we could allocate single entry array on stack, something like

text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
{
	struct text_to_poke tp = {.handler = handler, .addr = addr, .len = len};
	if (len > POKE_MAX_OPCODE_SIZE)
		return -E2BIG;
	memcpy(tp.opcode, opcode, len);
	return text_poke_bp_batch(&tp, 1);
}

> 
> > If so, you can remove bp_int3_addr and this block.
> > 
> >> +	if (bp_int3_addr) {
> >> +		if (regs->ip == (unsigned long) bp_int3_addr) {
> >> +			regs->ip = (unsigned long) bp_int3_handler;
> >> +			return 1;
> >> +		}
> >> +		return 0;
> >> +	}
> >>
> >> -	return 1;
> >> +	/*
> >> +	 * Batch mode.
> >> +	 */
> >> +	if (bp_int3_tpv_nr) {
> > 
> > if (unlikely(bp_int3_tpv_nr))
> > 
> > Sorry about interrupting, but this is a "hot-path" when we use kprobes.
> 
> No problem at all! :-)

Thanks! :-)

> 
> I will change this function to better deal with the hot-path (the default mode
> after the system boots up).
> 
> how about something like this:
> ------------------ %< ------------------
> int poke_int3_handler(struct pt_regs *regs)
> {
>         void *ip;
>         struct text_to_poke *tp;
> 
>         /*
>          * Having observed our INT3 instruction, we now must observe
>          * bp_patching_in_progress.
>          *
>          *      in_progress = TRUE              INT3
>          *      WMB                             RMB
>          *      write INT3                      if (in_progress)
>          *
>          * Idem for bp_int3_handler.
>          */
>         smp_rmb();
> 
>         if (likely(!bp_patching_in_progress))
>                 return 0;
> 
>         if (user_mode(regs))
>                 return 0;
> 
>         /*
>          * Single poke is only used at the boot.
>          */
>         if (unlikely(!bp_int3_tpv))
>                 goto single_poke;
> 
>         ip = (void *) regs->ip - sizeof(unsigned char);
>         tp = bsearch(ip, bp_int3_tpv, bp_int3_tpv_nr,
>                      sizeof(struct text_to_poke),
>                      text_bp_batch_bsearch);
>         if (tp) {
>                 /* set up the specified breakpoint handler */
>                 regs->ip = (unsigned long) tp->handler;
>                 return 1;
>         }
> 
>         return 0;
> 
> single_poke:
>         if (regs->ip == (unsigned long) bp_int3_addr) {
>                 regs->ip = (unsigned long) bp_int3_handler;
>                 return 1;
>         }
> 
>         return 0;
> }
> ------------- >% ----------
> 
> In this way the default code is up, and the only 'if' I am using is a var of the
> batch mode (that will be used later). If are are still at the boot, we are
> jumping to the end of the function.
> 
> look better?

yeah, it looks much better. But I just wonder why don't you consolidate both by
just because reducing code.

> 
> > 
> > Also, could you add NOKPROBE_SYMBOL(); for all symbols involved in this
> > process?
> > Recently I found I missed it for poke_int3_handler and sent a fix.
> > ( https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1898241.html )
> > If this increase the function-call-chain from poke_int3_handler, those
> > must be marked as NOKPROBE_SYMBOL().
> 
> Ack! Doing that!

Thank you!

> 
> Thanks!


-- 
Masami Hiramatsu <mhiramat@kernel.org>

  reply	other threads:[~2019-01-28 13:53 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-21 10:27 [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 1/9] jump_label: Add for_each_label_entry helper Daniel Bristot de Oliveira
2019-04-19 18:36   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 2/9] jump_label: Add the jump_label_can_update_check() helper Daniel Bristot de Oliveira
2019-04-19 18:37   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 3/9] x86/jump_label: Move checking code away from __jump_label_transform() Daniel Bristot de Oliveira
2019-04-19 18:38   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 4/9] x86/jump_label: Add __jump_label_set_jump_code() helper Daniel Bristot de Oliveira
2019-04-19 18:38   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 5/9] x86/alternative: Split text_poke_bp() into tree steps Daniel Bristot de Oliveira
2019-04-19 18:39   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 6/9] jump_label: Sort entries of the same key by the code Daniel Bristot de Oliveira
2019-04-19 18:40   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 7/9] x86/alternative: Batch of patch operations Daniel Bristot de Oliveira
2019-01-23  5:15   ` Masami Hiramatsu
2019-01-26 11:52     ` Daniel Bristot de Oliveira
2019-01-28 13:52       ` Masami Hiramatsu [this message]
2019-02-01 12:49         ` Daniel Bristot de Oliveira
2019-02-01 14:47           ` Masami Hiramatsu
2019-02-01 15:51             ` Daniel Bristot de Oliveira
2019-04-19 18:41   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 8/9] jump_label: Batch updates if arch supports it Daniel Bristot de Oliveira
2019-04-19 18:42   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 9/9] x86/jump_label: Batch jump label updates Daniel Bristot de Oliveira
2019-04-19 18:42   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2019-01-21 12:52 ` [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
2019-04-19 19:01 ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190128225254.6a30448ad13577c865f2bf69@kernel.org \
    --to=mhiramat@kernel.org \
    --cc=bp@alien8.de \
    --cc=bristot@redhat.com \
    --cc=crecklin@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hpa@zytor.com \
    --cc=jbaron@akamai.com \
    --cc=jkosina@suse.cz \
    --cc=jpoimboe@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=swood@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=williams@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox